next up previous contents
Next: Setting Options Up: No Title Previous: Contents

Introduction

MLC++ utilities take their options from environment variables or from the command-line. All examples are given assuming csh or tcsh is the shell. Text starting with the pound sign (#) is a comment. By default, required options that are not set will be prompted for.

Datasets are assumed to be in the MLC++ format, which is very similar to the C4.5 [] format.gif. Each dataset should include a names file describing how to parse the data, a data file containing the data, and an optional test file for estimating accuracy.

[Running ID3]
Fisher's iris dataset [] contains four attributes of iris plants: sepal length, sepal width, petal length, and petal width. The task is to categorize each instance into one of the three classes: Iris Setosa, Iris Versicolour, and Iris Virginica.

To run the ID3 induction algorithm [] on the iris dataset in directory /u/mlc/db/ consisting of iris.names, iris.data, and iris.test, one can type:

   setenv DATAFILE iris              # The dataset stem
   setenv INDUCER ID3                # pick ID3
   setenv ID3_UNKNOWN_EDGES no       # Don't bother with unknown edges
   setenv DISP_CONFUSION_MAT yes     # Show confusion matrix
   setenv DISPLAY_STRUCT dotty       # Show the tree using dotty
   Inducer
   dot -Tps -Gpage="8.5,11" -Gmargin="0,0" Inducer.dot > iris.ps
The output is:
   Classifying (% done): 10%  20%  30%  40%  50%  60%  70%  80%  90%  100%  done.
   Number of training instances: 100
   Number of test instances: 50.  Unseen: 50,  seen 0.
   Number correct: 47.  Number incorrect: 3
   Generalization accuracy: 94.00%.  Memorization accuracy: unknown
   Accuracy: 94.00% +- 3.39% [83.78% - 97.94%]
   
   Displaying confusion matrix... 
    (a)  (b)  (c)    <-- classified as 
   ---- ---- ---- 
     15    0    0    (a): class Iris-setosa
      0   15    2    (b): class Iris-versicolor
      0    1   17    (c): class Iris-virginica

If you have dot installed (see Appendix A), you can generate iris.ps file shown in Figure 1. If you have an X-terminal and dotty , MLC++ will display the graph on the screen.

  
Figure 1: The file iris.ps depicting the decision tree induced by ID3 on the iris dataset.

The generalization accuracy indicates the accuracy on unseen instances and the memorization accuracy indicates the accuracy on instances in the test set which were also in the training set. The accuracy is followed by and the theoretical standard deviation, and the range afterwards is the 95% confidence interval. See Section 3 for details.

[Cross-Validation]

To cross-validate an inducer (induction algorithm) and a dataset, one can do:

   setenv DATAFILE iris.all   # ".all" contains all the data
   setenv INDUCER ID3  
   setenv ACC_ESTIMATOR cv
   AccEst
The output is:
   10 folds: 1 2 3 4 5 6 7 8 9 10 
   Method: cv
   Trim: 0
   Seed: 7258789
   Folds: 10,  Times: 1
   Accuracy: 94.00% +- 2.10% (80.00% - 100.00%)

cross-validating a different inducer can be done by simply changing the environment variable value.

   setenv DATAFILE iris.all
   setenv INDUCER IB         # A nearest-neighbor algorithm
   setenv ACC_ESTIMATOR cv
   AccEst
The output is:
   10 folds: 1 2 3 4 5 6 7 8 9 10 
   Method: cv
   Trim: 0
   Seed: 7258789
   Folds: 10,  Times: 1
   Accuracy: 96.00% +- 1.47% (86.67% - 100.00%)

If you set the LOGLEVEL to 1, you will see all the available options. If you set the PROMPTLEVEL to ``basic'' or ``all'' (the default is ``required-only''), MLC++ will prompt you to fill in the options. Type '?' at any prompt for help.



next up previous contents
Next: Setting Options Up: No Title Previous: Contents



Ronny Kohavi
Sun Oct 6 23:17:50 PDT 1996