MLC++ utilities take their options from environment
variables or from the command-line. All examples are given
tcsh is the shell. Text starting
with the pound sign (#) is a comment. By default, required
options that are not set will be prompted for.
Datasets are assumed to be in the MLC++ format, which is very similar
to the C4.5  format.. Each
dataset should include a
names file describing how to parse
the data, a
data file containing the data, and an optional
test file for estimating accuracy.
Fisher's iris dataset  contains four attributes of iris plants: sepal length, sepal width, petal length, and petal width. The task is to categorize each instance into one of the three classes: Iris Setosa, Iris Versicolour, and Iris Virginica.
To run the ID3 induction algorithm  on the iris dataset
/u/mlc/db/ consisting of iris.names,
iris.data, and iris.test, one can type:
setenv DATAFILE iris # The dataset stem setenv INDUCER ID3 # pick ID3 setenv ID3_UNKNOWN_EDGES no # Don't bother with unknown edges setenv DISP_CONFUSION_MAT yes # Show confusion matrix setenv DISPLAY_STRUCT dotty # Show the tree using dotty Inducer dot -Tps -Gpage="8.5,11" -Gmargin="0,0" Inducer.dot > iris.psThe output is:
Classifying (% done): 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% done. Number of training instances: 100 Number of test instances: 50. Unseen: 50, seen 0. Number correct: 47. Number incorrect: 3 Generalization accuracy: 94.00%. Memorization accuracy: unknown Accuracy: 94.00% +- 3.39% [83.78% - 97.94%] Displaying confusion matrix... (a) (b) (c) <-- classified as ---- ---- ---- 15 0 0 (a): class Iris-setosa 0 15 2 (b): class Iris-versicolor 0 1 17 (c): class Iris-virginica
If you have dot
Appendix A), you can generate
iris.ps file shown in Figure 1. If you have an
X-terminal and dotty
will display the graph on the screen.
Figure 1: The file iris.ps depicting the decision tree induced by ID3 on the iris dataset.
The generalization accuracy indicates the accuracy on unseen instances and the memorization accuracy indicates the accuracy on instances in the test set which were also in the training set. The accuracy is followed by and the theoretical standard deviation, and the range afterwards is the 95% confidence interval. See Section 3 for details.
To cross-validate an inducer (induction algorithm) and a dataset, one can do:
setenv DATAFILE iris.all # ".all" contains all the data setenv INDUCER ID3 setenv ACC_ESTIMATOR cv AccEstThe output is:
10 folds: 1 2 3 4 5 6 7 8 9 10 Method: cv Trim: 0 Seed: 7258789 Folds: 10, Times: 1 Accuracy: 94.00% +- 2.10% (80.00% - 100.00%)
cross-validating a different inducer can be done by simply changing the environment variable value.
setenv DATAFILE iris.all setenv INDUCER IB # A nearest-neighbor algorithm setenv ACC_ESTIMATOR cv AccEstThe output is:
10 folds: 1 2 3 4 5 6 7 8 9 10 Method: cv Trim: 0 Seed: 7258789 Folds: 10, Times: 1 Accuracy: 96.00% +- 1.47% (86.67% - 100.00%)
If you set the LOGLEVEL to 1, you will see all the available options. If you set the PROMPTLEVEL to ``basic'' or ``all'' (the default is ``required-only''), MLC++ will prompt you to fill in the options. Type '?' at any prompt for help.