MLC++ utilities take their options from environment
variables or from the command-line. All examples are given
assuming csh or tcsh is the shell. Text starting
with the pound sign (#) is a comment. By default, required
options that are not set will be prompted for.
Datasets are assumed to be in the MLC++ format, which is very similar
to the C4.5 [] format.
. Each
dataset should include a names file describing how to parse
the data, a data file containing the data, and an optional
test file for estimating accuracy.
[Running ID3]
Fisher's iris dataset [] contains four attributes of
iris plants: sepal length, sepal width, petal length, and petal
width. The task is to categorize each instance into one of the
three classes: Iris Setosa, Iris Versicolour, and Iris Virginica.
To run the ID3 induction algorithm [] on the iris dataset
in directory /u/mlc/db/ consisting of iris.names,
iris.data, and iris.test, one can type:
setenv DATAFILE iris # The dataset stem setenv INDUCER ID3 # pick ID3 setenv ID3_UNKNOWN_EDGES no # Don't bother with unknown edges setenv DISP_CONFUSION_MAT yes # Show confusion matrix setenv DISPLAY_STRUCT dotty # Show the tree using dotty Inducer dot -Tps -Gpage="8.5,11" -Gmargin="0,0" Inducer.dot > iris.psThe output is:
Classifying (% done): 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% done.
Number of training instances: 100
Number of test instances: 50. Unseen: 50, seen 0.
Number correct: 47. Number incorrect: 3
Generalization accuracy: 94.00%. Memorization accuracy: unknown
Accuracy: 94.00% +- 3.39% [83.78% - 97.94%]
Displaying confusion matrix...
(a) (b) (c) <-- classified as
---- ---- ----
15 0 0 (a): class Iris-setosa
0 15 2 (b): class Iris-versicolor
0 1 17 (c): class Iris-virginica
If you have dot
installed (see
Appendix A), you can generate
iris.ps file shown in Figure 1. If you have an
X-terminal and dotty
, MLC++
will display the graph on the screen.
Figure 1: The file iris.ps depicting the decision tree
induced by ID3 on the iris dataset.
The generalization accuracy indicates the accuracy on unseen instances
and the memorization accuracy indicates the accuracy on instances
in the test set which were also in the training set. The accuracy
is followed by
and the theoretical standard deviation, and
the range afterwards is the 95% confidence interval. See
Section 3 for details.
[Cross-Validation]
To cross-validate an inducer (induction algorithm) and a dataset, one can do:
setenv DATAFILE iris.all # ".all" contains all the data setenv INDUCER ID3 setenv ACC_ESTIMATOR cv AccEstThe output is:
10 folds: 1 2 3 4 5 6 7 8 9 10 Method: cv Trim: 0 Seed: 7258789 Folds: 10, Times: 1 Accuracy: 94.00% +- 2.10% (80.00% - 100.00%)
cross-validating a different inducer can be done by simply changing the environment variable value.
setenv DATAFILE iris.all setenv INDUCER IB # A nearest-neighbor algorithm setenv ACC_ESTIMATOR cv AccEstThe output is:
10 folds: 1 2 3 4 5 6 7 8 9 10 Method: cv Trim: 0 Seed: 7258789 Folds: 10, Times: 1 Accuracy: 96.00% +- 1.47% (86.67% - 100.00%)
If you set the LOGLEVEL to 1, you will see all the available options. If you set the PROMPTLEVEL to ``basic'' or ``all'' (the default is ``required-only''), MLC++ will prompt you to fill in the options. Type '?' at any prompt for help.