The main change is the policy regarding the status of MLC++
, which is MLC++
2.0 and above is not public domain
anymore, but research domain. This means that it can be used
for research purposes but cannot be used in any commercial
product without prior agreement from Silicon Graphics. For
more details, see the MLC++
The preferred reference to MLC++
mlc-old-intro to mlc-new-intro.
The distribution is compiled in FAST mode, which is about
30% faster than non-fast mode (non-fast mode was the mode
The utilities distribution is given using dynamically shared
objects that save space. The compressed tar file is now
about half the size of the last version.
Persistent categorizers are now supported. Persistent
decision trees and Naive-Bayes are implemented. This allows
a categorizer to be saved and later read in. Assimilation
code allows instances with more attributes than required
by the categorizer to be assimilated and categorized.
Decision trees were improved as follows:
Decision trees now provide pruning in a way similar to C4.5.
Branch replacement is not being done and the fudge factors
(C4.5 adds 0.1 in certain places are not in the code).
The MC4 inducer defaults to a setting very similar
to C4.5's setting.
A new option, adjust thresholds, allow splits in decision
trees to be adjusted to actual data elements as in C4.5.
The option is implemented much more efficiently than in C4.5.
Gain ratio is supported as a splitting criterion. This is
implemented exactly as the C4.5 version (with all the hacks),
so that except for unknown handling and tie breakers, the
unpruned trees are the same.
More statistics are provided about the number of attributes
and depth of tree.
Improved output for MineSet
Naive-Bayes uses a value of 0 as a default for
NO_MATCHES_FACTOR, which is the value used when there are
no records matching a given attribute value and label value.
This was made to make the probability distribution consistent
and (surprisingly?) results sometimes improve. The previous
default was 0.5 over the number of instances.
Naive-Bayes now supports Laplace corrections.
Naive-Bayes now outputs MineSet
Evidence Visualizer format files.
The biasVar utility has been added for the bias-variance
decomposition based on kohavi-wolpert-bias-var.
NBTree described in kohavi-nbtree can be setup
with the following parameters:
A ``dribble'' support was added to show progress on big files.
Dribble is done for decision trees and discretization.
Unlabelled instance lists are partially supported. The syntax
is to say ``nolabel'' in the names file.
Options in OODG allow you to build an oblivious decision
tree to determine the cumulative purity of chosen attributes.
Control-c and kill signals are caught and handled.
A cleanup of temporary files is done.
Governors remove attributes with too many values and avoid
runs with too many label values. These limits are artificial
and can be increased by changing the appropriate options
in the message. With this change, dynamic projections
of instance lists are allowed during reading (mainly
an efficiency issue).
The URL for dotty has change with the breakup of AT&T.
Major source code changes:
At the source level Bag, BagCounter, List, and CounterList
have been unified into one smart class, making a lot of the
The GNU library is not being used any more.
Code is available to upload data into a database (Oracle,
Sybase, and Informix variations are available) through
database loaders (not a straight load, but the sql files are