MLC++ Updates 10/6/96 _______________ SGI MLC++ utilities 2.0 are RESEARCH DOMAIN only. This means they can be used for research/educational purposes. See http://www.sgi.com/tech/mlc/terms.html for details. Appended is a list of new features in version 2.0 -- Ronny Kohavi (ronnyk@sgi.com) Main Differences Between MLC++ 1.3.2 and 2.0 ____________________________________________ - The distribution is compiled in FAST mode, which is about 30% faster than non-fast mode (non-fast mode was the mode previously distributed). - The utilities distribution is given using dynamically shared objects that save space. The compressed tar file is now about half the size of the last version. - Persistent categorizers are now supported. Persistent decision trees and Naive-Bayes are implemented. This allows a categorizer to be saved and later read in. Assimilation code allows instances with more attributes than required by the categorizer to be assimilated and categorized. - Decision trees were improved as follows: - Decision trees now provide pruning in a way similar to C4.5. Branch replacement is not being done and the fudge factors (C4.5 adds 0.1 in certain places are not in the code). The MC4 inducer defaults to a setting very similar to C4.5's setting. - A new option, adjust thresholds, allow splits in decision trees to be adjusted to actual data elements as in C4.5. The option is implemented much more efficiently than in C4.5. - Gain ratio is supported as a splitting criterion. This is implemented exactly as the C4.5 version (with all the hacks), so that except for unknown handling and tie breakers, the unpruned trees are the same. - More statistics are provided about the number of attributes and depth of tree. - Improved output for MineSet(TM) Tree Visualizer. - Naive-Bayes changes: - Naive-Bayes uses a value of 0 as a default for NO_MATCHES_FACTOR, which is the value used when there are no records matching a given attribute value and label value. This was made to make the probability distribution consistent and (surprisingly?) results sometimes improve. The previous default was 0.5 over the number of instances. - Naive-Bayes now supports Laplace corrections. - Naive-Bayes now outputs MineSet(TM) Evidence Visualizer format files. - The biasVar utility has been added for the bias-variance decomposition based on Kohavi & Wolpert ICML-96 paper. - NBTree described in Kohavi, KDD-96 can be setup with the following parameters: setenv INDUCER catdt setenv CATDT_LEAF_INDUCER naive setenv LBOUND_MIN_SPLIT 30 setenv CATDT_CV_FOLDS 5 setenv CATDT_CV_TIMES 1 setenv CATDT_LEAF_NB_NO_MATCHES_FACTOR = 0.5 setenv IMPROVE_RATIO 0.05 - A ``dribble'' support was added to show progress on big files. Dribble is done for decision trees and discretization. - Unlabelled instance lists are partially supported. The syntax is to say ``nolabel'' in the names file. - Options in OODG allow you to build an oblivious decision tree to determine the cumulative purity of chosen attributes. - Control-c and kill signals are caught and handled. A cleanup of temporary files is done. - Governors remove attributes with too many values and avoid runs with too many label values. These limits are artificial and can be increased by changing the appropriate options in the message. With this change, dynamic projections of instance lists are allowed during reading (mainly an efficiency issue).