next up previous contents
Next: Differences from 1.3 Up: Differences from Previous Previous: Differences from Previous

Differences from 1.3.2

  1. The main change is the policy regarding the status of MLC++ . SGI MLC++ , which is MLC++ 2.0 and above is not public domain anymore, but research domain. This means that it can be used for research purposes but cannot be used in any commercial product without prior agreement from Silicon Graphics. For more details, see the MLC++ home page.

  2. The preferred reference to MLC++ changed from mlc-old-intro to mlc-new-intro.

  3. The distribution is compiled in FAST mode, which is about 30% faster than non-fast mode (non-fast mode was the mode previously distributed).

  4. The utilities distribution is given using dynamically shared objects that save space. The compressed tar file is now about half the size of the last version.

  5. Persistent categorizers are now supported. Persistent decision trees and Naive-Bayes are implemented. This allows a categorizer to be saved and later read in. Assimilation code allows instances with more attributes than required by the categorizer to be assimilated and categorized.

  6. Decision trees were improved as follows:

    1. Decision trees now provide pruning in a way similar to C4.5. Branch replacement is not being done and the fudge factors (C4.5 adds 0.1 in certain places are not in the code). The MC4 inducer defaults to a setting very similar to C4.5's setting.

    2. A new option, adjust thresholds, allow splits in decision trees to be adjusted to actual data elements as in C4.5. The option is implemented much more efficiently than in C4.5.

    3. Gain ratio is supported as a splitting criterion. This is implemented exactly as the C4.5 version (with all the hacks), so that except for unknown handling and tie breakers, the unpruned trees are the same.

    4. More statistics are provided about the number of attributes and depth of tree.

    5. Improved output for MineSet Tree Visualizer.

  7. Naive-Bayes changes:

    1. Naive-Bayes uses a value of 0 as a default for NO_MATCHES_FACTOR, which is the value used when there are no records matching a given attribute value and label value. This was made to make the probability distribution consistent and (surprisingly?) results sometimes improve. The previous default was 0.5 over the number of instances.

    2. Naive-Bayes now supports Laplace corrections.

    3. Naive-Bayes now outputs MineSet Evidence Visualizer format files.

  8. The biasVar utility has been added for the bias-variance decomposition based on kohavi-wolpert-bias-var.

  9. NBTree described in kohavi-nbtree can be setup with the following parameters:
          setenv INDUCER catdt
          setenv CATDT_LEAF_INDUCER naive
          setenv LBOUND_MIN_SPLIT 30
          setenv CATDT_CV_FOLDS 5
          setenv CATDT_CV_TIMES 1
          setenv CATDT_LEAF_NB_NO_MATCHES_FACTOR = 0.5
          setenv IMPROVE_RATIO 0.05

  10. A ``dribble'' support was added to show progress on big files. Dribble is done for decision trees and discretization.

  11. Unlabelled instance lists are partially supported. The syntax is to say ``nolabel'' in the names file.

  12. Options in OODG allow you to build an oblivious decision tree to determine the cumulative purity of chosen attributes.

  13. Control-c and kill signals are caught and handled. A cleanup of temporary files is done.

  14. Governors remove attributes with too many values and avoid runs with too many label values. These limits are artificial and can be increased by changing the appropriate options in the message. With this change, dynamic projections of instance lists are allowed during reading (mainly an efficiency issue).

  15. Misc changes:

    1. The URL for dotty has change with the breakup of AT&T.

  16. Major source code changes:

    1. At the source level Bag, BagCounter, List, and CounterList have been unified into one smart class, making a lot of the programming easier.

    2. The GNU library is not being used any more.

    3. Code is available to upload data into a database (Oracle, Sybase, and Informix variations are available) through database loaders (not a straight load, but the sql files are written out).

next up previous contents
Next: Differences from 1.3 Up: Differences from Previous Previous: Differences from Previous

Ronny Kohavi
Sun Oct 6 23:17:50 PDT 1996