next up previous contents
Next: C4.5 with Auto Up: Wrapper Inducers Previous: Bagging

Feature Subset Selection

 

The feature subset selection is a wrapper inducer that selects a good subset of features for improved accuracy performance [,,].

All options in accuracy estimation (Section 3) can be used with the extra options listed below.

[Feature Subset Selection]

To run the IB inducer on the monk1 dataset, one can do: {

   setenv LOGLEVEL 1
   setenv INDUCER FSS
   setenv FSS_INDUCER IB
   setenv DATAFILE    monk1
   setenv FSS_DOT_FILE IBFSS.dot
   Inducer
The output is:
   MLC++ Debug level is 0, log level is 1
   OPTION PROMPTLEVEL = required-only
   OPTION INDUCER = FSS
   OPTION INDUCER_NAME = FSS
   OPTION FSS_INDUCER = IB
   OPTION FSS_INDUCER_NAME = IB
   OPTION FSS_NUM_NEIGHBORS = 1
   OPTION FSS_EDITING = false
   OPTION FSS_NNKVALUE = num-distances
   OPTION FSS_NORMALIZATION = extreme
   OPTION FSS_NEIGHBOR_VOTE = inverse-distance
   OPTION FSS_MANUAL_WEIGHTS = false
   OPTION FSS_DOT_FILE = IBFSS.dot
   OPTION FSS_SEARCH_METHOD = best-first
   OPTION FSS_EVAL_LIMIT = 0
   OPTION FSS_SHOW_REAL_ACC = best-only
   OPTION FSS_MAX_STALE = 5
   OPTION FSS_EPSILON = 0.001
   OPTION FSS_USE_COMPOUND = true
   OPTION FSS_CMPLX_PENALTY = 0
   OPTION FSS_ACC_ESTIMATOR = cv
   OPTION FSS_ACC_EST_SEED = 7258789
   OPTION FSS_ACC_TRIM = 0
   OPTION FSS_CV_FOLDS = 10
   OPTION FSS_CV_TIMES = 1
   OPTION FSS_CV_FRACT = 1
   Method: cv
   Trim: 0
   Seed: 7258789
   Folds: 10,  Times: 1
   OPTION FSS_DIRECTION = forward
   OPTION DATAFILE = monk1
   OPTION NAMESFILE = monk1.names
   OPTION REMOVE_UNKNOWN_INST = false
   OPTION CORRUPT_UNKNOWN_RATE = 0
   Reading monk1.data.. done.
   OPTION TESTFILE = monk1.test
   Reading monk1.test..... done.
   
   New best node (1 evals) #0[]: accuracy: 39.49% +- 2.45% (30.77% - 50.00%).
   Test Set: 50.00% +- 2.41% [45.31% - 54.69%].  Bias: -10.51% cost: 10 complexity: 0
   .......
   New best node (8 evals) #5[4]: accuracy: 73.21% +- 2.92% (58.33% - 84.62%).
   Test Set: 75.00% +- 2.09% [70.71% - 78.85%].  Bias: -1.79% cost: 10 complexity: 1
   ......
   New best node (14 evals) #12[0, 1, 4]: accuracy: 99.17% +- 0.83% (91.67% - 100.00%).
   Test Set: 100.00% +- 0.00% [99.12% - 100.00%].  Bias: -0.83% cost: 10 complexity: 3
   ...................
   Final best node #12[0, 1, 4]: accuracy: 99.17% +- 0.83% (91.67% - 100.00%).
   Test Set: 100.00% +- 0.00% [99.12% - 100.00%].  Bias: -0.83% cost: 10 complexity: 3
   Expanded 8 nodes
   Accuracy: 100.00% +- 0.00% [99.12% - 100.00%]
} This example shows that one can improve the accuracy from 75% to (100%) by looking at only three features. In this case we know that these are the only three relevant features, but it is important to note that they were found automatically. Figure 3 shows the nodes visited and their information. The graph is automatically stored in the file FSS.dot. The edges show the difference in estimated accuracy between the two nodes. The information in each node of the graph is the following:

  1. On the top line is the node number. This helps you see the order in which nodes were evaluated. Then come the set of features used in brackets (starting from feature 0).

  2. On the second line is the estimated accuracy, from whatever accuracy estimation used ( e.g. , cross-validation, bootstrap, holdout) with the standard deviation of the mean.

  3. The third line will appear only for nodes where the real accuracy was computed (accuracy on the test set). The evaluation depends on the setting of on FSS_SHOW_REAL_ACC. By default, only nodes that were ``best'' at some stage will have this number. Note that this accuracy is never used by the search algorithm.

  
Figure 3: The search space for IB on the monk1 dataset



Ronny Kohavi
Sun Oct 6 23:17:50 PDT 1996