next up previous contents
Next: Bias-Variance Decomposition Up: Utilities Previous: Accuracy Estimation

Info

The info utility provides basic statistical information about a dataset. It reports the number of instances in the ``.data'' file, ``.test'' file, and ``.all'' file (the ``.all'' is optional and should contain the ``.train'' and ``.test'' for used in AccEst).

It reports the class probabilities, the number of attributes, and their type (continuous or nominal). If the option SHOW_ATTR_INFO is yes, then the number of values for each attribute is given. This may help pinpoint inappropriate declarations of attributes or even continuous attributes which simply have very few values.

Converting attributes with only two values to nominal is generally suggested to gain speedup. For example, the running time for C4.5 (excluding MLC++ overhead) on the StatLog DNA dataset [] is 14 seconds on an SGI Indy if the attributes are declared continuous and 4.7 seconds if they are declared nominal. Minor accuracy differences may result due to slightly different ways of handling such attributes.

[The ``info'' utility]

To get information about the attributes in the datafile ``labor-neg'' one can type:

   setenv DATAFILE labor-neg
   setenv SHOW_ATTR_INFO yes
   info
The output is:
   Data + Test == All
   Number of instances in labor-neg.all = 57
      Duplicate or conflicting instances : 0
   Number of instances in labor-neg.data = 40
      Duplicate or conflicting instances : 0
   Number of instances in labor-neg.test = 17
      Duplicate or conflicting instances : 0
   Class probabilities for labor-neg.all file
   Probability for the label 'good' : 64.91%
   Probability for the label 'bad' : 35.09%
   Majority accuracy: 64.91% on value good
   Number of attributes = 16 (continuous : 8 nominal : 8)
   Information about .all file : 
      3 distinct values for attribute #0 (duration) continuous
     17 distinct values for attribute #1 (wage increase first year) continuous
     15 distinct values for attribute #2 (wage increase second year) continuous
      9 distinct values for attribute #3 (wage increase third year) continuous
      4 distinct values for attribute #4 (cost of living adjustment) nominal
      8 distinct values for attribute #5 (working hours) continuous
      4 distinct values for attribute #6 (pension) nominal
      7 distinct values for attribute #7 (standby pay) continuous
     10 distinct values for attribute #8 (shift differential) continuous
      3 distinct values for attribute #9 (education allowance) nominal
      6 distinct values for attribute #10 (statutory holidays) continuous
      4 distinct values for attribute #11 (vacation) nominal
      3 distinct values for attribute #12 (longterm disability assistance) nominal
      4 distinct values for attribute #13 (contribution to dental plan) nominal
      3 distinct values for attribute #14 (bereavement assistance) nominal
      4 distinct values for attribute #15 (contribution to health plan) nominal



Ronny Kohavi
Sun Oct 6 23:17:50 PDT 1996