The info utility provides basic statistical information about a dataset. It reports the number of instances in the ``.data'' file, ``.test'' file, and ``.all'' file (the ``.all'' is optional and should contain the ``.train'' and ``.test'' for used in AccEst).
It reports the class probabilities, the number of attributes, and their type (continuous or nominal). If the option SHOW_ATTR_INFO is yes, then the number of values for each attribute is given. This may help pinpoint inappropriate declarations of attributes or even continuous attributes which simply have very few values.
Converting attributes with only two values to nominal is generally suggested to gain speedup. For example, the running time for C4.5 (excluding MLC++ overhead) on the StatLog DNA dataset [] is 14 seconds on an SGI Indy if the attributes are declared continuous and 4.7 seconds if they are declared nominal. Minor accuracy differences may result due to slightly different ways of handling such attributes.
[The ``info'' utility]
To get information about the attributes in the datafile ``labor-neg'' one can type:
setenv DATAFILE labor-neg setenv SHOW_ATTR_INFO yes infoThe output is:
Data + Test == All
Number of instances in labor-neg.all = 57
Duplicate or conflicting instances : 0
Number of instances in labor-neg.data = 40
Duplicate or conflicting instances : 0
Number of instances in labor-neg.test = 17
Duplicate or conflicting instances : 0
Class probabilities for labor-neg.all file
Probability for the label 'good' : 64.91%
Probability for the label 'bad' : 35.09%
Majority accuracy: 64.91% on value good
Number of attributes = 16 (continuous : 8 nominal : 8)
Information about .all file :
3 distinct values for attribute #0 (duration) continuous
17 distinct values for attribute #1 (wage increase first year) continuous
15 distinct values for attribute #2 (wage increase second year) continuous
9 distinct values for attribute #3 (wage increase third year) continuous
4 distinct values for attribute #4 (cost of living adjustment) nominal
8 distinct values for attribute #5 (working hours) continuous
4 distinct values for attribute #6 (pension) nominal
7 distinct values for attribute #7 (standby pay) continuous
10 distinct values for attribute #8 (shift differential) continuous
3 distinct values for attribute #9 (education allowance) nominal
6 distinct values for attribute #10 (statutory holidays) continuous
4 distinct values for attribute #11 (vacation) nominal
3 distinct values for attribute #12 (longterm disability assistance) nominal
4 distinct values for attribute #13 (contribution to dental plan) nominal
3 distinct values for attribute #14 (bereavement assistance) nominal
4 distinct values for attribute #15 (contribution to health plan) nominal