One important aspect of understanding Machine Learning is the critical role that data plays in ML. In this assignment I want you to familiarize yourself with one of the common formats (C4.5) for the standard data sets that can be found at the UCI Machine Learning repository. The C4.5 format allows researchers to create data sets of interest and deposit them at the UCI repository so that other researchers can use the data in their learning programs and compare their results.
In this assignment you will build two small data sets of your own and run a simple decision tree program I will provide for you on these data sets. More details on the C4.5 format are given below and then I will discuss the characteristics I want to see in YOUR data sets.
I have ported a number of the data sets from the UCI repository to my home machine so that we can easily access these data sets without having to ftp them. The files for the data sets can be found in the directory:
Each data set in the C4.5 format consists of two files, one ending in an extension .names which gives the feature names, possible feature values, and classification values for a data set and a second file ending in .data that lists the actual data points in a data set. For example, one data set is the labor data set which is stored in the files labor.names and labor.data. labor.names looks like this:
good, bad. | Classes duration: continuous wage increase first year: continuous wage increase second year: continuous wage increase third year: continuous cost of living adjustment: none,tcf,tc working hours: continuous pension: none,ret_allw,empl_contr standby pay: continuous shift differential: continuous education allowance: yes,no statutory holidays: continuous vacation: belowaverage,average,generous longterm disability assistance: yes,no contribution to dental plan: none,half,full bereavement assistance: yes,no contribution to health plan: none,half,full
The first line of any .names file indicates what classes into which data points are divided. In labor.names, the first line indicates that points are labeled good or bad (good or bad is their classification). The first line has the format:
Name1, Name2, ..., NameN.
Each of the names is a class that a point can be labeled with (and is the focus of our learning in an inductive learning system). In the labor data set, data points are labeled "good" or "bad". In the labor.names file a comment is added at the end of the first line (the characters " | Classes") which is ignored by the parser. Following the first line is a blank line and then a list of the feature names and the possible values of those features.
A feature name is simply any string of characters ending in ":". Some of the feature names in labor.names are "duration", "cost of living adjustment", and "contribution to dental plan". Following the ":" is one of two things, either the single word "continuous" or a list of names separated by commas. If the single word "continuous" appears then this feature is assumed to have values that are real numbers (this includes features with integer values). Examples of such features in labor.names include "duration", "wage increase first year", "wage increase second year", etc. On the other hand, if a list of names appears after the ":" then this list is assumed to indicate all of the different possible "discrete" values the feature may take on. For example, in labor.names, "cost of living adjustment" has "none,tcf,tc" following it. This means that the possible values of this feature for each data point are "none", "tcf" or "tc". The feature "pension" has possible values of "none", "ret_allw" or "empl_contr". Such features are generally called nominal or discrete features.
The .names file for a data set describes the features, feature values and class values for each point in the data set. The .data file actually lists the data points making up that data set. The first five (out of 57) data points in the labor data set data file (labor.data) are:
1,5.0,?,?,?,40,?,?,2,?,11,average,?,?,yes,?,good 2,4.5,5.8,?,?,35,ret_allw,?,?,yes,11,belowaverage,?,full,?,full,good ?,?,?,?,?,38,empl_contr,?,5,?,11,generous,yes,half,yes,half,good 3,3.7,4.0,5.0,tc,?,?,?,?,yes,?,?,?,?,yes,?,good 3,4.5,4.5,5.0,?,40,?,?,?,?,12,average,?,half,yes,half,good
Each data point is simply a list of the values for each feature in the data set (in the order they appear in the .names file) followed by the class value for that data point, with the values separated by commas. So, for example, the first line indicates a data point with the following feature/feature value pairs:
duration = 1 wage increase first year = 5.0 wage increase second year = ? wage increase third year = ? cost of living adjustment = ? working hours = 40 pension = ? standby pay = ? shift differential = 2 education allowance = ? statutory holidays = 11 vacation = average longterm disability assistance = ? contribution to dental plan = ? bereavement assistance = yes contribution to health plan = ?
The final value on the line is the class this point has been labeled with (in this case "good"). Each data point must have one value for each feature. The values must be listed in the order they appear in the .names file and must be of the appropriate type (a number for continuous features or one of the possible feature values for discrete features). The only exception to this rule is that if a feature value is not known for a particular data point, a "?" may be included to indicate that the value is unknown for this data point. In the first data point, several of the feature values are unknown (including "wage increase second year", "wage increase third year", "cost of living adjustment", etc.). Some data sets (especially the labor data set) have lots of examples with unknown values and other have none.
I want each of you to construct TWO data sets. Each data set should have at least five features and at least 25 data points. You may choose any type of data you are interested in but please try to avoid offensive concepts. In the first of the two data sets you should use only features with discrete values (no "continuous" features), you should not allow any unknown values and you should only have two possible class values. In the second data set you should have at least one continuous feature and may have unknown values or more than two class values if you like. For each data set you should construct two files, a DATASETNAME.names file and DATASETNAME.data file where DATASETNAME is the name you give the data set.
You should also make sure that neither data set is trivial where we will define trivial as being possible to classify based on only one feature. To check this, I have made available a working version of a program you will be implementing later, ID3. Copy the archive file cds.tar.Z to somewhere in your home directory (you should do this on one of the Computer Science department machines in HH314 such as csdev01). Then do the following:
uncompress cds.tar.Z tar xvf cds.tar cd check_data_set
In the directory check_data_set you will find a script named check_data which runs the program train. Run it on your first data file by typing:
check_data DATASETNAMEsubstituting DATASETNAME with the name (and path) or your data set. This should print out the line "Tree for class 0" followed by a representation of a decision tree. Your decision tree should have more than one layer or it is trivial (it has only one layer if their is a single feature name listed in the entire tree).
Print out a copy of each of the files making up your data sets and the result of running check_data on each of the data sets. Then write a short report discussing the interesting aspects of your data sets and why you chose them (and what they mean). Also, you should email the files making up your data sets to Hari.