Computer Science 8751
Machine Learning

Programming Assignment 2
Decision Trees for Discrete Features (35 points)
Due Tuesday, October 14, 2003

Introduction

In this lab you will be implementing a decision tree algorithm similar to the ID3 algorithm discussed in class, but using a different formula for gain as discussed below. Note that in this version of decision tree learning you may assume that you only have discrete features, that there are no unknown values, and that no pruning need be performed.

You will actually be implementing two separate programs. The first program should take as command line arguments the .names and .data file of a dataset and should create a decision tree from that data storing the result into a file named as the third command line argument. Your program should be run like this:

  dt_create DATASET.names DATASET.data DATASET.tree

This would take a dataset defined by the files DATASET.names and DATASET.data, and learn a tree from that data, storing the resulting tree in DATASET.tree. Your code should also print out a nicely formatted version of the tree that is learned from the dataset that looks something like this (you may make it look more impressive if you like):

Outlook
 =Rain
    Wind
      =Weak
         Class=Yes
      =Strong
         Class=No
 =Overcast
    Class=Yes
 =Sunny
    Humidity
      =Normal
         Class=Yes
      =High
         Class=No

Your code will need to save a representation of the resulting tree in the third filename supplied to the code (I suggest writing out the tree using a pre-order traversal). In your second program you should read in a dataset and a previously learned decision tree and determine how accurate that tree is for the dataset. This program should work like this:

  dt_predict DATASET.names DATASET.data DATASET.tree

where the DATASET.names and DATASET.data file define the set of data and the DATASET.tree contains the previously learned tree file (you may assume that the tree file will always be generated by your code and will always be correct for the dataset supplied). In this program you should count up for each of the examples in the dataset the class it actually is and which class it is predicted as using the supplied tree and print the resulting totals for each possble combination (the result is called a confusion matrix). You should also print the overall accuracy (which is simply the sum of the counts on the diagonal over the total number of examples). For example, your output might look like this:

               Actual Class
                  0   1 
               --------
Predicted   0 |   5   1 
Class       1 |   0  10 

  Accuracy =  93.75%

Gain Formula

In this program we will be using a different notion of gain from the Information Gain formula used in ID3. For a particular set of the data S and an attribute A the gain (Gain(A,S)) will be defined as follows:

         #classes  #classes
            ---       ---
  G(S) =    \         \     p(Class=X,S) * p(Class=Y,S)
            /         /
            ---       ---
            X=1       Y=X+1

          #values(A)                / #classes  #classes                                            \
             ---                    |   ---       ---                                               |
  G(A,S) =   \       P(Value=V,S) * |   \         \     p(Class=X|Value=V,S) * p(Class=Y|Value=V,S) |
             /                      |   /         /                                                 |
             ---                    |   ---       ---                                               |
             V=1                    \   X=1       Y=X+1                                             /

  Gain(A,S) = G(S) - G(A,S)

where p(Class=X,S) is the proportion of the set of points S that has class X, and P(Value=V,S) is the proportion of the set of points S that has feature value V, and P(Class=X|Value=V,S) is the proportion of those points in S that have class X amongst the points with feature value V.

What To Turn In

You should write up a short report (at least one page, no more than three) discussing your design decisions in implementing the decision tree code and how your version of the code works.

Comment your code and turn in hard copies of all of the code (and the report above). You should also perform tests of your code for the 8 pairs of data files in the directory ~rmaclin/public/datasets/8751. You should copy all of the .data and .names files from that directory to your code directory. After you have done so, run the following commands using your code and print out all of the results:

  dt_create promoters.names promoters.train-0.data promoters.tree0
  dt_predict promoters.names promoters.test-0.data promoters.tree0
  dt_create promoters.names promoters.train-1.data promoters.tree1
  dt_predict promoters.names promoters.test-1.data promoters.tree1
  dt_create promoters.names promoters.train-2.data promoters.tree2
  dt_predict promoters.names promoters.test-2.data promoters.tree2
  dt_create promoters.names promoters.train-3.data promoters.tree3
  dt_predict promoters.names promoters.test-3.data promoters.tree3

  dt_create soybean.names soybean.train-0.data soybean.tree0
  dt_predict soybean.names soybean.test-0.data soybean.tree0
  dt_create soybean.names soybean.train-1.data soybean.tree1
  dt_predict soybean.names soybean.test-1.data soybean.tree1
  dt_create soybean.names soybean.train-2.data soybean.tree2
  dt_predict soybean.names soybean.test-2.data soybean.tree2
  dt_create soybean.names soybean.train-3.data soybean.tree3
  dt_predict soybean.names soybean.test-3.data soybean.tree3

You must also submit your code electronically. To do this go to the link https://webapps.d.umn.edu/service/webdrop/rmaclin/cs8751-1-f2003/upload.cgi and follow the directions for uploading a file (you can do this multiple times, though it would be helpful if you would tar your files and upload one file archive).

To make your code easier to check and grade please use the following procedure for collecting the code before uploading it: