Computer Science 5751
Machine Learning

Programming Assignment 3
ID3 (85 points)
Due Thursday, March 8, 2001

Introduction

In this lab you will be implementing the ID3 decision tree algorithm (see the text book and class notes for more details on this algorithm). Your team assignments can be found here.

The Provided Code and ID3

The code you will be completing is in the archive sid3.tar.Z To use this code, download the archive to your account and then unpack it as follows:

  uncompress sid3.tar.Z
  tar xvf sid3.tar.Z

This will create the directory student_id3 and a set of files in that directory. The directory includes a makefile that will compile the provided files and produce two programs, train and classify. To call these programs correctly you will use the scripts contained in the files id3_create, id3_predict and id3_nfold. id3_create is used to construct an ID3 decision tree. To run it you type:

  id3_create NAMESFILE DATAFILE MODELFILE

where NAMESFILE and DATAFILE are the .names and .data file names of the data set for which you want to construct a decision tree. Some samples from the book are included (such as table3.2.names and table3.2.data). The name MODELFILE should be a name of a file where you want to save the decision tree you construct. A sample run of id3_create should produce something like the following:

% id3_create table3.2.names table3.2.data table3.2.model

Tree for class 0
     Outlook
       =Rain
          Wind
            =Weak
               [0.000000,1.000000]
            =Strong
               [1.000000,0.000000]
       =Overcast
          [0.000000,1.000000]
       =Sunny
          Humidity
            =Normal
               [0.000000,1.000000]
            =High
               [1.000000,0.000000]




Confusion matrix for training set:
               Actual Class
                  0   1 
               --------
Predicted   0 |   5   0 
Class       1 |   0   9 

  Accuracy = 100.00%



Predicted Output Vectors:
  Example_0: Actual=[1.000,0.000]   Predicted=[1.000,0.000]
  Example_1: Actual=[1.000,0.000]   Predicted=[1.000,0.000]
  Example_2: Actual=[0.000,1.000]   Predicted=[0.000,1.000]
  Example_3: Actual=[0.000,1.000]   Predicted=[0.000,1.000]
  Example_4: Actual=[0.000,1.000]   Predicted=[0.000,1.000]
  Example_5: Actual=[1.000,0.000]   Predicted=[1.000,0.000]
  Example_6: Actual=[0.000,1.000]   Predicted=[0.000,1.000]
  Example_7: Actual=[1.000,0.000]   Predicted=[1.000,0.000]
  Example_8: Actual=[0.000,1.000]   Predicted=[0.000,1.000]
  Example_9: Actual=[0.000,1.000]   Predicted=[0.000,1.000]
  Example_10: Actual=[0.000,1.000]   Predicted=[0.000,1.000]
  Example_11: Actual=[0.000,1.000]   Predicted=[0.000,1.000]
  Example_12: Actual=[0.000,1.000]   Predicted=[0.000,1.000]
  Example_13: Actual=[1.000,0.000]   Predicted=[1.000,0.000]

Note that the decision tree is now saved in the file table3.2.model.

id3_predict reads in a decision tree that was created by running id3_create and uses it to predict a set of data points using that decision tree. To run it you type:

  id3_predict MODELFILE NAMESFILE DATAFILE

The file MODELFILE should already exist when you run this program. An example of the result this produces is:

% id3_predict table3.2.model table3.2.names alt3.2.data


Tree for class 0
     Outlook
       =Rain
          Wind
            =Weak
               [0.000000,1.000000]
            =Strong
               [1.000000,0.000000]
       =Overcast
          [0.000000,1.000000]
       =Sunny
          Humidity
            =Normal
               [0.000000,1.000000]
            =High
               [1.000000,0.000000]




Confusion matrix for test set:
               Actual Class
                  0   1 
               --------
Predicted   0 |   5   1 
Class       1 |   0  10 

  Accuracy =  93.75%



Predicted Output Vectors:
  Example_0: Actual=[1.000,0.000]   Predicted=[1.000,0.000]
  Example_1: Actual=[1.000,0.000]   Predicted=[1.000,0.000]
  Example_2: Actual=[0.000,1.000]   Predicted=[0.000,1.000]
  Example_3: Actual=[0.000,1.000]   Predicted=[0.000,1.000]
  Example_4: Actual=[0.000,1.000]   Predicted=[0.000,1.000]
  Example_5: Actual=[1.000,0.000]   Predicted=[1.000,0.000]
  Example_6: Actual=[0.000,1.000]   Predicted=[0.000,1.000]
  Example_7: Actual=[1.000,0.000]   Predicted=[1.000,0.000]
  Example_8: Actual=[0.000,1.000]   Predicted=[0.000,1.000]
  Example_9: Actual=[0.000,1.000]   Predicted=[0.000,1.000]
  Example_10: Actual=[0.000,1.000]   Predicted=[0.000,1.000]
  Example_11: Actual=[0.000,1.000]   Predicted=[0.000,1.000]
  Example_12: Actual=[0.000,1.000]   Predicted=[0.000,1.000]
  Example_13: Actual=[0.000,1.000]   Predicted=[0.000,1.000]
  Example_14: Actual=[1.000,0.000]   Predicted=[1.000,0.000]
  Example_15: Actual=[0.000,1.000]   Predicted=[1.000,0.000]

You also have a script called id3_nfold which will allow you to run 10-fold cross validation tests (save this until you are certain all aspects of your code are working). To call this script you type:

  id3_nfold NAMESFILE DATAFILE

Completing the Code

You can read more about the code you will be completing here. Assume that there is just one output class in your problems (and therefore a model is one decision tree). In this program you may assume that there are no unknown data values but you should allow for the possibility of continuous features (use the approach discussed in the book for selecting split points for such features).

To complete the provided code you will need to complete four functions (learn, classify, read and write) and the Constructor and Destructor for the class ID3. The skeletons of these functions can be found in the file id3.C. You will also need to add some fields to the class ID3 in the file id3.h (representing the decision tree. A short description of what each function is supposed to do can be found in id3.C. Hari will give you more details on these functions and answer your questions during lab.

What to Turn In

Print out a copy of your teams version of the files id3.h and id3.C (plus any extra files you add). You should hand in printouts demonstrating how your program works by running id3_create and id3_predict as shown above. You should also run id3_create on all of your own data sets (if you used unknown values replace them). Next you should show the resulting decision trees produced on three test data sets from ~rmaclin/public/datasets: breast-cancer-wisconsin, credit-g, and promoters-936. Then, run id3_nfold at least twice on these three data sets. For each of these experiments save your results and print them out.

Next your team should write up a short report (at least one page, no more than three) discussing your design decisions in implementing the ID3 code and how your version of the code works.

Finally, each member of your team should write a short report (at least a half a page and no more than one page) discussing what you contributed to the effort and how the overall team approach worked for you.