Computer Science 8751
Machine Learning

Programming Assignment 2
Decision Trees (55 points)
Due Monday, March 3, 2008

Introduction

Decision trees are one of the simplest machine learning methods but also a very effective method. In this assignment you are to implement a variation of the ID3 decision tree method using information gain as your criteria for selecting the best feature. You should test your method on the data used in class (the coolcars dataset, your own personal dataset, the promoters-936 dataset, the house-votes-84, and the letter dataset).

Details

Your method should make use of the dataset class you created in assignment 1. Note you should allow for both missing and continuous features.

You should implement a variation of ID3 where the criterion for splitting is GainRatio rather than information gain.

To use continous features you should implement the splitting notion discussed in class (adding decision choices based on sorting the values and choosing a value to divide the data based on how it scores).

For unknown feature values you should have an example during training get divided up across branches based on the proportion of examples of the same class's values. For example, if at a node we have 2 red positive (+) examples, 3 blue +, 5 green +, 5 red negative (-) and 3 blue - examples and one positive example that has no color information then we should count that positive example as 0.2 of a red example, 0.3 of a blue example and 0.5 of a green example.

During testing, an unknown example should be sent down every path in the tree and the resulting class values used to vote.

Your code should produce a nicely formatted output of the decision tree that is readable.

What to Hand In

You should hand in a documented copy of your code (including your dataset class files).

In addition hand in the decision trees produced for your dataset, the coolcars dataset, the promoters-936 dataset, the house-votes-84 and the letter dataset.

You must also submit your code electronically. To do this create a tar file of all of your code and then submit it to the class webdrop by going to https://webdrop.d.umn.edu/ and picking the webdrop for 8751 after logging in.

Extra Credit

5 points - implement a post-pruning process that considers removing each branch plus replacing that branch with one of its children.