Computer Science 8751
Machine Learning

Programming Assignment 1
Using WEKA and Creating your own Data Set (30 points)
Due Wednesday, February 25, 2009

Introduction

For this class we will be making use of the WEKA machine learning code which is implemented in Java. To get WEKA you can go to the webpage http://www.cs.waikato.ac.nz/ml/weka. You should download this code and familiarize yourself with it. WEKA has online documentation built into it, you can also find a copy of a chapter from Witten & Frank's Data Mining book at http://www.cs.waikato.ac.nz/ml/weka/book.html which discusses the code (note that a new version of Witten and Frank's book has come out and has more extensive documentation, you may want to purchase a copy of this book).

In addition to downloading WEKA you will also be developing your own dataset. I would prefer that this dataset relate to your thesis research topic, but if you are having difficulty creating an appropriate dataset you may create one of interest to you. Note: NO cricket data sets.

To Do

  1. Pick a dataset from the UCI ML dataset repository that you want to work with. This dataset should have at least 200 examples, with at least 10 features, at least one of which should be continuous and at least one should be nominal. You may have to convert the dataset into a format appropriate for WEKA.
  2. Create a second dataset based on your research. It should have at least 50 examples, with at least 5 features, with at least one continuous and one nominal feature.
  3. Pick a learning method other than J48.
  4. Using the J48 decision tree method and the one you chose, perform five three-fold cross validation experiments on each dataset.

To Hand In

  1. Writeup a description of your dataset and include a printout of your data points for this dataset. Your description should include a discussion of each feature, a discussion of the class variable, and some expectations regarding likely class models that will be learned from this dataset.
  2. Present, using confusion matrices and appropriate bar graphs a discussion of the accuracy of the two learning algorithms for each of the datasets you used for testing. Also include documentation of your results.