Computer Science 8751
Machine Learning

Programming Assignment 1
Using WEKA and Creating your own Data Set (20 points)
Due Thursday, September 29, 2005


For this class we will be making use of the WEKA machine learning code which is implemented in Java. To get WEKA you can go to the webpage You should download this code and familiarize yourself with it. WEKA has online documentation built into it, you can also find a copy of a chapter from Witten & Frank's Data Mining book at which discusses the code (note that a new version of Witten and Frank's book has come out and has more extensive documentation, you may want to purchase a copy of this book).

In addition to downloading WEKA you will also be developing your own dataset. I would prefer that this dataset relate to your thesis research topic, but if you are having difficulty creating an appropriate dataset you may create one of interest to you.

To Do

  1. Pick a dataset from the UCI ML dataset repository that you want to work with. This dataset should have at least 200 examples, with at least 10 features, at least one of which should be continuous and at least one should be nominal. You may have to convert the dataset into a format appropriate for WEKA.
  2. Create a second dataset based on your research. It should have at least 50 examples, with at least 5 features, with at least one continuous and one nominal feature.
  3. Pick a learning algorithm (other than J48 -- preferably not a tree algorithm as J48 is a tree algorithm) to use for learning in WEKA.
  4. Perform ten 10-fold cross validation experiments on each dataset using the J48 decision tree method from WEKA and the other learning algorithm you chose.

To Hand In

  1. Writeup a description of your dataset and include a printout of your data points for this dataset. Your description should include a discussion of each feature, a discussion of the class variable, and some expectations regarding likely class models that will be learned from this dataset.
  2. Present, using confusion matrices and appropriate bar graphs a discussion of the accuracy of the two learning algorithms for each of the datasets you used for testing. Also include documentation of your results.