CS 5751 (Spring 2001) Program 2

Computer Science 5751
Machine Learning

Programming Assignment 2
Version Spaces (75 points)
Due Tuesday, February 20, 2001

Introduction

In this lab you will be implementing the Version Space algorithm (see the text book and class notes for more details on this algorithm). To do this you will be completing a set of code provided to you. In this implementation process you will be working in teams of three students (except for graduate students, who will complete the code individually). Your team assignments will be based on a survey I will conduct after the second week of class and will be shown here.

The Provided Code and Version Spaces

The code you will be completing is in the archive svs.tar.Z To use this code, download the archive to your account and then unpack it as follows:

  uncompress svs.tar.Z
  tar xvf svs.tar.Z

This will create the directory student_version_space and a set of files in that directory. The directory includes a makefile that will compile the provided files and produce two programs, train and classify. To call these programs correctly you will use the scripts contained in the files vs_create and vs_predict. vs_create is used to construct a version space. To run it you type:

  vs_create NAMESFILE DATAFILE MODELFILE

where NAMESFILE and DATAFILE are the .names and .data file names of the data set for which you want to construct a version space. Some samples from the book are included (such as table2.1.names and table2.1.data). The name MODELFILE should be a name of a file where you want to save the version space you construct. A sample run of vs_create should produce something like the following:

% vs_create table2.1.names table2.1.data table2.1.model
For positive example Example_0
  Sky=Sunny AirTemp=Warm Humidity=Normal Wind=Strong Water=Warm Forecast=Same 
  G: { <?,?,?,?,?,?> }
  S: { <Sunny,Warm,Normal,Strong,Warm,Same> }
For positive example Example_1
  Sky=Sunny AirTemp=Warm Humidity=High Wind=Strong Water=Warm Forecast=Same 
  G: { <?,?,?,?,?,?> }
  S: { <Sunny,Warm,?,Strong,Warm,Same> }
For negative example Example_2
  Sky=Rainy AirTemp=Cold Humidity=High Wind=Strong Water=Warm Forecast=Change 
  G: { <Sunny,?,?,?,?,?> <?,Warm,?,?,?,?> <?,?,?,?,?,Same> }
  S: { <Sunny,Warm,?,Strong,Warm,Same> }
For positive example Example_3
  Sky=Sunny AirTemp=Warm Humidity=High Wind=Strong Water=Cool Forecast=Change 
  G: { <?,Warm,?,?,?,?> <Sunny,?,?,?,?,?> }
  S: { <Sunny,Warm,?,Strong,?,?> }

Note that the resulting version space is now saved in the file table2.1.model.

vs_predict reads in a version space that was created by running vs_create and uses it to predict a set of data points using that version space. To run it you type:

  vs_predict MODELFILE NAMESFILE DATAFILE

The file MODELFILE should already exist when you run this program. An example of the result this produces is:

Classifying patterns:
  G: { <Sunny,?,?,?,?,?> <?,Warm,?,?,?,?> }
  S: { <Sunny,Warm,?,Strong,?,?> }
  Other: { <?,Warm,?,Strong,?,?> <Sunny,?,?,Strong,?,?> <Sunny,Warm,?,?,?,?> }
For positive example Example_0
  Sky=Sunny AirTemp=Warm Humidity=Normal Wind=Strong Water=Cool Forecast=Change 
  Prediction: positive
For negative example Example_1
  Sky=Rainy AirTemp=Cold Humidity=Normal Wind=Light Water=Warm Forecast=Same 
  Prediction: negative
For positive example Example_2
  Sky=Sunny AirTemp=Warm Humidity=Normal Wind=Light Water=Warm Forecast=Same 
  Prediction: negative (0.500000)
For negative example Example_3
  Sky=Sunny AirTemp=Cold Humidity=Normal Wind=Strong Water=Warm Forecast=Same 
  Prediction: negative (0.666667)


Confusion matrix for test set:
               Actual Class
                  0   1 
               --------
Predicted   0 |   2   1 
Class       1 |   0   1 

  Accuracy =  75.00%



Predicted Output Vectors:
  Example_0: Actual=[0.000,1.000]   Predicted=[0.000,1.000]
  Example_1: Actual=[1.000,0.000]   Predicted=[1.000,0.000]
  Example_2: Actual=[0.000,1.000]   Predicted=[0.500,0.500]
  Example_3: Actual=[1.000,0.000]   Predicted=[0.667,0.333]

Completing the Code

You can read more about the code you will be completing here. For this program you should make several assumptions. First, assume that there is only one class for each data point (as in C4.5 data sets) and that that class has exactly two possible values (such as "good" and "bad" or "yes" and "no"). Furthermore, you should assume that data sets you will apply this code to contain only discrete features and no unknown values. A function testing many of these conditions is included in the archived code.

To complete the provided code you will need to complete four functions (learn, classify, read and write) and the Constructor and Destructor for the class VersionSpace (this is not as easy as it sounds, plan to start early). The skeletons of these functions can be found in the file version_space.C. You will also need to add some fields to the class VersionSpace in the file version_space.h (representing the VersionSpace's S and G lists). A short description of what each function is supposed to do can be found in version_space.C. Hari will give you more details on these functions and answer your questions during lab.

What to Turn In

Print out a copy of your teams version of the files version_space.h and version_space.C (plus any extra files you add). You should hand in printouts demonstrating how your program works by running vs_create and vs_predict as shown above. You should also run vs_create on the table3.1 data and on your own first data sets (one for each team member). In all likelihood, all of these data sets will cause the version space to become empty (hand in print outs demonstrating this). Finally, you should create one other test data set with at least 5 discrete features and data points matching a concept based on those 5 features where at least 2 of the features are a particular discrete value and at least two are ?. Show that your version of Version Spaces will converge to the correct concept by running vs_create on enough data to cause convergence (and hand in print outs of these results).

Next your team should write up a short report (at least one page, no more than three) discussing your design decisions in implementing the Version Spaces code and how your version of the code works.

Finally, each member of your team should write a short report (at least a half a page and no more than one page) discussing what you contributed to the effort and how the overall team approach worked for you.

Computer Science 5751 Machine Learning Programming Assignment 2 Version Spaces (75 points) Due Tuesday, February 20, 2001

Introduction

The Provided Code and Version Spaces

Completing the Code

What to Turn In

Computer Science 5751
Machine Learning

Programming Assignment 2
Version Spaces (75 points)
Due Tuesday, February 20, 2001