CS 8751 (Spring 2003) Program 4

Computer Science 8751
Machine Learning and Knowledge Discovery in Databases

Programming Assignment 4
Genetic Algorithms (45 points)
Due Monday, March 31, 2003

Introduction

In this lab you will be implementing a genetic algorithm for producing a set of rules to predict a concept.

Hypotheses: Representing the Bit String

The format of your hypothesis as a bit string should be based on the GABIL algorithm with one generalization, you should allow your rules to reference continuous variables as well.

To represent clauses corresponding to discrete features use the standard GABIL representation, use one bit for each value of that feature, where a 1 indicates that value is allowable and a 0 indicates the value is not allowed. For example, for feature A with three possible values (a1, a2, a3), three bits would be used. If the three bits were set to 101 in a hypothesis that would indicate a condition of (A = a1 or A = a3).

For continuously valued attributes, you should use a representation with 2 bits for an operation and 32 bits to represent a floating point value. If both of the operation bits are set to 1, then any value of the continuously valued feature is allowable. If the 2 operations bits are 10 then values that are less than or equal to the floating point value are allowed, and if the 2 operation bits are 01 then values that are greater than or equal to the floating point value are allowed. For example, if the 34 bits for continuous feature B were 01 followed by the 32 bits to represent 1.5, this would indicate a test of B >= 1.5.

Genetic Operators

You should plan on implementing one cross-over operator, plus the point mutation, AddAlternative, and DropCondition operators. For cross-over, you should use a variation on the GABIL cross-over operator as follows:

Pick two parents stochastically based on the fitness of each parent divided by the sum of fitness across all members of the population.
Pick two cross-over points as follows from the first parent:
- Select a rule randomly from however many rules the first parent has (if 5 rules, pick one of the 5 randomly with equal probability)
- Select an attribute of that rule randomly (if 10 possible attributes including the class, select from amongst the 10 attributes randomly).
- Select a cross-over point with respect to that attribute randomly (if the attribute has 34 bits, pick from among 34 random cross-over points, before the 1st bit, before the 2nd bit, before the 3rd bit, etc.)
Then repeat this process to pick the second cross-over point in the first parent (note, do not allow the system to pick the same cross-over point). Order the two cross-over points so that the one occuring earlier in the string is the first.
Pick two cross-over points for the second parent. This should be based on the cross-over points from the first parent. Determine the number of possible combinations of cross-over points in the second parent that match in the attribute and bit, but not in rule (and are ordered correctly) and pick randomly among these.

Fitness

The fitness of individual points should be the correctness (on the training set) of each hypothesis.

General Genetic Algorithm

Your code should take a dataset and several parameters. The following parameters should be set by the user when calling the program:

Length of evolution (how many evolutionary populations to go through)
Size of the population
Probability of cross-over (the probability that the cross-over operator will be used in creating children -- if no cross-over occurs simply use the parents as is -- possibly mutating them)
Probability of point mutation, Adding an Alternative, and Droping Constraint
Percentage of the population to replace (for each evolutionary step, you should stochastically choose some percentage of the population to replace based on the inverse of its fitness -- its error squared).

A sample run of your code on a data set should produce show the top 11 hypotheses (the rules for the 11 highest fitness members of the population) after evolution has completed.

You should set up your code so that it is able to save the top 11 hypotheses in a file. You should then be able to use the resulting hypotheses to predict the class for a separate set of data (a test file) and produce a confusion matrix for that data. Your output should look something like this:

Confusion matrix for hypothesis 1 from file XXX on data in file XXX:

               Actual Class
                  0   1 
               --------
Predicted   0 |   5   1 
Class       1 |   0  10 

  Accuracy =  93.75%

Confusion matrix for hypothesis 2 from file XXX on data in file XXX:
...

What to Turn In

Print out a copy of all of your code files. You should hand in printouts demonstrating how your program works by running your program on several data sets, including your own. For the rules produced from your data set try to analyze the resulting rules and determine how accurate you think the rules are at capturing the concept expressed by your data.

You should also write up a short report (at least one page, no more than three) discussing your design decisions in implementing the genetic algorithm and how your version of the code works.

You must also submit your code electronically. To do this go to the link https://webapps.d.umn.edu/service/webdrop/rmaclin/cs8751-1-s2003/upload.cgi and follow the directions for uploading a file (you can do this multiple times, though it would be helpful if you would tar your files and upload one file archive).

To make your code easier to check and grade please use the following procedure for collecting the code before uploading it:

Store all of the code in a subdirectory named prog04_PLcode under a subdirectory named for your login. The PLcode value should be "java" for Java code, "cc" for C++ code and "c" for C code. For example, if your login is rmaclin, and you wrote your code in C++, all of the code should be stored in the directory
```
rmaclin/prog04_cc
```
Note that the suffix of all C++ code files (not .h files) should be ".cc". Only code files (for example, in C++, only .cc and .h files) should be stored in this directory.
Tar the contents of this file using the command:
```
tar cf prog04.tar login/prog04_PLcode
```
Upload the resulting file (named prog04.tar) to the web site as discussed above.