Computer Science 5751
Machine Learning

Programming Assignment 5
Clustering (K-Means) (75 points)
Due Thursday, April 19, 2001

Introduction

In this lab you will implement the K-Means clustering algorithm discussed in class and in the clustering paper available from the class web page. Your team assignments can be found here.

Our Version of the K-Means Algorithm

We will be implementing a K-Means algorithm that will include a number of extra features. In the basic K-Means algorithm we pick a set of K cluster centers (or centroids) then repeatedly do the following:

  For each data point, determine the closest centroid, assign that data
    point to that centroid
  For each centroid, calculate a simple gradient consisting of the sum of
    the differences between that centroid and the points that make up its
    cluster
  Move each centroid towards the "center" of its cluster (step an amount in
    the gradient direction based on the gradient calculated above and a
    movement or learning rate)

We will use a basic set of parameters to control this algorithm:

We will also have a set of extra parameters that will control other aspects of the algorithm:

K-Means and the Training Code

Although K-Means is not an unsupervised learning method, we will still be using the same code we have used previously to implement this algorithm. We will even use the class information associated with data points to do some evaluation of our clusters.

The code you will be completing is in the archive skm.tar.Z To use this code, download the archive to your account and then unpack it as follows:

  uncompress skm.tar.Z
  tar xvf skm.tar.Z

This will create the directory student_kmeans and a set of files in that directory. The directory includes a makefile that will compile the provided files and produce the program train. Note that although the code includes placeholders for a number of routines, the only one you will need to complete will be the member function of KMEANS learn. The other functions can be left blank.

As with the previous assignment, I cannot provide you with simple scripts to run train. To run the train program, you will need to provide at least the following arguments:

  train -dataset c4.5 <NAMESFILE> <DATAFILE> -save model <DUMMYNAME> -learn kmeans

The name provided after "-save model" can be any dummy name (an empty file will be stored under this name). As with the previous program, this program will have a number of options (most discussed above) that you will play with to get a feel for how the kmeans algorithm works.

After your code runs it should print:

For example, your code might produce the following:

% train -dataset c4.5 breast-cancer-wisconsin.names breast-cancer-wisconsin.data -save model junk -learn kmeans maxmove 0.01
Generating clusters:
  Terminating learning early (31 epochs), no centroid moved more than 0.010
Centroid 1:
  [0.652,0.410,0.353,0.693,0.372,0.944,0.462,0.222,0.137]
  Class (0) Distribution: [3,48]
Centroid 2:
  [0.144,0.018,0.021,0.018,0.103,0.049,0.246,0.011,0.001]
  Class (0) Distribution: [94,0]
Centroid 3:
  [0.145,0.012,0.022,0.017,0.112,0.034,0.058,0.007,0.010]
  Class (0) Distribution: [213,0]
Centroid 4:
  [0.581,0.763,0.705,0.586,0.492,0.858,0.666,0.519,0.078]
  Class (0) Distribution: [2,71]
Centroid 5:
  [0.734,0.461,0.524,0.197,0.429,0.504,0.391,0.533,0.119]
  Class (0) Distribution: [9,60]
Centroid 6:
  [0.443,0.065,0.091,0.077,0.134,0.060,0.133,0.068,0.010]
  Class (0) Distribution: [137,12]
Centroid 7:
  [0.754,0.914,0.911,0.764,0.710,0.720,0.739,0.899,0.408]
  Class (0) Distribution: [0,50]

When the more advanced features are invoked you should indicate this during the learning process:

% train -dataset c4.5 breast-cancer-wisconsin.names breast-cancer-wisconsin.data -save model junk -learn kmeans k 2 maxmove 0.01 split 1.8
Generating clusters:
  Terminating learning early (22 epochs), no centroid moved more than 0.010
  Largest distance within cluster too large : 2.577  -- splitting cluster
  Terminating learning early (24 epochs), no centroid moved more than 0.010
  Largest distance within cluster too large : 2.146  -- splitting cluster
  Terminating learning early (26 epochs), no centroid moved more than 0.010
  Largest distance within cluster too large : 2.146  -- splitting cluster
  Terminating learning early (53 epochs), no centroid moved more than 0.010
  Largest distance within cluster too large : 1.902  -- splitting cluster
  Terminating learning early (22 epochs), no centroid moved more than 0.010
  Largest distance within cluster too large : 1.902  -- splitting cluster
  Terminating learning early (28 epochs), no centroid moved more than 0.010
  Largest distance within cluster too large : 1.985  -- splitting cluster
  Terminating learning early (23 epochs), no centroid moved more than 0.010
  Largest distance within cluster too large : 1.869  -- splitting cluster
  Terminating learning early (21 epochs), no centroid moved more than 0.010
Centroid 1:
  [0.206,0.021,0.034,0.029,0.111,0.037,0.112,0.014,0.007]
  Class (0) Distribution: [436,1]
Centroid 2:
  [0.732,0.427,0.455,0.400,0.327,0.955,0.472,0.251,0.077]
  Class (0) Distribution: [5,66]
Centroid 3:
  [0.635,0.821,0.829,0.728,0.608,0.966,0.707,0.694,0.128]
  Class (0) Distribution: [0,42]
Centroid 4:
  [0.873,0.812,0.757,0.178,0.509,0.539,0.412,0.686,0.107]
  Class (0) Distribution: [1,24]
Centroid 5:
  [0.794,0.890,0.885,0.737,0.777,0.732,0.638,0.852,0.855]
  Class (0) Distribution: [0,24]
Centroid 6:
  [0.606,0.287,0.314,0.245,0.273,0.268,0.313,0.195,0.103]
  Class (0) Distribution: [9,23]
Centroid 7:
  [0.701,0.846,0.759,0.646,0.524,0.250,0.735,0.705,0.063]
  Class (0) Distribution: [0,25]
Centroid 8:
  [0.514,0.290,0.331,0.146,0.520,0.285,0.354,0.855,0.032]
  Class (0) Distribution: [6,8]
Centroid 9:
  [0.338,0.510,0.483,0.756,0.505,0.919,0.683,0.682,0.130]
  Class (0) Distribution: [1,28]

or

% train -dataset c4.5 breast-cancer-wisconsin.names breast-cancer-wisconsin.data -save model junk -learn kmeans k 20 maxmove 0.01 combine 0.5 movewhenbelow 10
Generating clusters:
  Reinitializing centroid with 7 cluster points
  Reinitializing centroid with 0 cluster points
  Reinitializing centroid with 9 cluster points
  Reinitializing centroid with 0 cluster points
  Reinitializing centroid with 5 cluster points
  Reinitializing centroid with 8 cluster points
  Reinitializing centroid with 9 cluster points
  Reinitializing centroid with 5 cluster points
  Terminating learning early (48 epochs), no centroid moved more than 0.010
  Combining two closest thresholds - distance : 0.136
  Terminating learning early (7 epochs), no centroid moved more than 0.010
  Combining two closest thresholds - distance : 0.168
  Terminating learning early (1 epochs), no centroid moved more than 0.010
  Combining two closest thresholds - distance : 0.186
  Terminating learning early (1 epochs), no centroid moved more than 0.010
  Combining two closest thresholds - distance : 0.202
  Terminating learning early (1 epochs), no centroid moved more than 0.010
  Combining two closest thresholds - distance : 0.203
  Terminating learning early (2 epochs), no centroid moved more than 0.010
  Combining two closest thresholds - distance : 0.243
  Terminating learning early (2 epochs), no centroid moved more than 0.010
  Combining two closest thresholds - distance : 0.292
  Terminating learning early (2 epochs), no centroid moved more than 0.010
  Combining two closest thresholds - distance : 0.352
  Terminating learning early (1 epochs), no centroid moved more than 0.010
  Combining two closest thresholds - distance : 0.453
  Terminating learning early (8 epochs), no centroid moved more than 0.010
Centroid 1:
  [0.828,0.447,0.520,0.299,0.356,0.923,0.642,0.793,0.106]
  Class (0) Distribution: [1,25]
Centroid 2:
  [0.710,0.446,0.400,0.900,0.432,0.906,0.524,0.266,0.165]
  Class (0) Distribution: [1,27]
Centroid 3:
  [0.192,0.023,0.046,0.035,0.110,0.125,0.111,0.011,0.007]
  Class (0) Distribution: [433,1]
Centroid 4:
  [0.339,0.255,0.254,0.282,0.350,0.256,0.346,0.583,0.011]
  Class (0) Distribution: [9,12]
Centroid 5:
  [0.702,0.838,0.778,0.530,0.481,0.225,0.689,0.792,0.066]
  Class (0) Distribution: [0,27]
Centroid 6:
  [0.758,0.290,0.325,0.111,0.279,0.213,0.346,0.136,0.152]
  Class (0) Distribution: [7,17]
Centroid 7:
  [0.866,0.777,0.734,0.112,0.580,0.558,0.337,0.720,0.297]
  Class (0) Distribution: [1,18]
Centroid 8:
  [0.773,0.969,0.930,0.659,0.718,0.949,0.769,0.405,0.136]
  Class (0) Distribution: [0,26]
Centroid 9:
  [0.378,0.656,0.668,0.778,0.530,0.905,0.619,0.829,0.101]
  Class (0) Distribution: [2,31]
Centroid 10:
  [0.666,0.426,0.473,0.236,0.331,0.955,0.396,0.166,0.045]
  Class (0) Distribution: [4,38]
Centroid 11:
  [0.783,0.947,0.913,0.918,0.760,0.738,0.690,0.884,0.812]
  Class (0) Distribution: [0,19]

kmeans also has one further option, show, when this option is used you should also print the names of the data points that are part of the cluster:

% train -dataset c4.5 labor.names labor.data -save model junk -learn kmeans maxmove 0.01 k 4 show
Generating clusters:
  Terminating learning early (60 epochs), no centroid moved more than 0.010
Centroid 1:
  [0.583,0.256,0.378,0.415,0.495,0.281,0.420,0.947,0.612,0.313,0.313,0.413,0.300,0.275,0.725,0.250,0.904,0.096,0.005,0.542,0.458,0.150,0.850,0.150,0.695,0.305,0.374,0.402,0.448]
  Class (0) Distribution: [4,9]
  Members:
    Example_12 Example_45 Example_21 Example_7 Example_25
    Example_8 Example_17 Example_43 Example_37 Example_34
    Example_31 Example_30 Example_16
Centroid 2:
  [0.569,0.472,0.445,0.550,0.713,0.286,0.168,0.794,0.402,0.402,0.521,0.542,0.314,0.503,0.497,0.412,0.157,0.337,0.546,0.714,0.286,0.267,0.228,0.694,0.764,0.236,0.150,0.188,0.811]
  Class (0) Distribution: [22,3]
  Members:
    Example_13 Example_24 Example_53 Example_29 Example_27
    Example_41 Example_56 Example_50 Example_54 Example_20
    Example_46 Example_9 Example_48 Example_10 Example_35
    Example_14 Example_28 Example_11 Example_52 Example_1
    Example_15 Example_39 Example_19 Example_23 Example_49
Centroid 3:
  [0.201,0.240,0.500,0.500,0.800,0.000,0.200,0.869,0.800,0.200,0.000,0.433,0.248,0.500,0.500,0.267,0.401,0.399,0.200,0.000,1.000,1.000,0.000,0.000,0.201,0.799,1.000,0.000,0.000]
  Class (0) Distribution: [0,5]
  Members:
    Example_36 Example_33 Example_18 Example_44 Example_40
Centroid 4:
  [0.744,0.301,0.357,0.623,0.231,0.526,0.474,0.683,0.263,0.263,0.737,0.500,0.444,0.655,0.345,0.404,0.228,0.609,0.310,0.693,0.307,0.406,0.594,0.242,0.822,0.178,0.321,0.679,0.244]
  Class (0) Distribution: [11,3]
  Members:
    Example_26 Example_51 Example_3 Example_32 Example_38
    Example_47 Example_42 Example_22 Example_4 Example_2
    Example_5 Example_6 Example_55 Example_0

You can read more about the code you will be completing here. Assume that there is just one output class in your problems. You should plan on using the same process you used in the previous lab to turn data points into vectors. Hari will give you more details on these functions and answer your questions during lab.

Experiments

Conduct experiments to try to determine an appropriate number of clusters to use for the breast-cancer-wisconsin data. A rough estimate of how good a set of clusters is can be obtained by totaling the smaller of the number of points in the class break down for each cluster (these points can be thought of as errors). Show runs from several experiments to show how your parameters choices change things.

Next run experiments to see whether splitting or combining clusters can be made to approximate your choices.

Also run your code on your own datasets (note you will likely need to use a smaller number of clusters). What conclusions can you draw from your results?

What to Turn In

Print out a copy of your team's version of the files kmeans.h and kmeans.C (plus any extra files you add). Also, construct a report of the results of your experiments from the previous section with the results mentioned.

Next your team should write up a short report (at least one page, no more than three) discussing your design decisions in implementing the K-Means code and how your version of the code works.

Finally, each member of your team should write a short report (at least a half a page and no more than one page) discussing what you contributed to the effort and how the overall team approach worked for you.