### Introduction

In this lab you will implement the K-Means clustering algorithm discussed in class and in the clustering paper available from the class web page.

### Our Version of the K-Means Algorithm

We will be implementing a K-Means algorithm that will include a number of extra features. In the basic K-Means algorithm we pick a set of K cluster centers (or centroids) then repeatedly do the following:

```  For each data point, determine the closest centroid, assign that data
point to that centroid
For each centroid, calculate a simple gradient consisting of the sum of
the differences between that centroid and the points that make up its
cluster
Move each centroid towards the "center" of its cluster (step an amount in
movement or learning rate)
```

We will use a basic set of parameters to control this algorithm:

• k count - the initial number of clusters (e.g., k 5 indicates the initial number of centroids is 5)
• moverate rate - the rate of movement, how much to move the centroids, (e.g., moverate 0.05 means that we multiply each gradient vector by 0.05 when changing each centroid's location)
• numepochs count - the number of times to repeat the above algorithm (e.g., nepochs 1000 will mean that we should repeat the above steps 1000 times to recenter the cluster centroids)
• initialize random|data - whether to initialize centroids randomly in data space or to initialize the centroids by picking a random data pattern (e.g., initialize random for the former and initialize data for the latter)

We will also have a set of extra parameters that will control other aspects of the algorithm:

• maxmove distance - this value will be used to stop learning early (before nepochs is reached), if no centroid moved a distance larger than the number then learning is terminated
• movewhenbelow count - this value is used to move an unproductive centroid, if during the learning above, any centroid represents fewer than count points, then that centroid is reinitialized by setting it to be the same as a random data point. Note that this option has to be turned on, if the user does not supply a value for this option, then it is not checked (it will be ok for a centroid to represent 0 points).
• combine distance - when this option is selected the learning phase may occur multiple times. First, a set of clusters is generated normally. Then we examine those centroids and see if any pair is closer together than distance. If any are, we pick the two closest together and combine them (by eliminating one of the centroids and positioning the other at the average of the two centers). Then we repeat the initial learning process (with one less cluster) and again test if any are too close together. This continues until none are too close together. Note that this feature has to be turned on, if the user does not use this option, then we do not attempt to combine centroids.
• split distance - when this option is selected the learning phase may occur multiple times. First, a set of clusters is generated normally. Then we examine each cluster of points. If any cluster contains two points that are farther than distance apart we will split one cluster. We pick the cluster to split by picking the one with the two points that are the farthest apart. To split the cluster we create a second centroid whose center is set to a random member of that clusters location (the original cluster center remains in place). Then we repeat the initial learning process (with one more cluster) and again test if any cluster is too spread. This continues until none are too spread. Note that this feature has to be turned on, if the user does not use this option, then we do not attempt to split centroids.
• Note, that it is undefined what will happen if the user attempts to use both the combine and split options.

Although K-Means is not a supervised learning method, we will still be using the same dataset code we have used previously to implement this algorithm. We will be using the class information associated with the data points to do some evaluation of our clusters.

After your code runs it should print:

• The centroid location for each cluster
• The distribution of example classes in that cluster (how many points from class value 1 were included in the cluster, how many points from class value 2), etc.

For example, your code might produce the following:

```% kmeans -dataset breast-cancer-wisconsin -maxmove 0.01
Generating clusters:
Terminating learning early (31 epochs), no centroid moved more than 0.010
Centroid 1:
[0.652,0.410,0.353,0.693,0.372,0.944,0.462,0.222,0.137]
Class (0) Distribution: [3,48]
Centroid 2:
[0.144,0.018,0.021,0.018,0.103,0.049,0.246,0.011,0.001]
Class (0) Distribution: [94,0]
Centroid 3:
[0.145,0.012,0.022,0.017,0.112,0.034,0.058,0.007,0.010]
Class (0) Distribution: [213,0]
Centroid 4:
[0.581,0.763,0.705,0.586,0.492,0.858,0.666,0.519,0.078]
Class (0) Distribution: [2,71]
Centroid 5:
[0.734,0.461,0.524,0.197,0.429,0.504,0.391,0.533,0.119]
Class (0) Distribution: [9,60]
Centroid 6:
[0.443,0.065,0.091,0.077,0.134,0.060,0.133,0.068,0.010]
Class (0) Distribution: [137,12]
Centroid 7:
[0.754,0.914,0.911,0.764,0.710,0.720,0.739,0.899,0.408]
Class (0) Distribution: [0,50]
```

When the more advanced features are invoked you should indicate this during the learning process:

```% kmeans -dataset breast-cancer-wisconsin -k 2 -maxmove 0.01 -split 1.8
Generating clusters:
Terminating learning early (22 epochs), no centroid moved more than 0.010
Largest distance within cluster too large : 2.577  -- splitting cluster
Terminating learning early (24 epochs), no centroid moved more than 0.010
Largest distance within cluster too large : 2.146  -- splitting cluster
Terminating learning early (26 epochs), no centroid moved more than 0.010
Largest distance within cluster too large : 2.146  -- splitting cluster
Terminating learning early (53 epochs), no centroid moved more than 0.010
Largest distance within cluster too large : 1.902  -- splitting cluster
Terminating learning early (22 epochs), no centroid moved more than 0.010
Largest distance within cluster too large : 1.902  -- splitting cluster
Terminating learning early (28 epochs), no centroid moved more than 0.010
Largest distance within cluster too large : 1.985  -- splitting cluster
Terminating learning early (23 epochs), no centroid moved more than 0.010
Largest distance within cluster too large : 1.869  -- splitting cluster
Terminating learning early (21 epochs), no centroid moved more than 0.010
Centroid 1:
[0.206,0.021,0.034,0.029,0.111,0.037,0.112,0.014,0.007]
Class (0) Distribution: [436,1]
Centroid 2:
[0.732,0.427,0.455,0.400,0.327,0.955,0.472,0.251,0.077]
Class (0) Distribution: [5,66]
Centroid 3:
[0.635,0.821,0.829,0.728,0.608,0.966,0.707,0.694,0.128]
Class (0) Distribution: [0,42]
Centroid 4:
[0.873,0.812,0.757,0.178,0.509,0.539,0.412,0.686,0.107]
Class (0) Distribution: [1,24]
Centroid 5:
[0.794,0.890,0.885,0.737,0.777,0.732,0.638,0.852,0.855]
Class (0) Distribution: [0,24]
Centroid 6:
[0.606,0.287,0.314,0.245,0.273,0.268,0.313,0.195,0.103]
Class (0) Distribution: [9,23]
Centroid 7:
[0.701,0.846,0.759,0.646,0.524,0.250,0.735,0.705,0.063]
Class (0) Distribution: [0,25]
Centroid 8:
[0.514,0.290,0.331,0.146,0.520,0.285,0.354,0.855,0.032]
Class (0) Distribution: [6,8]
Centroid 9:
[0.338,0.510,0.483,0.756,0.505,0.919,0.683,0.682,0.130]
Class (0) Distribution: [1,28]
```

or

```% kmeans -dataset breast-cancer-wisconsin -k 20 -maxmove 0.01 -combine 0.5 -movewhenbelow 10
Generating clusters:
Reinitializing centroid with 7 cluster points
Reinitializing centroid with 0 cluster points
Reinitializing centroid with 9 cluster points
Reinitializing centroid with 0 cluster points
Reinitializing centroid with 5 cluster points
Reinitializing centroid with 8 cluster points
Reinitializing centroid with 9 cluster points
Reinitializing centroid with 5 cluster points
Terminating learning early (48 epochs), no centroid moved more than 0.010
Combining two closest thresholds - distance : 0.136
Terminating learning early (7 epochs), no centroid moved more than 0.010
Combining two closest thresholds - distance : 0.168
Terminating learning early (1 epochs), no centroid moved more than 0.010
Combining two closest thresholds - distance : 0.186
Terminating learning early (1 epochs), no centroid moved more than 0.010
Combining two closest thresholds - distance : 0.202
Terminating learning early (1 epochs), no centroid moved more than 0.010
Combining two closest thresholds - distance : 0.203
Terminating learning early (2 epochs), no centroid moved more than 0.010
Combining two closest thresholds - distance : 0.243
Terminating learning early (2 epochs), no centroid moved more than 0.010
Combining two closest thresholds - distance : 0.292
Terminating learning early (2 epochs), no centroid moved more than 0.010
Combining two closest thresholds - distance : 0.352
Terminating learning early (1 epochs), no centroid moved more than 0.010
Combining two closest thresholds - distance : 0.453
Terminating learning early (8 epochs), no centroid moved more than 0.010
Centroid 1:
[0.828,0.447,0.520,0.299,0.356,0.923,0.642,0.793,0.106]
Class (0) Distribution: [1,25]
Centroid 2:
[0.710,0.446,0.400,0.900,0.432,0.906,0.524,0.266,0.165]
Class (0) Distribution: [1,27]
Centroid 3:
[0.192,0.023,0.046,0.035,0.110,0.125,0.111,0.011,0.007]
Class (0) Distribution: [433,1]
Centroid 4:
[0.339,0.255,0.254,0.282,0.350,0.256,0.346,0.583,0.011]
Class (0) Distribution: [9,12]
Centroid 5:
[0.702,0.838,0.778,0.530,0.481,0.225,0.689,0.792,0.066]
Class (0) Distribution: [0,27]
Centroid 6:
[0.758,0.290,0.325,0.111,0.279,0.213,0.346,0.136,0.152]
Class (0) Distribution: [7,17]
Centroid 7:
[0.866,0.777,0.734,0.112,0.580,0.558,0.337,0.720,0.297]
Class (0) Distribution: [1,18]
Centroid 8:
[0.773,0.969,0.930,0.659,0.718,0.949,0.769,0.405,0.136]
Class (0) Distribution: [0,26]
Centroid 9:
[0.378,0.656,0.668,0.778,0.530,0.905,0.619,0.829,0.101]
Class (0) Distribution: [2,31]
Centroid 10:
[0.666,0.426,0.473,0.236,0.331,0.955,0.396,0.166,0.045]
Class (0) Distribution: [4,38]
Centroid 11:
[0.783,0.947,0.913,0.918,0.760,0.738,0.690,0.884,0.812]
Class (0) Distribution: [0,19]
```

kmeans should also have one further option, show, when this option is used you should also print the names of the data points that are part of the cluster:

```% kmeans -dataset labor -maxmove 0.01 -k 4 -show
Generating clusters:
Terminating learning early (60 epochs), no centroid moved more than 0.010
Centroid 1:
[0.583,0.256,0.378,0.415,0.495,0.281,0.420,0.947,0.612,0.313,0.313,0.413,0.300,0.275,0.725,0.250,0.904,0.096,0.005,0.542,0.458,0.150,0.850,0.150,0.695,0.305,0.374,0.402,0.448]
Class (0) Distribution: [4,9]
Members:
Example_12 Example_45 Example_21 Example_7 Example_25
Example_8 Example_17 Example_43 Example_37 Example_34
Example_31 Example_30 Example_16
Centroid 2:
[0.569,0.472,0.445,0.550,0.713,0.286,0.168,0.794,0.402,0.402,0.521,0.542,0.314,0.503,0.497,0.412,0.157,0.337,0.546,0.714,0.286,0.267,0.228,0.694,0.764,0.236,0.150,0.188,0.811]
Class (0) Distribution: [22,3]
Members:
Example_13 Example_24 Example_53 Example_29 Example_27
Example_41 Example_56 Example_50 Example_54 Example_20
Example_46 Example_9 Example_48 Example_10 Example_35
Example_14 Example_28 Example_11 Example_52 Example_1
Example_15 Example_39 Example_19 Example_23 Example_49
Centroid 3:
[0.201,0.240,0.500,0.500,0.800,0.000,0.200,0.869,0.800,0.200,0.000,0.433,0.248,0.500,0.500,0.267,0.401,0.399,0.200,0.000,1.000,1.000,0.000,0.000,0.201,0.799,1.000,0.000,0.000]
Class (0) Distribution: [0,5]
Members:
Example_36 Example_33 Example_18 Example_44 Example_40
Centroid 4:
[0.744,0.301,0.357,0.623,0.231,0.526,0.474,0.683,0.263,0.263,0.737,0.500,0.444,0.655,0.345,0.404,0.228,0.609,0.310,0.693,0.307,0.406,0.594,0.242,0.822,0.178,0.321,0.679,0.244]
Class (0) Distribution: [11,3]
Members:
Example_26 Example_51 Example_3 Example_32 Example_38
Example_47 Example_42 Example_22 Example_4 Example_2
Example_5 Example_6 Example_55 Example_0
```

### Experiments

Conduct experiments to try to determine an appropriate number of clusters to use for the breast-cancer-wisconsin data. A rough estimate of how good a set of clusters is can be obtained by totaling the smaller of the number of points in the class break down for each cluster (these points can be thought of as errors). Show runs from several experiments to show how your parameters choices change things.

Next run experiments to see whether splitting or combining clusters can be made to approximate your choices.

Also run your code on your own datasets (note you will likely need to use a smaller number of clusters). What conclusions can you draw from your results?

### What to Turn In

Print out a copy of all of your code files. You should hand in printouts demonstrating how your program works by running your program on several data sets, including your own.

You should also write up a short report (at least one page, no more than three) discussing your design decisions in implementing the K-means algorithm and how your version of the code works.

`rmaclin/prog05_kmeans_cc`
`tar cf prog05_kmeans.tar login/prog05_kmeans_PLcode`