CHAPTER 5

 

 

CLASSIFICATION METHODS

 

This section briefly describes the various classification methods used in order to categorize the email messages into various folders. We have made use of three supervised and one unsupervised method. The supervised methods used are Naļve Bayes classifier, J48 Decision Trees and Support Vector Machines, whereas the unsupervised method is an adaptation of the K-means clustering method. Let us now see how each method works.

 

 

SUPERVISED METHODS:

 

Naļve Bayes Classifier: -

 

The Naļve Bayes classifier works on a simple, but comparatively intuitive concept. Also, in some cases it is also seen that Naļve Bayes outperforms many other comparatively complex algorithms. It makes use of the variables contained in the data sample, by observing them individually, independent of each other.

 

The Naļve Bayes classifier is based on the Bayes rule of conditional probability. It makes use of all the attributes contained in the data, and analyses them individually as though they are equally important and independent of each other.  For example, consider that the training data consists of various animals (say elephants, monkeys and giraffes), and our classifier has to classify any new instance that it encounters. We know that elephants have attributes like they have a trunk, huge tusks, a short tail, are extremely big, etc. Monkeys are short in size, jump around a lot, and can climb trees; whereas giraffes are tall, have a long neck and short ears.

 

The Naļve Bayes classifier will consider each of these attributes separately when classifying a new instance. So, when checking to see if the new instance is an elephant, the Naļve Bayes classifier will not check whether it has a trunk and has huge tusks and is large. Rather, it will separately check whether the new instance has a trunk, whether it has tusks, whether it is large, etc. It works under the assumption that one attribute works independently of the other attributes contained by the sample.

 

In our experiments, it is seen that the Naļve Bayes classifier performs almost at par with the other classifiers in most of the cases. Of the 26 different experiments carried out on various datasets, the Naļve Bayes classifier shows a drop in performance in only 3-4 cases, when compared with J48 and Support Vector Machines. This proves the widely held belief that though simple in concept, the Naļve Bayes classifier works well in most data classification problems.

 

J48 Decision Trees: -

 

A decision tree is a predictive machine-learning model that decides the target value (dependent variable) of a new sample based on various attribute values of the available data. The internal nodes of a decision tree denote the different attributes, the branches between the nodes tell us the possible values that these attributes can have in the observed samples, while the terminal nodes tell us the final value (classification) of the dependent variable.

 

The attribute that is to be predicted is known as the dependent variable, since its value depends upon, or is decided by, the values of all the other attributes. The other attributes, which help in predicting the value of the dependent variable, are known as the independent variables in the dataset.

 

The J48 Decision tree classifier follows the following simple algorithm. In order to classify a new item, it first needs to create a decision tree based on the attribute values of the available training data. So, whenever it encounters a set of items (training set) it identifies the attribute that discriminates the various instances most clearly. This feature that is able to tell us most about the data instances so that we can classify them the best is said to have the highest information gain. Now, among the possible values of this feature, if there is any value for which there is no ambiguity, that is, for which the data instances falling within its category have the same value for the target variable, then we terminate that branch and assign to it the target value that we have obtained.

 

For the other cases, we then look for another attribute that gives us the highest information gain. Hence we continue in this manner until we either get a clear decision of what combination of attributes gives us a particular target value, or we run out of attributes. In the event that we run out of attributes, or if we cannot get an unambiguous result from the available information, we assign this branch a target value that the majority of the items under this branch possess.

 

Now that we have the decision tree, we follow the order of attribute selection as we have obtained for the tree. By checking all the respective attributes and their values with those seen in the decision tree model, we can assign or predict the target value of this new instance. The above description will be more clear and easier to understand with the help of an example. Hence, let us see an example of J48 decision tree classification.

 

Consider that the captain of a cricket team has to decide whether to bat or field first in the event that they win the toss and have to make a decision. He decides to collect the statistics of the last ten matches when the winning captain has decided to bat first, and compare them in order to decide what to do, so as to make conditions most favorable for a win. The following table represents the data that he has collected, in order to help him make a decision. In the following table, Outlook, Humidity and the number of regular batsmen in the team are the independent variables, whereas the dependent variable is the final outcome of the game.

 

Table showing the various values of the attributes for the last ten games when the winning captain decided to bat first and the final outcome of the game.

 

Independent Variables

Dependent Variable

Outlook

Humidity

Number of batsmen in team > 6

Final Outcome

Sunny

High

Yes

Won

Overcast

High

No

Lost

Sunny

Low

No

Lost

Sunny

High

No

Won

Overcast

Low

Yes

Lost

Sunny

Low

Yes

Won

Sunny

Low

No

Lost

Sunny

High

No

Won

Sunny

Low

Yes

Won

Sunny

Low

Yes

Won

 

 

Though this data is apparently confusing, it looks as though a decision tree model may help the captain get a clearer picture of the underlying situation. Hence, let us build a decision tree model for the available data. As we can see from the table, it is obvious that whenever the winning captain decides to bat first on a day that is not sunny, the team loses the match. As such, this attribute gives us the most information. Let us make this our first attribute for splitting.

 

Then, we look at the branch where some ambiguity still exists, that is it still has a number of instances with both values of the dependent variable. We then realize that the attribute Humidity will give us more clear information about what happens when the day is sunny. We can see that the batting side wins when the day is sunny and the humidity is high. Thus, Humidity is ideally found to be the next attribute based on which the instances should be split.

 

Now we realize that the number of regular batsmen in the team plays an important role at this stage. If a team had more than six batsmen, it won the game. If not, it lost. Therefore, we now have a decision tree that looks like the diagram given on the following page. From this, the captain knows that statistically, it is better to field first if the day is overcast. If the day is sunny and the humidity is high, batting is advisable. Also, if the day is sunny, the humidity is low, and the team contains at least six regular batsmen, it is safe to bat first; otherwise they should field. Hence, the captain can now make an informed decision after winning the toss.

 

 

 

 

 

 

 

Dependent Variable: Game Won or Lost

 

 

 

 


Humidity

 
                                     

                         Overcast                         Sunny

 

 

                          Lost

 

                                                      High                      Low

                                                       

No. of batsmen > 6

 
                                                   Won

 

                   

                                                                   Yes                            No   

 

                                                                 Won                                  Lost             

 

 

Thus, whenever he comes across a new instance, the captain will just have to compare the attribute values for outlook, humidity and number of batsmen to arrive at the best decision in the event that he wins the toss.

 

In our experiments it was seen that J48 Decision trees performed really well. In most cases, their performance was almost at par with that of Support Vector Machines. In fact in several cases, it was seen that J48 Decision Trees had a higher accuracy than either Naļve Bayes, or Support Vector Machines.

 

 

 

Support Vector Machines: -

 

Support Vector Machines are supervised learning methods used for classification, as well as regression. The advantage of Support Vector Machines is that they can make use of certain kernels in order to transform the problem, such that we can apply linear classification techniques to non-linear data. Applying the kernel equations arranges the data instances in such a way within the multi-dimensional space, that there is a hyper-plane that separates data instances of one kind from those of another.

The kernel equations may be any function that transforms the linearly non-separable data in one domain into another domain where the instances become linearly separable. Kernel equations may be linear, quadratic, Gaussian, or anything else that achieves this particular purpose.

 

Once we manage to divide the data into two distinct categories, our aim is to get the best hyper-plane to separate the two types of instances. This hyper-plane is important because it decides the target variable value for future predictions. We should decide upon a hyper-plane that maximizes the margin between the support vectors on either side of the plane. Support vectors are those instances that are either on the separating planes on each side, or a little on the wrong side. The explanatory diagrams that follow will make these ideas a little more clear.

 

One important thing to note about Support Vector Machines is that the data to be separated needs to be binary. Even if the data is not binary, Support Vector Machines handles it as though it is, and completes the analysis through a series of binary assessments on the data.

 

Let us now see an example of how Support Vector Machines work. Since it is easier to understand the concept visually, we shall see the data instances in their original form on a graph, then we shall see how the data instances are separated upon the application of kernel functions on them, and how the best hyper-plane is found. We shall also see what support vectors exactly are and how does a margin look.

 

 

Original Space

 

 

 

 


                                 A              B   

                                                                             B  

                                       A         B           A    

                                                                            B

                                                             A    B                          B          

                                                                         

                                  A                                       B

                                                 A              B                              

                                   A                                 A               

 

 


Idea for diagram taken from class notes of Dr. Rich Maclin (Advanced Machine Learning and Knowledge Discovery Databases)

 

 

 

 

New Space

 

 


 

                                                                f(B)

                                                                         f(B)   f(B)             f(B)

                                        f(A)                                                f(B)

 Support                                                                                         f(B)       f(B)

  Vectors                            f(A)   f(A) 

                                                                                                  f(B)

                                                     f(A)

                                                                       f(A)

                                                                                              Margin       

                                          f(A)                       f(A)

                                                          f(A)                                                          

 

 

 

Idea for diagram taken from class notes of Dr. Rich Maclin (Advanced Machine Learning and Knowledge Discovery Databases)

 

 

 

As can be seen from the above diagrams, the data instances which were not linearly separable in the original domain have become linearly separable in the new domain, due to the application of a function (kernel) that transforms the position of the data points from one domain to another. This is the basic idea behind Support Vector Machines and their kernel techniques. Whenever a new instance is encountered in the original domain, the same kernel function is applied to this instance too, and its position in the new domain is found out. This position determines the binary target value to which the new instance belongs.

 

In many cases, it is often seen that Support Vector Machines perform the best among all machine-learning methods. It may be interesting to recall that Ron Bekkerman also came to the conclusion that Support Vector machines achieve a higher accuracy than Naļve Bayes, Maximum Entropy or Wide Margin Winnow. Though Wide Margin Winnow does perform faster and in some cases better than Support Vector Machines, on the whole Support Vector Machines outperform any other classifier for the task of email classification.

 

In our experiments too, it is seen that Support Vector Machines usually have the highest accuracy among any of the other classification methods. Of the 26 experiments that we carried out, it is seen that Support Vector Machines have the highest accuracy in 16 cases, while in most others it is a close second.

 

UNSUPERVISED METHOD:

 

SenseClusters (an adaptation of the K-means clustering algorithm): -

 

We have made use of SenseClusters in order to classify the email messages into different user-defined folders. SenseClusters is a freely available package of Perl programs, developed at the University of Minnesota Duluth, which can be used for automatic text and document classification. The advantage of SenseClusters is that it does not need any training data; it makes use of unsupervised learning methods in order to classify the available data. To know more about SenseClusters, kindly visit http://senseclusters.sourceforge.net/

 

In this section we will try to understand the K-means clustering algorithm that has been used in SenseClusters. Clustering is the process in which we divide the available data instances into a given number of sub-groups. These sub-groups are called clusters, and hence the name “Clustering”. To put it simply, the K-means algorithm outlines a method to cluster a particular set of instances into K different clusters, where K is a positive integer. It should be noted here that the K-means clustering algorithm requires to know the number of clusters from the user. It cannot identify the number of clusters by itself. However, SenseClusters now has the facility of automatically identifying the number of clusters that the data may comprise of.

 

The K-means clustering algorithm starts by placing K centroids as far away from each other as possible within the available space. Then each of the available data instances is assigned a particular centroid, depending on a metric like Euclidian distance, Manhattan distance, Minkowski distance, etc. The position of the centroid is recalculated every time an instance is added to the cluster and this continues until all the instances are grouped into the final required number of clusters. Since recalculating the cluster centroids may alter the cluster membership, the cluster memberships are also verified once the position of the centroid changes. This process continues till there is no further change in the cluster membership, and there is as little change in the positions of the centroids as possible.

 

The initial position of the centroids is thus very important since this position affects all the future steps in the K-means clustering algorithm. Hence, it is always advisable to keep the cluster centers as far away from each other as possible. If there are too many clusters, then clusters that closely resemble each other and are in the vicinity of each other are clubbed together. If there are too few clusters then clusters that are too big and may contain two or more sub-groups of different data instances are divided. The K-means clustering algorithm is thus a simple to understand, fairly intuitive method by which we can divide the available data into sub-categories.

 

In our experiments, it was seen that in many cases the accuracy of the unsupervised method was not far behind that of the supervised methods. In fact, in a couple of cases, the unsupervised method outperformed any of the supervised methods. This is very encouraging when we consider the fact that we do not require a training set for unsupervised learning methods. In many cases, it may be possible that when faced with the trade-off between making training data available and sacrificing the accuracy by 5-6%, people would prefer the latter option and hence move towards unsupervised methods of learning, as against supervised learning methods. As such, the results we have obtained look pretty encouraging, though unsupervised learning methods may not perform as well as supervised learning methods.