CHAPTER
5
CLASSIFICATION
METHODS
This
section briefly describes the various classification methods used in order to
categorize the email messages into various folders. We have made use of three
supervised and one unsupervised method. The supervised methods used are Naļve
Bayes classifier, J48 Decision Trees and Support Vector Machines, whereas the
unsupervised method is an adaptation of the K-means clustering method. Let us
now see how each method works.
SUPERVISED METHODS:
Naļve Bayes Classifier: -
The
Naļve Bayes classifier works on a simple, but comparatively intuitive concept.
Also, in some cases it is also seen that Naļve Bayes outperforms many other
comparatively complex algorithms. It makes use of the variables contained in
the data sample, by observing them individually, independent of each other.
The
Naļve Bayes classifier is based on the Bayes rule of conditional probability.
It makes use of all the attributes contained in the data, and analyses them
individually as though they are equally important and independent of each
other. For example, consider that the
training data consists of various animals (say elephants, monkeys and
giraffes), and our classifier has to classify any new instance that it encounters.
We know that elephants have attributes like they have a trunk, huge tusks, a
short tail, are extremely big, etc. Monkeys are short in size, jump around a
lot, and can climb trees; whereas giraffes are tall, have a long neck and short
ears.
The
Naļve Bayes classifier will consider each of these attributes separately when
classifying a new instance. So, when checking to see if the new instance is an
elephant, the Naļve Bayes classifier will not check whether it has a trunk and
has huge tusks and is large. Rather, it will separately check whether the new
instance has a trunk, whether it has tusks, whether it is large, etc. It works
under the assumption that one attribute works independently of the other
attributes contained by the sample.
In
our experiments, it is seen that the Naļve Bayes classifier performs almost at
par with the other classifiers in most of the cases. Of the 26 different
experiments carried out on various datasets, the Naļve Bayes classifier shows a
drop in performance in only 3-4 cases, when compared with J48 and Support
Vector Machines. This proves the widely held belief that though simple in
concept, the Naļve Bayes classifier works well in most data classification
problems.
J48 Decision Trees: -
A
decision tree is a predictive machine-learning model that decides the target
value (dependent variable) of a new sample based on various attribute values of
the available data. The internal nodes of a decision tree denote the different
attributes, the branches between the nodes tell us the possible values that
these attributes can have in the observed samples, while the terminal nodes
tell us the final value (classification) of the dependent variable.
The
attribute that is to be predicted is known as the dependent variable, since its
value depends upon, or is decided by, the values of all the other attributes.
The other attributes, which help in predicting the value of the dependent
variable, are known as the independent variables in the dataset.
The
J48 Decision tree classifier follows the following simple algorithm. In order
to classify a new item, it first needs to create a decision tree based on the
attribute values of the available training data. So, whenever it encounters a
set of items (training set) it identifies the attribute that discriminates the
various instances most clearly. This feature that is able to tell us most about
the data instances so that we can classify them the best is said to have the
highest information gain. Now, among the possible values of this feature, if
there is any value for which there is no ambiguity, that is, for which the data
instances falling within its category have the same value for the target
variable, then we terminate that branch and assign to it the target value that
we have obtained.
For
the other cases, we then look for another attribute that gives us the highest
information gain. Hence we continue in this manner until we either get a clear
decision of what combination of attributes gives us a particular target value,
or we run out of attributes. In the event that we run out of attributes, or if
we cannot get an unambiguous result from the available information, we assign
this branch a target value that the majority of the items under this branch
possess.
Now
that we have the decision tree, we follow the order of attribute selection as
we have obtained for the tree. By checking all the respective attributes and
their values with those seen in the decision tree model, we can assign or
predict the target value of this new instance. The above description will be
more clear and easier to understand with the help of an example. Hence, let us
see an example of J48 decision tree classification.
Consider that the captain of a cricket team has to decide whether to bat or field first in the event that they win the toss and have to make a decision. He decides to collect the statistics of the last ten matches when the winning captain has decided to bat first, and compare them in order to decide what to do, so as to make conditions most favorable for a win. The following table represents the data that he has collected, in order to help him make a decision. In the following table, Outlook, Humidity and the number of regular batsmen in the team are the independent variables, whereas the dependent variable is the final outcome of the game.
Table showing the various values of the attributes for the last ten games when the winning captain decided to bat first and the final outcome of the game.
|
Independent Variables |
Dependent Variable |
||
|
Outlook |
Humidity |
Number of batsmen in team > 6 |
Final Outcome |
|
Sunny |
High |
Yes |
Won |
|
Overcast |
High |
No |
Lost |
|
Sunny |
Low |
No |
Lost |
|
Sunny |
High |
No |
Won |
|
Overcast |
Low |
Yes |
Lost |
|
Sunny |
Low |
Yes |
Won |
|
Sunny |
Low |
No |
Lost |
|
Sunny |
High |
No |
Won |
|
Sunny |
Low |
Yes |
Won |
|
Sunny |
Low |
Yes |
Won |
Though
this data is apparently confusing, it looks as though a decision tree model may
help the captain get a clearer picture of the underlying situation. Hence, let
us build a decision tree model for the available data. As we can see from the
table, it is obvious that whenever the winning captain decides to bat first on
a day that is not sunny, the team loses the match. As such, this attribute
gives us the most information. Let us make this our first attribute for
splitting.
Then,
we look at the branch where some ambiguity still exists, that is it still has a
number of instances with both values of the dependent variable. We then realize
that the attribute Humidity will give us more clear information about what
happens when the day is sunny. We can see that the batting side wins when the
day is sunny and the humidity is high. Thus, Humidity is ideally found to be
the next attribute based on which the instances should be split.
Now
we realize that the number of regular batsmen in the team plays an important
role at this stage. If a team had more than six batsmen, it won the game. If
not, it lost. Therefore, we now have a decision tree that looks like the
diagram given on the following page. From this, the captain knows that
statistically, it is better to field first if the day is overcast. If the day is
sunny and the humidity is high, batting is advisable. Also, if the day is
sunny, the humidity is low, and the team contains at least six regular batsmen,
it is safe to bat first; otherwise they should field. Hence, the captain can
now make an informed decision after winning the toss.
Dependent Variable: Game Won or Lost




Humidity
Overcast Sunny
Lost
High Low
Won
No. of batsmen > 6
Yes No
Won
Lost
Thus, whenever he comes across a new instance, the
captain will just have to compare the attribute values for outlook, humidity
and number of batsmen to arrive at the best decision in the event that he wins
the toss.
In our experiments it was seen that J48 Decision
trees performed really well. In most cases, their performance was almost at par
with that of Support Vector Machines. In fact in several cases, it was seen
that J48 Decision Trees had a higher accuracy than either Naļve Bayes, or
Support Vector Machines.
Support
Vector Machines are supervised learning methods used for classification, as
well as regression. The advantage of Support Vector Machines is that they can
make use of certain kernels in order to transform the problem, such that we can
apply linear classification techniques to non-linear data. Applying the kernel
equations arranges the data instances in such a way within the
multi-dimensional space, that there is a hyper-plane that separates data
instances of one kind from those of another.
The
kernel equations may be any function that transforms the linearly non-separable
data in one domain into another domain where the instances become linearly
separable. Kernel equations may be linear, quadratic, Gaussian, or anything
else that achieves this particular purpose.
Once
we manage to divide the data into two distinct categories, our aim is to get
the best hyper-plane to separate the two types of instances. This hyper-plane
is important because it decides the target variable value for future
predictions. We should decide upon a hyper-plane that maximizes the margin
between the support vectors on either side of the plane. Support vectors are
those instances that are either on the separating planes on each side, or a
little on the wrong side. The explanatory diagrams that follow will make these
ideas a little more clear.
One
important thing to note about Support Vector Machines is that the data to be
separated needs to be binary. Even if the data is not binary, Support Vector
Machines handles it as though it is, and completes the analysis through a
series of binary assessments on the data.
Let
us now see an example of how Support Vector Machines work. Since it is easier
to understand the concept visually, we shall see the data instances in their
original form on a graph, then we shall see how the data instances are
separated upon the application of kernel functions on them, and how the best
hyper-plane is found. We shall also see what support vectors exactly are and
how does a margin look.
Original Space
![]()
A B
B
A B A
B
A B
B
A B
A B
A A
![]()
Idea
for diagram taken from class notes of Dr. Rich Maclin (Advanced Machine
Learning and Knowledge Discovery Databases)
![]()

![]()
![]()
f(B)
f(B) f(B) f(B)
f(A) f(B)
Support
f(B) f(B)
Vectors
f(A) f(A)
f(B)
f(A)
f(A)
Margin
f(A) f(A)
f(A)
Idea
for diagram taken from class notes of Dr. Rich Maclin (Advanced Machine
Learning and Knowledge Discovery Databases)
As can be seen from the above diagrams, the data
instances which were not linearly separable in the original domain have become
linearly separable in the new domain, due to the application of a function
(kernel) that transforms the position of the data points from one domain to
another. This is the basic idea behind Support Vector Machines and their kernel
techniques. Whenever a new instance is encountered in the original domain, the
same kernel function is applied to this instance too, and its position in the
new domain is found out. This position determines the binary target value to
which the new instance belongs.
In many cases, it is often seen that Support Vector
Machines perform the best among all machine-learning methods. It may be
interesting to recall that Ron Bekkerman also came to the conclusion that
Support Vector machines achieve a higher accuracy than Naļve Bayes, Maximum
Entropy or Wide Margin Winnow. Though Wide Margin Winnow does perform faster
and in some cases better than Support Vector Machines, on the whole Support
Vector Machines outperform any other classifier for the task of email
classification.
In our experiments too, it is seen that Support
Vector Machines usually have the highest accuracy among any of the other
classification methods. Of the 26 experiments that we carried out, it is seen
that Support Vector Machines have the highest accuracy in 16 cases, while in
most others it is a close second.
UNSUPERVISED METHOD:
We
have made use of SenseClusters in order to classify the email messages into
different user-defined folders. SenseClusters is a freely available package of
Perl programs, developed at the University of Minnesota Duluth, which can be
used for automatic text and document classification. The advantage of
SenseClusters is that it does not need any training data; it makes use of
unsupervised learning methods in order to classify the available data. To know
more about SenseClusters, kindly visit http://senseclusters.sourceforge.net/
In this section we will try to understand the
K-means clustering algorithm that has been used in SenseClusters. Clustering is
the process in which we divide the available data instances into a given number
of sub-groups. These sub-groups are called clusters, and hence the name
Clustering. To put it simply, the K-means algorithm outlines a method to
cluster a particular set of instances into K different clusters, where K is a
positive integer. It should be noted here that the K-means clustering algorithm
requires to know the number of clusters from the user. It cannot identify the
number of clusters by itself. However, SenseClusters now has the facility of
automatically identifying the number of clusters that the data may comprise of.
The K-means clustering algorithm starts by placing K
centroids as far away from each other as possible within the available space.
Then each of the available data instances is assigned a particular centroid,
depending on a metric like Euclidian distance, Manhattan distance, Minkowski
distance, etc. The position of the centroid is recalculated every time an
instance is added to the cluster and this continues until all the instances are
grouped into the final required number of clusters. Since recalculating the
cluster centroids may alter the cluster membership, the cluster memberships are
also verified once the position of the centroid changes. This process continues
till there is no further change in the cluster membership, and there is as
little change in the positions of the centroids as possible.
The initial position of the centroids is thus very
important since this position affects all the future steps in the K-means
clustering algorithm. Hence, it is always advisable to keep the cluster centers
as far away from each other as possible. If there are too many clusters, then
clusters that closely resemble each other and are in the vicinity of each other
are clubbed together. If there are too few clusters then clusters that are too
big and may contain two or more sub-groups of different data instances are
divided. The K-means clustering algorithm is thus a simple to understand, fairly
intuitive method by which we can divide the available data into sub-categories.
In our experiments, it was seen that in many cases
the accuracy of the unsupervised method was not far behind that of the
supervised methods. In fact, in a couple of cases, the unsupervised method
outperformed any of the supervised methods. This is very encouraging when we
consider the fact that we do not require a training set for unsupervised
learning methods. In many cases, it may be possible that when faced with the
trade-off between making training data available and sacrificing the accuracy
by 5-6%, people would prefer the latter option and hence move towards
unsupervised methods of learning, as against supervised learning methods. As
such, the results we have obtained look pretty encouraging, though unsupervised
learning methods may not perform as well as supervised learning methods.