Anagha Kulkarni and Mahesh Joshi
M.S. students in Computer Science
University of Minnesota, Duluth
(Practice talks for IICAI conference in Pune, India, Dec 2005)

Tuesday, December 13, 2005
4pm, HH 302

TITLE:
------
Name Discrimination and Email Clustering using Unsupervised Clustering and 
Labeling of Similar Contexts 

AUTHORS:
--------
Anagha Kulkarni and Ted Pedersen

ABSTRACT:
---------
In this paper, we apply an unsupervised word sense discrimination 
technique based on clustering similar contexts (Purandare and Pedersen, 
2004) to the problems of name discrimination and email clustering. Names  
of people, places, and organizations are not always unique. This can  
create a problem when we refer to or seek out information about such  
entities. When this occurs in written text, we show that we can cluster 
ambiguous names into unique groups by identifying which contexts are similar 
to each other. It has been previously shown by  (Pedersen, Purandare, and  
Kulkarni, 2005) that this approach can be successfully used for  
discrimination of names with two-way ambiguity. Here we show that it can 
be extended to multiway distinctions as well. We adapt the cluster 
labeling  technique introduced by (Kulkarni, 2005) for the multiway 
distinctions of name discrimination. On the similar lines of contextual 
similarity, we also observe that email messages can be treated as  
contexts, and that in clustering them  together we are able to group them  
based on their underlying content rather than the occurrence of specific 
strings.

TITLE:
------
A Comparative Study of Support Vector Machines Applied to the Supervised 
Word Sense Disambiguation problem in the Medical Domain

AUTHORS:
--------
Mahesh Joshi, Ted Pedersen, and Rich Maclin

ABSTRACT:
---------
We have applied five supervised learning approaches to word sense
disambiguation in the medical domain. Our objective is to evaluate
Support Vector Machines (SVMs) in comparison with other well known
supervised learning algorithms including the naive Bayes classifier,
C4.5 decision trees, decision lists and boosting approaches. Based on
these results we introduce further refinements of these approaches.
We have made use of unigram and bigram features selected using different  
frequency cut-off values and window sizes along with the statistical  
significance test of the log likelihood measure for bigrams. Our results  
show that overall, the best SVM model was most accurate in 27 of 60 cases,  
compared to 22, 14, 10 and 14 for the naive Bayes, C4.5 decision 
trees, decision list and boosting methods respectively.