Anagha Kulkarni and Mahesh Joshi M.S. students in Computer Science University of Minnesota, Duluth (Practice talks for IICAI conference in Pune, India, Dec 2005) Tuesday, December 13, 2005 4pm, HH 302 TITLE: ------ Name Discrimination and Email Clustering using Unsupervised Clustering and Labeling of Similar Contexts AUTHORS: -------- Anagha Kulkarni and Ted Pedersen ABSTRACT: --------- In this paper, we apply an unsupervised word sense discrimination technique based on clustering similar contexts (Purandare and Pedersen, 2004) to the problems of name discrimination and email clustering. Names of people, places, and organizations are not always unique. This can create a problem when we refer to or seek out information about such entities. When this occurs in written text, we show that we can cluster ambiguous names into unique groups by identifying which contexts are similar to each other. It has been previously shown by (Pedersen, Purandare, and Kulkarni, 2005) that this approach can be successfully used for discrimination of names with two-way ambiguity. Here we show that it can be extended to multiway distinctions as well. We adapt the cluster labeling technique introduced by (Kulkarni, 2005) for the multiway distinctions of name discrimination. On the similar lines of contextual similarity, we also observe that email messages can be treated as contexts, and that in clustering them together we are able to group them based on their underlying content rather than the occurrence of specific strings. TITLE: ------ A Comparative Study of Support Vector Machines Applied to the Supervised Word Sense Disambiguation problem in the Medical Domain AUTHORS: -------- Mahesh Joshi, Ted Pedersen, and Rich Maclin ABSTRACT: --------- We have applied five supervised learning approaches to word sense disambiguation in the medical domain. Our objective is to evaluate Support Vector Machines (SVMs) in comparison with other well known supervised learning algorithms including the naive Bayes classifier, C4.5 decision trees, decision lists and boosting approaches. Based on these results we introduce further refinements of these approaches. We have made use of unigram and bigram features selected using different frequency cut-off values and window sizes along with the statistical significance test of the log likelihood measure for bigrams. Our results show that overall, the best SVM model was most accurate in 27 of 60 cases, compared to 22, 14, 10 and 14 for the naive Bayes, C4.5 decision trees, decision list and boosting methods respectively.