Thursday, September 12
5:00pm 
Campus Center 120

Unlabeling and Annotating: What I Did on my Sabbatical Vacation
Rich Maclin

Associate Professor
Department of Computer Science
University of Minnesota, Duluth

This talk will cover two of the research topics I pursued during my 
recent sabbatical. In the first part of the talk I will discuss results  
from my poster with Mark Craven presented at ISMB 2002 entitled  
``Automatically Extracting Keyphrases for Clusters of Genes.'' With the  
growing use and importance of high throughput methods such as  
microarrays, one problem that is emerging is how to find commonalities  
among the clusters of genes that respond similarly in a high throughput  
experiment. This task normally involves extensive literature search by an  
expert. In this work we have developed a server, available on the web,  
that can assist in this task by analyzing the literature pertaining to a  
cluster of genes and finding keyphrases that seem to be characteristic of  
that cluster. Initial results from this work are promising and this is an  
ongoing area of research.

In the second part of my talk I will discuss results from my KDD 2002
paper with Kristin Bennett and Ayhan Demiriz entitled ``Exploiting 
Unlabeled Data in Ensemble Methods.'' This work presents a general method  
for taking advantage of unlabeled data available for a classification  
task. A classification task generally involves learning a model of a  
classification problem (e.g., which of these patients will develop  
cancer) based on information we can measure from each instance (e.g.,  
results of tests performed, family history data, etc.). But standard  
classification methods usually require an expert to ``label'' a large  
number of examples as being members or not members of the class of  
interest. This process is time-consuming and in some cases impossible. But  
it is often the case that one can obtain large amounts of data that has  
not been labeled with a class. We present a technique for making use of  
such data in a general manner based on ensemble methods. The resulting  
method won the NIPS 2001 contest for semi-supervised methods.