Thursday, October 24 Campus Center 120 5pm Using the Web for Word Sense Disambiguation Rada Mihalcea Department of Computer Science University of North Texas Word Sense Disambiguation (WSD) is well known to be one of the hardest problems in Natural Language Processing, and yet a necessary step in a large range of applications, including Machine Translation, Information Retrieval, Knowledge Acquisition, and others. While humans usually encounter no difficulties in identifying the correct sense of an ambiguous word, the task turns out to be tremendously harder when it needs to be performed by a computer. So far, data driven methods for WSD proved to produce the highest accuracy among various WSD algorithms, as shown in recent evaluation exercises for WSD. However, these corpus based methods rely on manually annotated data, and therefore their applicability is limited to a handful of tagged words. To get a better idea of the size of this problem, think that from the 20,000 ambiguous words in the common English vocabulary, manually annotated data has been produced only for about 150 ambiguous words. Moreover, the problem of tagged data availability multiplies by an order of magnitude when languages other than English are taken into consideration. In this talk, I will address the problem of the data bottleneck in automatic WSD, and describe several possible solutions for automatic or semi-automatic acquisition of sense tagged corpora. Specifically, I will present three methods for building annotated data. (1) Substitution, which is a method that works by automatically identifying examples of non-ambiguous related words, which are subsequently substituted with the ambiguous target word. (2) Bootstrapping, which is a method that attempts to identify, in an iterative manner, reliable clues for classifying new examples of ambiguous words. (3) Data collection over the Web, where millions of Web users can contribute their own knowledge to a data repository. Additionally, the methods will be comparatively evaluated from the point of view of their performance and costs.