Sense Tagged Text

This page contains versions of the Senseval-1, Senseval-2, line, hard serve, and interest data that have been converted to a common format (Senseval-2), POS tagged, and parsed. We have also created a page where disambiguated name discrimination data is available, and where a topic annotated version of the Enron Corpus is available.

Our general strategy has been to convert sense-tagged text to the Senseval-1 format using tools provided below, and then rely on the program Sval1to2 to convert from Senseval-1 to Senseval-2 format.

The pos tagging was done with the Brill tagger using our package posSenseval. The parsing was done with the Collins parser using our package parseSenseval.

Misc. Notes

The official dtd file as provided by the Senseval-2 organizers is here . Please note that our converted data will not "parse" as true xml text. This is due to the fact that in the original sense-tagged text, characters that require special handling in xml are not escaped, and so forth. We are considering ways to make this data "true" xml, and would be most grateful for any feedback on how to best do this. [TDP Feb 9, 2003]

We have a program OMtoSval2 that converts sense tagged text in the Open Mind format to Senseval-2 format, although we do not provide any versions of the Open Mind data since that is continually evolving.

We also provide several programs to help us verify Senseval-2 formatted data. These are found in the Sval2Check package, and will check the validity of a Senseval-2 formatted file (sval2parser.pl) and also identify duplicate instance ids and contexts (sval2dups.pl) which may signal problems in the data.

Please note that we have just converted, tagged, and parsed the data, we did not do any of the sense-tagging! Please see the associated READMEs for proper crediting of the sense-taggers.

By: Ted Pedersen - tpederse AT d umn edu