Sense Tagged Text

This page contains versions of the Senseval-1, Senseval-2, line, hard serve, and interest data that have been converted to a common format (Senseval-2), POS tagged, and parsed. We have also created a page where disambiguated name discrimination data is available, and where a topic annotated version of the Enron Corpus is available.

Our general strategy has been to convert sense-tagged text to the Senseval-1 format using tools provided below, and then rely on the program Sval1to2 to convert from Senseval-1 to Senseval-2 format.

The pos tagging was done with the Brill tagger using our package posSenseval. The parsing was done with the Collins parser using our package parseSenseval.

Senseval-1

12,000+ instances of 35 words as used in the Senseval-1 evaluation exercise.

There are some anomolies in the Senseval-1 data that are described in this README. We recommend that you get the data that has been corrected (with fixes).

Senseval-1 format (with fixes)
Senseval-2 format (with fixes)
Senseval-2 format (with fixes and pos tags)
Senseval-2 format (with fixes and parsed)

Senseval-1 format (no fixes)
Senseval-2 format (no fixes)

Senseval-1 Dry Run

20,000+ instances of 38 words. Distributed prior to Senseval-1 as a practice run. Uses different format than Senseval-1 exercise, so a conversion program is included with the data. See the README for general information.

Senseval-2 format Conversion tool .

Senseval-2 Data

12,000+ instances of 73 words. The original format is officially distributed here.

Senseval-2 data cleaned by posSenseval
Senseval-2 data with pos tags from posSenseval
Senseval-2 data parsed via parseSensval

Senseval-2 keys in standard and SenseClusters format
Senseval-2 training data in plain text format (no xml markup)

Senseval-3 Data

57 different words, with nouns and adjectives tagged with WordNet senses, and the verbs with WordSmythe senses. The the official distribution of the data here. Note that the Senseval-3 data uses the Senseval-2 format, so no conversion is necessary.

Senseval-3 data parsed

line

4000+ instances of the noun line, tagged with 6 wordnet senses. See the README for general information.

original format.
original format with two duplicate instances removed. Conversion to Senseval-1 tool .
Senseval-1 format
Senseval-2 format
Senseval-2 format with pos tags
Senseval-2 format parsed

hard

4000+ instances of the adjective hard, tagged with 3 wordnet senses. See the README for general information.

original format
original format wo/^M characters and duplicate instances removed. Conversion to Senseval-1 tool .
Program to create unique instance ids in original hard data.
Senseval-1 format
Senseval-2 format
Senseval-2 format with pos tags
Senseval-2 format parsed

serve

4000+ instances of the verb serve, tagged with 4 wordnet senses. See the README for general information.

original format
original format wo/^M characters. Conversion to Senseval-1 tool .
Senseval-1 format
Senseval-2 format
Senseval-2 format with pos tags
Senseval-2 format parsed

interest

2369 instances of the noun interest from the ACL/DCI Treebank that is tagged with 6 LDOCE senses. See the README for general information.

original format (ftp from nsmu) or a local copy. Conversion to Senseval-1 tool .
original format without POS tags

Senseval-1 format with original pos tags
Senseval-2 format with original pos tags
Senseval-2 format with pos tags
Senseval-2 format parsed

Senseval-1 format without POS tags
Senseval-2 format without POS tags

Misc. Notes

The official dtd file as provided by the Senseval-2 organizers is here . Please note that our converted data will not "parse" as true xml text. This is due to the fact that in the original sense-tagged text, characters that require special handling in xml are not escaped, and so forth. We are considering ways to make this data "true" xml, and would be most grateful for any feedback on how to best do this. [TDP Feb 9, 2003]

We have a program OMtoSval2 that converts sense tagged text in the Open Mind format to Senseval-2 format, although we do not provide any versions of the Open Mind data since that is continually evolving.

We also provide several programs to help us verify Senseval-2 formatted data. These are found in the Sval2Check package, and will check the validity of a Senseval-2 formatted file (sval2parser.pl) and also identify duplicate instance ids and contexts (sval2dups.pl) which may signal problems in the data.

Please note that we have just converted, tagged, and parsed the data, we did not do any of the sense-tagging! Please see the associated READMEs for proper crediting of the sense-taggers.

By: Ted Pedersen - tpederse AT d umn edu