Senseval-2 Format Converters
These programs take data in a variety of different formats and convert it
into the Senseval-2 lexical sample format. This is the format that we use
for all of our word sense disambiguation tools.
WePS Format to Senseval-2 Format
The weps2sval2 program takes the official format of the
Web People Search (WePS)
task of Semeval-2007 and converts it into Senseval-2 format. This
allows us to cluster and evaluate the WePS data using SenseClusters.
weps2sval2 (v 0.02) (released 05/27/08)
English Giga Word or Plain Text to Senseval-2 with Conflated Words
The nameconflate program allows you to create N ambiguous words as found
in the English Giga Word corpus or in plain text by conflating those N
words into one word. The output is in the lexical sample format of
Senseval-2, and maintains the information about the original
"unambiguous" word, as well as placing the conflated "ambiguous" word in
the text. This can then be used as test data for lexical sample word
sense disambiguation programs.
nameconflate (v 0.16) (released 04/10/06)
Readme and some
Name Conflate data created from English Giga Word text.
English Giga Word to Plain Text
The program giga2plain.pl allows you to convert English GigaWord corpus files into plain text.
It will do some normalization of the text, including removing numeric values and
punctuation, removing date and title lines. and converting all text to lower case.
giga2plain.pl (released 04/03/11)
Plain Text to HeadLess Senseval-2 Format
The text2headless program takes plain text and converts it into our
headless format, which means it is formatted like Senseval-2 text, except
that no target word is specified. This is useful for the headless
clustering supported in SenseClusters.
text2headless (v 0.03) (released 10/15/05)
National Library of Medicine WSD Test Collection to Senseval-2
The nlm2sval2 program converts the
NLM WSD Test Collection into the
Senseval-2 format. The NLM WSD Test Corpus consists of 50 ambiguous
words that occur in medical journal abstracts (from Medline) that have
been manually sense tagged. Please note that this uses the PMID basic
nlm2sval2 (v 0.01) (released 1/19/05)
See the Readme!
Senseval-1 Data Conversion to Senseval-2 Format
The Senseval-1 and Senseval-2 data formats are different. We provide
Perl code that converts the Senseval-1 data into the Senseval-2
format, for both the preliminary "dry run" data and the actual
competition data. We have also converted the widely used "line", "hard",
"serve" and "interest" corpora into this format. You can find ready to
use converted versions of this data here! The
software that carried out the conversion is available below, and has
various options that allow you to customize the output.
Sval1to2-v0.31 (released 02/06/03)
Open Mind Data Conversion to Senseval-2 Format
The sense-tagged data collected by the
Open Mind Word Expert is formatted differently than the Senseval-2
data. We provide Perl code that converts the Open Mind data into the
Senseval-2 format, as well as providing information about the agreement
rate among the Open Mind taggers.
OMtoSVAL2-v0.1 (released 01/06/03)
Sample Agreement Report
- tpederse AT d umn edu