=============================
OMtoSVAL2 Package
=============================
version 0.01
Copyright (C) 2002
Amruta Purandare, pura0010@d.umn.edu
Ted Pedersen, tpederse@umn.edu
University of Minnesota, Duluth
===============
1. Introduction
===============
We have developed two Perl programs that operate on the Open-Mind sense tagged
corpus [1]. The first converts this data into the Senseval-2 format[2] and the
second finds the rate of agreement among the taggers who contributed to the
Open-Mind data.
======================
2. Package Description
======================
Our package consists of the following programs:
----------------
2.1 omwe2sval.pl
----------------
This program converts the Open-Mind sense tagged data into Senseval-2 format.
-----------------------------
2.1.1 How to run this program
-----------------------------
This program can be run using the command as shown below
omwe2sval.pl TAG_FILE INSTANCE_FILE
--------------
2.1.2 Input
--------------
The program accepts two files as as described below.
------------------
2.1.2.a TAG_FILE
------------------
(called as the OMWE-tagging file by the Open-Mind Team)
Each line in the TAG_FILE is a space separated, "instance tag" pair that shows
a tag assigned to an instance by a contributor. The sense tags are the
WordNet 1.7 sense keys. Instances which are tagged by multiple users will have
multiple entries in this file.
e.g. TAG_FILE ->
act.n.la.003 act%1:10:01::
act.n.la.003 act%1:10:02::
act.n.la.017 act%1:10:01::
act.n.la.018 act%1:10:01::
act.n.la.018 act%1:10:01::
act.n.la.020 act%1:10:02::
act.n.la.024 unclear
act.n.la.024 unlisted-sense
act.n.la.024 act%1:10:01::
act.n.la.024 unlisted-sense
shows instance-ids in the first column and the sense tags assigned by a
contributor in the second column.
--------------
SOME COMMENTS
--------------
When the same instance is tagged by multiple contributors, the tags assigned
may or may not match. We can see some cases where the contributors agree
and attach same tag to an instance while in other cases, the contributors
don't agree on a particular tag and hence same instance is given different tags
by different contributors.
The contributors may select 'unclear' if they are not clear about the meaning
of the word in an instance or may also choose 'unlisted-sense' if the meaning
they have in mind is not shown at the time of tagging.
This file should be sorted on the instance ids before giving it as the input to
the program.
[Warning - If the TAG_FILE or the Instance and Sense ids in the TAG_FILE do not
follow the Open Mind specified format, the behavior of our programs is
unpredicted.]
-------------------------
2.1.2.b INSTANCE_FILE
-------------------------
(called as the 'ids-to-sentences' file by the Open Mind Team)
This file lists all instances in the Open-Mind database and follows the format
described in the README that comes with the Open-Mind data.
Each instance should be on a separate line showing
I target_word ? target_location Word/POS[/NE] [Word/POS[/NE] ..]
where I = Instance Id
target_word = Target word as it appears in Instance I
target_location = Location at which the target_word is found
in instance I when the words are counted from 0.
Word/POS[/NE] = Each word in Instance I with its POS tag and optional
Named Entity information separated by /
This information about the target word position is very useful as there could
be instances having multiple occurrences of the target word in the same form.
Example of an instance -
bum.n.la.017 bum ? 14 The/DT price/NN is/VBZ right/NN ,/, the/DT food/NN is/VBZ
good/NN and/CC nobody/NN gives/VBZ you/PRP the/DT bum/NN 's/POS rush/NN ./.
Where bum.n.la.017 is an instance id which uniquely identifies this instance in
the corpus.
'bum' is a target word here and appears in the same format in the instance at
the location 14 (one that is specified after ?). Please remember that the
token-counting is started from 0 i.e. 'The' appears at the 0th location. This
number (that specifies the location of the target word) is then followed by the
POS tagged tokens in the instance. The POS tag is shown after the first '/'
while you may find the Named entity information in some tokens which follows
the second '/'.
For further information of these file formats please refer to the README file
that comes with the Open-Mind Corpus.
This file should be sorted so that all the instances for the same target word
appear consecutively.
[Warning - If the INSTANCE_FILE or the Instance and Sense ids in the
INSTANCE_FILE do not follow the Open Mind specified format, the behavior of our
programs is unpredicted.]
---------------
2.1.3 Output
---------------
The program converts the instances listed in the Instance file (passed as the
2nd argument to this program) into Senseval-2 format using the Tag information
specified in the Tag file which is the first command line argument to this
program.
The following shows an example of this conversion
Lets assume that the following is the only instance in the Instance File and
has 2 entries in the Tag file which are as shown below-
Instance File =>
act.n.tb.138 acts ? 11 Under/IN current/JJ law/NN ,/, such/JJ
suspects/NNS are/VBP immune/JJ from/IN prosecution/NN for/IN acts/NNS
committed/VBN while/IN not/RB British/JJ citizens/NNS ./.
Tag File =>
act.n.tb.138 act%1:10:02::
act.n.tb.138 unlisted-sense
Output =>
Under current law ,
such suspects are
immune from
prosecution for acts
committed while
not British citizens
.
Note that the Target Word here is 'acts' at position 11 in the instance and is
marked with
tags in the Senseval-2 data file.
For more information on Senseval-2 Format, please refer to the
http://www.senseval.org
-----------------
2.1.4 By Products
-----------------
The program also creates various output files as byproducts in the directory
named as output.
----------------------
2.1.4.a notag.txt File
----------------------
This file will list the instance ids of the instances which are not yet
tagged. If an instance in the Instance file with instance id doesn't have
any sense tag in the Tag File, the program will print I to the notag.txt file.
----------------------
2.1.4.b repeated.txt
----------------------
This file will list the instance ids which are repeated in the Instance File
along with the number of times they are repeated. This will occur if two/more
instances use the same instance id or if the same instance is repeated more
than once. When an instance id appears multiple times, the first instance
that uses this instance-id is kept while the rest of the instances are ignored.
---------------------
2.1.4.c mismatch.txt
---------------------
When the target word appears multiple times in the same form in the same
instance, we can decide which one is meant using the target location
information.
We put a double check to see the token present in the instance at the
specified target location (in the beginning of the instance after the ? symbol)
matches with the Target Word specified (earlier in the instance before the ?
symbol). If this check fails, the program doesn't skip the instance but takes
the target word location true and reports the instance id in the mismatch.txt
file.
----------------------
2.1.4.d noinstance.txt
----------------------
If an instance id found in the tag file has no corresponding entry in the
instance file, it will be reported to the noinstance.txt file.
-----------------
2.2 omwe-agree.pl
-----------------
This program shows the statistical distribution of the tagged instances, for
each target word in the TAG_FILE, as described below.
The program divides the total tagged instances for each target word into
2 categories ONE-TAG and MULTI-TAG.
If an instance is tagged by more than one contributors, it will have multiple
entries in the TAG_FILE and is called a MULTI-TAG instance.
On the other hand, if an instance is shown to a single contributor and has a
single entry in the TAG_FILE, it is counted as a ONE-TAG instance.
If an instance is tagged by more than one contributors(MULTI-TAG instance) and
if all the contributors assign same sense tag, we say that the contributors
agreed on the tag. Otherwise, we say that they disagree and the instance is
assigned multiple tags.
The omwe-agree.pl program computes the total number of MULTI-TAG instances for
which the contributors agree and those for which they disagree. In other words,
the output of this program will show the agreement and disagreement rate of
the MULTI-TAG instances per target word.
Here, the case when an instance is shown to the same contributor multiple times
is considered same in which an instance is shown to different contributors.
-------------
2.2.1 Input
-------------
The program accepts the Tag file as its input and this file is same
as the one described earlier in section 2.1.2.a in this README.
-----------------------------
2.2.2 How to run this program
-----------------------------
The program can be run using the command shown below
omwe-agree.pl OMWE-tagging
where OMWE-tagging is an input Tag file (format described in section 2.1.2.a
of this README)
-------------
2.2.3 Output
-------------
The output of this program is shown on the standard output device
and it shows 2 tables as described below.
---------------
2.2.3.a Table1
---------------
Table columns
WORD #INSTANCES ONE-TAG MULTI-TAG AGREE DISAGREE %AGREE %DISAGREE
-------------------------
Column Header Description
-------------------------
WORD #INSTANCES
----------------
These columns show various words found in the input TAG_FILE along with the
total number of tagged instances for these words.
ONE-TAG
--------
This column shows how many of the #INSTANCES have single tag in the TAG_FILE.
These are shown to only one contributor and only once.
MULTI-TAG
-----------
This column shows how many of the #INSTANCES have multiple tags or shown to
multiple users.
AGREE
------
This column shows number of instances out of MULTI-TAG which have single
distinct tag assigned by various contributors. All contributors assigning tag
to these instances agree on the same tag.
DISAGREE
--------
This column shows the number of instances out of MULTI-TAG which have more than
one distinct tag assigned by the contributors. When at least one contributor
assigning a tag to an instance disagrees with others assigning tags to the same
instance, we say the contributors disagree on the sense tag.
%AGREE
-------
This shows the % of the instances having multiple tags for which all the
contributors agree. i.e. AGREE/MULTI-TAG*100
%DISAGREE
---------
This shows the % of the instances having multiple tags for which at least one
contributor disagrees. i.e. DISAGREE/MULTI-TAG*100
e.g.
WORD #INSTANCES ONE-TAG MULTI-TAG AGREE DISAGREE %AGREE %DISAGREE
act.n 5 1 4 1 3 25.00 75.00
totals 5 1 4 1 3 25.00 75.00
Shows -
(1) Total 5 instances are tagged for word art.n
(2) Out of 5, one instance has just one tag while 4 have multiple tags
(3) Out of 4 having multiple tags, the contributors agree for one
instance while for other three, the contributors disagree.
(4) 25% of the multi tag instances(4) have agreement(1) and 75% have
disagreement(3).
---------------
2.2.3.b Table2
---------------
This display a histogram showing the number of instances with specific number
of tags assigned.
Columns-
INSTANCES TAGS ASSIGNED
e.g.
INSTANCES TAGS
1 1
4 2
Shows there is just 1 instance with 1 tag and 4 instances have total 4 tags
assigned. In other words, one instance has single entry in the TAG_FILE while
other 4 have double entries in the TAG_FILE.
------------------
2.2.3.c notag2.txt
------------------
This is an output file which lists all the instances which have no tag in the
Input TAG_FILE.
--------------
2.2.4 Options
--------------
--agree A
Set the value of A to a numeric value in [0-100] to see only those
words which have %agreement greater than or equal to threshold A.
This is provided to filter out the words which have %agreement less
than some threshold value. Special case would be to see the words with
100% agreement.
--disagree D
Set value of D to a numeric value in [0-100] to see only those words
which have %disagreement greater than or equal to threshold D.
This is provided to filter out the words which have %disagreement less
than some threshold value. Special case would be to see the words with
100% disagreement.
============
3. Copying
============
This suite of programs is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your option)
any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program; if not, write to the Free Software Foundation, Inc., 59 Temple
Place - Suite 330, Boston, MA 02111-1307, USA.
Note: The text of the GNU General Public License is provided in the file
GPL.txt that you should have received with this distribution.
=====================
4. ACKNOWLEDGMENTS
=====================
This work has been partially supported by a National Science Foundation
Faculty Early CAREER Development award (#0092784).
=================
5. REFERENCES
=================
[1] Open Mind word expert,[online] 2002, Available from
http://www.teach-computers.org/word-expert.html
Accessed on 12/14/2002.
[2] SENSEVAL: Evaluation exercises for Word Sense Disambiguation,
[online] 2002, Available from http://www.senseval.org/, Accessed on
12/14/2002.
==============
6. Contact us
==============
Thanks for using OMtoSVAL2. Please feel free to contact us if you have any
difficulty in using this software or if you have any additional comments and
suggestions to enhance its functionality.
Amruta Purandare
pura0010@d.umn.edu
(README last updated on 12/14/2002 -Amruta)