=============================
	 		   	OMtoSVAL2 Package
			=============================
				version 0.01
                              Copyright (C) 2002
                        Amruta Purandare, pura0010@d.umn.edu
			Ted Pedersen, tpederse@umn.edu
                        University of Minnesota, Duluth

===============
1. Introduction
===============
We have developed two Perl programs that operate on the Open-Mind sense tagged
corpus [1]. The first converts this data into the Senseval-2 format[2] and the 
second finds the rate of agreement among the taggers who contributed to the 
Open-Mind data. 

======================
2. Package Description
======================
Our package consists of the following programs:

----------------
2.1 omwe2sval.pl
----------------
This program converts the Open-Mind sense tagged data into Senseval-2 format.

-----------------------------
2.1.1 How to run this program
-----------------------------
This program can be run using the command as shown below

        omwe2sval.pl TAG_FILE INSTANCE_FILE 

--------------
2.1.2 	Input
--------------
The program accepts two files as as described below.

------------------
2.1.2.a TAG_FILE
------------------
(called as the OMWE-tagging file by the Open-Mind Team)

Each line in the TAG_FILE is a space separated, "instance tag" pair that shows 
a tag assigned to an instance by a contributor. The sense tags are the 
WordNet 1.7 sense keys. Instances which are tagged by multiple users will have 
multiple entries in this file. 

e.g. TAG_FILE ->
act.n.la.003    act%1:10:01::
act.n.la.003    act%1:10:02::
act.n.la.017    act%1:10:01::
act.n.la.018    act%1:10:01::
act.n.la.018    act%1:10:01::
act.n.la.020    act%1:10:02::
act.n.la.024    unclear
act.n.la.024    unlisted-sense
act.n.la.024    act%1:10:01::
act.n.la.024    unlisted-sense

shows instance-ids in the first column and the sense tags assigned by a 
contributor in the second column. 

--------------
SOME COMMENTS
--------------
When the same instance is tagged by multiple contributors, the tags assigned 
may or may not match. We can see some cases where the contributors agree 
and attach same tag to an instance while in other cases, the contributors
don't agree on a particular tag and hence same instance is given different tags
by different contributors. 

The contributors may select 'unclear' if they are not clear about the meaning
of the word in an instance or may also choose 'unlisted-sense' if the meaning 
they have in mind is not shown at the time of tagging.

This file should be sorted on the instance ids before giving it as the input to
the program.

[Warning - If the TAG_FILE or the Instance and Sense ids in the TAG_FILE do not 
follow the Open Mind specified format, the behavior of our programs is 
unpredicted.]

-------------------------
2.1.2.b INSTANCE_FILE
------------------------- 
(called as the 'ids-to-sentences' file by the Open Mind Team)		

This file lists all instances in the Open-Mind database and follows the format 
described in the README that comes with the Open-Mind data. 

Each instance should be on a separate line showing 
        I target_word ? target_location Word/POS[/NE] [Word/POS[/NE] ..]

where   I               = Instance Id
	target_word     = Target word as it appears in Instance I
	target_location = Location at which the target_word is found
			  in instance I when the words are counted from 0.
	Word/POS[/NE]   = Each word in Instance I with its POS tag and optional
			  Named Entity information separated by /

This information about the target word position is very useful as there could
be instances having multiple occurrences of the target word in the same form. 

Example of an instance - 	
bum.n.la.017 bum ? 14 The/DT price/NN is/VBZ right/NN ,/, the/DT food/NN is/VBZ
good/NN and/CC nobody/NN gives/VBZ you/PRP the/DT bum/NN 's/POS rush/NN ./.
	
Where bum.n.la.017 is an instance id which uniquely identifies this instance in
the corpus. 
'bum' is a target word here and appears in the same format in the instance at 
the location 14 (one that is specified after ?). Please remember that the 
token-counting is started from 0 i.e. 'The' appears at the 0th location. This 
number (that specifies the location of the target word) is then followed by the
POS tagged tokens in the instance. The POS tag is shown after the first '/' 
while you may find the Named entity information in some tokens which follows 
the second '/'.

For further information of these file formats please refer to the README file 
that comes with the Open-Mind Corpus. 

This file should be sorted so that all the instances for the same target word 
appear consecutively.
		
[Warning - If the INSTANCE_FILE or the Instance and Sense ids in the 
INSTANCE_FILE do not follow the Open Mind specified format, the behavior of our
programs is unpredicted.]

---------------	
2.1.3	Output
---------------
The program converts the instances listed in the Instance file (passed as the 
2nd argument to this program) into Senseval-2 format using the Tag information 
specified in the Tag file which is the first command line argument to this 
program. 
	 
The following shows an example of this conversion 

Lets assume that the following is the only instance in the Instance File and 
has 2 entries in the Tag file which are as shown below-
	
Instance File =>	
	act.n.tb.138 acts ? 11 Under/IN current/JJ law/NN ,/, such/JJ 
	suspects/NNS are/VBP immune/JJ from/IN prosecution/NN for/IN acts/NNS 
	committed/VBN while/IN not/RB British/JJ citizens/NNS ./.
		
Tag File =>
	act.n.tb.138    act%1:10:02::
	act.n.tb.138    unlisted-sense

Output =>
	<?xml version="1.0" encoding="iso-8859-1"?>
	<!DOCTYPE corpus SYSTEM "lexical-sample.dtd"
	<corpus lang='english'>
       	<lexelt item="act.n">	
	<instance id="act.n.tb.138">
	<answer instance="act.n.tb.138" senseid="act%1:10:02::"/>
	<answer instance="act.n.tb.138" senseid="unlisted-sense"/>
	<context>
	Under <p="IN"/> current <p="JJ"/> law <p="NN"/> , 
	<p=","/> such <p="JJ"/> suspects <p="NNS"/> are 
	<p="VBP"/> immune <p="JJ"/> from <p="IN"/> 
	prosecution <p="NN"/> for <p="IN"/> <head> acts 
	<p="NNS"/> </head> committed <p="VBN"/> while <p="IN"/>
	 not <p="RB"/> British <p="JJ"/> citizens <p="NNS"/> 
	. <p="."/>					
	</context>
	</instance>
	</lexelt>	
	</corpus>

Note that the Target Word here is 'acts' at position 11 in the instance and is 
marked with <head></head> tags in the Senseval-2 data file.

For more information on Senseval-2 Format, please refer to the 
	http://www.senseval.org

-----------------
2.1.4 By Products
-----------------
The program also creates various output files as byproducts in the directory
named as output. 

----------------------
2.1.4.a	notag.txt File
----------------------
This file will list the instance ids of the instances which are not yet 
tagged. If an instance in the Instance file with instance id <I> doesn't have 
any sense tag in the Tag File, the program will print I to the notag.txt file. 

----------------------
2.1.4.b	repeated.txt
----------------------
This file will list the instance ids which are repeated in the Instance File 
along with the number of times they are repeated. This will occur if two/more 
instances use the same instance id or if the same instance is repeated more 
than once. When an instance id appears multiple times, the first instance
that uses this instance-id is kept while the rest of the instances are ignored.

---------------------
2.1.4.c mismatch.txt
---------------------
When the target word appears multiple times in the same form in the same 
instance, we can decide which one is meant using the target location 
information.

We put a double check to see the token present in the instance at the 
specified target location (in the beginning of the instance after the ? symbol) 
matches with the Target Word specified (earlier in the instance before the ? 
symbol). If this check fails, the program doesn't skip the instance but takes 
the target word location true and reports the instance id in the mismatch.txt 
file.
	
----------------------
2.1.4.d noinstance.txt
----------------------
If an instance id found in the tag file has no corresponding entry in the 
instance file, it will be reported to the noinstance.txt file. 

-----------------
2.2 omwe-agree.pl
-----------------
This program shows the statistical distribution of the tagged instances, for 
each target word in the TAG_FILE, as described below.

The program divides the total tagged instances for each target word into 
2 categories ONE-TAG and MULTI-TAG.

If an instance is tagged by more than one contributors, it will have multiple
entries in the TAG_FILE and is called a MULTI-TAG instance. 
On the other hand, if an instance is shown to a single contributor and has a 
single entry in the TAG_FILE, it is counted as a ONE-TAG instance. 

If an instance is tagged by more than one contributors(MULTI-TAG instance) and 
if all the contributors assign same sense tag, we say that the contributors 
agreed on the tag. Otherwise, we say that they disagree and the instance is 
assigned multiple tags.
The omwe-agree.pl program computes the total number of MULTI-TAG instances for
which the contributors agree and those for which they disagree. In other words,
the output of this program will show the agreement and disagreement rate of
the MULTI-TAG instances per target word.  

Here, the case when an instance is shown to the same contributor multiple times
is considered same in which an instance is shown to different contributors. 

-------------
2.2.1 Input
-------------
The program accepts the Tag file as its input and this file is same
as the one described earlier in section 2.1.2.a in this README.

-----------------------------
2.2.2 How to run this program 
-----------------------------
The program can be run using the command shown below
	omwe-agree.pl OMWE-tagging
where OMWE-tagging is an input Tag file (format described in section 2.1.2.a 
of this README)

-------------
2.2.3 Output
-------------
The output of this program is shown on the standard output device 
and it shows 2 tables as described below. 	

---------------
2.2.3.a Table1 
---------------

Table columns
WORD  #INSTANCES  ONE-TAG  MULTI-TAG  AGREE  DISAGREE  %AGREE  %DISAGREE

-------------------------
Column Header Description
-------------------------

WORD  #INSTANCES 
----------------
These columns show various words found in the input TAG_FILE along with the 
total number of tagged instances for these words.

ONE-TAG
--------
This column shows how many of the #INSTANCES have single tag in the TAG_FILE.
These are shown to only one contributor and only once.

MULTI-TAG
-----------
This column shows how many of the #INSTANCES have multiple tags or shown to 
multiple users. 

AGREE
------
This column shows number of instances out of MULTI-TAG which have single 
distinct tag assigned by various contributors. All contributors assigning tag
to these instances agree on the same tag.

DISAGREE
--------
This column shows the number of instances out of MULTI-TAG which have more than
one distinct tag assigned by the contributors. When at least one contributor 
assigning a tag to an instance disagrees with others assigning tags to the same
instance, we say the contributors disagree on the sense tag.

%AGREE
-------
This shows the % of the instances having multiple tags for which all the
contributors agree. i.e. AGREE/MULTI-TAG*100 

%DISAGREE
---------
This shows the % of the instances having multiple tags for which at least one
contributor disagrees. i.e. DISAGREE/MULTI-TAG*100

e.g.

WORD #INSTANCES ONE-TAG MULTI-TAG AGREE DISAGREE %AGREE %DISAGREE
act.n   5       1       4       1       3       25.00   75.00
totals  5       1       4       1       3       25.00   75.00

Shows -
(1)	Total 5 instances are tagged for word art.n
(2)	Out of 5, one instance has just one tag while 4 have multiple tags
(3)	Out of 4 having multiple tags, the contributors agree for one 
	instance while for other three, the contributors disagree.
(4)	25% of the multi tag instances(4) have agreement(1) and 75% have 
	disagreement(3).

---------------
2.2.3.b Table2
---------------
	
This display a histogram showing the number of instances with specific number
of tags assigned.

Columns-
INSTANCES	TAGS ASSIGNED

e.g.
	INSTANCES       TAGS
	1               1
        4               2  	
	
Shows there is just 1 instance with 1 tag and 4 instances have total 4 tags 
assigned. In other words, one instance has single entry in the TAG_FILE while
other 4 have double entries in the TAG_FILE. 
			
------------------
2.2.3.c notag2.txt
------------------
This is an output file which lists all the instances which have no tag in the 
Input TAG_FILE.

--------------
2.2.4 Options
--------------
		
--agree A
	Set the value of A to a numeric value in [0-100] to see only those 
	words which have %agreement greater than or equal to threshold A.
	This is provided to filter out the words which have %agreement less
	than some threshold value. Special case would be to see the words with 
	100% agreement. 

--disagree D
	Set value of D to a numeric value in [0-100] to see only those words
        which have %disagreement greater than or equal to threshold D.
        This is provided to filter out the words which have %disagreement less
        than some threshold value. Special case would be to see the words with 
	100% disagreement.

============
3. Copying
============

This suite of programs is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your option)
any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.  See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with
this program; if not, write to the Free Software Foundation, Inc., 59 Temple
Place - Suite 330, Boston, MA  02111-1307, USA.

Note: The text of the GNU General Public License is provided in the file
GPL.txt that you should have received with this distribution.

=====================
4. ACKNOWLEDGMENTS
=====================

This work has been partially supported by a National Science Foundation
Faculty Early CAREER Development award (#0092784).

=================
5. REFERENCES
=================
	
	[1] Open Mind word expert,[online] 2002, Available from 
	    http://www.teach-computers.org/word-expert.html 
	    Accessed on 12/14/2002.
	[2] SENSEVAL: Evaluation exercises for Word Sense Disambiguation, 
	    [online] 2002, Available from http://www.senseval.org/, Accessed on
	    12/14/2002.
 
==============
6. Contact us
==============
Thanks for using OMtoSVAL2. Please feel free to contact us if you have any
difficulty in using this software or if you have any additional comments and
suggestions to enhance its functionality.

Amruta Purandare
pura0010@d.umn.edu
	
(README last updated on 12/14/2002 	-Amruta)