============================= OMtoSVAL2 Package ============================= version 0.01 Copyright (C) 2002 Amruta Purandare, pura0010@d.umn.edu Ted Pedersen, tpederse@umn.edu University of Minnesota, Duluth =============== 1. Introduction =============== We have developed two Perl programs that operate on the Open-Mind sense tagged corpus [1]. The first converts this data into the Senseval-2 format[2] and the second finds the rate of agreement among the taggers who contributed to the Open-Mind data. ====================== 2. Package Description ====================== Our package consists of the following programs: ---------------- 2.1 omwe2sval.pl ---------------- This program converts the Open-Mind sense tagged data into Senseval-2 format. ----------------------------- 2.1.1 How to run this program ----------------------------- This program can be run using the command as shown below omwe2sval.pl TAG_FILE INSTANCE_FILE -------------- 2.1.2 Input -------------- The program accepts two files as as described below. ------------------ 2.1.2.a TAG_FILE ------------------ (called as the OMWE-tagging file by the Open-Mind Team) Each line in the TAG_FILE is a space separated, "instance tag" pair that shows a tag assigned to an instance by a contributor. The sense tags are the WordNet 1.7 sense keys. Instances which are tagged by multiple users will have multiple entries in this file. e.g. TAG_FILE -> act.n.la.003 act%1:10:01:: act.n.la.003 act%1:10:02:: act.n.la.017 act%1:10:01:: act.n.la.018 act%1:10:01:: act.n.la.018 act%1:10:01:: act.n.la.020 act%1:10:02:: act.n.la.024 unclear act.n.la.024 unlisted-sense act.n.la.024 act%1:10:01:: act.n.la.024 unlisted-sense shows instance-ids in the first column and the sense tags assigned by a contributor in the second column. -------------- SOME COMMENTS -------------- When the same instance is tagged by multiple contributors, the tags assigned may or may not match. We can see some cases where the contributors agree and attach same tag to an instance while in other cases, the contributors don't agree on a particular tag and hence same instance is given different tags by different contributors. The contributors may select 'unclear' if they are not clear about the meaning of the word in an instance or may also choose 'unlisted-sense' if the meaning they have in mind is not shown at the time of tagging. This file should be sorted on the instance ids before giving it as the input to the program. [Warning - If the TAG_FILE or the Instance and Sense ids in the TAG_FILE do not follow the Open Mind specified format, the behavior of our programs is unpredicted.] ------------------------- 2.1.2.b INSTANCE_FILE ------------------------- (called as the 'ids-to-sentences' file by the Open Mind Team) This file lists all instances in the Open-Mind database and follows the format described in the README that comes with the Open-Mind data. Each instance should be on a separate line showing I target_word ? target_location Word/POS[/NE] [Word/POS[/NE] ..] where I = Instance Id target_word = Target word as it appears in Instance I target_location = Location at which the target_word is found in instance I when the words are counted from 0. Word/POS[/NE] = Each word in Instance I with its POS tag and optional Named Entity information separated by / This information about the target word position is very useful as there could be instances having multiple occurrences of the target word in the same form. Example of an instance - bum.n.la.017 bum ? 14 The/DT price/NN is/VBZ right/NN ,/, the/DT food/NN is/VBZ good/NN and/CC nobody/NN gives/VBZ you/PRP the/DT bum/NN 's/POS rush/NN ./. Where bum.n.la.017 is an instance id which uniquely identifies this instance in the corpus. 'bum' is a target word here and appears in the same format in the instance at the location 14 (one that is specified after ?). Please remember that the token-counting is started from 0 i.e. 'The' appears at the 0th location. This number (that specifies the location of the target word) is then followed by the POS tagged tokens in the instance. The POS tag is shown after the first '/' while you may find the Named entity information in some tokens which follows the second '/'. For further information of these file formats please refer to the README file that comes with the Open-Mind Corpus. This file should be sorted so that all the instances for the same target word appear consecutively. [Warning - If the INSTANCE_FILE or the Instance and Sense ids in the INSTANCE_FILE do not follow the Open Mind specified format, the behavior of our programs is unpredicted.] --------------- 2.1.3 Output --------------- The program converts the instances listed in the Instance file (passed as the 2nd argument to this program) into Senseval-2 format using the Tag information specified in the Tag file which is the first command line argument to this program. The following shows an example of this conversion Lets assume that the following is the only instance in the Instance File and has 2 entries in the Tag file which are as shown below- Instance File => act.n.tb.138 acts ? 11 Under/IN current/JJ law/NN ,/, such/JJ suspects/NNS are/VBP immune/JJ from/IN prosecution/NN for/IN acts/NNS committed/VBN while/IN not/RB British/JJ citizens/NNS ./. Tag File => act.n.tb.138 act%1:10:02:: act.n.tb.138 unlisted-sense Output => Under current law , such suspects are immune from prosecution for acts committed while not British citizens . Note that the Target Word here is 'acts' at position 11 in the instance and is marked with tags in the Senseval-2 data file. For more information on Senseval-2 Format, please refer to the http://www.senseval.org ----------------- 2.1.4 By Products ----------------- The program also creates various output files as byproducts in the directory named as output. ---------------------- 2.1.4.a notag.txt File ---------------------- This file will list the instance ids of the instances which are not yet tagged. If an instance in the Instance file with instance id doesn't have any sense tag in the Tag File, the program will print I to the notag.txt file. ---------------------- 2.1.4.b repeated.txt ---------------------- This file will list the instance ids which are repeated in the Instance File along with the number of times they are repeated. This will occur if two/more instances use the same instance id or if the same instance is repeated more than once. When an instance id appears multiple times, the first instance that uses this instance-id is kept while the rest of the instances are ignored. --------------------- 2.1.4.c mismatch.txt --------------------- When the target word appears multiple times in the same form in the same instance, we can decide which one is meant using the target location information. We put a double check to see the token present in the instance at the specified target location (in the beginning of the instance after the ? symbol) matches with the Target Word specified (earlier in the instance before the ? symbol). If this check fails, the program doesn't skip the instance but takes the target word location true and reports the instance id in the mismatch.txt file. ---------------------- 2.1.4.d noinstance.txt ---------------------- If an instance id found in the tag file has no corresponding entry in the instance file, it will be reported to the noinstance.txt file. ----------------- 2.2 omwe-agree.pl ----------------- This program shows the statistical distribution of the tagged instances, for each target word in the TAG_FILE, as described below. The program divides the total tagged instances for each target word into 2 categories ONE-TAG and MULTI-TAG. If an instance is tagged by more than one contributors, it will have multiple entries in the TAG_FILE and is called a MULTI-TAG instance. On the other hand, if an instance is shown to a single contributor and has a single entry in the TAG_FILE, it is counted as a ONE-TAG instance. If an instance is tagged by more than one contributors(MULTI-TAG instance) and if all the contributors assign same sense tag, we say that the contributors agreed on the tag. Otherwise, we say that they disagree and the instance is assigned multiple tags. The omwe-agree.pl program computes the total number of MULTI-TAG instances for which the contributors agree and those for which they disagree. In other words, the output of this program will show the agreement and disagreement rate of the MULTI-TAG instances per target word. Here, the case when an instance is shown to the same contributor multiple times is considered same in which an instance is shown to different contributors. ------------- 2.2.1 Input ------------- The program accepts the Tag file as its input and this file is same as the one described earlier in section 2.1.2.a in this README. ----------------------------- 2.2.2 How to run this program ----------------------------- The program can be run using the command shown below omwe-agree.pl OMWE-tagging where OMWE-tagging is an input Tag file (format described in section 2.1.2.a of this README) ------------- 2.2.3 Output ------------- The output of this program is shown on the standard output device and it shows 2 tables as described below. --------------- 2.2.3.a Table1 --------------- Table columns WORD #INSTANCES ONE-TAG MULTI-TAG AGREE DISAGREE %AGREE %DISAGREE ------------------------- Column Header Description ------------------------- WORD #INSTANCES ---------------- These columns show various words found in the input TAG_FILE along with the total number of tagged instances for these words. ONE-TAG -------- This column shows how many of the #INSTANCES have single tag in the TAG_FILE. These are shown to only one contributor and only once. MULTI-TAG ----------- This column shows how many of the #INSTANCES have multiple tags or shown to multiple users. AGREE ------ This column shows number of instances out of MULTI-TAG which have single distinct tag assigned by various contributors. All contributors assigning tag to these instances agree on the same tag. DISAGREE -------- This column shows the number of instances out of MULTI-TAG which have more than one distinct tag assigned by the contributors. When at least one contributor assigning a tag to an instance disagrees with others assigning tags to the same instance, we say the contributors disagree on the sense tag. %AGREE ------- This shows the % of the instances having multiple tags for which all the contributors agree. i.e. AGREE/MULTI-TAG*100 %DISAGREE --------- This shows the % of the instances having multiple tags for which at least one contributor disagrees. i.e. DISAGREE/MULTI-TAG*100 e.g. WORD #INSTANCES ONE-TAG MULTI-TAG AGREE DISAGREE %AGREE %DISAGREE act.n 5 1 4 1 3 25.00 75.00 totals 5 1 4 1 3 25.00 75.00 Shows - (1) Total 5 instances are tagged for word art.n (2) Out of 5, one instance has just one tag while 4 have multiple tags (3) Out of 4 having multiple tags, the contributors agree for one instance while for other three, the contributors disagree. (4) 25% of the multi tag instances(4) have agreement(1) and 75% have disagreement(3). --------------- 2.2.3.b Table2 --------------- This display a histogram showing the number of instances with specific number of tags assigned. Columns- INSTANCES TAGS ASSIGNED e.g. INSTANCES TAGS 1 1 4 2 Shows there is just 1 instance with 1 tag and 4 instances have total 4 tags assigned. In other words, one instance has single entry in the TAG_FILE while other 4 have double entries in the TAG_FILE. ------------------ 2.2.3.c notag2.txt ------------------ This is an output file which lists all the instances which have no tag in the Input TAG_FILE. -------------- 2.2.4 Options -------------- --agree A Set the value of A to a numeric value in [0-100] to see only those words which have %agreement greater than or equal to threshold A. This is provided to filter out the words which have %agreement less than some threshold value. Special case would be to see the words with 100% agreement. --disagree D Set value of D to a numeric value in [0-100] to see only those words which have %disagreement greater than or equal to threshold D. This is provided to filter out the words which have %disagreement less than some threshold value. Special case would be to see the words with 100% disagreement. ============ 3. Copying ============ This suite of programs is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. Note: The text of the GNU General Public License is provided in the file GPL.txt that you should have received with this distribution. ===================== 4. ACKNOWLEDGMENTS ===================== This work has been partially supported by a National Science Foundation Faculty Early CAREER Development award (#0092784). ================= 5. REFERENCES ================= [1] Open Mind word expert,[online] 2002, Available from http://www.teach-computers.org/word-expert.html Accessed on 12/14/2002. [2] SENSEVAL: Evaluation exercises for Word Sense Disambiguation, [online] 2002, Available from http://www.senseval.org/, Accessed on 12/14/2002. ============== 6. Contact us ============== Thanks for using OMtoSVAL2. Please feel free to contact us if you have any difficulty in using this software or if you have any additional comments and suggestions to enhance its functionality. Amruta Purandare pura0010@d.umn.edu (README last updated on 12/14/2002 -Amruta)