Readme for Program compoundSval2.pl ----------------------------------- December 18th, 2001 version 0.1 Copyright (c) 2001-2002 Satanjeev Banerjee, bane0025@d.umn.edu Ted Pedersen, tpederse@d.umn.edu University of Minnesota, Duluth 1. Introduction: ---------------- This program detects compound words in the text within the tags of SENSEVAL-2 [1] lexical sample files, and connects the components of such compound words with underscores. For example, it replaces the sentence "the prime interest rate in new york city is in high spirits" to "the prime_interest_rate in new_york_city is in_high_spirits". 2. Running the program: ----------------------- To acquaint oneself with the commandline syntax etc, one may run the program without any arguments, like so: perl compoundSval2.pl This will output some help messages. More help can be obtained by running the program like so: perl compoundSval2.pl --help The way to run this program is to provide it with a file containing compound words, which we call the WORDSFILE, and a source lexical sample file formatted in the SENSEVAL-2 format, which we call the LEXFILE, like so: perl compoundSval2.pl WORDSFILE LEXFILE We provide a sample WORDSFILE, the file list_of_compounds.txt. This contains all the compound words in the online lexical database, WordNet [2]. We also provide a test xml file, test.xml, which is formatted in the way the SENSEVAL-2 lexical sample files are usually formatted. The output of this program is sent to the stdout; all compound words, as defined in the WORDSFILE are identified in this output. 3. Compound Identification: --------------------------- Following are a few features of the compound identification algorithm: 1. Compound words are identified in a longest-compound-first greedy manner. Thus in the sentence "this is the big bang theory", the program would output "this is the big_bang_theory", although "big_bang" is a compound too! 2. Sometimes, a choice needs to be made between two different compounds. In these situations the compound found "first" is chosen, in keeping with the greedy approach. Thus, in the sentence, "what a work of art teacher", the word "art" can be part of one of the two compounds "work_of_art" and "art_teacher". However, since "work_of_art" comes "before" "art_teacher", it is chosen, and the output of this program is "what a work_of_art teacher". 3. If the word within the tags, called the "head-word", is part of a compound word with components outside the tags, then the entire compound becomes the new head-word. For example, the sentence "this is my art teacher", gets converted to "this is my art_teacher". 4. Punctuation marks are ignored, and they do not come in the way of finding compounds. Thus, "work of! art" would become "work_of_art". 4. About File list_of_compounds: -------------------------------- This is a file containing all the compound words in the online lexical database WordNet [2]. This list has been obtained by searching for words with an underscore in them in the files index.noun, index.verb, index.adj and index.adv from the wordnet 1.7 installation. 4. Example Output: ------------------ For the input test.xml file: this is a work of art. i love this work of art? i am not your art teacher boy! the prime interest rate in new york city is in high spirits and art this is a work of. art! what a work of art teacher! what a big bang theory. it begins with a big bang! Following is the output file, when using the words file list_of_compounds.txt: this is a work_of_art i love this work_of_art i am not your art_teacher boy the prime_interest_rate in new_york_city is in_high_spirits and art this is a work_of_art what a work_of_art teacher what a big_bang_theory it begins with a big_bang In instance "art.40001", we detect work_of_art twice. In the first of the two, the compound becomes the new head word! In "art.40002", we detect art_teacher, again becoming the new head word. Instance "art.40003" is the standard example of multiple matches. We find prime_interest_rate, new_york_city and in_high_spirits. Instance "art.40004" demonstrates that puncutation marks are ignored. Instance "art.40006" demonstrates the choice of "work_of_art" over "art_teacher", and instance "art.40006" shows how "big_bang_theory" is chosen, instead of just "big_bang". -------- This suite of programs is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. Note: The text of the GNU General Public License is provided in the file GPL.txt that you should have received with this distribution.