SenseTools =========== Version 0.3 Copyright (C) 2001-2003 Ted Pedersen, tpederse@d.umn.edu University of Minnesota, Duluth Satanjeev Banerjee, Satanjeev.Banerjee@cs.cmu.edu Carnegie Mellon University http://www.d.umn.edu/~tpederse/sensetools.html 1. Introduction: ---------------- SenseTools is a suite of programs that convert sense-tagged training data and untagged test/evaluation data into feature vectors suitable for use by the Weka machine learning package [3]. The training and test data is assumed to be formatted like the lexical-sample data as used in the SENSEVAL-2 [4] word sense disambiguation exercise. The features used to represent the training and test data can be manually specified by the user, or automatically identified using the N-gram Statistics Package [1]. This suite consists of the following programs: preprocess.pl: Takes an xml file in SENSEVAL-2 lexical-sample format and splits it apart into as many files as there are lexical elements in the original file. Each lexical element usually corresponds with a word used in a particular part of speech. It also does other sundry preprocessing tasks with the data. nsp2regex.pl: Takes n-word sequences and represents them as regular expressions. These are then used by xml2arff.pl to identify lexical features in the training and test data, and convert a lexical element files from text into feature vectors. xml2arff.pl: Takes a lexical element file (created by preprocess.pl above) and a set of regular expressions (created by nsp2regex.pl above) and converts the lexical element file into feature vector representation. In other words, the text is converted into a feature vector representation, where each instance in the lexical element file becomes a feature vector. The output of xml2arff.pl takes the form of an .arff file, which is a Weka specific representation of training and test data. WekaClassify.java: Takes an .arff file associated with the test/evaluation data for a particular word, and a machine learning model created by Weka. WekaClassify.java will classify each instance in the test file based on the specified model and will output an answer file that complies with the SENSEVAL-2 format for answer files. This allows us to use the standard SENSEVAL-2 scoring program (scorer.python) to score our answers. tilde.pl: Replaces the '%' in the senseid's in the training examples by a '~'. This is required because Weka treats '%' as a comment, while in SENSEVAL-2 it is included a part of the senseid's for the English lexical sample data (these senseid's are actually WordNet [5] mnemonics, the '%' being a part of their standard format). If the senseid's do not include '%', as is the case with the Spanish lexical sample data, then tilde.pl is not necessary. ensembleByDist.pl: Takes the output from a number of models created by WekaClassify.java and creates a simple ensemble by summing the probabilities or confidence measures associated with each possible sense of each instance. It produces a single file where each possible sense for each instance is assigned a score, and that score is the sum of the probability or confidence measures associated with that particular sense value. winnerTakeAll.pl: Takes the output of WekaClassify.java and outputs the answer that has the highest score. It will output multiple answers in the case of ties. xml2key.pl: Takes one or more lexical element files where the correct senseid's have already been assigned to instances and creates a key file that provides the correct sense-tag for each instance. This key is used with the scorer.python program used during SENSEVAL-2. Programs with a .pl extension are written in Perl (version 5.6) while WekaClassify.java is written in Java. This README continues with a brief description of the format of SENSEVAL-2 data and a very short (and contrived) example SENSEVAL-2 file. We then describe each of the above programs in detail, explaining its working with the help of our example SENSEVAL-2 file. 2. Format of SENSEVAL-2 Data File: ---------------------------------- The lexical sample tasks of SENSEVAL-2 are formatted using XML. Please consult the SENSEVAL website [4] for a complete description of the format. We describe below the facts about the format relevant to SenseTools. There are two lexical sample files provided per language in SENSEVAL-2: a training file where multiple occurrences of the target words in the lexical sample have been manually sense tagged, and a test/evaluation file that contains instances of the words in the lexical sample that must be sense-tagged. The following description applies to both these files, unless otherwise mentioned. The lexical sample files include XML tags that either provide additional information about the data, or demarcate regions of the text and give some regions special significance. XML tags that demarcate regions come in pairs, one to start the region and one to end it. Usually the tag that marks the beginning of a region looks like and is called a start tag, and the one that marks the end of the region looks like and is called an end tag. Informational XML tags look like . The first "important" tag in the file is the one that describes the corpus language. This is a tag that denotes a region, usually the whole file. Since it marks a region, it comes in a pair. The tag that starts the region is (where 'english' denotes the language in this file) and the tag that ends the region is . Within this tag pair there are one or more "lexical elements" or lexelts. Each such lexelt represents all the data available for a single word that has to be disambiguated. This region has the tag in the beginning (where "word" is the word to be disambiguated) and the tag at its end. Within each lexelt region there are several "instances". An instance consists of a sentence containing the word to be disambiguated as well as a few sentences around the example sentence. This region is demarcated with the tag in the beginning and the tag in the end. Here, "word.XXXXX" is the target word followed by a few digits that uniquely identify this instance among all the instances for this word. Within the instance region of the "training" sample file we have one or more answer tags that look like this: . Here "word.XXXX" matches with the instance id in the tag while the quoted string after "senseid=" refers to a mnemonic from WordNet [5] that represents the meaning of the target word in this instance. The "test" sample file does not have answer tags since these tags are created via human sense-tagging. Each instance has one - tag pair that marks the start and end of the group of sentences that make up the context of the target word. The target word itself is marked by the tags and . 3. An Example SENSEVAL-2 File: ------------------------------ The following is an example SENSEVAL-2 file that we will refer to later in this README as example.xml Art you can dance to from the creative group called Halo. There's always one to be heard somewhere during the summer in the piazza in front of the artgallery and town hall or in a park. Paintings, drawings and sculpture from every period of art during the last 350 years will be on display. Not only is it allowing certain health authoritiesto waste millions of pounds on computer systems that dont work, it also allowed the London ambulance service to put lives at risk with a system that had not been fully proven in practice. Here we have two lexelts, "art.n" and "authority.n", where "n" denotes that these are noun senses of the words. We have three instances of art with instance id's art.40001, art.40002 and art.40007 respectively, and one instance of authority with instance id authority.40001. The first instance has two answers, while the others have one each. 3.1. Assumptions Made About the Input Xml File: ----------------------------------------------- The input XML file should be valid xml! Thus, for example, tags that appear in pairs (like and ) should be correctly paired up, and should be properly nested. Programs in this suite of packages DO NOT check for the correctness of the input xml file and so its behavior is undefined for "wrong" xml files. New tags not defined in the current Senseval-2 format can be included, as long as they follow the above restrictions. Multiple tags are allowed to be on the same line. However all the characters in a single tag should be on the same line. That is a tag (from the '<' to the '>') must not have a new-line character in it. 4. Program preprocess.pl: ------------------------- This is the first program to use after obtaining the SENSEVAL-2 lexical sample data and before attempting to use any of the other programs in the suite. This program splits the lexical sample files up into separate files per lexical element, and also tokenizes the actual text to be processed so it can be converted to a feature vector representation more conveniently. Recall that the lexical-sample data for a given language consists of two files: the "training" file and the "test" file. Each of these consists of multiple lexical elements, each element containing multiple instances of a word to be disambiguated. By default, preprocess.pl creates a separate xml file for each lexical element, meaning that in general there is a separate file for all the instances associated with each word. Each such xml file is in the same general format as the lexical sample file, except that it only contains instances for one word and is specific to one lexical element. Preprocess.pl also creates a "count" file for each lexical element that only contains the text between the tags in the corresponding "xml" file. This is created for the convenience of the Ngram Statistics Package, which is geared towards plain text and has no convenient mechanism for ignoring XML tags in data. Finally, preprocess.pl has the very important role of identifying tokens in the text that occurs in the region defined by the tags. Tokens are any sequence of characters that can be defined through regular expressions. These regular expressions should be provided by the user; although preprocess.pl provides default regular expressions, we suggest that the user not depend on them... see README.details.txt for more on this. This section continues with a description of the tokenization process, and then a brief description of the various ways this program may be used. 4.1 Tokenization of Text: ------------------------- Program preprocess.pl accepts regular expressions from the user and then "tokenizes" the text between the tags. This is done to simplify the construction of regular expressions in program nsp2regex.pl and to achieve optimum regular expression matching in xml2arff.pl. Following is a description of the tokenization process. The text within the tags is considered as one string, the "input" string. This algorithm takes this input string and creates an "output" string where tokens are identified and separated from each other by a SINGLE space. Regex's provided by the user are checked against the input string to see if a sequence of characters starting with the first character of the string match against any of these regex's. As soon as a we find a regular expression that does match, this checking is halted, the matched sequence of characters is removed from the string and appended to an "output" string with exactly one space to its left and right. If none of the regex's match against the starting characters of the input string, the first character is considered a "non-token". By default this non token is placed in angular brackets (<>) and then put into the output string with one space to its left and right. This process is continued until the input string becomes empty. This process is restarted for the next instance. For example, assume we provide the following regular expressions to preprocess.pl: \w+ \w+ The first regular expression says that a sequence of characters starting with "", having an unbroken sequence of alphanumeric characters and finally ending with a "" is a valid token. Also, an unbroken sequence of alphanum characters makes a token. Then, assuming that the following text occurs within the tags of an instance: No, he has no authority on this! preprocess.pl would then convert this text to: No <,> he has no authority on this Observe that "No", "he", "has", "no", "authority", etc are all the tokens, while "," and "!" arent tokens and so have been put into angular brackets. Further observe that each token has exactly one space to its left and right. One can provide a file containing regular expressions to preprocess.pl using the switch --token. In this file, each regular expression should be on a line of its own and should be preceeded and succeeded with '/' signs. Further these should be perl regular expressions (see [6] for more on what constitutes a perl regular expression). Thus our regular expressions above would look like so: /\w+<\/head>/ /\w+/ We shall call the file these regular expressions lie in "token.txt". Then, we would run preprocess.pl on example.xml with this token file like so: preprocess.pl example.xml --token token.txt Preprocess.pl produces two kinds of output: "xml" output and "count" output. 4.2. XML output: ---------------- By default, for each lexical element "word" in the training or test file (in the lexical sample of SENSEVAL-2), preprocess.pl creates a file of the name "word".xml. For example for the file example.xml, preprocess.pl will create files art.n.xml and authority.n.xml if it is run as follows: preprocess.pl example.xml --token token.txt File art.n.xml: Art you can dance to from the creative group called Halo <.> There <'> s always one to be heard somewhere during the summer in the piazza in front of the art gallery and town hall or in a park <.> Paintings <,> drawings and sculpture from every period of art during the last 350 years will be on display <.> File authority.n.xml: Not only is it allowing certain health authorities to waste millions of pounds on computer systems that dont work <,> it also allowed the London ambulance service to put lives at risk with a system that had not been fully proven in practice <.> Observe of course that the text within the region has been tokenized as described previously according to the regular expressions in file token.txt. This default behavior can be stopped either by using the switch --xml FILE, by which only one FILE is created, or by using the switch --noxml, by which no xml file is created. 4.3. Count output: ------------------ Besides creating xml output, this program also creates output that can be used directly with the program count.pl (from the Ngram Statistics Package [1]). After tokenizing the region within the tags of each instance, the program puts together ONLY these pieces of text to create "count.pl ready" output. This is because count.pl assumes that all tokens in the input file needs to be "counted" and generally we are only interested in the "contextual" material provided in each instance, and not the tags that occur outside the region of text. By default, for each lexical element "word", this program creates a file of the name word.count. For example, for the file example.xml, we would get the files art.n.count and authority.n.count. File art.n.count: Art you can dance to from the creative group called Halo <.> There <'> s always one to be heard somewhere during the summer in the piazza in front of the art gallery and town hall or in a park <.> Paintings <,> drawings and sculpture from every period of art during the last 350 years will be on display <.> File authority.n.count: Not only is it allowing certain health authorities to waste millions of pounds on computer systems that dont work <,> it also allowed the London ambulance service to put lives at risk with a system that had not been fully proven in practice <.> This default behavior can be stopped either by using the switch --count FILE, by which only one FILE is created, or by using the switch --nocount, by which no count file is created. Note that the --xml/--noxml switches and the --count/--nocount switches are independant of each other. Thus, although providing --xml FILE or --noxml switchs produces a single xml FILE or no xml file at all, you will still get all the count files, unless you also give the --count FILE or --nocount switches. Similarly, providing the --count FILE or --nocount switches does not affect the production of the xml files. 4.4. Further Details in File README.details.txt: ------------------------------------------------ File details.txt continues with more subtle variations of running the program preprocess.pl as well as some extra switches that preprocess.pl provides for advanced uses. Specifically, file details.txt fleshes out in detail the effect of various regular expressions on tokenization, and some sundry issues with regular expression files. It also discusses the switches --senseid, --lexelt, --split, --seed, --removeNotToken, and --nontoken. 5. Program nsp2regex.pl: ------------------------ We convert our lexical sample files into a feature vector representation where each instance of the lexical sample file is represented by a vector of binary features. Each feature is represented by a regular expression and, for a given instance, gets a value of 0 or 1 if the regular expression corresponding to the feature is matched by the text within the tags of the instance or not. While program xml2arff.pl does the actual conversion of files from lexical sample format to feature vector representation, program nsp2regex.pl is concerned with creating regular expressions for features. A "feature" could be a single word, a set of words, an XML tag, or some combination of these that occurs within the region of a given instance. 5.1 Features Based on Bigrams: ------------------------------ In our examples below, we assume that our instance is the following: Paintings , drawings and sculpture from every period of art during the last 350 years will be on display . As mentioned above, a feature could represent whether a set of words occur or not in the instance. Say our feature is the bigram "period of". Since it is indeed present, this feature would get a value of 1 for this instance. The input to nsp2regex would be period<>of<> Note that the diamond <> must terminate every word or token we wish to match. Whatever comes after the last diamond on the line is ignored. Note too that we must not have the <> itself as a token we wish to match. (We describe and explain the output of nsp2regex.pl in README.details.txt). Further note that the order of the words in the bigram is important. Thus the feature "of period" would get a value of 0 for this instance since it is not present! 5.2. Features Based on Bigrams with Intervening Skipped Tokens: --------------------------------------------------------------- Normally, a sequence of words must be consecutive for it to exist as a feature. As such the feature "drawings sculpture" does not exist in the instance above since those two words are separated by "and" in the instance. However, we could specify in a regular expression that the words in a feature are allowed to have other words between them. Thus we may modify the "drawings sculpture" feature so that at most one word is allowed between its two words. In that case this feature exists! This can be done in nsp2regex.pl by adding a "directive" before the bigram in the input to the program: @count.WindowSize=3 drawings<>sculpture<> This specifies that the "window" within which we allow our tokens to match (in this case "drawings" and "sculpture") is 3 words long. Thus the tokens "drawings" and "sculpture" may be separated by at MOST one token. Program nsp2regex.pl takes the above output and produces a regular expression that would match in the instance above. Similarly, if we have the following as the input to nsp2regex.pl: @count.WindowSize=3 Paintings<>drawings<> then the regular expression that nsp2regex.pl produces would match the instance above because the "," is a token too! However, for this, nsp2regex.pl must be provided with a token file in the same way as preprocess.pl was provided. Moreover, this should be the same file provided to preprocess.pl so that the token definitions of the two programs match. One may provide the token file, say token.txt, to nsp2regex.pl using the --token switch. This ensures that when nsp2regex.pl creates regex's that ignore intervening tokens, its definition of a token coincides with preprocess.pl's definition of a token. 5.3. Skipping of "Tags", and Other Kinds of Features: ------------------------------------------------------- By default, we assume that any string enclosed within < and > is ignored during feature identification. This is because such strings are either non-tokens or they are XML tags. We call this "skipping". Thus the feature corresponding to the tokens "of art" does get a value of 1 in the instance above even without requiring a window of 3 tokens since the XML tag gets skipped. However, if we specify that a feature includes a string that begins with a < and ends with a > then we will not ignore that particular feature. Note that since we require our input text to be XML, any "normal" use of > or < will be designated using the XML convention of < or >. On the other hand, sometimes we may explicitly want an XML tag in a feature. Thus, we may have a feature like "art"... this feature gets a 1 above. Similarly, the feature "art during" also gets a 1... the "" is "skipped" as before. While the features used to create feature vectors can be decided "by hand", one could also use the Ngram Statistics Package [1]. See README.details.txt for more on how this may be done. 5.4. Important Note: -------------------- Note that the regular expressions that nsp2regex.pl creates assume that the text on which they are going to be applied has already been tokenized, and tokens have been separated by single spaces. Thus it is essential to run preprocess.pl on the text first before running the regular expressions created by nsp2regex.pl on them! 5.5. Further Details in File README.details.txt: ------------------------------------------------ File details.txt continues with a detailed description of the regular expressions created in various situations, what they match and what they mean. Also, pointers on how to use NSP to generate features are given. 6. Program xml2arff.pl: ----------------------- This program converts Senseval-2 lexical files into a feature vector representation suitable for use with the Weka machine learning system. The input to xml2arff.pl is a set of features specified as regular expressions (created by nsp2regex.pl), and XML formatted Senseval-2 lexical file(s). xml2arff.pl converts the the lexical file(s) to a vector representation using the regular expressions. The vector representation follows the arff format used by Weka, a free and open source software written in Java that implements many machine learning algorithms. See [3] for more information on Weka as well as on the .arff format that Weka uses; we present here a brief description of the .arff format. 6.1. Description of the arff format: ------------------------------------ Recall that an "instance" in the SENSEVAL-2 lexical source files is defined as all the data within the tag pair, and consists of one or more sentences containing one occurrence of the target word to be disambiguated, demarcated by the tag pair. Further recall that if the input file is a training file, then each such instance has an tag specifying the correct senseid(s) of the target word for this instance. The arff format also consists of "instances"; we shall distinguish between the two kinds of instances by calling the ones in the input lexical file "lexical instances" and those in the arff format the "arff instances". The output arff file has three components: the relation name, the attributes and the data itself. 6.1.1. The relation name: ------------------------- The relation name specifies a name for the dataset. It can be chosen from the command line using the option --relation. If not chosen, it defaults to the string "RELATION". Besides giving a name to the dataset, it is not used for anything else. 6.1.2. The attributes: ---------------------- The "attributes" correspond to the "features" discussed previously. If there are N regular expressions in the input regex file, there are N+1 attributes in the output arff-format file. Of these attributes, attributes 1 to N are derived from regex's 1 to N respectively contained in the input regex file. Each of these N attributes are binary: 0/1. Attribute N+1, the "class" attribute, corresponds to the senseid of the target word. This attribute has as many "values" as there are unique senseid's used in the tags of the xml source file. If two xml source files are used, this "class" attribute has as many values as there are unique senseid's in the tags across the two source files. 6.1.3. The data: ---------------- The data section consists of at least as many arff instances as there are lexical instances in the input SENSEVAL source file. Specifically, for a training lexical file, there is one arff instance for each answer tag of each lexical instance, while for a test lexical file, the number of lexical instances and arff instances are same. There are two formats for an arff instance, the "normal" format and the "sparse" format. We use the "sparse" format; in this format each arff instance consists of a set of comma separated ordered pairs of numbers. The first number of each such pair represents an attribute number (attributes have numbers 0 upwards, and are numbered in the same order as the order in which they occur in the @attribute section. The second number of the pair denotes the value that this attribute gets. Only those attributes that get a non-zero value are mentioned; those having a value of 0 are not shown in the output. For example following is a typical arff instance: {3 1, 5 art~1:06:00::} % art.n art.40001 This says that attribute 3 has a value of 1, attribute 5 (the class attribute) has a value of art~1:06:00:: and the other attributes have a value of 0 each. The "% art.n art.40001" is not a part of the arff instance as such... its a comment that is used by WekaClassify for creating output that follows the SENSEVAL-2 answer key format. Note that for the (N+1)th attribute, which corresponds to the senseid of the target word, the senseid in the answer tag of the lexical instance is reported. In case the input file does not contain answer tags (eg. when it is a test file) then this field is given a '?', which implies that it is a "missing" value. In case the input file contains more than one answer for a single lexical instance, then for each answer a separate arff instance is created. This is because the arff format has no mechanism to report multiple answers for the same arff instance. In all these multiple arff instances that correspond to different answers of the same lexical instance, every field except the last one is the same. 6.2. Running the Program: ------------------------- There are two ways to run xml2arff.pl: 1) By passing a single regular expression file and a single lexical sample file in SENSEVAL-2 format. 2) By passing a single regular expression file and a pair of lexical sample files in SENSEVAL-2 format (using the --training and --test options). In both cases, the input lexical sample files are converted into the arff format discussed above. In the second case, the senseid's used as the possible values of the (N+1)th attribute in the arff-format output of both the training and test files are collected from the answer tags of both the input files. This is necessary because a model created from a training file can only be used to generate answers from a test file whose attributes are exactly the same as that of the training file. Although the test file does not have any answer tags and therefore no senseid's, its (N+1)th attribute must still look exactly like the (N+1)th attribute of the training file which will have several senseid's. Thus when one has a training file and a test file, it is necessary to use the second "way" above to ensure that the two output arff files do indeed have the same attributes, each with the same range of possible values. 6.3. An Illustrative Example: ----------------------------- Let us say that the following is our input training xml file. Let us call this file art.n.xml. art you can dance to from the creative group called halo There always one <1> to be heard somewhere during the summer in the piazza in front of the art gallery and town hall or in a park Let us say we want to make regex's out of the following features, as discussed previously: @count.WindowSize=3 one<>to<> during<>summer<> the<>art<> art<>can<> heard<>somewhere<> <1><>to<> Recall that the line "@count.WindowSize=3" implies that between the tokens we allow at most one word to occur. Additionally of course we would like the regex's to skip any XML tags and non-tokens that come in between and that are not explicitly mentioned in the feature. Passing the above to nsp2regex.pl creates a regular expression file, each regex corresponding to one of the features above. Say this file is called regex.txt. At this point if we were to run the following command: xml2arff.pl regex.txt art.n.xml we would get art.n.xml.arff which would look like so (the line numbers have been added here for explanation purposes): Line 1: @relation 'RELATION' Line 2: @attribute 'one<>to<>1' {0,1} Line 3: @attribute 'during<>summer<>1' {0,1} Line 4: @attribute 'the<>art<>1' {0,1} Line 5: @attribute 'art<>can<>1' {0,1} Line 6: @attribute 'heard<>somewhere<>1' {0,1} Line 7: @attribute '<1><>to<>1' {0,1} Line 8: @attribute 'class' {fine_art~1:06:00::, art_gallery~1:06:00::, art~1:06:00::} Line 9: @data Line 10: {3 1, 6 art~1:06:00::} % art.n art.40001 Line 11: {3 1, 6 fine_art~1:06:00::} % art.n art.40001 Line 12: {0 1, 1 1, 2 1, 4 1, 5 1, 6 art_gallery~1:06:00::} % art.n art.40002 Line 1 takes the default relation name RELATION. Lines 2 through 7 declare the attributes corresponding to the regex's from the regex file. Notice the names are taken from the "@name = foo" portion of the output from nsp2regex.pl. Line 8 lists the senseid's found in the input xml file. Here we have three. Line 9 declares that the data starts here. Lines 10, 11 and 12 are the feature vector representations of the two lexical instances. Note that this is in the sparse format mentioned previously. Notice that attributes 0, 1, 2, 4 and 5 get a value of 0 for instance art.40001 but get 1 for instance art.40002. Similarly, attribute 3 gets 1 for art.40001 but 0 for art.40002. The last pair denotes the "class attribute" and has a value picked out of the senseid of the instance. Since the first lexical instance, instance number art.40001, has two answer tags, there are two arff instances corresponding to that one lexical instance. These two instances differ only in their senseid-attribute, that is the last attribute. The second lexical element, instance number art.40002, has only one answer tag, and hence has only one corresponding arff instance. In lines 9, 10 and 11, the data after the % sign is treated as comments in this arff format. This data specifies the lexelt and the instance id... these are used by program WekaClassify to generate answer files in SENSEVAL format (more on that below). Note: On looking at the attributes "matched", one sees that the matching has indeed proceeded as expected. "one to" has matched the string "one <1> to" because the non-token sequence <1> is "skipped". "during summer" has matched the string "during the summer" because we are skipping at most one word. "art can" has matched since "you" is skipped. Also, "<1> to" has also matched because recall that non-token characters are skipped ONLY when they do not occur in the feature itself! 7. Program WekaClassify: ---------------------------- This java program takes a Senseval test file in arff format (as output by xml2arff.pl) and classifies the instances in the file using the stored representation of a "model" learned by Weka from a set of training data. 7.1. The Model: --------------- WekaClassify uses a previously learned (and stored) model to classify instances in the given test arff file. This model has to be created by training some classifier (such as a decision tree, a neural network, a naive Bayesian classifier, etc) on training data consisting of instances of the same target word as that in the test arff file. Moreover, the model has to be created using Weka. One can create such a model in Weka by using the -d switch (see Weka's [3] help and documentation for more information). A model thus saved contains information about the type of classifier used as well as other facts "learned" from the training data. WekaClassify uses this model to classify the instances in the test file. The model file can be passed to WekaClassify using the -d option. Classification is done by calling the Weka library routines. Hence Weka should be available on the java CLASSPATH for this program to run properly. 7.2. Other Command-line Options: -------------------------------- Other command-line options include the -t switch to specify the test file and the -p switch to specify the level of precision required. By default, values are output up to 4 places of decimal. 7.3. The Output of WekaClassify: -------------------------------- For each instance in the test file, WekaClassify outputs a probability distribution over all the possible 'class' values. As expected, these values range from 0 (implying this particular instance has zero probability of belong to this particular class) to 1 (implying this particular instance belongs to this particular class). For SENSEVAL test files, "class" values are the possible senseid values that the target word in the test data may assume. This output is in the answer file format required by SENSEVAL (and used by the scorer.python program). Following is the format of the output of WekaClassify: ... For example, suppose for our art.n.xml.arff file above we find the following: art.n art.40001 art_gallery~1:06:00::/0.0 fine_art~1:06:00::/0.0 art~1:06:00::/1.0 art.n art.40002 art_gallery~1:06:00::/0.25 fine_art~1:06:00::/0.25 art~1:06:00::/0.50 For art.4001, this tells us that the first two senses have zero probability of occurring, while the third sense has 100 percent probability of occurring. 8. Program xml2key.pl: ---------------------- This program takes one or more training files in the SENSEVAL format and creates a "key" file that contains the correct sense assignments as taken from the training files. This key file can be directly used with scorer.python as used in SENSEVAL-2. The tags in the training files are used to generate this output. For example for the art.n.xml file above, we would get the following output key file: art.n art.40001 art~1:06:00:: fine_art~1:06:00:: art.n art.40002 art_gallery~1:06:00:: 9. Program tilde.pl: -------------------- We represent WordNet senseid's for the English lexical sample data of Senseval-2 like "art~1:06:00::". However, the actual format in WordNet is "art%1:06:00::". It is an unfortunate coincidence that the arff format of Weka treats the % sign as a comment, so if we don't remove this from the senseids we will find our arff instances prematurely terminated by this comment. Thus, we have replaced the "%" sign with the "~" sign early in the process, usually immediately before or after running preprocess.pl. We then replace the "~" sign the "%" sign after WekaClassify.java has run, so that our results have the same form as was standard in Senseval-2 and can be scored against the "official" Senseval-2 key files. This program takes an input file and looks for a string of characters that look like a senseid (either ~ or % followed by the sequence numbers, :, numbers, : numbers, :). If the first character is a ~ it is replaced with a %, and vice versa. Thus this program goes both directions... both to replace % with ~ and ~ with %. For example, if the key file above is run through tilde.pl, we would get: art.n art.40001 art%1:06:00:: fine_art%1:06:00:: art.n art.40002 art_gallery%1:06:00:: Similarly, if the training file is run through tilde.pl, the senseid's should get suitably converted. Thus running example.xml through tilde.pl gives: art you can dance to from the creative group called halo There always one <1> to be heard somewhere during the summer in the piazza in front of the art gallery and town hall or in a park 10. Program ensembleByDist.pl: ------------------------------ This program creates an "ensemble" of results by adding together the output from a number of different models produced by Weka. Each of these models can be based on different learning algorithms and/or feature sets, and the hope is that their combined performance will be better than any of the individual models. Each model outputs a distribution of probabilities or confidence measure for each possible sense of each instance, and these are all summed together to arrive at a final score for each sense of each instance. Normally the sense with the highest score would be assigned to the instance, and this is carried out by winnerTakeAll.pl. This program expects a certain directory structure. We shall explain the directory structure through an example. Assume that we have represented the training data using two different feature sets, Bigrams and Trigrams. Further assume that for each feature set, we have trained two classifiers, say a NaiveBayesian classifier and a DecisionTree. Finally say we have two words to disambiguate, word-1 and word-2, representing two different tasks. Our directory structure should be as follows: Bigrams word-1 NaiveBayesian.answer.dist DecisionTree.answer.dist word-2 NaiveBayesian.answer.dist DecisionTree.answer.dist Trigrams word-1 NaiveBayesian.answer.dist DecisionTree.answer.dist word-2 NaiveBayesian.answer.dist DecisionTree.answer.dist where, Bigrams and Trigrams are our top-most directories. Each contains one directory each for the two tasks, word-1 and word-2. Each word directory contains one file each for the two classifiers that contains the sense assignments as reported by WekaClassify.java. These filenames must end with ".answer.dist". Thus in our example, both word-1 and word-2 have four answer files each: NaiveBayesian using Bigrams, NaiveBayesian using Trigrams, DecisionTree using Bigrams and DecisionTree using Trigrams. Each of these ".answer.dist" files should be in the format expected by scorer.python. (Program WekaClassify produces outputs in this format). Thus for each instance in task word-1, there are four probability distributions over the possible classes in the four files. Program ensembleByDist.pl will add these probability distributions and output the total for each. For each word, ensembleByDist.pl outputs a file consisting of a single distribution based on the sum of all the input distributions named: task.ensemble.answer.dist. Thus in the above example, it would create two such output files: word-1.ensemble.answer.dist and word-2.ensemble.answer.dist. For example let the following be the files: Bigrams/word-1/NaiveBayesian.answer.dist : word-1 word-1.40001 A/0.0 B/0.0 C/1.0 word-1 word-1.40002 A/0.25 B/0.25 C/0.50 Bigrams/word-1/DecisionTree.answer.dist : word-1 word-1.40001 A/1.0 B/0.0 C/0.0 word-1 word-1.40002 A/0.75 B/0.25 C/0.0 Trigrams/word-1/NaiveBayesian.answer.dist : word-1 word-1.40001 A/0.25 B/0.50 C/0.25 word-1 word-1.40002 A/0.0 B/1.0 C/0.0 Trigrams/word-1/DecisionTree.answer.dist : word-1 word-1.40001 A/0.0 B/0.5 C/0.5 word-1 word-1.40002 A/0.85 B/0.15 C/0.0 Bigrams/word-2/NaiveBayesian.answer.dist : word-2 word-2.40001 A/1.0 B/0.0 word-2 word-2.40002 A/0.25 B/0.25 Bigrams/word-2/DecisionTree.answer.dist : word-2 word-2.40001 A/0.1 B/0.9 word-2 word-2.40002 A/0.25 B/0.75 Trigrams/word-2/NaiveBayesian.answer.dist : word-2 word-2.40001 A/0.4 B/0.6 word-2 word-2.40002 A/0.5 B/0.5 Trigrams/word-2/DecisionTree.answer.dist : word-2 word-2.40001 A/0.21 B/0.79 word-2 word-2.40002 A/0.52 B/0.48 Then upon running ensembleByDist.pl like so: ensembleByDist.pl Bigrams Trigrams we get the following files: word-1.ensemble.answer.dist : word-1 word-1.40001 A/1.25 B/1 C/1.75 word-1 word-1.40002 A/1.85 B/1.65 C/0.5 word-2.ensemble.answer.dist : word-2 word-2.40001 A/1.71 B/2.29 word-2 word-2.40002 A/1.52 B/1.98 Note that the final output will likely not be a probability distribution over the possible classes in the strict sense since a normalization is not done. However, if these values are probabilities then scorer.python will do its own normalization. Also note that some of the Weka learning methods do not output probabilities, but rather some type of confidence measure. It is possible to add apples and oranges (probabilities with confidence measures) so a user should be certain that the ensemble as constructed is reasonable, and that the summing operation that underlies it is appropriate. 11. Program winnerTakeAll.pl: ----------------------------- The output of WekaClassify provides a distribution of probabilities or confidence measures over all the possible values of the sense/class feature for an instance. This program takes the output from ensembleByDist.pl as input and outputs only the sense that has the highest summed score for each instance. If there is a tie for the highest value then all senses that share this value are output. For example given the following WekaClassify output: art.n art.40001 art_gallery~1:06:00::/0.0 fine_art~1:06:00::/0.0 art~1:06:00::/1.0 art.n art.40002 art_gallery~1:06:00::/0.25 fine_art~1:06:00::/0.25 art~1:06:00::/0.50 winnerTakeAll.pl will output the following: art.n art.40001 art~1:06:00:: art.n art.40002 art~1:06:00:: This default behavior can be changed by changing the "margin" that is used to find out the winner. If such a margin 'm' is provided, an answer is output if the difference between its probability and the probability of the best answer is less than or equal to 'm'. The default margin is 0. For example if a margin of 0.6 was used in the example above, we would have got the following from winnerTakeAll.pl: art.n art.40001 art~1:06:00:: art.n art.40002 art~1:06:00:: fine_art~1:06:00:: art_gallery~1:06:00:: The output is in the format required by scorer.python. A probability value is not attached to the output answer(s), and therefore equal probability is assumed for all answers that make it to the output. 12. Replicating Results from Senseval-2 for Duluth Systems ---------------------------------------------------------- The Duluth systems that participated in the Spanish and English lexical sample tasks in Senseval-2 can be replicated through the use of SenseTools (v0.1), the Bigram Statistics Package (v0.4), and Weka (v3.2). There are a set of C shell scripts for Unix/Linux that tie these tools together and replicate the Senseval-2 systems. These shell scripts are distributed with SenseTools and can be found in the Scripts subdirectory. SenseTools-0.3 is not compatible with the original Duluth systems, but must rather be used with the more recent version of the DuluthShell (v0.3), NSP (v0.55), and Weka (3.2.2). Some of the Duluth systems that participated in Senseval-2 are described in [2]. 13. Copying: ------------ This suite of programs is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. Note: The text of the GNU General Public License is provided in the file GPL.txt that you should have received with this distribution. 14. Acknowledgments: --------------------- This work has been partially supported by a National Science Foundation Faculty Early CAREER Development award (#0092784) and by a Grant-in-Aid of Research, Artistry and Scholarship from the Office of the Vice President for Research and the Dean of the Graduate School of the University of Minnesota. 15. References: --------------- 1. S. Banerjee and T. Pedersen, The Design, Implementation, and Use of the Ngram Statistics Package - Appears in the Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, February 17-21, 2003, Mexico City The code can be obtained from http://www.d.umn.edu/~tpederse/nsp.html 2. T. Pedersen. A baseline methodology for word sense disambiguation. Appears in the Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics (CICLING-02) February 17-22, 2002, Mexico City 3. Weka 3 - Machine Learning Software in Java. World Wide Web site: http://www.cs.waikato.ac.nz/ml/weka/. 4. SENSEVAL-2: Second International Workshop on Evaluating Word Sense Disambiguation Systems. World Wide Web site: http://www.sle.sharp.co.uk/senseval2/. 5. C. Fellbaum. (ed). 1998. WordNet: An Electronic Lexical Database. The MIT Press. 6. L. Wall, T. Christiansen and J. Orwant. 2000. Programming Perl. O'Reilly.