1. Various Issues of Tokenization in Program preprocess.pl: ----------------------------------------------------------- 1.1. Default Regular Expressions: --------------------------------- Recall that program preprocess.pl uses regular expressions to tokenize the text that appears between the tags. Although tokenization is best controlled via a user specified tokenization file designated via the --token option, there is also a default definition of tokens that is used in the absence of a tokenization file, which consists of the following: /w+/ /[\.,;:\?!]/ According to this definition, a token is either a single punctuation mark from the specified class, or it is a string of alpha-numeric characters. Note that this default definition is generally not a good choice for XML data since it does not treat XML tags as tokens and will result in them "breaking apart" during pre-processing. For example, given this default definition, the string : art will be represented by preprocess.pl as <<> head <>> art <<> head <>> which suggests that "<", ">", and "/" are non-tokens, while "art" and "head" are. This is unlikely to provide useful information. These defaults correspond to those in NSP [1], which is geared towards plain text. These are provided as a convenience, but in general we recommend against relying upon them when processing XML data. 1.2. Regular Expression /\S+/: ------------------------------ Assume that the only regular expression in our token file token.txt is /\S+/. This regular expression says that any sequence of non-white-space characters is a token. Now, if we run the program like so: preprocess.pl example.xml --token token.txt (where examle.xml is the example xml file described in section 3 of README.txt and token.txt is the file that contains the above regular expressions /\S+/). We would get all the four files, art.n.xml, art.n.count, authority.n.xml and authority.n.count. From here on we shall show only the "authority" files to save space; it is understood that the art files are also created. File authority.n.xml: Not only is it allowing certain health authoritiesto waste millions of pounds on computer systems that dont work, it also allowed the London ambulance service to put lives at risk with a system that had not been fully proven in practice. File authority.n.count: Not only is it allowing certain health authoritiesto waste millions of pounds on computer systems that dont work, it also allowed the London ambulance service to put lives at risk with a system that had not been fully proven in practice. Note that every character is a part of some sequence of non-white-space characters, and is therefore part of some token. Hence no character is put into <> brackets. Also, each non-white-space-character-sequence, that is each token, is placed in the output with exactly one space character to its left and right. 1.3. Regular Expression /\w+/: ------------------------------ On the other hand if our token file token.txt were to contain the following regex which treats every sequence of alpha numeric characters as a token: /\w+/ ... and we were to run the program like so: preprocess.pl example.xml --token token.txt ... then our authority files would like like so: File authority.n.xml: Not only is it allowing certain health <<> head <>> authorities < head <>> to waste millions of pounds on computer systems that dont work <,> it also allowed the London ambulance service to put lives at risk with a system that had not been fully proven in practice <.> File authority.n.count: Not only is it allowing certain health <<> head <>> authorities < head <>> to waste millions of pounds on computer systems that dont work <,> it also allowed the London ambulance service to put lives at risk with a system that had not been fully proven in practice <.> Note again that since the '<' and '>' of the head tags are not alpha-numeric characters they are considered as "non-token" characters, and are put within the <> tags. Further note that if there are more than one such non-token characters one after another, they get put into one pair of diamond brackets '<' and '>'. As mentioned in section 1.1 above, the user should include regular expressions that preserve the tags. Thus for the above example, a regular expression like /\w+<\/head>/ would work admirably. 1.4. Other Useful Regular Expressions in the Token File: -------------------------------------------------------- Besides the regular expressions \w+ and \w+, we have found the following regular expressions useful too. /[\.,;:\?!]/ - This states that a single occurrence of one of the puncutation marks in the list is a token. This helps us specify that a puncutation mark is indeed a token and should not be ignored! Further, this allows us to create features consisting of punctuation marks using NSP. /&([^;]+;)+/ - The XML format forces us to replace certain meta symbols in the text by their standard formats. For example, if the '<' symbol occurs in the text, it is replaced with "<". Similarly, '-' is replaced with "‐". This regular expression recognizes these constructs as tokens instead of breaking them up! 1.5. Order of Regular Expressions Is Important: ----------------------------------------------- Recall that at every point of the "input string", the matching mechanism marches down the regular expressions in the order they are provided in the input regular expression file, and stops at the FIRST regular expression that matches. Thus the order of the regular expression makes a difference. For example, say our regular expression file has the following regular expressions in this order: /he/ /hear/ /\w+/ and our input text is "hear me" Then our output text is " he ar me " On the other hand, if we reverse the first two regular expressions /hear/ /he/ /\w+/ we get as output " hear me " Thus as expected, the order of the regular expressions define how the output will look. 1.6. Redundant Regular Expressions: ----------------------------------- Consider the following regular expressions: /\S+/ /\w+/ As should be obvious, every token that matches the second regular expression matches the first one too. We say that the first regular expression "subsumes" the second one, and the second regular expression is redundant. This is because the matching mechanism will always stop at the first regular expression, and never get an opportunity to exercise the second one. Note of course that this does not adversely affect anything. 1.7. Ignoring Non-Tokens using --removeNotToken: ------------------------------------------------ Recall that characters in the input string that do not match any regular expression as defined in token are put into angular (<>) brackets. You may, if you wish, remove these "non tokens", that is not have them appear in the output xml and count files, by using the switch --removeNotToken. Thus, for the following text: No, he has no authority on me! and with regular expressions \w+ \w+ and if we were to run the program with the switch --removeNotToken, preprocess.pl would convert the text into: No he has no authority on me 1.8. Ignoring Non-Tokens using --nontoken: ------------------------------------------ The --nontoken option allows a user to specify a list of regular expressions. Any strings in the input file that match this list are removed from the file prior to tokenization. It's important to note the order in which tokenization occurs. First, those strings that match the regexes defined in nontoken are removed. Then the strings that match the regexes defined in token are matched. Those tokens that do not match the token regexes are then removed. Thus, the "order" of precedence during tokenization is: -nontoken -token -removeNotToken 2. Information Insertion using Program preprocess.pl: ----------------------------------------------------- 2.1. Inserting lexelt and senseId Information: ---------------------------------------------- The lexelt information and the senseId information are outside the region. Program preprocess.pl gives you the capability to bring these pieces of information inside the context. Switch --useLexelt puts the tag within the tags, where WORD is the word in the immediately preceding tag. Switch --useSenseid puts the tag within the tags, where XXXXX is the number in the immediately preceding tag. For example, running the program like so: preprocess.pl example.xml --useLexelt --useSenseid --token token.txt produces this for authority.n.xml: Not only is it allowing certain health authorities to waste millions of pounds on computer systems that dont work , it also allowed the London ambulance service to put lives at risk with a system that had not been fully proven in practice . Note that the extra information is put inside the region. Hence the user has to provide a token file that will preserve these tags. For instance, as shown in the previous section, if one were to rely on the default regex's, these tags would not be preserved (the '<' and '>' would be considered non-token symbols) and the lexelt and senseid information would not be included within the tags. So for example, the following regular expression file is adequate: \w+ \w+ 2.2. Inserting Sentence-Boundary Tags: -------------------------------------- The english lexical sample data available from SENSEVAL-2 is such that each sentence within the tags is on a line of its own. This human-detected sentence boundary information is usually lost in preprocess.pl, but can be preserved using the switch --putSentenceTags. This puts each line within ~~and~~ tags. Assuming that each sentence was originally on a line of its own, then ~~marks the start of a sentence and~~ marks its end. Note that no sentence boundary detection is done: if the end of line character (\n) does not match the end of a sentence, then the tags will not be indicative of a sentence boundary either. For example, assume the following is our source xml file, source.xml: This is the first line This is the second line This is the last line for word Further assume our token file is this: /\w+/ // /<\/s>/ /\w+/ Running preprocess.pl like so: preprocess.pl --token token.txt source.xml Produces the following word.xml file: This is the first line This is the second line This is the last line for word and the following word.count file: This is the first line This is the second line This is the last line for word However, running preprocess.pl like so: preprocess.pl --token token.txt --putSentenceTags source.xml Produces the following word.xml file: ~~This is the first line~~ ~~This is the second line~~ ~~This is the last line for word~~ and the following word.count file: ~~This is the first line~~ ~~This is the second line~~ ~~This is the last line for word~~ Note that the ~~and~~ tags are placed into the data BEFORE the tokenization process. Hence a token regular expression that preserves these tags is required! The token file shown above is adequate for this. 3. Splitting Input Lexical Files using preprocess.pl: ----------------------------------------------------- Besides splitting the lexical elements into separate files, preprocess.pl also allows you to split the instances of a single lexical element into separate "training" and "test" files. If one is attempting to replicate the results of the Senseval-2 systems, then it is appropriate to use preprocess.pl as described in Overall.txt and to use the given test/evaluation files from Senseval-2. However, if one has a corpus of sense-tagged text, it is often desirable to divide that sense tagged text into training and test portions in order to develop or tune a methodology. This is the intention of the --split option. The --split option of preprocess.pl allows you to specify an integer N... the instances of each lexical element in the input XML SOURCE file are split into two files approximately in the ratio N:(100-N). If an output XML file "foo" is specified through the switch --xml then two files, foo-training.xml and foo-test.xml are created. If an output count file "foo" is specified through the switch --count then two files, foo-training.count and foo-test.count are created. Creation of Xml and count output files can be suppressed by using the --noxml and --nocount switches respectively. If neither --noxml nor --xml switches are used, then files of the type word-training.xml, word-test.xml are created. If neither --nocount nor --count switches are used, then files of the type word-training.count, word-test.count are created. The instances are shuffled before being put into training and test files. Perl automatically seeds the randomizing process... but you can specify your own seed using the switch --seed. 4. Using NSP to Create Features for nsp2regex.pl: ------------------------------------------------- Recall that preprocess.pl can be used to generate *.count files that contain only the tokenized text within the tags of all the instances. Program count.pl from the Ngram Statistics Package (NSP) can be used to generate all n-word sequences that occur in these *.count files. Additionally, count.pl can create n-word sequences that span over more than n words... that is n-word sequences that had one or more intervening words in the source that were skipped over. If this is done then it is necessary to force the --extended switch of count.pl so that the output of count.pl has the "@count.WindowSize=..." directive in it. Recall that nsp2regex.pl uses this directive to detect that skipping of tokens needs to be done. The output of count.pl can be directly used as input to nsp2regex.pl. Instead of using the output of count.pl directly, one could also run program statistic.pl of NSP on the output of count.pl to select a subset of the features found by count.pl. The output of statistic.pl can be used directly with nsp2regex.pl Note that the output of both count.pl and statistic.pl consist of tokens separated by <> signs. Further, the last <> is usually followed by a sequence of numbers. Recall however that everything after the last <> is ignored by nsp2regex.pl. Thus the output of count.pl and statistic.pl can be directly used with nsp2regex.pl. Similar to most programs in SenseTools, both count.pl and statistic.pl can accept a token file to tokenize the input text. Usually in one experiment, all these token files should be the same to keep the tokenization consistent. If no token files were used with preprocess.pl, then no token file need be used with NSP either... the default tokens are the same in all programs in NSP as well as SenseTools. However, as mentioned elsewhere, it is advisable to use a token file with preprocess.pl, and to then use the same token file in every program thereafter. 5. Creating Features by Hand for nsp2regex.pl: ---------------------------------------------- It is not necessary that one use NSP to create features for the lexical files. In particular, one cannot produce features that include non-tokens using NSP since programs in NSP (like count.pl and statistic.pl) ignore non-tokens. For example, assume the following lexical file: Not only is it allowing certain health authorities to waste millions of pounds on computer systems that dont work <,> it also allowed the London ambulance service to put lives at risk with a system that had not been fully proven in practice <.> We may want to create a feature where the word "practice" is followed immediately with a period. In this case our input to nsp2regex.pl would be "practice<><.><>". Similarly we may want a feature where the lexelt is authority.n. In this case, our input to nsp2regex.pl would be "<>". 6. Explanation of Regular Expressions Created Using nsp2regex.pl: ---------------------------------------------------------------- 6.1. Default Regular Expression (without Skipping Intermediate Tokens): ----------------------------------------------------------------------- Recall that by default, nsp2regex.pl creates regex's that match space separated tokens. Recall that the regular expressions that nsp2regex.pl creates are based on the assumption that the text on which these regex's are going to be used has tokens separated by a single space. Further the regular expressions thus created ignore XML tags and non-tokens, as described in the examples above. For example, the following line in the input to nsp2regex.pl: a<>bigram<> is converted to the following regex: /\s(<[^>]*>)*a(<[^>]*>)*\s(<[^>]*>\s)*(<[^>]*>)*bigram(<[^>]*>)*\s/ @name = a<>bigram In this output, everything from the first / to the last / constitutes the regular expression. The portion "@name = a<>bigram" is used by xml2arff for giving a name to the attribute corresponding to this regular expression. 6.1.1. What This Regular Expression will Match: ----------------------------------------------- This regular expression defines a feature that will match the tokens "a" and "bigram" under the following conditions: i> Tokens "a" and "bigram" have exactly one space to their left and right. For example, this regex will match the sentence " this is a bigram ". This regex will not match the sentence " i wanna bigram " nor the sentence " i have a bigrams ". It will not even match " I have a bigram ". This is because nsp2regex.pl creates regular expressions that assume that there is exactly ONE space character between tokens! ii> Tokens "a" and "bigram" are bounded by one or more xml tags or non-tokens, that is a sequence of characters that start with '<' and end with '>'. eg: this regex will match the sentence : " this is a bigram ". This regex will also match " this is a bigram ". iii> tokens "a" and "bigram" are separated by one or more space separated xml tags. eg: this regex will match the sentence " this is a <,> bigram ". It will also match " this is a <,> bigram " and " this is a <,> bigram ". iv> combinations of the above cases. 6.1.2. Explanation of this Regular Expression: ---------------------------------------------- Following is an explanation of the various parts of the regular expression: /\s(<[^>]*>)*a(<[^>]*>)*\s(<[^>]*>\s)*(<[^>]*>)*bigram(<[^>]*>)*\s/ @name = a<>bigram a> All the portion between the first '/' and the last '/' is the regular expression. b> The regular expression starts with requiring a single space character, \s. This is consistent with the assumption that every token has exactly one space to its left and one to its right. c> The next chunk is (<[^>]*>)*a(<[^>]*>)* Note that the portion (<[^>]*>) represents exactly our definition of an XML tag, namely that it should start with a '<', have 0 or more characters, except the '>' character, and then end with the '>' character. The '*' outside the bracket denotes that we are willing to match 0 or more such tags. After that, we wish to match a single occurrence of the first token, 'a', again followed by 0 or more tags. Note that the tags are "stuck" to the token 'a', in that there is no space between the tag and the token 'a'. Of course if in the text there is a space between an XML tag and 'a', then the space would match the space in above. d> Having matched token 'a' with 0 or more tags "stuck" to its right and left, we now wish to match exactly a single space character through the \s. Again this corresponds to our assumption that tokens in the text are separated by exactly one space character! e> The next chunk (<[^>]*>\s)* is again our familiar XML tag. This time we wish to "skip" over 0 or more occurrences of any XML tag that lie between the first and the second token, ie between 'a' and 'bigram'. Since these are not "stuck" to the next token 'bigram', they are space separated from each other and from 'bigram'. Hence, for every token we match, we also match a space character! f> The next chunk is (<[^>]*>)*bigram(<[^>]*>)* which is exactly like the chunk for 'a' in point above. g> Finally we wish to match a single space character \s. h> The portion after the last '/' @name = a<>bigram creates a "name" for this feature. This name is used by xml2arff while creating the vector output of the input XML file. While this name is not necessary, it makes the vector output more human-readable. 6.2. Regular Expression with Skipping of Intermediate Tokens: ------------------------------------------------------------- Recall that nsp2regex.pl can create regular expressions that ignore one or more tokens that occur between the tokens to be matched. Further recall that this can be switched on by having the directive "@count.WindowSize=..." in the input file to nsp2regex.pl. We need to provide nsp2regex.pl with the same token file we provide preprocess.pl... say following is the token file: /\w+<\/head>/ /\w+/ Let the input file to the nsp2regex.pl program be the following: @count.WindowSize=3 a<>bigram<> then, the output regular expression from nsp2regex.pl is: /\s(<[^>]*>)*a(<[^>]*>)*\s(<[^>]*>\s)*((<[^>]*>)*((\w+<\/head>)|(\w+))(<[^>]*>)*\s(<[^>]*>\s)*){0,1}(<[^>]*>)*bigram(<[^>]*>)*\s/ @name = a<>bigram<>1 6.2.1. What This Regular Expression will Match: ----------------------------------------------- This regular expression will match the tokens "a" and "bigram" separated by 0 or 1 occurrences of the white space separated token ((\w+<\/head>)|(\w+)). This is the token definitions obtained from the token.txt file above! For example, this regular expression will match the following sentences: " this is a funny bigram " " this is a bigram " " this is a nice bigram " " this is a <,> bigram " " this is a <,> nice bigram " This regular expression will not match: " this is a really big bigram ", " i wanna write bigram ". " this is a , bigram ", 6.2.2. Explanation of this Regular Expression: ---------------------------------------------- Following is a description of various parts of the regular expression: /\s(<[^>]*>)*a(<[^>]*>)*\s(<[^>]*>\s)*((<[^>]*>)*((\w+<\/head>)|(\w+))(<[^>]*>)*\s(<[^>]*>\s)*){0,1}(<[^>]*>)*bigram(<[^>]*>)*\s/ @name = a<>bigram<>1 On careful observation one will notice that the above regular expression differs from the previous regular expression (section 6.1.2) in only one portion. Specifically the portion \s(<[^>]*>)*a(<[^>]*>)*\s(<[^>]*>\s)* is the same as above... recall that this matches a space, followed by 'a' with XML tags or non-token characters (within <> brackets) stuck to its left and right, followed by a single space, followed by 0 or more XML tags and non-token characters, with a space after every such tag. Further note that the portion (<[^>]*>)*bigram(<[^>]*>)*\s is again the same as before... they match 'bigram' with XML tags and non-token character tags stuck to its left and right, followed by a single space. Thus the only "new" portion in this regex is ((<[^>]*>)*((\w+<\/head>)|(\w+))(<[^>]*>)*\s(<[^>]*>\s)*){0,1} We call this the "separator" portion of the regex; this is the portion that allows for the "ignoring" of up to one token between the tokens 'a' and 'bigram'. This token can be either a \w+ or a \w+. a> Observe that the entire section is within a pair of round brackets, followed by a {0,1}. This says that this portion is allowed to occur 0 or 1 times. This is consistent with the window size of 3... besides 'a' and 'bigram', we allow at most one other token to come into the window. If our window size were to be 10 say, this would be {0,8}. b> The first part inside this bracketed portion is (<[^>]*>)*((\w+<\/head>)|(\w+))(<[^>]*>)*. This says that we are willing to match either a \w+ or a \w+. Further whatever we match can be preceeded or followed by an XML tag or a non-token character ensconced with the angular brackets <>. c> Having matched either of the two options, we wish to match a single space, \s, followed by one or more XML tags or non-tokens, in keeping with our desire to skip these tags! e> And, as mentioned in above, we would like to do this matching at most once, that is there will be at most one such token between 'a' and 'bigram'. f> The name of the feature has also changed to @name = a<>bigram<>1 implying that we are allowing at most one token to come in between our two main tokens! 7. A Fine Point about nsp2regex.pl: ----------------------------------- Fine Point 1: Certain characters, like '.', '*', '?' etc have special meaning when used within a regular expression. If these characters occur in the tokens that the regular expression is being built from, they are "escaped" (by prepending them with a slash '\'). Follwing is a list of characters that are so escaped: '\', '/', '|', '(', ')', '[', ']', '{', '}', '^', '$', '*', '+', '?' and '.' 8. A Brief History of xml2arff.pl: ---------------------------------- Our original approach to xml2arff.pl was to keep the regular expressions produced by nsp2regex.pl relatively simple, and then create mechanisms in xml2arff.pl that would do the skipping over XML tags, non-token characters and other tokens as and when required. However, this produced extremely slow code, since we had to make choices of what to skip for every instance and every regular expression. By creating complicated regular expressions, we shifted the responsibility of skipping etc to the regular expressions themselves. This resulted in nearly incomprehensible regular expressions, but improved speeds markedly since we no longer have to make the expensive choices of what to skip etc, and of course the Perl regex engine is very efficient! 9. A Fine Point about xml2arff.pl: ---------------------------------- The attributes in the output arff files are in exactly the same order as the regular expressions in the input regex file. The order of attributes is important to certain machine learning algorithms. For example, while building a tree in the Decision Tree algorithm, if there is a tie between two attributes, the one occurring earlier is chosen. By keeping the order of attributes exactly the same as the order of the input regular expressions, xml2arff.pl allows the user to control this order. This is particularly useful when the input regex's originate as bigrams found by the Ngram Statistics Package (NSP). Recall that bigrams found by NSP are ordered on their frequency or on some statistical measure of association, and so bigrams higher in the list are possibly more indicative of information than bigrams lower down the list. Given a choice therefore, Weka chooses attributes corresponding to higher ranked bigrams than lower ranked ones. 10. Some Common Pitfalls to Avoid with SenseTools: -------------------------------------------------- 1. The input files should be formatted in the SENSEVAL-2 format. If for example the tag is missing, then the entire file will be ignored, and there will be no output from preprocess.pl. 2. The token files used with preprocess.pl, NSP programs, nsp2regex.pl etc should all be the same. If they are not, then the token definitions of these programs will not coincide and the matching of the regular expressions may no longer make any "sense". That is, unexpected matches/mis-matches may likely occur. 3. The regular expressions of nsp2regex.pl must be used ONLY with text that has already been passed through preprocess.pl. If this is not true, then again we may land up with unexpected matches/mismatches. 4. One must exercise caution when passing the same file into preprocess.pl twice. This may result in unexpected output. For example, say the text within the region is the following: "this is just too cool, man" And our token regular expressions are: /\w+/ /\w+/ After passing the file through preprocess.pl once, we would get the following changed text: " this is just too cool <,> man " On passing it the second time with the same token file we would get the following changed text: " this is just too cool <<,>> man " If you do have to pass the same text through preprocess.pl twice, one way would be to use the regular expression /\S+/ the first time... this would consider everything as tokens and nothing would be put in <> brackets as non-tokens. Then the second time around, one can use the main token file. This situation occurs when one wishes to run preprocess.pl first on the main train file, and then run the program again on the individual lexelt files created after the first run of the program.