1. Various Issues of Tokenization in Program preprocess.pl:
-----------------------------------------------------------
1.1. Default Regular Expressions:
---------------------------------
Recall that program preprocess.pl uses regular expressions to tokenize
the text that appears between the tags. Although
tokenization is best controlled via a user specified tokenization file
designated via the --token option, there is also a default definition
of tokens that is used in the absence of a tokenization file, which
consists of the following:
/w+/
/[\.,;:\?!]/
According to this definition, a token is either a single punctuation
mark from the specified class, or it is a string of alpha-numeric
characters. Note that this default definition is generally not a good
choice for XML data since it does not treat XML tags as tokens and
will result in them "breaking apart" during pre-processing. For
example, given this default definition, the string :
art
will be represented by preprocess.pl as
<<> head <>> art <<> > head <>>
which suggests that "<", ">", and "/" are non-tokens, while "art" and
"head" are. This is unlikely to provide useful information.
These defaults correspond to those in NSP [1], which is geared towards
plain text. These are provided as a convenience, but in general we
recommend against relying upon them when processing XML data.
1.2. Regular Expression /\S+/:
------------------------------
Assume that the only regular expression in our token file token.txt is
/\S+/. This regular expression says that any sequence of
non-white-space characters is a token. Now, if we run the program like
so:
preprocess.pl example.xml --token token.txt
(where examle.xml is the example xml file described in section 3 of
README.txt and token.txt is the file that contains the above regular
expressions /\S+/).
We would get all the four files, art.n.xml, art.n.count,
authority.n.xml and authority.n.count. From here on we shall show only
the "authority" files to save space; it is understood that the art
files are also created.
File authority.n.xml:
Not only is it allowing certain health authoritiesto waste millions of pounds on computer systems that dont work, it also allowed the London ambulance service to put lives at risk with a system that had not been fully proven in practice.
File authority.n.count:
Not only is it allowing certain health authoritiesto waste millions of pounds on computer systems that dont work, it also allowed the London ambulance service to put lives at risk with a system that had not been fully proven in practice.
Note that every character is a part of some sequence of
non-white-space characters, and is therefore part of some token. Hence
no character is put into <> brackets. Also, each
non-white-space-character-sequence, that is each token, is placed in
the output with exactly one space character to its left and right.
1.3. Regular Expression /\w+/:
------------------------------
On the other hand if our token file token.txt were to contain the
following regex which treats every sequence of alpha numeric
characters as a token:
/\w+/
... and we were to run the program like so:
preprocess.pl example.xml --token token.txt
... then our authority files would like like so:
File authority.n.xml:
Not only is it allowing certain health <<> head <>> authorities <> head <>> to waste millions of pounds on computer systems that dont work <,> it also allowed the London ambulance service to put lives at risk with a system that had not been fully proven in practice <.>
File authority.n.count:
Not only is it allowing certain health <<> head <>> authorities <> head <>> to waste millions of pounds on computer systems that dont work <,> it also allowed the London ambulance service to put lives at risk with a system that had not been fully proven in practice <.>
Note again that since the '<' and '>' of the head tags are not
alpha-numeric characters they are considered as "non-token"
characters, and are put within the <> tags. Further note that if there
are more than one such non-token characters one after another, they
get put into one pair of diamond brackets '<' and '>'. As mentioned in
section 1.1 above, the user should include regular expressions that
preserve the tags. Thus for the above example, a regular expression
like /\w+<\/head>/ would work admirably.
1.4. Other Useful Regular Expressions in the Token File:
--------------------------------------------------------
Besides the regular expressions \w+ and \w+, we have
found the following regular expressions useful too.
/[\.,;:\?!]/ - This states that a single occurrence of one of the
puncutation marks in the list is a token. This helps
us specify that a puncutation mark is indeed a token
and should not be ignored! Further, this allows us to
create features consisting of punctuation marks using
NSP.
/&([^;]+;)+/ - The XML format forces us to replace certain meta
symbols in the text by their standard formats. For
example, if the '<' symbol occurs in the text, it is
replaced with "<". Similarly, '-' is replaced with
"‐". This regular expression recognizes these
constructs as tokens instead of breaking them up!
1.5. Order of Regular Expressions Is Important:
-----------------------------------------------
Recall that at every point of the "input string", the matching
mechanism marches down the regular expressions in the order they are
provided in the input regular expression file, and stops at the FIRST
regular expression that matches. Thus the order of the regular
expression makes a difference. For example, say our regular expression
file has the following regular expressions in this order:
/he/
/hear/
/\w+/
and our input text is "hear me"
Then our output text is " he ar me "
On the other hand, if we reverse the first two regular expressions
/hear/
/he/
/\w+/
we get as output " hear me "
Thus as expected, the order of the regular expressions define how the
output will look.
1.6. Redundant Regular Expressions:
-----------------------------------
Consider the following regular expressions:
/\S+/
/\w+/
As should be obvious, every token that matches the second regular
expression matches the first one too. We say that the first regular
expression "subsumes" the second one, and the second regular
expression is redundant. This is because the matching mechanism will
always stop at the first regular expression, and never get an
opportunity to exercise the second one. Note of course that this does
not adversely affect anything.
1.7. Ignoring Non-Tokens using --removeNotToken:
------------------------------------------------
Recall that characters in the input string that do not match any regular
expression as defined in token are put into angular (<>) brackets. You
may, if you wish, remove these "non tokens", that is not have them appear
in the output xml and count files, by using the switch --removeNotToken.
Thus, for the following text:
No, he has no authority on me!
and with regular expressions
\w+
\w+
and if we were to run the program with the switch --removeNotToken,
preprocess.pl would convert the text into:
No he has no authority on me
1.8. Ignoring Non-Tokens using --nontoken:
------------------------------------------
The --nontoken option allows a user to specify a list of regular
expressions. Any strings in the input file that match this list
are removed from the file prior to tokenization.
It's important to note the order in which tokenization occurs.
First, those strings that match the regexes defined in nontoken
are removed. Then the strings that match the regexes defined in
token are matched. Those tokens that do not match the token
regexes are then removed. Thus, the "order" of precedence during
tokenization is:
-nontoken
-token
-removeNotToken
2. Information Insertion using Program preprocess.pl:
-----------------------------------------------------
2.1. Inserting lexelt and senseId Information:
----------------------------------------------
The lexelt information and the senseId information are outside the
region. Program preprocess.pl gives you the
capability to bring these pieces of information inside the context.
Switch --useLexelt puts the tag within the
tags, where WORD is the word in the immediately
preceding tag.
Switch --useSenseid puts the tag within the
tags, where XXXXX is the number in the immediately
preceding tag.
For example, running the program like so:
preprocess.pl example.xml --useLexelt --useSenseid --token token.txt
produces this for authority.n.xml:
Not only is it allowing certain health authorities to waste millions of pounds on computer systems that dont work , it also allowed the London ambulance service to put lives at risk with a system that had not been fully proven in practice .
Note that the extra information is put inside the
region. Hence the user has to provide a token file that will preserve
these tags. For instance, as shown in the previous
section, if one were to rely on the default regex's, these tags would
not be preserved (the '<' and '>' would be considered non-token
symbols) and the lexelt and senseid information would not be included
within the tags.
So for example, the following regular expression file is adequate:
\w+
\w+
2.2. Inserting Sentence-Boundary Tags:
--------------------------------------
The english lexical sample data available from SENSEVAL-2 is such that
each sentence within the tags is on a line of its
own. This human-detected sentence boundary information is usually lost
in preprocess.pl, but can be preserved using the switch
--putSentenceTags. This puts each line within and
tags. Assuming that each sentence was originally on a line of its own,
then marks the start of a sentence and marks its end. Note
that no sentence boundary detection is done: if the end of line
character (\n) does not match the end of a sentence, then the
tags will not be indicative of a sentence boundary either.
For example, assume the following is our source xml file, source.xml:
This is the first line
This is the second line
This is the last line for word
Further assume our token file is this:
/\w+/
//
/<\/s>/
/\w+/
Running preprocess.pl like so:
preprocess.pl --token token.txt source.xml
Produces the following word.xml file:
This is the first line This is the second line This is the last line for word
and the following word.count file:
This is the first line This is the second line This is the last line for word
However, running preprocess.pl like so:
preprocess.pl --token token.txt --putSentenceTags source.xml
Produces the following word.xml file:
This is the first line This is the second line This is the last line for word
and the following word.count file:
This is the first line This is the second line This is the last line for word
Note that the and tags are placed into the data BEFORE the
tokenization process. Hence a token regular expression that preserves
these tags is required! The token file shown above is adequate for
this.
3. Splitting Input Lexical Files using preprocess.pl:
-----------------------------------------------------
Besides splitting the lexical elements into separate files,
preprocess.pl also allows you to split the instances of a single
lexical element into separate "training" and "test" files.
If one is attempting to replicate the results of the Senseval-2
systems, then it is appropriate to use preprocess.pl as described
in Overall.txt and to use the given test/evaluation files from
Senseval-2. However, if one has a corpus of sense-tagged text, it is
often desirable to divide that sense tagged text into training and
test portions in order to develop or tune a methodology. This is the
intention of the --split option.
The --split option of preprocess.pl allows you to specify an integer
N... the instances of each lexical element in the input XML SOURCE
file are split into two files approximately in the ratio N:(100-N).
If an output XML file "foo" is specified through the switch --xml then
two files, foo-training.xml and foo-test.xml are created.
If an output count file "foo" is specified through the switch --count
then two files, foo-training.count and foo-test.count are created.
Creation of Xml and count output files can be suppressed by using the
--noxml and --nocount switches respectively.
If neither --noxml nor --xml switches are used, then files of the type
word-training.xml, word-test.xml are created.
If neither --nocount nor --count switches are used, then files of the
type word-training.count, word-test.count are created.
The instances are shuffled before being put into training and test
files. Perl automatically seeds the randomizing process... but you can
specify your own seed using the switch --seed.
4. Using NSP to Create Features for nsp2regex.pl:
-------------------------------------------------
Recall that preprocess.pl can be used to generate *.count files that
contain only the tokenized text within the tags of
all the instances. Program count.pl from the Ngram Statistics Package
(NSP) can be used to generate all n-word sequences that occur in these
*.count files. Additionally, count.pl can create n-word sequences that
span over more than n words... that is n-word sequences that had one
or more intervening words in the source that were skipped over. If
this is done then it is necessary to force the --extended switch of
count.pl so that the output of count.pl has the
"@count.WindowSize=..." directive in it. Recall that nsp2regex.pl uses
this directive to detect that skipping of tokens needs to be done. The
output of count.pl can be directly used as input to nsp2regex.pl.
Instead of using the output of count.pl directly, one could also run
program statistic.pl of NSP on the output of count.pl to select a
subset of the features found by count.pl. The output of statistic.pl
can be used directly with nsp2regex.pl
Note that the output of both count.pl and statistic.pl consist of
tokens separated by <> signs. Further, the last <> is usually followed
by a sequence of numbers. Recall however that everything after the
last <> is ignored by nsp2regex.pl. Thus the output of count.pl and
statistic.pl can be directly used with nsp2regex.pl.
Similar to most programs in SenseTools, both count.pl and statistic.pl
can accept a token file to tokenize the input text. Usually in one
experiment, all these token files should be the same to keep the
tokenization consistent. If no token files were used with
preprocess.pl, then no token file need be used with NSP either... the
default tokens are the same in all programs in NSP as well as
SenseTools. However, as mentioned elsewhere, it is advisable to use a
token file with preprocess.pl, and to then use the same token file in
every program thereafter.
5. Creating Features by Hand for nsp2regex.pl:
----------------------------------------------
It is not necessary that one use NSP to create features for the
lexical files. In particular, one cannot produce features that include
non-tokens using NSP since programs in NSP (like count.pl and
statistic.pl) ignore non-tokens.
For example, assume the following lexical file:
Not only is it allowing certain health authorities to waste millions of pounds on computer systems that dont work <,> it also allowed the London ambulance service to put lives at risk with a system that had not been fully proven in practice <.>
We may want to create a feature where the word "practice" is followed
immediately with a period. In this case our input to nsp2regex.pl
would be "practice<><.><>".
Similarly we may want a feature where the lexelt is authority.n. In
this case, our input to nsp2regex.pl would be
"<>".
6. Explanation of Regular Expressions Created Using nsp2regex.pl:
----------------------------------------------------------------
6.1. Default Regular Expression (without Skipping Intermediate Tokens):
-----------------------------------------------------------------------
Recall that by default, nsp2regex.pl creates regex's that match space
separated tokens. Recall that the regular expressions that
nsp2regex.pl creates are based on the assumption that the text on
which these regex's are going to be used has tokens separated by a
single space. Further the regular expressions thus created ignore XML
tags and non-tokens, as described in the examples above.
For example, the following line in the input to nsp2regex.pl:
a<>bigram<>
is converted to the following regex:
/\s(<[^>]*>)*a(<[^>]*>)*\s(<[^>]*>\s)*(<[^>]*>)*bigram(<[^>]*>)*\s/ @name = a<>bigram
In this output, everything from the first / to the last / constitutes
the regular expression. The portion "@name = a<>bigram" is used by
xml2arff for giving a name to the attribute corresponding to this
regular expression.
6.1.1. What This Regular Expression will Match:
-----------------------------------------------
This regular expression defines a feature that will match the tokens
"a" and "bigram" under the following conditions:
i> Tokens "a" and "bigram" have exactly one space to their left and
right. For example, this regex will match the sentence " this is a
bigram ". This regex will not match the sentence " i wanna bigram "
nor the sentence " i have a bigrams ". It will not even match " I
have a bigram ". This is because nsp2regex.pl creates regular
expressions that assume that there is exactly ONE space character
between tokens!
ii> Tokens "a" and "bigram" are bounded by one or more xml tags or
non-tokens, that is a sequence of characters that start with '<'
and end with '>'. eg: this regex will match the sentence : " this
is a bigram ". This regex will also match " this is
a bigram ".
iii> tokens "a" and "bigram" are separated by one or more space
separated xml tags. eg: this regex will match the sentence " this
is a <,> bigram ". It will also match " this is a <,> bigram
" and " this is a <,> bigram ".
iv> combinations of the above cases.
6.1.2. Explanation of this Regular Expression:
----------------------------------------------
Following is an explanation of the various parts of the regular
expression:
/\s(<[^>]*>)*a(<[^>]*>)*\s(<[^>]*>\s)*(<[^>]*>)*bigram(<[^>]*>)*\s/ @name = a<>bigram
a> All the portion between the first '/' and the last '/' is the regular
expression.
b> The regular expression starts with requiring a single space
character, \s. This is consistent with the assumption that every
token has exactly one space to its left and one to its right.
c> The next chunk is (<[^>]*>)*a(<[^>]*>)*
Note that the portion (<[^>]*>) represents exactly our definition
of an XML tag, namely that it should start with a '<', have 0 or
more characters, except the '>' character, and then end with the
'>' character. The '*' outside the bracket denotes that we are
willing to match 0 or more such tags. After that, we wish to match
a single occurrence of the first token, 'a', again followed by 0 or
more tags. Note that the tags are "stuck" to the token 'a', in that
there is no space between the tag and the token 'a'. Of course if
in the text there is a space between an XML tag and 'a', then the
space would match the space in above.
d> Having matched token 'a' with 0 or more tags "stuck" to its right
and left, we now wish to match exactly a single space character
through the \s. Again this corresponds to our assumption that
tokens in the text are separated by exactly one space character!
e> The next chunk (<[^>]*>\s)* is again our familiar XML tag. This
time we wish to "skip" over 0 or more occurrences of any XML tag
that lie between the first and the second token, ie between 'a' and
'bigram'. Since these are not "stuck" to the next token 'bigram',
they are space separated from each other and from 'bigram'. Hence,
for every token we match, we also match a space character!
f> The next chunk is (<[^>]*>)*bigram(<[^>]*>)* which is exactly like
the chunk for 'a' in point above.
g> Finally we wish to match a single space character \s.
h> The portion after the last '/' @name = a<>bigram creates a "name"
for this feature. This name is used by xml2arff while creating the
vector output of the input XML file. While this name is not
necessary, it makes the vector output more human-readable.
6.2. Regular Expression with Skipping of Intermediate Tokens:
-------------------------------------------------------------
Recall that nsp2regex.pl can create regular expressions that ignore
one or more tokens that occur between the tokens to be
matched. Further recall that this can be switched on by having the
directive "@count.WindowSize=..." in the input file to
nsp2regex.pl. We need to provide nsp2regex.pl with the same token file
we provide preprocess.pl... say following is the token file:
/\w+<\/head>/
/\w+/
Let the input file to the nsp2regex.pl program be the following:
@count.WindowSize=3
a<>bigram<>
then, the output regular expression from nsp2regex.pl is:
/\s(<[^>]*>)*a(<[^>]*>)*\s(<[^>]*>\s)*((<[^>]*>)*((\w+<\/head>)|(\w+))(<[^>]*>)*\s(<[^>]*>\s)*){0,1}(<[^>]*>)*bigram(<[^>]*>)*\s/ @name = a<>bigram<>1
6.2.1. What This Regular Expression will Match:
-----------------------------------------------
This regular expression will match the tokens "a" and "bigram"
separated by 0 or 1 occurrences of the white space separated token
((\w+<\/head>)|(\w+)). This is the token definitions obtained
from the token.txt file above!
For example, this regular expression will match the following
sentences:
" this is a funny bigram "
" this is a bigram "
" this is a nice bigram "
" this is a <,> bigram "
" this is a <,> nice bigram "
This regular expression will not match:
" this is a really big bigram ",
" i wanna write bigram ".
" this is a , bigram ",
6.2.2. Explanation of this Regular Expression:
----------------------------------------------
Following is a description of various parts of the regular expression:
/\s(<[^>]*>)*a(<[^>]*>)*\s(<[^>]*>\s)*((<[^>]*>)*((\w+<\/head>)|(\w+))(<[^>]*>)*\s(<[^>]*>\s)*){0,1}(<[^>]*>)*bigram(<[^>]*>)*\s/ @name = a<>bigram<>1
On careful observation one will notice that the above regular
expression differs from the previous regular expression (section 6.1.2)
in only one portion.
Specifically the portion \s(<[^>]*>)*a(<[^>]*>)*\s(<[^>]*>\s)* is the
same as above... recall that this matches a space, followed by 'a'
with XML tags or non-token characters (within <> brackets) stuck to
its left and right, followed by a single space, followed by 0 or more
XML tags and non-token characters, with a space after every such tag.
Further note that the portion (<[^>]*>)*bigram(<[^>]*>)*\s is again
the same as before... they match 'bigram' with XML tags and non-token
character tags stuck to its left and right, followed by a single
space.
Thus the only "new" portion in this regex is
((<[^>]*>)*((\w+<\/head>)|(\w+))(<[^>]*>)*\s(<[^>]*>\s)*){0,1}
We call this the "separator" portion of the regex; this is the portion
that allows for the "ignoring" of up to one token between the tokens
'a' and 'bigram'. This token can be either a \w+ or a
\w+.
a> Observe that the entire section is within a pair of round brackets,
followed by a {0,1}. This says that this portion is allowed to
occur 0 or 1 times. This is consistent with the window size of
3... besides 'a' and 'bigram', we allow at most one other token to
come into the window. If our window size were to be 10 say, this
would be {0,8}.
b> The first part inside this bracketed portion is
(<[^>]*>)*((\w+<\/head>)|(\w+))(<[^>]*>)*. This says that we
are willing to match either a \w+ or a \w+. Further
whatever we match can be preceeded or followed by an XML tag or a
non-token character ensconced with the angular brackets <>.
c> Having matched either of the two options, we wish to match a single
space, \s, followed by one or more XML tags or non-tokens, in
keeping with our desire to skip these tags!
e> And, as mentioned in above, we would like to do this matching
at most once, that is there will be at most one such token between
'a' and 'bigram'.
f> The name of the feature has also changed to @name = a<>bigram<>1
implying that we are allowing at most one token to come in between
our two main tokens!
7. A Fine Point about nsp2regex.pl:
-----------------------------------
Fine Point 1: Certain characters, like '.', '*', '?' etc have special
meaning when used within a regular expression. If these characters
occur in the tokens that the regular expression is being built from,
they are "escaped" (by prepending them with a slash '\'). Follwing is
a list of characters that are so escaped: '\', '/', '|', '(', ')',
'[', ']', '{', '}', '^', '$', '*', '+', '?' and '.'
8. A Brief History of xml2arff.pl:
----------------------------------
Our original approach to xml2arff.pl was to keep the regular
expressions produced by nsp2regex.pl relatively simple, and then
create mechanisms in xml2arff.pl that would do the skipping over XML
tags, non-token characters and other tokens as and when
required. However, this produced extremely slow code, since we had to
make choices of what to skip for every instance and every regular
expression. By creating complicated regular expressions, we shifted
the responsibility of skipping etc to the regular expressions
themselves. This resulted in nearly incomprehensible regular
expressions, but improved speeds markedly since we no longer have to
make the expensive choices of what to skip etc, and of course the Perl
regex engine is very efficient!
9. A Fine Point about xml2arff.pl:
----------------------------------
The attributes in the output arff files are in exactly the same order
as the regular expressions in the input regex file. The order of
attributes is important to certain machine learning algorithms. For
example, while building a tree in the Decision Tree algorithm, if
there is a tie between two attributes, the one occurring earlier is
chosen. By keeping the order of attributes exactly the same as the
order of the input regular expressions, xml2arff.pl allows the user to
control this order. This is particularly useful when the input regex's
originate as bigrams found by the Ngram Statistics Package (NSP). Recall
that bigrams found by NSP are ordered on their frequency or on some
statistical measure of association, and so bigrams higher in the list are
possibly more indicative of information than bigrams lower down the list.
Given a choice therefore, Weka chooses attributes corresponding to higher
ranked bigrams than lower ranked ones.
10. Some Common Pitfalls to Avoid with SenseTools:
--------------------------------------------------
1. The input files should be formatted in the SENSEVAL-2 format. If
for example the tag is missing, then the entire file will
be ignored, and there will be no output from preprocess.pl.
2. The token files used with preprocess.pl, NSP programs, nsp2regex.pl
etc should all be the same. If they are not, then the token
definitions of these programs will not coincide and the matching of
the regular expressions may no longer make any "sense". That is,
unexpected matches/mis-matches may likely occur.
3. The regular expressions of nsp2regex.pl must be used ONLY with text
that has already been passed through preprocess.pl. If this is not
true, then again we may land up with unexpected
matches/mismatches.
4. One must exercise caution when passing the same file into
preprocess.pl twice. This may result in unexpected output. For
example, say the text within the region is the
following:
"this is just too cool, man"
And our token regular expressions are:
/\w+/
/\w+/
After passing the file through preprocess.pl once, we would get the
following changed text:
" this is just too cool <,> man "
On passing it the second time with the same token file we would get
the following changed text:
" this is just too cool <<,>> man "
If you do have to pass the same text through preprocess.pl twice,
one way would be to use the regular expression /\S+/ the first
time... this would consider everything as tokens and nothing would
be put in <> brackets as non-tokens. Then the second time around,
one can use the main token file. This situation occurs when one
wishes to run preprocess.pl first on the main train file, and then
run the program again on the individual lexelt files created after
the first run of the program.