README ------ posSenseval Version 0.1 Copyright (C) 2001-2002 Mohammad Saif, moha0149@d.umn.edu Ted Pedersen, tpederse@d.umn.edu University of Minnesota, Duluth ##################### LAST UPDATED: May, 2003 ########################### posSenseval is a software package to part of speech tag any text in Senseval-2 data format. Brill Tagger[Brill 92][Brill 94] is utilized for the purpose. ############### TO IMMEDIATELY PART OF SPEECH TAG DATA #################### # # 1. Download and unpack: # a. posSenseval: # http://www.d.umn.edu/~moha0149/research.html # Set environment variable and path as shown: # setenv POSHOME $POSPACKAGE/posSenseval # set path = ($POSHOME $path) # ...where $POSPACKAGE represents the directory where the package # has been unpacked. # # b. SenseTools : # http://www.d.umn.edu/~tpederse/senseval2.html # Set PATH as shown: # set path = ($SENSEHOME/SenseTools-0.1 $path) # ...where $SENSEHOME represents the directory where Sensetools # is unpacked and given that the version being used is 0.1. # # c. Brill Tagger: # http://www.cs.jhu.edu/~brill # Compile the tagger. # # d. BrillPatch: # http://www.d.umn.edu/~moha0149/research.html # (Optional but suggested. Patch to be applied as per its README) # The application of this patch guarantees pre-tagging and applies # lexicalized contextual rules even when the data is pre-tagged. # # 2. Type: # pos BRILL OUTPUT SOURCE [tr] # # BRILL is the location where the Brill Taggers executable (tagger) # is located. # # OUTPUT is the name of the directory where all the output files # are to be created. The directory specified will be created by # the script and should not be existing already. # # SOURCE is the name of the Senseval-2 data format file which is # to be part of speech tagged. A complete or relative path where # the file exists may be specified. If `tr' is specified as the # fourth command line argument, the part of speech tagged # individual lexical element files are named as training files. # Else, they are named as test files. # ########################################################################### ################################ DETAILS ################################## 1. INTRODUCTION: =============== Assigning the appropriate part of speech to the words in a text is known as part of speech tagging. Consider the following sentence. Jack will chair the meeting It may be part of speech tagged as shown below, given, NNP stands for proper noun, MD for modal, VB for verb, DT for determiner and NN for noun. Jack/NNP will/MD chair/VB the/DT meeting/NN Numerous part of speech taggers are available commercially and in the public domain. Most of these tag all the tokens in the given input text. They expect the text to be tokenized and sentences separated by a marker (usually, a new line character). This however, is not always the case and is what motivates this package. Part of speech tagging is a pre-requisite for many Natural Language Tasks. Most of the techniques which utilize data for word sense disambiguation rely on the part of speech of the target word, its surrounding words and sometimes syntactically and lexically related words. Many chunkers and parsers which are used to obtain these syntactic relations utilize the part of speech tags for their functioning. The Senseval-2 exercise held in 2001, brought together numerous word sense disambiguation systems from all over the world on one platform. For the first time, the results from all these systems could be compared as the systems were trained and evaluated on a common sense tagged data set. An offshoot of which was that all the systems involved were designed to accept a common data format - the Senseval-2 data format. Data in Senseval-2 format has numerous xml and sgml tags which are not a part of the contextual information. The text need not be tokenized and may have parts of sentences separated by a new line character (split sentences). posSenseval is designed to efficiently part of speech tag any data in Senseval-2 data format using the Brill tagger, which as any standard tagger, has the requirements mentioned above. Creating manually annotated data like the Senseval-2 English Lexical Sample Space is both expensive and time intensive. Conversion of data in various data formats to the Senseval-2 data format can be a significant source of quality sense tagged data for the numerous Natural Language Tools which accept data in Senseval-2 format. The Sval1to2 package which converts data in Senseval-1 format to that in Senseval-2 data format is available at: http://www.d.umn.edu/~tpederse/data.html The lineOneTwo, hardOneTwo, serveOneTwo and interestOneTwo packages which convert the "line", "hard", "serve" and "interest" data respectively, into Senseval-1 data format and then use Sval1to2 to convert them to Senseval-2 data format are available at the authors' web page. Thus, there is a rich source of sense-tagged data available in the Senseval-2 data format which may be used for many of the Natural Language Processing tasks. posSenseval can be used to accurately assign parts of speech to all of them. 2. Senseval-2 data format: ========================= The English lexical sample space of Senseval-2 has 4328 instances of evaluation data and 8611 instances of training data. Each instance has a sense tagged sentence along with a few neighboring sentences forming its context. The data is partitioned into two sets - the training data and the evaluation data. In the evaluation data, one word in the context is marked (surrounded by and ) as the target word, whose sense as intended in the context is to be disambiguated. It also has other xml tags (demarcated by angular braces) and certain sgml tags (placed in square braces). The training data has the same kind of tags but additionally has the intended sense of the target word. Information about the Senseval-2 data format is available at its web page. http://www.sle.sharp.co.uk/senseval2 3. SCRIPTS: ========== A script, `pos', is provided with this package which takes you through the complete process of efficiently part of speech tagging any data in Senseval-2 data format. Details in Sections 6 and 7. Data may be first processed by `refine' to improve the quality of the tagging done by `pos'. `refine' is a software package, available at the authors' web page, which may be used pre-tag the head words with appropriate part of speech tags and to place sentences on new lines. Higher accuracy of tagging warrants the following: 1. Symbols such as `<', `>', `[', `]', `{' and `}', which have special meaning in Senseval-1 or Senseval-2 data format, must be represented by appropriate character references in HTML4.0. http://www.cs.tut.fi/~jkorpela/html/guide/entities.html 2. There should not be new-line characters within angular braces. 3. Sentences must be separated from each other by new line characters. 4. Sentences should not have new line characters within them. 5. All data within square braces is assumed to be contextual if composed of multiple tokens and non-contextual otherwise (This is the case in Senseval-2 data which has SGML tags). The file `samplerun' shows how I have used pos to generate part of speech tagged line, hard, serve, interest, Senseval-1 and Senseval-2 data. 4. LOCATION OF FILES: ==================== On unpacking the package all files and data created are within the `posSenseval' directory. This directory is created in the directory where the package was unpacked. Let POSHOME represent the complete path to posSenseval. The scripts and all the perl programs as indicated by their `.pl' extension are located here. The original Senseval-1, Senseval-2 and the fixed Senseval-1 data may be downloaded from: Senseval-2 : http://www.sle.sharp.co.uk/senseval2/Results/guidelines.htm#rawdata Senseval-1 : http://www.itri.brighton.ac.uk/events/senseval/ARCHIVE/resources.html Fixed Senseval-1: http://www.d.umn.edu/~moha0149/research.html More information about Senseval data and the Senseval exercise may be obtained from the Senseval web page: http://www.itri.brighton.ac.uk/events/senseval/ Token files used by `preprocess.pl' are at : $POSHOME/data/TOKEN/ May place input Senseval-2 format files at : $POSHOME/data/user/ Output created by package may be placed at : $POSHOME/data/output/ Senseval-1 and Senseval-2 test and training files which have been processed by the `refine' package are at $POSHOME/data/samples/ These files have the head words pre-tagged and sentences placed on new lines. They have been created by running the script `proc' of `refine' with the following command line options: proc 1 2 1 2 1 DESTINATION SOURCE proc 2 2 1 2 1 DESTINATION SOURCE proc 3 2 1 2 1 DESTINATION SOURCE proc 4 2 1 2 1 DESTINATION SOURCE For further details see README of `refine'. These files are described in the sections to follow. The change in location of the above stated data files is possible but entails appropriate changes in the script. All the perl programs assume the Perl software to be at `/usr/bin'. If this is not the case requisite changes will need to be made to the first line of the perl programs. An alternative to changing the code for this purpose is to alias `/usr/bin/perl' to the appropriate perl directory. For eg. if the perl software is located in the `usr/local/bin/perl' directory, the following command aliases `usr/bin/perl' to `/usr/local/bin/perl'. alias /usr/bin/perl /usr/local/bin/perl 5. NECESSARY SOFTWARE: ===================== The following software must be downloaded and their locations placed in the PATH, in order to successfully part of speech tag data. 1. posSenseval: http://www.d.umn.edu/~moha0149/research.html The complete path of the posSenseval directory, created after unpacking the package, needs to be set as an environment variable in your .cshrc file. This is how I set it... setenv POSHOME $POSPACKAGE/posSenseval ...where $POSPACKAGE represents the directory where the package has been unpacked. The $POSPACKAGE/posSenseval directory must be added to the PATH as well. This is how I did it... set path = ($POSHOME $path) 2. SenseTools (version 0.1 or above): http://www.d.umn.edu/~tpederse/senseval2.html The directory which contains `preprocess.pl' (a part of SenseTools) should be placed in the PATH. This is how I made the entry in my .cshrc file assuming I have downloaded Sensetools in $SENSEHOME(say) and the version I am using is 0.1. set path = ($SENSEHOME/SenseTools-0.1 $path) 3. Brill Tagger : http://www.cs.jhu.edu/~brill `pos' copies the data files after appropriate pre-processing to the directory housing Brill tagger's executable. A necessary pre-condition of using the Brill tagger is that the tagging has to be done in the directory containing the initial and final state tagger. Note: Brill Taggers home directory need not be added to the PATH explicitly; the script takes care of it. It is also suggested that the patch to the Brill Tagger, BrillPatch [Mohammad Pedersen 2002] be downloaded from... http://www.d.umn.edu/~moha0149/research.html The application of this patch guarantees pre-tagging and applies lexicalized contextual rules even when the data is pre-tagged. This ensures a higher precision of tagging when `pos' is used in combination with `refine', which pre-tags the data. Please see the paper [Mohammad Pedersen 2002] for further details. 6 GENERATING PART OF SPEECH TAGGED DATA: ======================================= Part of speech tagging data in Senseval-2 data format may be done using the following script. Type... pos BRILL OUTPUT SOURCE [tr] where, BRILL is the path to the directory which houses brill tagger's executable (tagger) OUTPUT is the name of the directory where all the output files are to be created. The directory specified will be created by the script and should not be existing already. The final part of speech tagged output file is named output.xml SOURCE is the name of the Senseval-2 data format file to be tagged. A complete or relative path where the file exists may be specified. If `tr' is specified as the fourth command line argument, the individual lexical element files are named as training files. In all other cases they are named as test files. If `refine' is used for pre-tagging and/or split sentence restoration, its final output file ( such as multisent.xml or replace.xml) should be used as the SOURCE file. Individual files generated for each lexical element can be identified by their `.count' and `.xml' extensions. These files are created in the $OUTPUT/LexSamples/ directory. A copy of the same is also placed in the $POSHOME/data/WEKA directory. Thus, in case you run this script with both training and evaluation data, the training and evaluation .xml and .count files are stored together for each lexical element in this directory. For eg. $POSHOME/data/WEKA/art.n will have four files - the .xml and .count training and evaluation files. The data generated has all contextual alphabetic characters in lower case. This is done so that strings with different capitalizations may be treated as the same type by the systems which utilize this data. The script `pos' first copies the SOURCE file into the OUTPUT directory by the name `1.txt'. It then creates a number of intermediate files before finally creating `output.xml' which is the part of speech tagged SOURCE file. It then goes on to create the individual lexical element files as mentioned above. 7. DETAILS OF GENERATING PART OF SPEECH TAGGED DATA: =================================================== Following is a description of the various perl programs utilized by `pos' to do the part of speech tagging. The tagging process may be broken into three steps: Pre-Processing, the actual tagging and Post-Processing. 7.1 PRE-PROCESSING: ================== This is the stage in which data is stripped of its xml and sgml tags and appropriately processed to make it suitable to be tagged by the Brill Tagger. 7.1.1 REMOVING XML TAGS WITHIN CONTEXT (filter1.pl): --------------------------------------------------- filter1.pl removes the data within angular braces in the context of the SOURCE file. This data is non-contextual and is thus eliminated before the tagging phase along with the angular braces. The output is placed in the DESTINATION file. The data which is filtered out is placed in the FILTER file. Usage: filter1.pl [OPTIONS] FILTER DESTINATION SOURCE OPTIONS: --help Prints the help message. The script runs the program with the following options: filter1.pl filter1.out 2.txt 1.txt 7.1.2 ELIMINATING THE REMAINING NON-CONTEXTUAL DATA (filter2.pl): ---------------------------------------------------------------- `filter2.pl' is a perl program that eliminates the non-contextual data within square brackets.It makes the assumption that data within square brackets is non-contextual and hence exceptions to the same are first converted to the SGML format (`[' to `&lsqb' and `]' to `&rsqb'). The exceptions are identified by the presence of multiple tokens within the square braces. The program eliminates data within curly braces. This data is assumed to be non-contextual as found in Senseval-1 data. An exception to this is typographical error information within curly braces. This information is replaced by the correct spelling. As is the case with square braces, if the curved braces are a part of the context, they must be specified by their appropriate character references - `{' and `}' represent left and right curly braces, respectively. The new line character in the second line of Senseval-1 test data which is within angular braces is eliminated by this program. It non-tokenizes the `\/' symbol and the pre-tag slashes `//`. It also eliminates `\226' and `\227' tokens which are not part of the contextual information. Following is the usage of filter2.pl : Usage: filter2.pl [OPTIONS] FILTER DESTINATION SOURCE [SOURCE]... The SOURCE file is processed to form the DESTINATION file. All the data eliminated is stored in the FILTER for reference. OPTIONS: --help Prints the help message. The script runs the program with the following options: filter2.pl filter2.out 3.txt 2.txt 7.1.3 ELIMINATING XML TAGS (preprocess.pl): ------------------------------------------ `preprocess.pl' is used to eliminate xml tags before tagging and later to generate xml and count files for individual lexical elements. It is a part of Sensetools which can be downloaded from... http://www.d.umn.edu/~tpederse/senseval2.html I have used it with the following options to process the output file of filter2.pl(3.txt). preprocess.pl 3.txt --token $POSHOME/data/TOKEN/token1.txt / --count OUTPUT --noxml --putSentenceTags The command produces a file (OUTPUT) which contains only the contextual information - no tags other than and which demarcate the sentence boundaries. It also tokenizes the data. Note that hyphenated words are not split up into multiple tokens thus treating them as compound words. The regular expressions in token1.txt identify the valid tokens. All others are eliminated. The README of Sensetools may be referred for a better understanding of the token files. 7.1.4 DEALING WITH CHARACTER REFERENCES (filter3.pl): ---------------------------------------------------- filter3.pl removes the character references within the context of INPUT file. The character refernces which may be helpful in correct part of speech tagging are converted to appropriate symbols while the others are eliminated. The output is placed in the DESTINATION file. The data which is filtered out is placed in the FILTER file. USAGE : filter3.pl [OPTIONS] FILTER DESTINATION SOURCE OPTIONS: --help Prints the help message. The script runs the program with the following options: filter3.pl filter3.out 4a.txt 4.txt 7.1.5 NEW LINES (sentence.pl): ----------------------------- `sentence.pl' eliminates sentence delimiter tags in the SOURCE file(s) and places sentences on new lines; a form required by the Brill Tagger. The sentence boundaries should be demarcated by ... in the INPUT file. It also non-tokenizes the pre-tagging forward slashes `//' and tokenizes apostrophe `'' as required by Brill Tagger i,e, ` ''. Einstein's mass and energy / / NN formula was revolutionary . converted to... Einstein 's mass and energy//NN formula was revolutionary . The formatted output is stored in the DESTINATION file. Usage : sentence.pl [OPTIONS] DESTINATION SOURCE In the script, `SOURCE' is the output file generated by filter3.pl and `DESTINATION' is the file which can be used as input to the Brill Tagger. The script runs the program with the following options: sentence.pl 5.txt 4a.txt `xml.pl' is a script which can be used to check if any non-contextual data (xml tags) still exist in the SOURCE file. The script prints all tokens within angular braces to the DESTINATION file. Ideally it should be empty. Usage: xml.pl DESTINATION SOURCE The script runs the program with the following options: xml.pl xml.out 5.txt & 7.2 POS TAGGING (Brill Tagger): ============================== The Brill tagger is used to tag the words with appropriate part of speech. It can be obtained from... http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z I have used it with the following options: tagger LEXICON INPUT BIGRAMS LEXICALRULEFILE CONTEXTUALRULEFILE -i INTER > TAG where, LEXICON, LEXICALRULEFILE and CONTEXTUALRULEFILE - are links to files extracted from treebank and wsj data. These files are provided by the tagger. INPUT - is the file having the preprocessed data (contextual data only) TAG - is the output file BIGRAMS - a file which uses BIGRAMS to assign Part of speech to unknown words. Presently empty. INTER - the intermediate part of speech tagged data file as created by the initial state tagger. Note: As mentioned earlier the input to tagger must be copied to the directory housing the initial and final state tagger for tagging. The script copies the pre-processed INPUT file by the name 5.txt. The output of the tagger is in the form of two files - inter.txt and 6.txt corresponding to INTER and TAG. These files (5.txt, inter.txt and 6.txt) should not already exist in the Brill Tagger directory. inter.txt is the output of the initial state tagger and 6.txt is the output of the final state tagger. The script runs the program with the following options: tagger LEXICON 5.txt BIGRAMS LEXICALRULEFILE CONTEXTUALRULEFILE -i inter.txt > 6.txt Note: Make sure to compile the Brill code by using the make command as specified in the Brill README. 7.3 POST PROCESSING: =================== This stage primarily involves placing all the xml and sgml tags back into the part of speech tagged data and formatting the part of speech tags as xml tags. 7.3.1 LOOKING UP CHANGED PRE-TAGS (mistag.pl): ------------------------------------------- At this point the list of instances wherein the pre-tag of a head word has been overriden by the contextual rules of the Brill Tagger can be observed by using `mistag.pl' Usage: mistag.pl [OPTIONS] BRILL PRETAG DESTINATION1 DESTINATION2 OPTIONS: --help Prints the help message. PRETAG the pre-tagged file name BRILL the brill tagged file name DESTINATION1 is the file listing all changes. DESTINATION2 is the file listing radical changes in part of speech tag. The file has line number of instance in the Brill tagged file and the line as in `BRILL' and `PRETAG' The script runs the program with the following options: mistag.pl 6.txt 1.txt all-mistag.out radical-mistag.out Note: The term `Radical change' has been used to refer to change of the form NNP to VBZ say, where the general class of the tag has changed (Noun to Verb here). A change of the form NNP to NNS will not figure in this list. all.pl and radical.pl use DESTINATION1 and DESTINATION2 as created by mistag.pl as SOURCE files to list the head words followed by the tag as assigned by the tagger, followed by the pre-tag. Information for each head word is placed on a new line and entries are created for only those head words which have been mis-tagged by the tagger. Following are the usages: Usage: all.pl [OPTIONS] SOURCE OPTIONS: --help Prints the help message. Usage: radical.pl [OPTIONS] SOURCE OPTIONS: --help Prints the help message. The output may be re-directed into appropriate files. The script redirects the outputs to all.out and radical.out respectively. It may be noted that radical.out lists just the broad part of speech for example (noun - N, verb - V, adjective - J). all.out, however, lists the complete part of speech (plural noun - NNS, past tense verb - VBD etc.). The number of lines in all.out and radical.out are equal to the number of total mis-tagged head words and radically mis-tagged head words respectively. The script runs the programs with the following options: all.pl all-mistag.out > all.out radical.pl radical-mistag.out > radical.out 7.3.2 RETURN OF THE XML TAGS (xml-back.pl): ------------------------------------------ This program adds the xml tags from the SOURCE file back to the TAGGED file and places the resultant lines in DESTINATION file. The number of non-xml tokens in the SOURCE is compared with the number of tokens in the TAGGED file. Any mismatch causes the relevant lines of the two files to be printed in the file FILTER. This file is empty if there is perfect synchronization among the two files. The contextual alphabetic characters are converted to the lower case. Forward slash symbols which are part of the contextual data (and not the token tag separator) are transformed to their character reference form and placed in angular braces, `<⁄>'. This is to avoid mis-identification as the token tag separator, especially in case of pre-tagging. Usage: xml-back.pl [OPTIONS] FILTER DESTINATION SOURCE1 SOURCE2 where, FILTER - file which has relevant lines in case of mismatch SOURCE1 - XML tagged Senseval-2 data format file SOURCE2 - File obtained after POS tagging DESTINATION - Resultant output file The script runs the program with the following options: xml-back.pl xml-back.out 7.txt 1.txt 6.txt 7.3.3 FORMATTING THE POS TAGS (xmlise.pl): ----------------------------------------- This program formats the part of speech tags into angular braces. The formatted form of the SOURCE file is placed in the DESTINATION file. USAGE : xmlise.pl [OPTIONS] DESTINATION SOURCE DESTINATION is the final part of speech tagged Senseval-2 data format file. The program also converts `<' and `>' symbols back to the sgml format and placed in angular braces. Thus transforming them to `<<>' and `<>>', respectively. `<' and `>' are symbols have special meaning in the Senseval-2 data format and thus may be mis-identified by systems using this data. Hence, the conversion to SGML format. The script runs the program with the following options: xmlise.pl output.xml 7.txt 7.3.4 FILES FOR EACH LEXICAL ELEMENT: ------------------------------------ Individual xml and count files for each lexical element may be generated by processing the part of speech tagged file generated above `output.xml', with preprocess.pl. Following is what I have used: preprocess.pl output.xml --token $POSHOME/data/TOKEN/token2.txt --putSentenceTags Observe the use of a different token file here as compared to the earlier used token1.txt. This is to accept the xmlised part of speech tags as valid tokens. The count files have just the contextual information and the xmlised part of speech tags. The xml and count files are created in the directory corresponding to the lexical element in the OUTPUT directory... $OUTPUT/LexSample Another copy of the same data is also placed in the following directory: $POSHOME/data/WEKA This is so that in case both training and test files are processed, both sets of files will be together in this directory. It may be noted, however, that the above mentioned directory's contents may have to be deleted before starting afresh. NOTE: Large sentences may cause a problem while opening the files in the vi editor. 8. MISCELLANEOUS TEST SCRIPTS: ============================== Apart from the debug and testing files mentioned above, following scripts can be used to check the validity and sanctity of the part of speech tagged data produced. head.pl : This script can be used to list all head instances in the initial source file and the final part of speech tagged files. The data should match both in number and what they hold. This is how I have used it: head.pl SOURCE1 > DESTINATION1 head.pl SOURCE2 > DESTINATION2 where, SOURCE1 is the file before any pre-processing by this script. SOURCE2 is the output file of xmlise.pl DESTINATION1&2 are the output files which list the head instances. sat.pl : This script can be used to list all `sat' instances in the initial source file and the final part of speech tagged files. The data should match both in number and what they hold. This is how I have used it: sat.pl SOURCE1 > DESTINATION1 sat.pl SOURCE2 > DESTINATION2 where, SOURCE1 is the file before any pre-processing by this script. SOURCE2 is the output file of xmlise.pl DESTINATION1&2 are the output files which list the sat instances. The script runs the programs with the following options: head.pl 1.txt > head-in.out head.pl 7.txt > head-out.out sat.pl 1.txt > sat-in.out sat.pl 7.txt > sat-out.out 9. Exceptions to Data Format Norms: =================================== The Senseval-1, Senseval-2 and other sense-tagged data have their own characteristic data formats. Since the data is created manually, there is always a possibility of certain instances not following one or more norms. These exceptions may pose significant problems to systems which rely on the specified data format for their execution. This section lists some of the discrepancies we found in Senseval-1 and Senseval-2 data. 9.1 Senseval-1 Data : --------------------- a. There exist contextual tokens within square braces. That is, there are square braces which are part of contextual information albeit they have not been converted to their appropriate character reference form. b. Some non-contextual tokens are placed in curly braces. Typographical error information is also placed in curly braces. This needs special handling to place the correct spelling in the context and ignore the non-contextual information. c. Compound words have not been handled appropriately. There exist hyphenated head words such as `waist-band' and also non-hyphenated compound head words such as `waistband'. d. There exist sentences with new-line character within them. e. Certain instances have mis-tagged head words. For example, `big' is tagged as the head word instead of `band'. f. Certain instances have duplicates (details to be added). 9.2 Senseval-2 Data : --------------------- a. There exist instances which have a forward slash, `/', as a token instead of its SGML representation. b. The second and third lines of the Senseval-2 test data are... We thus have a new-line character within angular braces. c. There exist sentences with new-line character within them. d. There exist sentences which are not placed on new lines. e. Certain instances have duplicates (details to be added). 10. Copying: =========== This suite of programs is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. Note: The text of the GNU General Public License is provided in the file GPL.txt that you should have received with this distribution. 11. REFERENCES: ============== 1. [Brill 92] E.Brill. A Simple Rule-Based Part of Speech Tagger. In Proceedings of the Third Conference on Applied Computational Linguistics, Trento, Italy, 1992. 2. [Brill94] E.Brill. Some Advances in Rule-Based Part of Speech Tagging. Proceedings of the 12th National Conference on Artificial Intelligence (AAAI-94), Seattle, WA, 1994. 3. [Brill95] E.Brill. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 21(4):543-565, 1995. 4. [Mohammad Pedersen 2002] S.Mohammad and T.Pedersen. Guaranteed Pre-Tagging for the Brill Tagger. In the Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics(CICLing-2003), in Fenruary 2003, in Mexico City. 5. [RamshawM94] L.Ramshaw and M.Marcus. Exploring the statistical derivation of transformational rule sequences for part-of-speech tagging. In ACL Balancing Act Workshop, pages 86-95, 1994.