Ted Pedersen - CS 8995 Corpus Based Natural Language Processing

CS 8995 Corpus Based Natural Language Processing

Final Project: Empirical Methods for Multilingual Text

Stage 1 - Gold Standard Data Due Fri, Mar 23, 4 pm
Everything else Due Mon, Mar 26, 4 pm

As of Mar 20 the evaluation requirements have been expanded! Please make sure to include the new information!

Objectives

To investigate the problem of sentence alignment of parallel corpora.

Specification

Your team will produce a written document that describes the general problem, previous approaches taken to solve this problem, your particular approach, and how you evaluated the effectiveness of your approach. Your team should implement your solution in Perl and evaluate it using gold standard data that your team either creates or finds.

Written Report
Your team should produce a formal written report that includes references. You should discuss the general problem of sentence alignment (what it is, why we should care about it) and then describe three "classical" methods of sentence alignment. You should also identify and discuss 2 "new" approaches that have been published since 1999.

You should then describe the approach that your team will take to sentence alignment - describe your algorithm and whatever factors led you to take this approach. If you are re-implementing a classical approach then you should provide a detailed description of the method, and also discuss why you selected it over other alternatives.

After the description of your algorithm, you should discuss your method of evaluation and the results that your algorithm achieved in processing your gold standard data. Your evaluation results should include a measure of some sort that could be used to compare your approach to others. You must also provide a program that performs this evaluation!

All of the writing in this report should be original to your team. In writing about the other approaches, you should read and discuss related material that you find in our textbook and other sources (all of which should be mentioned in your references). Your writing should be based on those discussions and reflect the unique character of your team. You should not, under any circumstances, "borrow" text from other sources to include in your report. Producing this report may be even more difficult than writing your Perl code so make certain to allow sufficient time to produce a quality document.

You should submit your report either as a postscript of pdf file.
Perl Implementation
Your implementation must be in Perl and it should run from the command line as follows:
```
teamname.pl file1.utx 
```
where file1.utx is a parallel corpora formatted as in assignment 4. (You can allow your program to accept multiple input files, although this is not required). The input and output of your program should be in unicode. You may assume that the input to the program is limited to one pair of languages. In other words, file1.utx is assumed to be made up of a single language pair (english and one other language). However, you should try and make your sentence alignment program language independent. One time you might run it with english-french, the next time with english-spanish, etc.

The output of the sentence alignment program should be formatted like this . You must follow this format exactly. Please note that your program does not reorder the text in the input files. Rather, the alignment information is contained in the alignment tags as shown in the example.
Gold Standard Data and Evaluation Program
In order to evaluate your approach you will need at least 100 manually aligned sentences. To perform the evaluation you should run the sentence alignment program on the gold standard data (in its pre-aligned form) and compare your program's output with the manually created gold standard. How you make this comparison is up to you. However, make sure that your evaluation is quantitative (ie some measure of agreement or disagreement is produced) and make sure that you have a minimum of 100 manually aligned sentences in your gold standard.

You must implement your evaluation method in a Perl program. It should run from the command line as follows:
```
teamname-eval.pl gold.utx output.utx
```
where output.utx is the output created by your alignment program, and gold.utx is the manually aligned (gold standard) version of this same data. teamname-eval.pl should compare these two files and produce a score of some sort that measures the effectiveness of the alignment program. gold.utx should follow the same format output.utx, the only difference is that gold.utx should be created manually, while output.utx is the aligned version of that data as created by your program.

Other information

There are four items to submit.

the written report
perl implementation of sentence alignment program
gold standard and other data you create/find
evaluation program

Please use turnin to submit all four items. Remember that you can only use turnin from hh33812. No email submission is necessary. The proper turnin commands are:

turnin -c cs8995 -p p1a teamname.pl (for sentence alignment program)
turnin -c cs8995 -p p1b teamname.utx (gold standard data)
turnin -c cs8995 -p p1c teamname.(pdf|ps) (for written report)
turnin -c cs8995 -p p1d teamname-eval.pl (evaluation program)

This is a team project. Please consult with and work with your team members closely. You may divide the work as you see fit, and you have considerable discretion in your approach to this problem.

Please note that the deadline will be enforced by automatic means. Any submissions after the deadline will not be received.

by: Ted Pedersen - tpederse@d.umn.edu

CS 8995 Corpus Based Natural Language Processing

Final Project: Empirical Methods for Multilingual Text

Stage 1 - Gold Standard Data Due Fri, Mar 23, 4 pm Everything else Due Mon, Mar 26, 4 pm

Objectives

Specification

Written Report

Perl Implementation

Gold Standard Data and Evaluation Program

Other information

Stage 1 - Gold Standard Data Due Fri, Mar 23, 4 pm
Everything else Due Mon, Mar 26, 4 pm