CS 8995 Corpus Based Natural Language Processing

UNDER CONSTRUCTION. THIS MAY BE REVISED IN RESPONSE TO QUESTIONS. LAST UPDATE Tue APRIL 3, 10 am

Final Project: Empirical Methods for Multilingual Text

Stage 2 - Due Mon, Apr 16, 4 pm

Objectives

To refine the methods developed for sentence alignment of parallel corpora in the previous assignment.

Specification

Your team should further study and improve the performance of the alignment algorithm that you implemented in stage 1. Feel free to use whichever team member's stage 1 resources appear to be the most effective. You may assume that one of the languages to be aligned is English.

For all code submitted, please include a header comment that includes the name of your team and the team members, as well as brief instructions on how to use the program. Also, if you have used code from a previous team, please give full and accurate credit to the original author/s and clearly explain how you have changed their code. You should also explicitly state that you have received the permission of the original authors to use this code.

Other information

There are five items to submit. You may also submit a boundary detection program if you have separated that from Gale-Church. This is optional. Please use turnin to submit all five items. Remember that you can only use turnin from hh33812. No email submission is necessary. The proper turnin commands are:
turnin -c cs8995 -p p2a teamname-gc.pl (gale-church sentence alignment)
turnin -c cs8995 -p p2b teamname-b.pl (baseline algorithm)
turnin -c cs8995 -p p2c teamname.(pdf|ps) (written analysis)
turnin -c cs8995 -p p2d teamname-eval.pl (evaluation program)
turnin -c cs8995 -p p2e teamname-results.dat (evaluation results)
turnin -c cs8995 -p p2f teamname-bound.pl (optional separate boundary detector)
This is a team project. Please consult with and work with your team members closely. You may divide the work as you see fit, and you have considerable discretion in your approach to this problem.

Please note that the deadline will be enforced by automatic means. Any submissions after the deadline will not be received.

by: Ted Pedersen - tpederse@d.umn.edu