CS 8995 Corpus Based Natural Language Processing

Final Project: Empirical Methods for Multilingual Text

Stage 1 - Gold Standard Data Due Fri, Mar 23, 4 pm
Everything else Due Mon, Mar 26, 4 pm

As of Mar 20 the evaluation requirements have been expanded! Please make sure to include the new information!


To investigate the problem of sentence alignment of parallel corpora.


Your team will produce a written document that describes the general problem, previous approaches taken to solve this problem, your particular approach, and how you evaluated the effectiveness of your approach. Your team should implement your solution in Perl and evaluate it using gold standard data that your team either creates or finds.

There are four items to submit. Please use turnin to submit all four items. Remember that you can only use turnin from hh33812. No email submission is necessary. The proper turnin commands are:
turnin -c cs8995 -p p1a teamname.pl (for sentence alignment program)
turnin -c cs8995 -p p1b teamname.utx (gold standard data)
turnin -c cs8995 -p p1c teamname.(pdf|ps) (for written report)
turnin -c cs8995 -p p1d teamname-eval.pl (evaluation program)
This is a team project. Please consult with and work with your team members closely. You may divide the work as you see fit, and you have considerable discretion in your approach to this problem.

Please note that the deadline will be enforced by automatic means. Any submissions after the deadline will not be received.

