Ted Pedersen - CS 8995 Corpus Based Natural Language Processing

CS 8995 Corpus Based Natural Language Processing

Final Project: Empirical Methods for Multilingual Text

Stage 1 - Gold Standard Data as submitted by each team on 3/23

Download the gold standard data from each team and run your sentence alignment program on it. Your sentence alignment program should be named teamname.pl (where teamname = morelia, toluca, etc) and your sentence alignment evaluation program should be named teamname-eval.pl. Remember to remove the alignment tags from the gold standard data before you feed it to your sentence aligner. The alignment tags should only be used by the evaluation program. So, if you are team TOLUCA and you are running MORELIA gold standard data, the steps you follow might look like this:

 
remove-align-tags.pl morelia.utx > morelia-notag.utx 
toluca.pl morelia-notag.utx > morelia-align.utx
toluca-eval.pl morelia.utx morelia-align.utx

Please send me a summary of your results via email to tpederse. Just the table described in this evaluation note will be fine. Make sure to identify for which team you are submitting results. I plan to present/discuss these in class on Monday 3/26 so it would help to have them as far ahead of class time as is possible.

by: Ted Pedersen - tpederse@d.umn.edu