CS 8995 Corpus Based Natural Language Processing

Assignment 4 - Due Wed, Mar 7, 4 pm

Under Construction - Date of last update: Thu Mar 01, 11am

Objectives

To collect a parallel corpus of translated text for future projects.

Specification

Create a UNICODE (UTF-8) file that contains parallel text from two languages. Your corpus should consist of 20 pairs of translated articles, documents, etc. where one of the languages is English and the other is a single language of your choice. Thus, your corpus should consist of 40 articles, 20 in English and 20 in the other language. Your corpus should consist of plain text (like the project gutenburg files). There should be no html, xml, or other forms of markup embedded in the text.

Each article in your corpus should be at least 500 words (in English). However, if you can find longer articles then your results on subsequent projects will be much better. Do not subdivide longer articles, government documents, technical documentation, etc into smaller pieces. Your objective should be to collect 20 distinct articles rather than simply collecting 10,000 words of text (20 articles * 500 word minimum) in each of the two languages.

Good choices for text will include newspaper or magazine articles, technical documentation, or government documents. Do not use religious or literary works. Remember to make sure that you are collecting articles that have been translated, as opposed to articles that may simply be about the same topic but be substantially different. Do not, under any circumstances, use online translation aids to create a translation!

Submission format

It is crucial that you format your corpus in this format . Here is an example (using literary text however!)

You should have twenty such entries in your corpus. Remember that your student-id-number is the 7 digit code that UMD knows you as (not your email address). The entire corpus that you submit should be in Unicode, so you must either use Perl or a Unicode text editor to create the markup scheme shown above.

Other information

Please use turnin to submit this assignment. Remember that you can only use turnin from hh33812. No email submission is necessary. The proper turnin command is:
turnin -c cs8995 -p a4 userid.utx
This is an individual assignment. Please work on your own. Given the huge volume of text available online it is highly unlikely that you will coincidentally find the same articles as another student. However, if you are concerned that there is a common source of translated text that multiple students might access (a popular newspaper, etc) please arrange with your colleagues to avoid collecting the same text!

Please note that the deadline will be enforced by automatic means. Any submissions after the deadline will not be received.

by: Ted Pedersen - tpederse@d.umn.edu