CS 8995 Corpus Based Natural Language Processing

Assignment 4 Bitexts

Below are the bitexts collected by each of the members of CS 8995. All of these bitexts include English and another language, among them French, Spanish, Chinese, and German.

I have reviewed these bitexts and in general they seem like reasonable translations, although sometimes the data is rather messy. If you notice any serious problems in this data please let me know and I will ask the creator to correct them.

I would encourage you to use some of this data to experiment with your sentence boundary detection and alignment programs. While you don't have a gold standard to compare to, you can get an idea of how well your approach is working simply by inspecting the output.

ajoshi.utx
gchen.utx
kris0078.utx
pars0093.utx
varm0003.utx
anagaraj.utx
hbommaga.utx
kulk0015.utx
sholtz.utx
xliu.utx
apaluska.utx
knakka.utx
kvanhorn.utx
sing0174.utx
zchen.utx
bane0025.utx
kodu0002.utx
lath0027.utx
svikram.utx
dhru0002.utx
kotn0007.utx
nath0030.utx
vadr0001.utx

by: Ted Pedersen - tpederse@d.umn.edu