RESEARCH
My thesis research will be broadly based on Natural Language
Processing, of course. The thesis focuses on developing methods that
automatically organize email into topic folders. We'll be experimenting with
the Enron e-mail corpus. You
might also be interested in the work done by Ron Bekkerman, who has created a
small subset of the Enron Corpus and run some experiments on it. More details
can be found here.
We are presently classifying a portion of the corpus into categories manually.
The data will then be tested on tools that automatically classify data into
different categories. A comparison of the manually classified and automatically
classified data will certainly give some interesting results.
At present, I am going through some papers and websites which
relate to the above and which shall aid me in my research.
Some other interesting links:
· UC Berkeley Enron Email Analysis
A collection of resources created by the
A package of Perl programs that discriminates among the senses of a word through the use of clustering techniques.
Home page of the first Conference on E-mail and Anti-Spam.
· Gmail
Data creation scripts:
Some annotated data used in experiments:
· Manually sorted .cd3 files in a new directory structure
Chapter 5 – Classification Methods