My thesis research will be broadly based on Natural Language Processing, of course. The thesis focuses on developing methods that automatically organize email into topic folders. We'll be experimenting with the Enron e-mail corpus. You might also be interested in the work done by Ron Bekkerman, who has created a small subset of the Enron Corpus and run some experiments on it. More details can be found here. We are presently classifying a portion of the corpus into categories manually. The data will then be tested on tools that automatically classify data into different categories. A comparison of the manually classified and automatically classified data will certainly give some interesting results.

At present, I am going through some papers and websites which relate to the above and which shall aid me in my research.





      Introducing the Enron Corpus

      The Enron Corpus: A New Dataset for Email Classification Research

      Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora

      eMailSift: Adapting Graph Mining Techniques for Email Classification

      Email Classification with Co-Training





Some other interesting links:


       UC Berkeley Enron Email Analysis

A collection of resources created by the University of California at Berkeley, making use of the Enron data.



A package of Perl programs that discriminates among the senses of a word through the use of clustering techniques.


      CEAS 2004 Conference Papers

Home page of the first Conference on E-mail and Anti-Spam.





Data creation scripts:



Some annotated data used in experiments:


      Original .cd3 files

       Manually sorted .cd3 files in a new directory structure



Data Results in Graphs

Chapter 5 Classification Methods