CS 8761 Natural Language Processing - Fall 2002 - MRD + Web => Corpus

Final Project - Stage II - Warmup due Mon Nov 25, noon

This may be revised in response to your questions. Last update Tue Nov 19 1:00 pm


The World Wide Web is a billion+ word corpus. Let's use it. :)


This is a simple warm-up exercise designed to give you some experience with using the Web as a source of text data, and in the use of the LWP module in Perl. Stage II will be expanded once we are limber.

Write a module using LWP that will find and retrieve sentences in Web pages that use a given set of words. You should allow the user to specify one or more words to be found. Your module should collect as many sentences as it can that use these words.

You have a fundamental choice to make in creating this program. Do you query a search engine (via LWP) to find this information or do you create a spider (using LWP) of some sort that goes to selected neighborhoods in the Web and locates text? I leave this choice to you. However, it is important that your module gather as much text as it can. For example, if I request sentences with a very common word it in (like "baseball" or "love" or "car") I would expect your module to return quite a few sentences. We will need to collect large amounts of data for the final portion of this project, so please focus on maximizing the amount of text you locate.

Your module should attempt to filter out non-sentences (very short fragments, captions, etc.) and display only complete sentences, but of course this will be imperfect.

In addition to your module, please provide an interface program that "shows off" the functionality of your module. Since we will be collecting large amounts of text, it will make sense to allow users to write the data collected to a file that they name. I will demo such an interface in class so that you have an idea of what this looks like.


Provide sufficient documentation (to be read via perldoc) such that I can install your module and start to use it without much difficulty.

Submission Guidelines

Submit your module and supporting files to the web drop as a single tar file that is named for your team. Things should be structured such that I can install your module using the standard "3 step" CPAN install. The three steps are as follows.
perl Makefile.PL
make test
I should not have to do anything else to get your module installed. I should be able to include it in a program via the "use" directive. Please provide some example usages in your documentation, and of course your test files will show how to use the code as well.

Make sure you team name, individual names, date, etc. are included in your source code. Your code may well end being distributed via CPAN so provide appropriate info about copyrights, distribution, etc.

This is a team assignment. You are strongly advised to divide up the work of the project into tasks that can be carried out in parallel by various team members. All team members should be acknowledged in the comments, etc. and all teammates will receive the same grade. Do not work with other teams. Each team should operate independent of all the other teams. Make your own decisions as a team and do not be influenced by the decisions of other teams if you happen to hear of them accidentally. You are free to work with your teammates as closely as is necessary.

by: Ted Pedersen - tpederse@umn.edu