CS 8761 Natural Language Processing - Fall 2002 - MRD + Web => Corpus
Final Project - Stage II - Warmup due Mon Nov 25, noon
This may be revised in response to your questions. Last update Tue
Nov 19 1:00 pm
The World Wide Web is a billion+ word corpus. Let's use it. :)
This is a simple warm-up exercise designed to give you some experience
with using the Web as a source of text data, and in the use of the LWP
module in Perl. Stage II will be expanded once we are limber.
Write a module using LWP that will find and retrieve sentences in Web
pages that use a given set of words. You should allow the user to specify
one or more words to be found. Your module should collect as many
sentences as it can that use these words.
You have a fundamental choice to make in creating this program. Do you
query a search engine (via LWP) to find this information or do you create
a spider (using LWP) of some sort that goes to selected neighborhoods in
the Web and locates text? I leave this choice to you. However, it is
important that your module gather as much text as it can. For example, if
I request sentences with a very common word it in (like "baseball" or
"love" or "car") I would expect your module to return quite a few
sentences. We will need to collect large amounts of data for the
final portion of this project, so please focus on maximizing the
amount of text you locate.
Your module should attempt to filter out non-sentences (very short
fragments, captions, etc.) and display only complete sentences, but of
course this will be imperfect.
In addition to your module, please provide an interface program that
"shows off" the functionality of your module. Since we will be collecting
large amounts of text, it will make sense to allow users to write the data
collected to a file that they name. I will demo such an interface in class
so that you have an idea of what this looks like.
Provide sufficient documentation (to be read via perldoc) such that I can
install your module and start to use it without much difficulty.
Submit your module and supporting files to the web drop as a single tar
file that is named for your team. Things should be structured such that
I can install your module using the standard "3 step" CPAN install.
The three steps are as follows.
I should not have to do anything else to get your module installed. I
should be able to include it in a program via the "use" directive. Please
provide some example usages in your documentation, and of course your
test files will show how to use the code as well.
Make sure you team name, individual names, date, etc. are included in your
source code. Your code may well end being distributed via CPAN so provide
appropriate info about copyrights, distribution, etc.
This is a team assignment. You are strongly advised to divide up the work
of the project into tasks that can be carried out in parallel by various
team members. All team members should be acknowledged in the comments,
etc. and all teammates will receive the same grade. Do not work with
other teams. Each team should operate independent of all the other teams.
Make your own decisions as a team and do not be influenced by the
decisions of other teams if you happen to hear of them accidentally. You
are free to work with your teammates as closely as is necessary.