RACING CLUB INFORMATION

MEMBERS

Ashraful Alam (alam0026@d.umn.edu)

Mahesh Joshi (joshi031@d.umn.edu)

Anagha Kulkarni (kulka020@d.umn.edu)

Jason Michelizzi (mich0212@d.umn.edu)

Sampanna Salunke (salu0005@d.umn.edu)

WEBSITE

http://seam.sourceforge.net

A demo version is available at

http://marimba.d.umn.edu/cgi-bin/RacingClub/welcome.cgi


INTRODUCTION TO AUTOMATED ESSAY GRADING

Nowadays, number of competitive examinations have some form of writing test. For example, the GRE, GMAT, TOEFL, and SAT all include essay writing. Such tests primarily aim at testing basic writing skills like vocabulary, usage, organizational skill, discourse structure, etc. A common technique that is used in these examinations is to require the student to write an essay in response to a given prompt. The essay prompt in most of these examinations is fairly specific. It generally requires the student to express his or her opinion regarding an issue, practice, or specific situation. The prompt also generally specifies some standpoint on the matter with supporting examples. The student is expected to express his or her opinions with respect to this standpoint. Additionally, the student is expected to provide appropriate and relevant examples for supporting the opinions expressed.

A good example of an essay prompt is:

    Automated essay scoring is unfair to students, since
    there are many different ways for a student to express
    ideas intelligently and coherently. A computer program
    can not be expected to anticipate all of these
    possibilities, and will therefore grade students more
    harshly than they deserve.
    Discuss whether you agree or disagree (partially or
    totally) with the view expressed providing reasons and
    examples.

(from http://www.d.umn.edu/~tpederse/Courses/CS8761-FALL04/Assign/assign4.html)

The expected structure of the essay in response to such a prompt is generally what is known as the ``Five Paragraph Essay''. The first paragraph should contain the ``thesis statement'' which clearly outlines the direction in which the essay is expected to develop. It can contain any introductory material. The next three paragraphs are devoted to supporting arguments, each in a separate paragraph. They should be well-illustrated with appropriate examples wherever relevant. Finally, the last paragraph summarizes the essay in a concise but distinct way without introducing any new ideas.

With an increasing number of students appearing for these competitive examinations every year, the grading of such essays has become a very time consuming task. Efforts have been made since to try to automate evaluation of such essays using computers; however, due to technological limitations, these did not prove very successful in the past. Due to many recent advances in the field of corpus-based Natural Language Processing, automated essay grading has recently gained prominence. Automated essay grading was first introduced in a high-stakes test by the Educational Testing Service (ETS). ETS uses automated methods to grade essays for the GMAT test.

For a computer to replace a human grader, a significant number of things have to be handled, such as grammar checking, spell checking, content checking, conceptual flow analysis, writing style, etc. The computer is expected to consider all of these aspects of an essay and assign an overall score to the student essay. Like any human grader, the technique should be capable of giving a ``holistic score'' - which essentially requires considering the meaning of the essay as a whole rather than simply adding up scores of individual components, which may not give merits to the essence conveyed by the entire essay. This whole task quickly becomes tough because of the complexity of any natural language. The grading becomes even tougher if we have to handle metaphor/irony and implement fact checking. The next important expectation from an essay grader is useful and specific feedback comments, along with probable suggestions or improvements.

RELATED WORK

Researchers from Natural Language Processing, Machine Learning, Linguistics, Psychology and many other fields have been pursuing research activities to train computers for tasks typically done by humans. Some of these research areas are essay grading, name disambiguation, text classification, creation of text summaries, identifying the tone of the text, creating headlines for the text, etc. These activities usually were inspired by the necessity to reduce human work. This reduction of human work could also mean increased profit margins. Mechanizing tasks was also desirable because of the consistency that machines offer. For example, machines do not experience fatigue like humans do. Thus we can see that there were and are various inspirations and motivations behind these research activities.

Grading essays in particular has evolved over time. The large amount of essay grading that teachers had to face motivated Ellis Page to develop a system for grading student essays in 1960s [10]. The system used surface and indirect features like essay length, average word length, number of commas, etc. to grade essays. Though this method was the first step towards automated essay grading, it was not very well-received. The reasons were evident because the system would not penalize those essays which were irrelevant but were long and would credit those that had more commas even though the commas may not be at the right place. However, the more important issue raised against the system was that students could not get any feedback from this system that any human assessor would provide. This was a significant issue because the feedback given by a human assessor would give not only an overall grade to the essay but also corrections and suggestions with specific pointers.

It was soon realized that along with these surface features direct features which could assess the essay based on - organization of the essay, relevance of the essay to the prompt, richness of the content, syntactic variety of the essay, etc. and then return feedback along with the overall score to the user were needed. But extracting such features was a tremendous challenge and almost an impossible task until 1990s when advances in Natural Language Processing and related fields opened up new opportunities for the researchers.

Researchers in various fields and at various organization embarked upon this new path for extraction of direct measures using various computational tools. ETS' E-rater [6] and Tom Landauer's Intelligent Essay Assessor (IEA) [1] are two of the well-known essay grading systems in this new breed of automated systems. ETS researchers attempt to find the above mentioned direct features in student essays. Relevance of the student essay is computed using content vector analysis. E-rater also performs ``word salad detection'', which is a method students may use to fool an automated grading system into assigning a higher grade. A word salad is a list of words relevant to the essay prompt, tossed in with no attention to sentence formation or structure. E-rater requires a set of sample essays to train on before it can make any judgements about student essays. IEA takes a different approach, instead of looking at the structure of a sentence it attempts to find the relevance of a sentence using Latent Semantic Analysis. The emphasis in IEA is on the semantic meaning of the sentence, and not the structure of the sentence itself. This is different from e-rater, which looks at the syntax and the richness of the vocabulary used.

Another system was developed by Larkey [8] that used a combination of Bayesian classifiers and a k-nearest-neighbor classifier along with more traditional lexical analysis methods to analyze essays. A linear regression was used to combine the individual scores from each classifier into an overall score. The method correlated well with the judgments of human graders (0.86 <= r <= 0.88).

A system called Schema Extract Analyze and Report (SEAR) was developed by Christie [9]. It is intended to be more flexible than most systems because it will analyze both style and content. SEAR uses traditional methods to analyze style that need to be calibrated on a set of training data. Each method or metric is given a weight, and the weights are adjusted until adequate agreement between SEAR and human judgments are reached.

Only certain types of essays are candidates for content assessment. In particular, technical writings are the intended domain for this sort of analysis. The content scheme is a data structure that does not require training or calibration.

All the automated essay grading systems to date have concentrated on finding syntactic correctness, word salad detection, relevance detection and essay organization etc. SEAM approaches the task of essay grading from a different direction. The only ``traditional'' features that are checked are relevance and gibberish. In addition to these, SEAM also attempts to identify statements of fact students may add to the essay to emphasize their point. An attempt is also made to verify these facts for correctness.


OBJECTIVE

The objective of the SEAM project is to explore ideas towards implementing the following four components of an automated essay grading system.

GIBBERISH DETECTION

Gibberish detection (also known as Word Salad Detection, a phrase coined at ETS) attempts to identify statements in an essay which do not convey any meaning to the reader. Such sentences are typically written with the intention of fooling an automated grading system. A bunch of phrases related to the prompt, thrown together with no syntactic structure can be identified as gibberish. For example, ``386 semaphore interrupt threading scheduler'' in an essay would qualify as gibberish. Another form of gibberish is a sentence with very poorly formed grammar, where the sentence syntax is so mangled that it is difficult to comprehend the meaning the author is trying to convey. An example of such sentences can be ``I is under over home by''. This sentences is clearly gibberish as it fails to convey any meaningful message to its audience.

IRRELEVANCE MEASURE

Frequently, students may express facts or opinions in the essays which may be correct by themselves, but have little relevance to the essay prompt. For example, a student may write about how computers are slowly taking over human jobs in response to an essay prompt asking about the fairness of an automated essay grading system to the students.

While evaluating an essay for relevance many sentences cannot be clearly delineated as relevant or irrelevant. Examples can be partially relevant sentences like ``I do not like automated essay grading because I do not like computers as I am really bad at typing.'' A sentence like that does convey the opinion of the student regarding automated essay grading and is certainly more relevant than something like ``The sky is blue and the grass is green.'' However, as compared to the prompt, it is not really indicating complete relevance.

One of the approaches for relevance detection could be to have a set of good essays for each prompt that the system is going to present. However, in this case the system may not generalize well to essay responses which are not closer to any of the ones in the database, but still are valid and perfectly relevant.

In our system therefore, we intend to assign similarity scores to each sentence in the response, with respect to the prompt as a whole. The approach we use is that of Latent Semantic Analysis (LSA). Using this we hope to generalize better on the relevance measure performance - since the LSA methodology at its core has the property of finding deeper than obvious similarities. So, using our system we expect that with respect to a prompt like the one presented in introduction, the sentence ``I do not like automated essay grading because I do not like computers as I am really bad at typing.'' will get a higher similarity score than something like ``The sky is blue and the grass is green.'', even if these actual sentences or their look-alikes are not at all present in the corpus that we use to train LSA.

IDENTIFYING FACTS FROM OPINIONS

This part of the project attempts to discriminate facts from opinions, i.e. it tries to classify each sentence as either the student's opinion or a verifiable fact. For example, ``I think that death penalty is bad'' is an opinion but ``The death penalty is legal in most States'' is a fact that can be verified.

Not all facts, however, can be verified by an essay grader. For example, if a student writes, ``I ate cereal for breakfast this morning'', this may be a fact, but it cannot be verified. This system will only attempt to identify statements that are reasonably verifiable. Statements that are not classified as verifiable factual statements are treated the same way as opinions are treated by the system.

VERIFYING STATEMENTS OF FACT

Essays frequently contain examples or statements of fact, made by students to emphasize some point. Once such statements have been identified by the Fact Identification module, the Fact Verification module attempts to check it for correctness. Since there is no database that contains every possible fact or piece of information against which the module can compare the sentence, it looks to the Internet for information. The Internet is a vast and diverse source of information, related to almost every possible topic. However, it is just as easy to get incorrect factual information from the Internet as it is to get correct information. Care must be taken to ensure that the incorrect pieces of knowledge do not result in a wrong judgement. Short sentences which contain less factual information are usually easier to check. For example, 'Carbon monoxide is poisonous'. Statements containing statistical information like 'There are 53 states in the United States of America' are also relatively easy to check. Statements which contain more than one fact are more difficult cases. Since the Internet is being used as a knowledge database, it is possible that the two or more facts in a sentence may not occur very frequently together. The number of pages a Fact Verification module can visit is extremely limited, and it becomes harder to make a judgement in such a case.


IMPLEMENTATION OF APPROACHES

GIBBERISH DETECTION

The gibberish detection module(s) is/are being developed and managed by Ashraful Alam. To detect gibberish a hybrid approach had been taken. The first step to this approach utilizes the well Known Link Parser developed at Carnegie Mellon University by Davy Temperley, Daniel Sleator, John Lafferty. The Link parser is freely available for download at

http://www.link.cs.cmu.edu/link/ftp-site/link-grammar/link-4.1a/unix/

Link parser uses what is known as Link Grammar to differentiate correct and incorrect sentences. It is generally able to identify a grammatically correct sentence as a correct sentence and an incorrect sentence as wrong. The intact version of the parser also prints out a possible parse tree of the sentence. Since the parse tree is of little interest to us we have modified the original code to exclude irrelevant output. The modified version of the Link parser now outputs ``Found complete link'' for a correct sentence or ``No Complete Link found'' for an incorrect sentence.

The Link Parser however is not 100% successful in detecting correct sentences. It fails to detect grammatically correct complex or compound sentences and even simple sentences in some cases. To remedy this failure an in house approach had been added to detect ``gibberish''.

In this second step the Brill Tagger [3] developed by Eric Brill at the University of Pennsylvania is used to first part of speech tag(pos) the sentences that were coined incorrect by the Link Parser.

This pos tagged sentences are then passed through a set of rules to detect patterns that would indicate presences of gibberish. Sentences that fall in the gibberish pattern are output as gibberish. Now there are sentences that are flagged as potentially gibberish but was not detected by the ruleset. There sentences then written to a file for further processing using a Naive Bayes Classifier.

Before the Naive Bayes' classifier is used, a few of the rulesets used by the Alpha version of SEAM was removed as they were very strict and could easily be fooled. The idea is that, once the classifier is trained, it should be able to detect such patterns in sentences and will be able to flag those as gibberish.One of the ruleset that has been removed is the one that required a presence of a noun-verb couple or a pronoun-verb couple to be present in a sentence for it to be non-gibberish. Even though this ruleset worked fairly well in most cases, in some larger but grammatically correct sentences this pattern is not visible. Those sentences were then wrongly detected as gibberish. Removal of this particular rule helped solve this problem.

Currently the ruleset based filter tries to find out patterns of parts of speech that indicates the presence of gibberish. Here are the rules that are currently used in this filter:

1) A sentence not having a single noun or pronoun is gibberish.

A sentence like ``Go jump run walk'' will be labeled gibberish because it does not have any noun or pronoun in it. This is a controversial rule. Because a sentence like ``Go home'' should be correct. But our system find it unacceptable. However, this is a good gibberish detection method as a lot of the gibberish would not have any noun or pronoun and without those it is almost impossible to express any sense. If a sentence has no noun or pronoun but still is correct, it will be flagged as non-gibberish by the Link parser and will not be processed further as it is already flagged as non-gibberish.

A sentence that does not have a noun is most likely to be a gibberish sentence. In the sentences we use on a day-to-day basis, if analyzed, we will see that most of the sentences that are does not have a context, meaning that are topic starters, usually have a noun to express a complete sentence. In written English, its very hard to express a complete sense without including a noun or a pronoun in a sentence. In most cases, the subject of a sentence are formed by nouns and pronouns. If the subject is not present, the sentences fails to express complete sense in most cases. Hence this rule was picked. During the testing phase, this rule seemed to be very successful in identifying gibberish sentences. If a student tries to fool the system with a list of related adjectives, verbs, etc. than the sentences is coined as gibberish as it does not have an essential component to express complete sense.

2) A sentence not having any sort of verb is gibberish. This filter is flexible in the sense that it accepts gerunds as verbs.

So a sentence like - He swimming. will be accepted as non-gibberish. The sentence here does not completely fail to express sense. Thereby this approach is taken. However a sentence like

``Karl, Ron, Richard, John to movie'' will be labeled gibberish as it does not have a verb. The above mentioned sentence do express some sense but different people can have different explanation for it. Hence we think it fails to express any sense.

3) A sentence that has two or more consecutive conjunctions, preposition, articles (determiners), multiple ``To''s or participle form of verbs are good indications of occurrence of gibberish. Also sentences having more than 3 back-to-back adjectives are most likely gibberish. This idea is used to create the 4th rule to label sentences that has occurrence of these patterns as gibberish.

[The numbered bullets above are the descriptions of the rules used. ]

The above rules were picked because the occurences of these events are clearly indicative of presence of gibberish. The use of rule-based gibberish detection is intended to detect the most obvious gibberish. To detect more complicated case of gibberish, we intend to use a Naive Bayes Classifier. More rulesets will make the system relatively harsh and it is more likely to fail.

After the ruleset based filter has output a file with the potentially gibberish sentences that did not get caught by the ruleset, the Naive Bayes' classifier is employed on those sentences. For our purposes, we chose the Algorithm::NaiveBayes Perl package developed by Ken Williams which is download-able from CPAN (http://search.cpan.org).

The same module is used by the fact identification module.

The Naive Bayes' classifier is trained with a corpus of non-gibberish and gibberish sentences in order for it to be able to classify sentences as either gibberish or non-gibberish. The corpus of non-gibberish sentences are taken from the apw corpus and the corpus of gibberish sentences are hand picked by the developers. The hand picked sentences contain sentences that contains sentences structures that we thought indicates the presence of gibberish. For example, the classifier is trained with sentences flagged as gibberish that does not have any verb in it, or sentences that have only nouns and pronouns in it. The handpicked corpus of gibberish sentences are fairly small, however it has been quite successful in training the classifier in detecting gibberish fairly consistently.

The Naive Bayes Classifier is shipped pre-trained with the SEAM package as a file that can be loaded during runtime to classify sentences that were entered by the user as an essay response to a prompt. For each sentences that are predicted by the Naive Bayes' classifier, it outputs a probability of it being gibberish and non-gibberish. If the probability of being gibberish is high for any sentence, that sentence is flagged as gibberish and is output to the user as gibberish. During this whole process, starting from the Link Parser to the use of Naive Bayes, sentences that were not flagged as gibberish were output to a file that only contains non-gibberish sentences. This sentences are then further processed by the Relevance measure, fact identifying and fact verification modules.

The corpus of gibberish is only 50 lines of text and it is as follows:

I/PRP is/VBZ under/IN over/IN store/NN down/IN the/DT house/NN Cow/NN goat/NN pig/NN cat/NN lion/NN Civil/NNP War/NNP confederate/NNP hero/NN Northern/NNP Army/NNP is/VBZ are./VB BMW/NNP Mercedes/NNP Chrysler/NNP Acura/NNP Lexus/NNP Infinity/NNP car./NN Go/VB shouting/VBG are/VBP eat/VB sleeping/VBG running/VBG jumping./VBG Northerners/NNS Civil/NNP War/NNP won/VBD southerner/NN hero/NN fighting/VBG eat/VB The/DT read/NN riding/VBG hood/NN eat/VB go/NN swim/VB late/JJ day/NN home/NN I/PRP is/VBZ go/VB cow/NN eat/VB speak/VB Cow/NNP speak/VB me/PRP to./TO He/PRP me/PRP I/NN you/PRP hero/NN coward/NN lazy/JJ Running/VBG walking/VBG talking/VBG swimming/VBG In/IN over/IN under/IN beside/IN table./NN He/PRP boy/NN girl/NN student/NN child./NN The/DT dog/NN cat/NN cow/NN chicken/NN fox./NN the/DT at/IN car/NN house/NN me./PRP He/PRP go/VBP no/DT home/NN school/NN come/VB too/JJ School/NNP good/JJ bad/JJ no/DT don't/VB see./VB Sky/NNP blue/JJ green/JJ run/NN fly./VB Bird/NNP run/VB over/IN is/NN sweet/JJ home./NN The/DT cow/NN run/NN home/NN drive/NN sports/NNS S/NNP car/NN always/CD Cow/NN goat/NN car/NN home/NN pig/NN horse./NN Cow/NNP good/JJ milk/NN Frog/NNP fast/JJ animal/NN Go/VB run/VB jump/VB sleep/VB eat/VB over/IN under/IN beside/IN before/IN beneath/IN into./IN beneath/IN into./IN Good/JJ nice./JJ Toshiba/NNP Satellite./NNP I/PRP is/VBZ under/IN over/IN home./NN I/PRP is/VB over/IN John/PRP is/VB under/IN Cow/NN goat/NN car/NN house/NN egg/NN meat/NN Duluth/NNP Computer/NN intel/NN amd/NN P/NN pc/NN harddrive/NN disc/NN processor./NN Car/NNP BMW/NNP Toyota/NNP corolla/NN camry/NNP X5/NNP Escalade/NNP CTS/NNP Yukon./NNP He/PRP she/PRP I/PRP we/PRP us/PRP they./PRP We/PRP John/NNP Sid/NNP Scott/NNP Reynold/NNP Ted/NNP Rick/NNP they/PRP us/PRP I./PRP He/PRP goat/NN He/PRP goat/NN He/PRP goat./NN He/PRP He/PRP He/PRP He/PRP He/PRP He/PRP He./PRP He/PRP go/VB jump/VB eat/VB out/IN home/NN John/NN go/VB sweet/JJ climb/VB river/NN snow/NN mount/VB is/VB sweet/JJ home/NN go/VB eat/VB beneath/IN River/NN flow/VB under/IN slow/RP Duluth/NN world/NN acting/VBG sleepy/JJ. Tree/NNP cold/JJ flow/NN dead/JJ white/JJ River/NNP slow/JJ run/NN flow/NN winter/NN cold/JJ I/PRP JOHN/NN I/PRP teacher/NN I/PRP teacher/NN John/NNP I/PRP Scott/NNP Car/NN he/PRP house/NN No/DT job/NN for/IN tearcher/NN waste/NN time/NN

This is a fairly small corpus that is used to train the Naive Bayes' classifier to detect gibberish. The corpus of non-gibberish that was used to train the classifier was also relatively shorter. Experiment was done with larger corpuses and relatively shorter corpuses. Both larger and smaller corpus has the disadvantage of overtraining or undertraining. The corpus of non-gibberish is a compilation os GRE AWE samples from http://www.wayabroad.com. Parts of the corpus is as follows:

Do/VBP you/PRP agree/VBP or/CC disagree?/NN Use/VB specific/JJ reasons/NNS and/CC examples/NNS to/TO develop/VB your/PRP$ essay./CD ANSWER:/NNP This/DT is/VBZ very/RB discussable/JJ theme./JJ There/EX is/VBZ pro/FW and/CC con/JJ for/IN children/NNS being/VBG raised/VBN in/IN the/DT countryside/NN or/CC in/IN a/DT big/JJ city./JJ If/IN we/PRP would/MD look/VB from/IN history/NN point/NN of/IN view/NN we/PRP would/MD see/VB that/IN our/PRP$ community/NN had/VBD migrations/NNS for/IN countryside/NN to/TO big/JJ cities./JJ Years/NNS ago/RB people/NNS were/VBD living/VBG mostly/RB in/IN a/DT countryside/NN working/VBG in/IN fields/NNS with/IN cattle/NNS while/IN small/JJ number/NN of/IN people/NNS was/VBD start/VB living/VBG and/CC developing/VBG sites/NNS that/WDT today/NN become/VB big/JJ cities./JJ However/RB are/VBP those/DT sites/NNS developing/VBG and/CC start/VB having/VBG a/DT life/NN on/IN a/DT much/RB higher/JJR level./CD Also/RB jobs/NNS which/WDT they/PRP were/VBD doing/VBG some/DT were/VBD more/RBR sophisticated/JJ than/IN jobs/NNS which/WDT were/VBD available/JJ in/IN countryside/NN site./JJ People/NNS form/VBP countryside/NN realized/VBD that/IN and/CC the/DT started/VBN a/DT migrations/NNS into/IN big/JJ cities./JJ It/PRP was/VBD not/RB just/RB a/DT sort/NN of/IN jobs/NNS which/WDT you/PRP can/MD get./VB Living/NNP in/IN a/DT city/NN would/MD allowed/VB very/RB fast/RB finding/VBG a/DT job/NN than/IN in/IN countryside./JJ People/NNS on/IN countryside/NN are/VBP all/PDT the/DT time/NN in/IN somebody/NN else/JJ business/NN rather/RB than/IN theirs./JJ City/NN people/NNS mind/NN own/JJ business./CD Also/RB in/IN a/DT big/JJ city/NN culture/NN is/VBZ on/IN much/RB higher/JJR level/NN and/CC being/VBG around/IN cultural/JJ happenings/NNS forces/NNS you/PRP to/TO behave./CD With/IN countryside/NN is/VBZ not/RB that/DT case./CD For/IN rising/VBG a/DT kids/NNS in/IN countryside/NN is/VBZ until/IN some/DT age/NN good./JJ They/PRP can/MD have/VB and/CC make/VB a/DT connection/NN with/IN nature/NN and/CC keep/VB her/PRP$ when/WRB they/PRP move/VBP in/IN a/DT big/JJ city./JJ In/IN countryside/NN before/IN parents/NNS would/MD not/RB have/VB to/TO worry/VB about/IN criminals/NNS but/CC now/JJ days/NNS is/VBZ not/RB a/DT case/NN unfortunately./JJ Also/RB parents/NNS would/MD be/VB a/DT worried/VBN where/WRB they/PRP will/MD send/VB kids/NNS on/IN college/NN if/IN they/PRP were/VBD living/VBG in/IN countryside./JJ Today/NN we/PRP have/VBP very/RB well/RB a/DT universities/NNS and/CC colleges/NNS also/RB available/JJ in/IN a/DT countryside./JJ At/IN the/DT end/NN I/PRP will/MD give/VB my/PRP$ personal/JJ thought?/NN I/PRP would/MD like/VB to/TO live/VB in/IN countryside/NN which/WDT is/VBZ not/RB so/RB faraway/JJ from/IN a/DT big/JJ city./JJ Which/WDT I/PRP think/VBP now/RB day?s/VBZ people/NNS looking/VBG in/IN something/NN like/IN that/DT solution/NN or/CC I/PRP can/MD say/VB they/PRP start/VB going/VBG back/RB from/IN big/JJ cities/NNS and/CC living/VBG again/RB in/IN countryside./JJ Combination/NNS of/IN both/DT would/MD be/VB the/DT best/JJS and/CC for/IN kids/NNS and/CC for/IN parents./JJ QUESTION:/NNS In/IN general/JJ people/NNS are/VBP living/VBG longer/RB now./CD Discuss/NNP the/DT causes/NNS of/IN this/DT phenomenon./CD Use/VB specific/JJ reasons/NNS and/CC details/NNS to/TO develop/VB your/PRP$ essay./CD ANSWER:/NNP To/TO explain/VB the/DT fact/NN that/IN people/NNS are/VBP living/VBG longer/RB now/RB I/PRP will/MD use/VB 3/CD hypothesis./JJ The/DT first/JJ reason/NN that/WDT can/MD explained/VBD that/WDT is/VBZ the/DT lifestyle/NN of/IN people/NNS that/WDT is/VBZ less/RBR difficult/JJ now./CD An/DT another/DT reason/NN could/MD be/VB the/DT variety/NN and/CC the/DT quality/NN of/IN the/DT food/NN thats/NNS better/JJR today/NN and/CC a/DT third/JJ reason/NN could/MD be/VB the/DT improvement/NN of/IN the/DT medicine/NN science./CD Since/IN 100/CD years/NNS lifestyle/NN has/VBZ become/VBN easier/JJR for/IN people./CD Take/VB for/IN example/NN the/DT laundry./JJ Women/NNP took/VBD twelve/CD hours/NNS to/TO clean/VB clothes/NNS of/IN a/DT small/JJ family./CD Today/NN women/NNS or/CC men/NNS just/RB have/VBP to/TO put/VB clothes/NNS in/IN the/DT laudry/NN machine/NN and/CC it/PRP will/MD be/VB done/VBN in/IN one/CD hour./CD Today/NN we/PRP eat/VBP fruit/NN and/CC vegetables/NNS all/DT year/NN long./CD We/PRP eat/VBP fresh/JJ salmon/NN or/CC other/JJ fish/NN each/DT week./CD If/IN we/PRP want/VBP to/TO serve/VB a/DT big/JJ rostbeef/NN for/IN diner/NN we/PRP just/RB have/VBP to/TO go/VB to/TO the/DT grocery/NN store/NN and/CC buy/VB one./CD It/PRP was/VBD impossible/JJ to/TO eat/VB an/DT orange/JJ each/DT week/NN or/CC to/TO use/VB citrus/JJ in/IN a/DT receipt/NN in/IN january/JJ because/IN they/PRP was/VBD not/RB easily/RB available/JJ in/IN Canada/NNP in/IN winter./JJ Medicine/NNP was/VBD incredibly/RB improved/VBN in/IN comparaison/NN with/IN what/WP it/PRP was/VBD 100/CD years/NNS ago./RB Today/NN doctors/NNS have/VBP new/JJ drugs/NNS like/IN penicilin/NN to/TO cure/VB ill-people./CD They/PRP can/MD easily/RB cure/VB pneumony/NN or/CC other/JJ infections./JJ Futhermore/NN to/TO cure/VB very/RB sick/JJ people/NNS doctors/NNS can/MD for/IN example/NN transplante/VBG new/JJ hearts/NNS or/CC make/VB complicate/VB surgeons./CD In/IN conclusion/NN I/PRP think/VBP that/IN people/NNS are/VBP living/VBG longer/RB now/RB because/IN the/DT lifestyle/NN in/IN general/NN is/VBZ easier/JJR they/PRP have/VBP access/NN to/TO a/DT variety/NN of/IN foodstuff/NN and/CC doctors/NNS can/MD cure/VB illness/NN that/WDT killed/VBD people/NNS in/IN the/DT past./NN QUESTION:/NNP In/IN general/JJ people/NNS are/VBP living/VBG longer/RB now./CD Discuss/NNP the/DT causes/NNS of/IN this/DT phenomenon./CD Use/VB specific/JJ reasons/NNS and/CC details/NNS to/TO develop/VB your/PRP$ essay./CD ANSWER:/NNP Healthy/JJ eating/NN habits/NNS daily/JJ exersice/NN and/CC sleeping/VBG at/IN least/JJS eight/CD hours/NNS at/IN night/NN are/VBP some/DT of/IN the/DT reasons/NNS why/WRB people/NNS live/VBP longer/RB now.Consistency/VBG and/CC discipline/NN are/VBP needed/VBN to/TO achieve/VB all/DT these./JJ I/PRP believe/VBP that/IN people/NNS should/MD adopt/VB healthy/JJ eating/NN habits/NNS in/IN order/NN to/TO maintain/VB body/NN weight/NN and/CC a/DT healthy/JJ body.For/JJ instance/NN fiber/NN should/MD be/VB included/VBN in/IN every/DT meal./CD Fiber/NNP can/MD be/VB found/VBN in/IN fruits/NNS and/CC green/JJ vegetables.It/NN is/VBZ recommended/VBN to/TO eat/VB alot/RB of/IN fiber/NN in/IN order/NN to/TO have/VB lean/JJ muscles/NNS and/CC less/RBR fat.In/JJ addition/NN carbohydrates/VBZ and/CC sugars/NNS should/MD be/VB consumed/VBN in/IN a/DT less/JJR quantity/NN because/IN it/PRP can/MD be/VB converted/VBN into/IN body/NN fat.For/JJ example/NN if/IN people/NNS eat/VBP spaghetti/NNS for/IN lunch/NN they/PRP should/MD not/RB eat/VB it/PRP again/RB for/IN dinner.And/VBD all/DT sweets/NNS should/MD be/VB avoided/VBN from/IN a/DT healthy/JJ diet/NN in/IN order/NN to/TO reduce/VB calories/NNS which/WDT later/RB become/VB stored/VBN as/IN body/NN fat./JJ However/RB a/DT good/JJ daily/JJ exercise/NN routine/NN is/VBZ necessary/JJ to/TO achieve/VB longeviness.Running/CD jogging/NN or/CC walking/VBG for/IN a/DT lapse/NN time/NN of/IN 1hr./CD are/VBP good/JJ examples/NNS of/IN a/DT basic/JJ exercise/NN routine./JJ I/PRP myself/PRP prefer/VBP to/TO make/VB my/PRP$ daily/JJ exercise/NN routine/JJ at/IN a/DT local/JJ gym./CD For/IN instance/NN my/PRP$ routine/NN consists/VBZ of/IN weightlifting/VBG for/IN 1/CD hour/NN added/VBD to/TO a/DT 30/CD minute/NN session/NN of/IN aerobics./CD Other/JJ people/NNS prefer/VBP to/TO swim/VB and/CC play/VB tennis/NN or/CC squash/NN as/IN part/NN of/IN their/PRP$ daily/JJ exercises.On/CD the/DT other/JJ hand/NN people/NNS that/WDT dont/VBP have/VBP time/NN for/IN an/DT exercise/NN routine/JJ opt/VB for/IN been/VBN active/JJ during/IN the/DT day.For/JJ instance/NN walking/VBG to/TO places/NNS intead/VBP of/IN using/VBG the/DT elevator/NN can/MD be/VB a/DT good/JJ option./JJ Another/DT reason/NN that/WDT adds/VBZ up/IN to/TO a/DT long/JJ life/NN is/VBZ sleep./CD It/PRP has/VBZ been/VBN demonstrated/VBN that/IN adults/NNS need/VBP at/IN least/JJS 8/CD hours/NNS of/IN sleep/NN to/TO maintain/VB a/DT good/JJ health/NN thus/RB a/DT long/JJ life./CD Sleep/NN is/VBZ a/DT must/MD because/IN the/DT body/NN needs/VBZ to/TO rest/VB and/CC recover/VB from/IN the/DT daily/JJ routine.In/JJ fact/NN people/NNS that/WDT sleep/VBP less/JJR hours/NNS have/VBP a/DT tendency/NN to/TO become/VB ill/JJ and/CC tired/VBN most/RBS of/IN the/DT time/NN compared/VBN to/TO those/DT who/WP get/VBP to/TO sleep/VB 8/CD hours./CD Therefore/RB having/VBG good/JJ eating/NN habits/NNS a/DT daily/JJ exercise/NN routine/JJ and/CC a/DT good/JJ nights/NNS sleep/VBP are/VBP the/DT ingredients/NNS that/WDT lead/VBP to/TO a/DT long/JJ life.In/CD order/NN to/TO achieve/VB these/DT goals/NNS successfully/RB one/PRP must/MD be/VB consistent/JJ and/CC possess/VBP discipline./JJ QUESTION:/NNS In/IN general/JJ people/NNS are/VBP living/VBG longer/RB now./CD Discuss/NNP the/DT causes/NNS of/IN this/DT phenomenon./CD Use/VB specific/JJ reasons/NNS and/CC details/NNS to/TO develop/VB your/PRP$ essay./CD ANSWER:/NNP A/DT greater/JJR number/NN of/IN people/NNS are/VBP now/RB hitting/VBG the/DT eighty-year/JJ mark/NN than/IN ten/CD years/NNS ago./RB In/IN fact/NN the/DT life/NN expectancy/NN of/IN the/DT average/JJ human/NN has/VBZ gone/VBN up/IN considerably/RB and/CC is/VBZ rising/VBG still./JJ This/DT phenomenon/NN is/VBZ the/DT result/NN of/IN several/JJ reasons./CD For/IN one/CD continuing/VBG scientific/JJ and/CC medical/JJ innovation/NN ensure/VB that/IN more/JJR people/NNS receive/VBP the/DT treatment/NN they/PRP require./CD As/IN our/PRP$ knowledge/NN grows/VBZ regarding/VBG various/JJ diseases/NNS we/PRP become/VBP better/RBR equipped/VBN to/TO tackle/VB them./JJ Consequently/RB we/PRP have/VBP managed/VBN to/TO eradicate/VB some/DT diseases/NNS on/IN a/DT global/JJ scale/NN while/IN controlling/VBG the/DT other/JJ diseases/NNS so/RB that/IN the/DT rate/NN of/IN mortality/NN does/VBZ not/RB reach/VB an/DT alarming/JJ height/NN during/IN the/DT outbreak/NN of/IN a/DT disease./JJ The/DT plagues/VBZ of/IN yore/PRP$ as/RB well/RB as/IN the/DT more/RBR recent/JJ plague/NN outbreaks/NNS are/VBP becoming/VBG few/JJ and/CC far/RB in/IN between./JJ Such/JJ control/NN of/IN diseases/NNS means/VBZ that/IN the/DT general/JJ life/NN expectancy/NN has/VBZ gone/VBN up./RB Diseases/NNPS such/JJ as/IN cancer/NN which/WDT used/VBD to/TO result/VB almost/RB inevitably/RB in/IN death/NN are/VBP now/RB curable/RB provided/VBN they/PRP are/VBP diagnosed/VBN at/IN a/DT certain/JJ stage./JJ Diseases/NNPS such/JJ as/IN tuberculosis/NN and/CC cholera/NN now/JJ cause/NN fewer/JJR deaths/NNS than/IN they/PRP used/VBD to/TO only/RB a/DT few/JJ decades/NNS ago./RB Another/DT reason/NN for/IN the/DT greater/JJR life/NN expectancy/NN is/VBZ the/DT general/JJ betterment/NN of/IN the/DT quality/NN of/IN life./CD What/WP we/PRP call/VBP the/DT global/JJ village/NN is/VBZ fast/RB becoming/VBG a/DT city./JJ And/CC in/IN this/DT city/NN more/JJR and/CC more/JJR people/NNS are/VBP being/VBG provided/VBN a/DT better/JJR level/NN of/IN hygiene/NN than/IN ever/RB before./CD A/DT better/JJR and/CC improving/VBG system/NN of/IN communication/NN ensures/VBZ that/IN the/DT latest/JJS medical/JJ discoveries/NNS in/IN the/DT United/NNP States/NNPS and/CC Europe/NNP are/VBP known/VBN all/DT over/IN the/DT world/NN in/IN a/DT space/NN of/IN a/DT few/JJ days./NNS Therefore/RB more/JJR and/CC more/JJR peeople/NN have/VBP access/NN to/TO better/JJR health/NN care./NN Even/JJ people/NNS living/VBG in/IN relatively/RB remote/JJ areas/NNS have/VBP access/NN to/TO some/DT kind/NN of/IN medical/JJ facilities./JJ Though/IN these/DT facilities/NNS may/MD be/VB incapable/NN of/IN handling/VBG a/DT crisis/NN they/PRP may/MD help/VB prevent/VB death/NN in/IN cases/NNS that/WDT require/VBP antibiotics/NNS and/CC antivenin/NN thereby/RB preventing/VBG death/NN by/IN infection/NN or/CC poison./CD The/DT increase/NN in/IN awareness/NN also/RB means/VBZ that/IN people/NNS in/IN general/JJ are/VBP becoming/VBG more/RBR aware/JJ of/IN the/DT risks/NNS of/IN various/JJ diseases./JJ For/IN instance/NN more/JJR people/NNS now/RB than/IN two/CD decades/NNS ago/RB are/VBP aware/JJ of/IN the/DT scourge/NN of/IN cholesterol/NN and/CC the/DT havoc/NN it/PRP wreaks/VBZ on/IN an/DT individuals/NNS circulatory/JJ system./JJ Similarly/RB people/NNS are/VBP becoming/VBG aware/JJ that/DT prevention/NN is/VBZ after/IN all/DT better/JJR than/IN cure/NN and/CC are/VBP taking/VBG the/DT appropriate/JJ steps/NNS to/TO prevent/VB infection/NN from/IN diseases./JJ In/IN general/JJ though/IN the/DT increased/VBN life/NN expectancy/NN owes/VBZ much/JJ to/TO the/DT revolution/NN in/IN communication./JJ It/PRP may/MD be/VB mentioned/VBN that/IN even/RB two/CD hundred/CD years/NNS ago/RB inventions/NNS and/CC discoveries/NNS were/VBD being/VBG made./JJ However/RB they/PRP did/VBD few/JJ people/NNS any/DT good./JJ It/PRP simply/RB took/VBD too/RB long/JJ to/TO disseminate/VB the/DT information./JJ The/DT general/JJ level/NN of/IN awareness/NN regarding/VBG health/NN was/VBD also/RB low./CD However/RB it/PRP has/VBZ been/VBN noticed/VBN that/IN since/IN the/DT inception/NN of/IN communication/NN through/IN first/JJ print/NN and/CC then/JJ radio/NN and/CC television/NN the/DT level/NN of/IN awareness/NN regarding/VBG health/NN has/VBZ generally/RB risen./CD To/TO illustrate/VB print/NN facilitated/VBN the/DT spread/NN of/IN knowledge/NN through/IN books./CD Radio/NNP helped/VBD bring/VB that/DT knowledge/NN to/TO many/VB people./CD Television/NNP helped/VBD to/TO further/VB this/DT knowledge/NN and/CC disseminate/VB is/VBZ amongst/IN an/DT even/RB greater/JJR number/NN of/IN people./CD And/CC finally/RB the/DT Internet/NNP has/VBZ removed/VBN the/DT final/JJ barriers/NNS between/IN intercultural/JJ and/CC interracial/JJ communication./JJ In/IN the/DT coming/VBG years/NNS we/PRP may/MD hope/VB to/TO see/VB an/DT even/RB greater/JJR increase/NN in/IN life/NN expectancy/NN even/RB as/IN communication/NN techniques/NNS continue/VBP to/TO improve./CD QUESTION:/NNS In/IN general/JJ people/NNS are/VBP living/VBG longer/RB now./CD Discuss/NNP the/DT causes/NNS of/IN this/DT phenomenon./CD Use/VB specific/JJ reasons/NNS and/CC details/NNS to/TO develop/VB your/PRP$ essay./CD ANSWER:/NNP According/VBG to/TO the/DT latest/JJS statistics/NNS the/DT average/JJ life/NN span/NN of/IN human/JJ beings/NNS is/VBZ about/IN 75-79/CD years/NNS old./JJ Comparing/VBG to/TO that/IN ten/CD years/NNS ago/RB it/PRP has/VBZ been/VBN increased/VBN by/IN 10/CD ./. Advancement/NNP in/IN technology/NN better/JJR lifestyle/NN and/CC nutritional/JJ diets/NNS are/VBP the/DT main/JJ reasons/NNS for/IN prolonged/VBN lives./CD Advancement/NNP in/IN technology/NN helps/VBZ to/TO bring/VB about/IN new/JJ methods/NNS of/IN curing/VBG fatal/JJ diseases./JJ Cancer/NNP used/VBD to/TO have/VB no/DT cure/NN in/IN the/DT last/JJ decade/NN but/CC recently/RB new/JJ methods/NNS such/JJ as/IN X-ray/NN treatment/NN and/CC various/JJ types/NNS of/IN injections/NNS are/VBP introduced/VBN which/WDT can/MD greatly/RB extend/VB the/DT patients/NNS lives./CD Advancement/NNP in/IN technology/NN also/RB improves/VBZ the/DT quality/NN of/IN medical/JJ instuments./JJ This/DT lowers/VBZ the/DT risk/NN and/CC minimizes/VBZ the/DT possible/JJ side/NN effects/NNS of/IN various/JJ operations./JJ

This is the last 100 lines from the non-gibberish corpus. The corpus is included in the software package under the SEAM/utils directory after you have unpacked the tarball.

IRRELEVANCE MEASURE

The irrelevance measure and related modules are being developed and managed by Anagha and Mahesh. The approach is to use Latent Semantic Analysis (LSA) for finding similarity values between the provided essay prompt and each sentence in the essay response.

The following are the sub-modules for the same:

Boundary detection

This module does the job of detecting sentence boundaries in a give text file and returns the identified sentences for the same. Sentence detection handles the basic cases of ., ! and ?. Additionally it handles quotes, URLs and some abbreviations. The critical thing about it is that it expects each sentence to begin with an initial capital alphabet. Also, it expects that there is a space between the end of the previous sentence and the beginning of the next sentence. This same module is used throughout the system to perform sentence boundary detection.

Co-occurrence matrix creation

This module takes as input multiple sentences and creates a word by sentence co-occurrence matrix. Smoothing of counts is done using the approach mentioned by Landauer et al. in ``An Introduction to Latent Semantic Analysis'' (http://lsa.colorado.edu/papers/dp1.LSAintro.pdf). Co-occurrence matrix creation excludes stop words from the co-occurrence matrix since they would not add any significant semantic meaning, but in fact would tend to skew the overall results. Additionally, words that occur in just one sentence or that appear with a frequency below a lower cut-off are also removed for the same reason.

Training on corpus

To represent world knowledge, the LSA approach makes use of a plain text corpus, which is converted into a matrix form. This module initially calls the co-occurrence matrix creation module. The next step is to reduce the dimensionality of this matrix using the Singular Value Decomposition (SVD) of the co-occurrence matrix. We have used the SVDPACKC package for carrying out the SVD step. Helper modules for SVDPACKC - mat2harbo.pl and svdpackout.pl - from the SenseClusterspackage (http://senseclusters.sourceforge.net) have been redistributed with our package. For our training purpose, we have used the APW newswire data which was made available to us by Dr. Ted Pedersen. A file containing the matrix (lib/SEAM/apw_corpus_matrix) created using this data is provided with the distribution. It is a comparatively small corpus with 9038 words represented in 27 dimensions (after SVD is applied). So, there is a need of training the system on a larger corpus. The issue we faced was that SVDPACKC could not handle very large matrices. So we have provided a feature in Relevance module to accept multiple matrix files at input.

This module is also used by Fact Checking component at runtime to create a corpus from the data obtained from the web.

Similarity module

Once the matrix representation of the corpus is ready, the similarity measures are taken between each sentence in the essay and the prompt. Our approach is to represent the prompt as one single vector in a high dimensional semantic space (as the sum of all word vectors for all words in the prompt). The word vectors are created from the input co-occurrence matrix. Similarly, vectors for each identified sentence in the prompt are created. The similarity among the vectors is then measured as the cosine between them. The higher the similarity between the prompt vector and the sentence vector, the more is the relevance of the sentence to the essay prompt.

TraceLog module

This module helps maintain a common log file for the entire system - per user session. It is also used by the other components.

Utilities

The above modules are used by 2 utilities (Perl scripts) - createcorpusmatrix.pl and relevance.pl.

createcorpusmatrix.pl creates a co-occurrence matrix out of any input file specified at the command line and writes it out to the specified output file. It also takes a file containing a list of stop words to exclude from the co-occurrence. This is for the users to pre-create a co-occurrence matrix out of their own training data. At command line, the user can specify the lower cut-off for word frequency. Any word with frequency below this will be ignored.

relevance.pl minimally requires 3 input files. First is the co-occurrence matrix of the corpus, created through createcorpusmatrix.pl. Second should contain the essay prompt and third the actual essay response. It then prints out the similarity value of each sentence in the response with the prompt - in descending order of similarity values. An optional 4th command line parameter species if a verbose output is required or simply a user level feedback is required.

The relevance module is capable of accepting multiple corpus matrix files so as to average out the similarity scores received for each essay sentence based on individual files.

A log file parameter is accepted by both modules, where the developer level trace is logged.

FACT IDENTIFICATION

Factual statement identification is being implemented by Jason Michelizzi. The approach being used is based on a Naive Bayesian classifier. The classifier used is the Algorithm::NaiveBayes Perl module, which is available on CPAN (http://search.cpan.org).

Bayes' theorem is:

  P(X=x|Y=y) = P(Y=y|X=x)P(X=x) / P(Y=y)

In our case we two random variables. One represents the ngrams in a sentence, and the other represents classes to which the sentences can belong (i.e., fact or opinion). So for each sentence, we want to find

  P(fact | ngrams) and P(opinion | ngrams)

That is, we want to find out whether it's more likely that a sentence is a fact or an opinion.

                                      P(ngrams | c) * P(c)
  Optimal_category =    Argmax        --------------------
                    c in Categories       P(ngrams)

where the set Categories = {fact, opinion}.

We can disregard P(ngrams) because it will be constant over the different categories.

  Optimal_c =    Argmax      P(ngrams | c) * P(c)
              c in Categories

Then we go one step further and consider the probabilities of the individual ngrams in the sentence.

  Optimal_c =   Argmax    P(ngram1|c)*P(ngram2|c)* ... *P(ngramN|c)*P(c)
             c in Categories

This is the ``naive'' part of Naive Bayes. We are assuming that the probabilities of the different ngrams in the sentence are independent (which they generally are not); however, Naive Bayes classifiers tend to work pretty well anyways. Some spam filters use naive Bayes to categorize e-mails as spam and not-spam.

The classifier is trained on a corpus of text where each sentence has been manually tagged as either a verifiable fact (VF) or not a verifiable fact. The features used by the classifier are the words in the sentence and a few punctuation marks (such as ! and ?). The trained classifier is stored to disk using the YAML Perl module and is loaded when SEAM is run. The corpus used to train the classifier is distributed in the 'samples' directory.

When an essay is being evaluated, each sentence is processed classified as either a verifiable fact or not a verifiable fact. The sentences classified by Naive Bayes as verifiable facts are then passed on to the fact verification sub-system.

FACT VERIFICATION

Fact Verification is being managed by Sampanna. The module is developed based on the following hypothesis: Given a small corpus, built from the Internet using a specific search string, the words in the fact to be checked would appear in similar contexts in the corpus. This would result in the words being tagged as semantically similar by LSA, which looks at the context that words appear in to make a judgement about semantic similarity.

The current approach is using LSA and Google to validate facts. The input to the system is a text file containing the different statements identified as facts, one per line. For every fact that needs to be verified, a corpus related to the fact is created. The sentence is initially tagged using the Brill tagger, and using the nouns in the fact a search query is sent to Google in the ``and'' mode. Using LWP to visit the top 10 URLs returned by Google and HTML::Parser to strip HTML tags, the corpus is created. Binary files such as Adobe PDFs and Microsoft Word documents are ignored. The decision taken during the beta version to keep just the first 10 URLs was maintained after some testing. The testing involved checking the final decision of the Fact Verification module for changes, to determine how much difference increasing or decreasing the number of URLs would make. Another factor that affected the decision was the time taken to visit each of the URLs, fetch information and parse and prepare it for analysis. Each URL fetch and parse operation takes time, often changing with the vagaries of the Internet.

After building the corpus, the next step is to train LSA on it. This is done using the SEAM::TrainOnCorpus. This module creates the co-occurrence matrix and uses SVDPACKC to return a SVD matrix. The reduction factor used is 10. Similarity values between the different words in the fact are now computed. Only words which are not stopwords are considered. The similarity values thus obtained are averaged to give a single value: the higher the value, the more likely the fact has of being true. Simple facts are relatively easier to confirm, for example ``Sampanna Salunke is in Duluth'' gets tagged as correct by the system. However, as the complexity and length of the sentence increases, it becomes more difficult to make judgements about the fact. Cut-offs decided after experimentation are: any sentence that gets a score of greater than 0.4 is tagged as a true fact, between 0.2 and 0.4 is tagged as something that cannot be verified by the system, and anything below 0.2 is tagged as a false fact.

While the method does yield promising results, it faces several difficulties. One is the very diversity of the Internet that we hope to exploit. For example, while checking a simple fact like ``Pune is in Antarctica'' (which is false - Pune is in India) the system was fooled by reports about scientists from Pune doing research in Antarctica. Even a query like 'IBM makes caramel pudding' resulted in a large amount of hits, and the reason for finding these on one page was that a person at IBM, with an IBM email address, had submitted a recipe for caramel pudding on a food oriented website. Another major difficulty the system faces is the way information is structured on the Internet. People frequently do not end their statements with a period, usually relying on HTML formatting to perform the task of separating one sentence from the other. This module relies on good boundary detection, and the boundary detection mechanism, which is based on normal English writing style sometimes fails to detect correct boundaries. This can result in inflated contexts, and lead to incorrect results since a lot of different ideas may end up in the same context.


OVERALL PROPOSED SOLUTION

GIBBERISH DETECTION

For the final release of SEAM, we hope to upgrade the gibberish detection by using a n-gram statistics package Named 'N-gram Statistics Package' to find the likelihood of words that occur together in a large corpus. Sentences having collocations that are highly unlikely to occur most likely have gibberish present in them. This approach again will be acting as an inclusion to the currently existing system to further strengthen its capabilities. The rule based approach previously used is good for only basic gibberish detection and tends to fail very easily. So we intend to move from that approach as much as we can. However, rule based approach will still act as the second step to gibberish detection. We intend to continue using the Link Parser to detect syntactically incorrect sentences that are gibberish. This ensures that grammatically correct sentences are not wrongly flagged as gibberish. It's fairly hard to train a Naive Bayes' Classifier to handle every possible types of sentences which Link Parser can already do with great accuracy. Therefore we would not spend much resources in trying to perform the same task and rather rely on Link Parser to do that.

This however has a drawback as Link Parser does not detect syntactically correct but gibberish as wrong sentences. This enables our system to further work with those sentences as they are not flagged as potential gibberish sentences. To detect syntactically correct but gibberish sentences, we intend to use N-gram statistics package along with the Link Parser to flag sentences as potentially gibberish which will then be processed by the Brill tagger, rule-set and the Naive Bayes' classifier. By following this approach we will be able to at least consider syntactically correct gibberish sentences.

The Gibberish module calculates the score with the following formula:

Score = 6 - (number of Gibberish Sentences / Total number of sentences) * 6

This score is penalized further if there is more than 30% gibberish by subtracting 1. If there is more than 40% gibberish present then the score is reduced by 2. It is also made sure that the score is not negative.

IRRELEVANCE MEASURE

The relevance scores that we obtain using the LSA approach range from -1 to +1 (since they are cosine values between the content vectors). A positive value indicates some similarity. The greater the value, greater is the similarity.

We conducted several experiments on our system using the UMD essay corpus and also tried to run similar experiments on the LSA website (http://lsa.colorado.edu). From those experiments, we have established 0.40 as our threshold for relevance cutoff. This in our opinion is neither too stringent, nor too loose. We lowered it to 0.40 rather than 0.50 because we check the similarity of a sentence with the essay prompt only. It may happen that some sentence is closely related to an idea that student has presented, but may receive slightly less similarity score for it with respect to the prompt. We feel that this lowering of cut=off may help in such cases.

For this beta release therefore, we simply combine the relevance scores of the individual sentences by averaging them and assign a relevance score out of 6 for the entire essay.

Alternatively, there could be 2 values, one being a high threshold - above which sentences are definitely relevant and the second below which the sentences are definitely irrelevant. The sentences falling in the interval between these two thresholds would be partially relevant. But, for now we have decided to stick with this hard cut-off, though it seems a bit un-intuitive.

A supplementary approach we could think of was determining significant gaps in the similarity scores of the sentences in the essay and tagging sentences below that gap as irrelevant.

Finally, the relevance measure may go down significantly due to spelling mistakes in a student's essay. We had thought of conducting experiments by using a spell checker on the essay, choosing the appropriate words and then finding the similarity scores. This way, the relevance of an ill-formed but relevant sentence should go up. However, upon more discussion, we could not come up with an approach to select appropriate spelling for a mis-spelled word. For proper nouns for example, the suggestions of a spell checker could change the meaning of the sentence entirely. A more comprehensive approach would be required therefore to select the correct spelling. This is however not a part of the final release.

The relevance module assigns a score out of 6. The formula that it uses is:

Score = ( Number of relevant sentences / Total non-gibberish sentences received ) * 6

FACT IDENTIFICATION

The current approach for fact identification considers all the words in the sentence to be features; however, it is likely that certain words or certain types of words contribute nothing to discriminating between facts and opinions. Nouns, for example, might not be helpful. Consider the sentences ``Lincoln was a good President'' and ``Lincoln was the President''. The same nouns occur in both sentences, but the latter is a fact while the former is not.

In future versions, we would like to try considering only certain parts of speech, verbs and adjectives, for example. Additionally, the training corpus used in this release is probably a bit to small, being on the order of one hundred sentences.

FACT VERIFICATION

One way of improving results would be to restrict the search to certain websites. For example, for an essay about the current political scenario in the world, it is reasonable to expect the student to quote facts from the current news. In such a case, restricting the search to websites like the BBC , CNN, New York Times would improve accuracy and does indeed do so, according to some preliminary tests conducted on the system. Another modification that would help is a better boundary detection algorithm, which takes into account normal English writing styles as well as HTML tags to decide boundaries in the corpus.

This module also returns a score, the scoring mechanism is a simple ratio system and does not compare quality of facts to make a decision. If the student has typed in 4 facts, and 2 are correct, the score assigned is (2/4) * 6.


SCORING METHOD

The scoring method used by SEAM can be broken down into three cases

  1. If the essay is determined to be entirely gibberish, then the essay is given the minimum possible score: zero.

  2. If there are no sentences that are determined to be statements of fact, then the overall score is the mean of the individual scores of the gibberish and relevance subsystems.

  3. If there are statements of fact, then the overall score is the weighted sum of the individual scores from the gibberish, relevance, and fact checking subsystems:
      score = .45 * gibberish + .45 * relevance + .10 * fact_checking

RELEVANCE MODULE

Since we have used the LSA approach for relevance detection, we decided to have the LSA site (http://lsa.colorado.edu) as our baseline for comparison of results.

We evaluated the relevance scores assigned to an essay (on a scale of 6) with respect to the score assigned by ``One to Many Comparison'' Demo on the site.

We tried the following cases:

1. Relevant response to the prompt

2. Partially relevant response to the prompt

3. Totally irrelevant response to the prompt

We observed that the scores given by the LSA demo were comparable to our scores under all of the above categories. We noticed that even with totally irrelevant responses, the similarity values were sometimes very high on both systems, though this was seen to a lesser extent on our system, presumably because of the small size of our corpus.

Here, we also need to take into consideration that the training corpus for the two systems is entirely different.

Overall, the systems compared fairly well. So we concluded that the module was working in the expected manner, though certainly not precise.


EVALUATION

The evaluation section follows the following steps in evaluating SEAM:

Evaluation of

1) Gibberish detection

2) Fact finding

3) Relevance

4) Fact finding

5) Overall essay scores

Gibberish Detection Evaluation

For this evaluation the prompt for the essay was:

Automated essay scoring is unfair to students, since there are many different ways for a student to express ideas intelligently and coherently. A computer program can not be expected to anticipate all of these possibilities, and will therefore grade students more harshly than they deserve.

Discuss whether you agree or disagree (partially or totally) with the view expressed providing reasons and examples.

The Input essay was :

An armed gang stole more than $39 million from Belfast's Northern Bank.

Police believe the robbers may have had inside information, as well as paramilitary help, in view of the military precision employed in the raid on the Donegall Square West bank.

On Sunday night, gang members went to the homes of two senior Northern Bank officials in Dunmurry, near south Belfast, and Loughinisland, Co Down.

With their families held hostage, the men went to work on Monday morning with instructions to behave normally.

It was at the close of business on Monday that the operation to empty the cash vaults of the distribution center was launched.

At least one lorry nosed its way down a narrow side entrance on Wellington Street, just off Donegall Square West, opposite City Hall.

Thousands were on the streets at the time, and detectives are looking for anyone who might have spotted some sort of activity.

CCTV footage has also been studied in a bid to trace the movements of the getaway van, which turned left into Queen Street and then took a sharp right towards the M1 leading out of the city.

 Young people, old people.
 Young teach none old.
 Young think not old.
 Young people have fun.
 Young people teach fun things to old.
 Home is under over river.
 Young man play old man.
 Man learn anything.
 Home I under over at the cold.
 River slow soil under.
 Earth, Sun, Mars, Jupitar.
 I is under over at home.
 Cat red.
 Automate essay grade score now.
 ETS Jill New Jersey.
 Automated Score.
 AMD, INTEL, VIA, PowerPC.
 Canada neighbour USA.
 University of Minnesota home close.
 Duluth Lake Superior cold city.
 The greenish fast sports car was his.
 He is a freshmen at UMD.
 Sun Machines make good servers.
 University of Minnesota. 
 Duluth city new west. 
 Potato Mashed taste. 
 Green salad body for good.

------------------------Output from SEAM----------------------------------

The gibberish sentences are:

Home is under over river. Young man play old man. Home I under over at the cold. River slow soil under. Earth Sun Mars Jupitar. I is under over at home. Cat red. Automate essay grade score now. ETS Jill New Jersey. Automated Score. AMD INTEL VIA PowerPC. Canada neighbour USA. University of Minnesota home close. Duluth Lake Superior cold city. University of Minnesota. Duluth city new west. Potato Mashed taste. Green salad body for good..

Gibberish score: 0.6

Discussion: Most of the gibberish sentences were captured by our gibberish detector as it was expected. The non-gibberish sentences are then sent to the rest of the modules. Also as expected the gibberish module assigned a very low score for the essay as all its contents are gibberish.

Evaluation of the fact finding module

Output from the fact finding module:

For this test we sent the following sentences to the system:

 Pentium 4 is a microprocessor. 
 Ted Pedersen is an associate professor from the University of Minnesota Duluth. 
 Xeon is a microprocessor.  
 Ronaldinho won the 2004 FIFA player of the year award. 
 India got indepence in 1947.
 The factual sentences are:
 Pentium 4 is a microprocessor. 
 Ronaldindo won the 2004 FIFA player of the year award. 
 India got independence in 1947.

Below we included a sample trace output from SEAM.

 Tokens: a, T_NUM, pentium, is, microprocessor
 Category scores:
   NF => 0.219476038872712
   VF => 0.97561789054975
 Pentium 4 is a microprocessor.

For this sentence, the Naive Bayes classifier successfully tagged the sentence as a fact. This is a temporal fact. However it matches the pattern for a factual sentence and hence was picked as a fact.

 Tokens: a, xeon, is, microprocessor
 Category scores:
  NF => 0.990457589974651
  VF => 0.1378178597338

For the sentence Xeon is a microprocessor, Naive Bayes could not conclude that it is a factual statement. In this case it should have been. As mentioned in the report earlier, our system is not able to consistently determine factual statements. This is an example of such sort.

 Tokens: the, fifa, T_NUM, of, player, award, ronaldinho, year, won
 Category scores:
  NF => 0.047248986953747
  VF => 0.998883142931066
 Ronaldinho won the 2004 FIFA player of the year award.

The above sentence is picked as a fact and it is indeed a fact.

 Tokens: professor, the, university, of, minnesota, duluth, is, associate, from, pedersen, ted, an
 Category scores:
  NF => 0.855439367978021
  VF => 0.51790297133089

This statement is a fact, however our system was not able to detect it.

 Tokens: india, T_NUM, got, in, indepence
 Category scores:
  NF => 0.122395112692264
  VF => 0.992481453926998
 India got indepence in 1947. .

This sentence is found as a fact which is correct.

Overall the factchecker was able to detect 3 out of 5 factual statements it was run on.

Evaluation of the Relevence module:

For this test we have used the following prompt:

Automated essay scoring is unfair to students, since there are many different ways for a student to express ideas intelligently and coherently. A computer program can not be expected to anticipate all of these possibilities, and will therefore grade students more harshly than they deserve.

Discuss whether you agree or disagree (partially or totally) with the view expressed providing reasons and examples.

Automated essay gradiing is gradiing essays by computer. It is kinda funny. No teacher. Automated essay grading is fast. Teacher is a teacher.

No worry about friends making fun. Kinda nice. Not compullssory. Can write any time I like, i like freedom.

No job for tearcher, waste time. Students work hard. Kinda bad. Student become teacher. teacher become student. problem

Copinwg esay. Not need to write again and again and again.

Atomated essay is automeated. teacher i teacher. steduent is a student.

The temperature there was in minus.

Output from the Relevence module:

        Anything below a similarity score of 0.4 is tagged as irrelevant.
        automated essay grading is fast  :: 0.70008
        Relevant
        can write any time i like i like freedom  :: 0.61722
        Relevant
        
        students work hard  :: 0.60932
        Relevant
        
        automated essay gradiing is gradiing essays by computer  :: 0.60791
        Relevant
        
        no worry about friends making fun  :: 0.59341
        Relevant
        
        teacher is a teacher  :: 0.50714
        Relevant
        
        student become teacher teacher become student problem copinwg esay  :: 0.48793
        Relevant
        
        atomated essay is automeated teacher i teacher steduent is a student  :: 0.47771
        Relevant
        
        the temperature there was in minus  :: 0.00000
        Irrelevant
        
        it is kinda funny  :: 0.00000
        Irrelevant
        
        Total number of sentences: 10
        Number of relevant sentences = 8
        Number of irrelevant sentences = 2
        The overall relevance score (out of 6) for the essay is 4.8

As expected the relevant sentences were tagged as relevant and the irrelevant ones are tagged as irrelevant. After experimentation, we found out that a threshold of 0.4 works best for our system.

Evaluation of the Fact finding module

For the evaluation of the fact finding module we tested our system with the following sentences:

 Ted Pedersen is an associate professor at the University of Minnesota Duluth. 
 Ted Pedersen is an associate professor at the University of Minnesota Crookston. .

The system found the sentences and factual sentences and determined whether they were verifiable or not. The first sentence is a true statement and the second sentence is a false statement. The system successfully determined the first sentene as a fact and the second sentence as a non-fact. Here is a sample trace of the system in student mode(to reduce cluttering):

 The factual sentences are:
 Ted Pedersen is an associate professor at the University of Minnesota Duluth. 
 Ted Pedersen is an associate professor at the University of Minnesota Crookston. .
 The true facts are:
 ted pedersen is an associate professor at the university of minnesota duluth
 The false facts are:
 none
 The  non-verifiable facts are:
 ted pedersen is an associate professor at the university of minnesota crookston

Evaluation of overall essay score

Here are some sample runs of essays with matching prompts. Both prompts and their respective essays were taken from http://www.wayabroad.com

 Prompt: original score 4.5
 our score: 6

Decisions can be made quickly, or they can be made after careful thought. Do you agree or disagree with the following statement? The decisions that people make quickly are always wrong. Use reasons and specific examples to support your opinion.

Answer:

Making decision is one of the necessary skills in people's normal lives. As to whether making decisions quickly are always wrong, different people hold different views due to their distinctive backgrounds. Some agree with the argument, while others believe that making decision quickly is not always wrong. As a matter of fact, the issue is a complex and controversial one. On balance, however, I tend to agree that making decisions fast is not always incorrect. My argument could be fairly substantiated by the following discussions.

To begin with, Making decisions quickly often bring about the correct results. At the point, determining some simple decisions in the normal life may prove such an opinion. For example, people could determine very quickly and correctly as to going to shopping or taking a rest at home, eating a piece of bread or an egg for their breakfast. Moreover, even in some more complicated situations, people still can make decision quickly but correctly. For instance, a manager may determine quickly to take a correct action when he faces a problem in the company. Owing to his experience and knowledge, he has the ability to make good decision even quickly.

Next, making decisions fast is necessary ability in some cases. For instance, when people encounter a child in a fired house, they would be in a dilemma in such an emergent and dangerous situation, because if they go to save the child's life, their lives would be threatened, otherwise, the child may lost his young life. They, however, have to decide rapidly on whether they should save the child from the dangerous situation. In such emergency, people need to determine quickly and wisely.

Admittedly, some quickly-made decisions may sometimes inevitably incur unexpected and wrong results, and certainly a lot of facts may prove such argument. The argument is, however, not necessarily to put forward a conclusion that fast decisions are always wrong, because tremendous quickly-made decisions could be proved to be wised ones in our normal life.

In sum, although some quick decisions may sometimes be proved incorrect later, in the light of the above discussion, it can be safely said decisions could be made not only quickly, but also at the same correctly. In some cases, people could easily make wised decisions. While in other cases, they may be forced to make quick determination when encountering emergent situations. People should choose their ways flexibly to deal with different situations.

Discussion: This essays was graded as a 4.0 essay when it was included in wayabroad.com. However our system gave it a score of 6.0. This is expected, as our system does not look at the structure of the essay, spelling mistakes, flow of writing etc. Given the limitation of our system, a score of 6.0 simply implies that the essay was relevant to the topic and that it didnt have much gibberish if any was present at all.

 Prompt: original score : 4.0
 Seam score : 6.0

Some people say that computers have made life easier and more convenient. Other people say that computers have made life more complex and stressful. What is your opinion? Use specific reasons and examples to support your answer.

Answer:

Computer is a high technology that has many advantages in recent day. In my opinion, I think that computers have made my life easier and more convenient. The specific reasons and examples that support my opinion described in below.

Firstly, computers have several kinds of program, word processing, web browser and so on, that can help people work in their business because computers save their time from routine works such as document, photo, filing, printing, searching data and so on. Because of computer abilities, their businesses run fast and smoothly.

Secondly, not only in business but also in medical service, because of both proficiency and accuracy, have computers an important role in modern medical tools. For example, inventors use computer to controlling machine or laser that helps doctors to easier and more proficient operate on important organs such as brain, heart, eyes, and stomach.

Finally, in education, reaching knowledge is boundless because computer networks provide people to reach information that is contributed on Internet. For example, Students find information for their reports from libraries in other nations. People can receive an innovation that was invented in another side of the world.

In conclusion, in modern life style, computers have made life easier and more convenient. People use computer to improve their business, health, and education that is an important part to develop their country as well.

Discussion: Here is yet another example of an essay that got 4.0 as a score. Our system graded it as a 6.0 essay. Again, the score of 6 is simply an indication that the essay did not contain any gibberish and that the sentences were relative to the prompt.

Prompt: Score = 1

Score from our system: 4

Automated essay scoring is unfair to students, since there are many different ways for a student to express ideas intelligently and coherently. A computer program can not be expected to anticipate all of these possibilities, and will therefore grade students more harshly than they deserve. Discuss whether you agree or disagree (partially or totally) with the view expressed providing reasons and examples.

Answer:

Automated essay gradiing is gradiing essays by computer. It is kinda funny. No teacher. Automated essay grading is fast. Teacher is a teacher.

No worry about friends making fun. Kinda nice. Not compullssory. Can write any time I like, i like freedom.

No job for tearcher, waste time. Students work hard. Kinda bad. Student become teacher. teacher become student. problem

Copinwg esay. Not need to write again and again and again.

Atomated essay is automeated. teacher i teacher. steduent is a student.

The temperature there was in minus.

Discussion: The above essay is an example as a bad essay. A trace run of the essay is included below for further inspection.

Output from SEAM for the above essay.

 The gibberish sentences are:
 No teacher. 
 Kinda nice. 
 Not compullssory. 
 No job for tearcher waste time. 
 Kinda bad. 
 Not need to write again and again and again. 
 Gibberish score: 2.8
 The relevant sentences are:
 automated essay grading is fast 
 can write any time i like i like freedom 
 students work hard 
 automated essay gradiing is gradiing essays by computer 
 no worry about friends making fun 
 teacher is a teacher 
 student become teacher teacher become student problem copinwg esay 
 atomated essay is automeated teacher i teacher steduent is a student
 The irrelevant sentences are:
 the temperature there was in minus 
 it is kinda funny
 Relevance score:
        The overall relevance score (out of 6) for the essay is 4.8
 The factual sentences are:
 Automated essay grading is fast.
 The true facts are:
 none
 The false facts are:
 none
 The non-verifiable facts are:
 automated essay grading is fast
 Overall score for your essay:  4

As seen above, SEAM gave the essay a score of 4. The sentences included were fairly related to essay grading and had some non-relevan sentences which got a score of 0.

Our system uses an LSA based approach to determine the relevance of sentences to the essay prompt. Because the system was trained on the apw corpus, a lot of times unrelated sentences tend to show up as relevant. This is because these words are semantically related and have showed up together in the apw corpus.


RELATION TO PREVIOUS WORKS

GIBBERISH DETECTION

Jill Burstein et al. at Educational Testing Services Inc. had worked on gibberish(a.k.a. Word Salad) detection and syntax checking. Our solution is similar in that it employs a hybrid approach. It first detects grammatically incorrect sentences using CMU Link Parser and then goes on to use a rule based approach in detecting gibberish. The rule based approach makes use of Brill tagger to parts of speech tag the words in previous found grammatically incorrect sentences.

We did not come across any literature that stated the use of Naive Bayes' classifier to detect gibberish. However, we found out from general discussions on the Internet that some Spam filter's make use of Naive Bayes' classifiers to detect spam in email and that spammers make use of gibberish to fool the Naive Bayes' classifiers into thinking the emails do not contain spam. This does not directly tie into what we are doing, but does give us food for thought.

IRRELEVANCE MEASURE

Thomas K. Landauer et al. have developed an automated essay grading system called, Intelligent Essay Assessor(IEA), which provides relevance checking with feedback using LSA.

http://lsa.colorado.edu/ provides a web interface for ``Sentence Comparison'' which gives similarity scores between two or more sentences in different semantic spaces. Our approach makes use of the same technique, with exception that SVD is currently not included in our implementation.

The E-Rater developed at ETS also handles relevance analysis by way of content checking. Their approach uses content vector analysis for relevance check.

FACT IDENTIFICATION

Naive Bayes' classifier has often been used in text processing. Spam filters use naive Bayes' to detect junk email. We will attempt to use Naive Bayes' to classify text as opinion or fact.

There has been some work done in discriminating opinions in text. For example, Wiebe [7] has done some work in identifying subjective sentences. However, identifying verifiable factual statements is more complicated than just filtering out the opinions because there are facts that are not verifiable, as was discussed above.

FACT CHECKING

Fact checking is a derivative of the method used by Tom Landauer's Intelligent Essay Assessor (IEA henceforth) [1]. IEA employs LSA to check if the student essay is similar to a stock of sample essays. In the IEA method, more importance is given to the content of the sentence than to the sentence structure or length. Our fact verification module will work in a similar manner, checking just the main idea in the sentence for validity. The main contributing words in a sentence are checked against internet sources to see if they often appear in a similar context. Sentence structure and length are ignored.


REFERENCES

1. T.K. Landauer, P.W. Foltz, and D. Laham. An introduction to Latent Semantic Analysis. Discourse Processes, 25:259-284, 1998.

2. M. A. Hearst. The debate on Automated Essay Grading. Intelligent Systems, IEEE, 15: 22-37, 2000.

3. E. Brill. The Brill Tagger. URL: http://research.microsoft.com/~brill/

4. D. Temperley, D. Sleater, J. Lafferty. The Link Parser. URL: http://www.link.cs.cmu.edu/link/

5. Latent Semantic Analysis at CU Boulder. URL: http://lsa.colorado.edu

6. M. Chodorow and J. Burstein. Beyond essay length: evaluating e-rater's performance on TOEFL essays. TOEFL Research Report 73, ETS RR-04-04, 2004.

7. T. Wilson, J. Wiebe, & R. Hwa. Just how mad are you? Finding strong and weak opinion clauses. In proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-2004), 2004.

8. L. Larkey. Automatic essay grading using text categorization techniques. In Proceedings of the twenty-first annual International Conference on Research and Development in Information Retrieval (SIGIR 98), 1998.

9. J.R. Christie. Automated essay marking for both style and content. In Proceedings of the Third Annual Computer Assisted Assessment Conference, 1999.

10. E.B. Page. The Use of the Computer in Analyzing Student Essays. In International Review of Education, vol. 14, 1968.