Ever since man has gained self-awareness, he has tried to communicate with
other fellow human beings. The early form of communication was rudimentary and
based on sign language. Human intelligence has come a long way since then, but
human language has evolved even more rapidly. Today humans communicate verbally
as well as in written form and both have rich complex linguistic elements to
represent human thinking.

It is no surprise then that human writing is considered as a measure of his/her intelligence along with other analytical, technical skills. That's why exams like Graduate Record Examination (GRE) and GMAT ask students to write essays to analyze their thinking and communication skills. Until recently, these essays have been graded by only human beings, more than one in most cases. But as the number of students taking these exams is increasing every year (by thousands), the testing institutes are facing both financial and human resource problems. Hence fast and economical solutions to grading essays are being sought.

This is where automated essay grading comes to the rescue. It aims at giving computers the job of human graders. This will alleviate both financial and resource problems because of the speed of a computer.

However after its introduction, this approach has met with heavy criticism, because whether computers can be made to understand something as complex and variant as human language is still under debate.

But recent advances in fields of Natural Language Processing and Information Retrieval, together with vast amount of linguistic data available from Internet have made significant progress in this area possible and already such systems are being commercially deployed (e.g. E-Rater).


History of automated essay grading


Major developments during  1966 - 1982 [1]

Project Essay Grade (PEG) [1] [15]

The research in possibility of use of computers in automated essay grading began in 1966 by Ellis Page. His system graded essays based on what he called 'proxies', indirect measures of the quality of essay like average length of word, total essay length in number of words, number of commas etc. It didn't take into account the actual content of the essay and hence was heavily criticized and not commercially deployed. It was also susceptible to students tricking the system by using tricks like just writing longer essays for example to get more score.

Although it wasn't commercially accepted that time, PEG laid the foundation for all automated essay grading work that has been done till date.


Writer's Workbench tool (WWB) [1] [16]

This was next step in the direction of automated essay grading. It was not an essay scoring system. Instead it aimed at providing feed back to writers about spelling, usage of words, readability etc. It was one of the earliest spell checkers built. It gave indication of readability of the writing based on parameters like extremely long sentences or use of pretentious words etc. This was a step in the direction of automated grading of writing quality.


Major developments during 1994 - 1997 [1]

By this time lot of research had begun in Natural Language Processing and Information Retrieval and researchers could use these tools for automated essay grading. CAEC and IEA were the main systems created during this time and PEG was enhanced to make use of the recent developments.


Computer Analysis of Essay Content (CAEC) [1] [17]

As mentioned earlier, 'PEG' used indirect measures, called 'proxies', for determining quality of essay without taking into account the actual content of the essay. This was the major drawback of the system. A team of researchers at ETS, headed by Jill Burstein worked on finding direct measures for quality of an essay to overcome this drawback. Some direct measures were syntactic variety based on quantifying types of sentences and clauses found in essays, which could be measured using NLP statistical methods.

Another measure was finding similarity between two documents using vector space modeling, a technique which was commonly used in Information Retrieval.


Intelligent Essay Assessor (IEA) [1]

This was developed in the late nineties by Tom Landauer and his team and is based on Latent Semantic Analysis. In this, using a matrix transformation technique known as Singular Value Decomposition (SVD), deeper hidden relationships between words and documents are discovered, and these relations are used to understand relation between words and sentences. Some of the notable features of IEA were that it gave a customized feedback about the essay and also had plagiarism detection built into it.


Major developments during 1998 - 2000 [1]

This was the time when many automated grading systems, including PEG, IEA and CAEC converted from just research projects into fully functional commercially deployed systems.


Text Categorization Technique (TCT) (1998) [12] [14]

Larkey was the first person to come up with the idea of using Text Categorization technique for automated essay grading. He used TCT along with text complexity features and Linear Regression for grading essays.Text categorization refers to the problem of categorizing given text into different categories. It is based on developing 'binary classifiers' which will separate 'good' essays from the 'bad' ones and give a score which can be used for assigning grade to the essay. This involves the use of multiple text categorization techniques. Firstly, Bayesian classifiers are used to find out the probability of a given document belonging to a particular category. Then some important keywords from the essay are found and k-nearest neighbor technique is used to find the k essays (from the set of sample human graded essays available), which are closest to the student essay. Then for assessing the style of the essay, 11 text complexity indicator features are extracted from these k essays and these are used to calculate the style of the essay. Examples of complexity features which can be used are number of words in the essay, average sentence length in the essay etc. Larkey tested the system on two sets of essays - one set on social studies, which has lot of content and other set on general public opinion, which had less of content and more of style. The system performed remarkably well on both the sets.


E-Rater [1] [17]

The CAEC was patented by ETS and renamed to E-Rater and commercially deployed for scoring GMAT essays in 1999. Also, apart from just knowing score, a need for providing feedback about the essays was felt and Criterion was developed. Criterion had additional evaluation features that provided students and teachers not only with final scores but also with categorized scores and comments about how the student performed in each of the sub tasks.


Recent Development - 2002

Bayesian Essay Test Scoring sYstem (BETSY) (2002) [11] [13]

This is the most recent system being developed. Its first release was in year 2002. This system is being developed by Lawrence M Rudner at University of Maryland. The system is trained using a training data and then classifies given essays into 5 categories: extensive, essential, partial, unsatisfactory This classification is done taking into account the content indicative features as well as style indicative features of the essay i.e. the system grades on basis of not only the content but also the writing style of the student. The system uses Multivariate Bernoulli Model (MBM) and the Bernoulli Model (BM) for classification. MBM has a precalibrated set of features that the ideal essay should have. These features are calibrated using the training data. It then looks at the features present in the student essay and calculates the probability of each possible grade as the product of probabilities of features present in the essay. In BM. the conditional probability of features is found taking into account the number of essays in data set belonging to that category and having that feature. Both the MBM and BM models are naive Bayesian models because they assume conditional independence between the features. According to the developers of BETSY, their system incorporates the good features of PEG, LSA and E-Rater. According to Rudner, the system has achieved an accuracy of 80% in grading in test conducted so far.

OBJECTIVE:

Our aim is to create a web based Automated Essay evaluator system, through which a student can evaluate and improve their essay writing skills. The user will be required to login to the system, and write an essay on the topic of their interest from the number of available prompt choices. Once the essay has been submitted, the system will evaluate it based on several important criteria. Once this is accomplished, the system will provide a score within a range of 0-6 (or 0-100) along with a constructive feedback. The 0 to 6 scoring standard is used by Education testing Services(ETS) [4]. A score of 0 reflects that the essay is not at all satisfactory in any aspect while a score of 6 shows that the essay is very well structured, grammatically correct, contains correct facts, and highly relevant to the given prompt.

The following are the four main criteria:

1. Gibberish detection:

We, the humans; are intellectual beings, and hence we tend to manipulate things so as to serve our purpose or to gain maximum advantage from the issue at hand. GRE, GMAT and SAT are exams, and naturally; any student will want to take advantage of any known loopholes found in the grading mechanisms for the exam so as to obtain maximum score.

The early versions of automated essay evaluator systems were easily tricked by students. One of the tricks students used for getting higher scores was just to list high profile words related to the asked question or prompt. The system used to look for the words related to the prompt in the essay and hence fell pray to this kind of trickery. According to Marty A. Hearst [1] such list of words is termed as ``word salad'' or gibberish by scholars in the field.

It is very difficult to distinguish gibberish essays from grammatically ill formed essays. There is no clear line of separation between these two kinds of essay. While a grammatically ill formed essay can be considered as a genuine attempt by a student whose written English is not very good, the gibberish essay when considered as word salad, is a deliberate and conscious attempt by a student to cheat the system. Technically a gibberish essay is not always a result of student trying to trick the system. An essay written by a student who's English is poor can turn out to be gibberish too.

For a human reader it is quite easy to differentiate between gibberish and a grammatically poor essay. This is so because we can extract the intended meaning from a phrase with poor grammar but we cannot extract the intended meaning from the gibberish phrase. On the other hand it is not very straight forward for the automated essay grading system to perform the same task since it does derive its judgement based on semantics of the content. It has to follow predefined rules which act differently in different context, hence contributing to erroneous results in several cases.


Example of a gibberish essay:

Question: Euthanasia is a very controversial topic. There are people who think that mercy killing should be allowed to relieve people from endless pain and there is a group of people who think that we should not play god. Express your opinion on the issue.

Answer: Mercy pong who ping mercy. life why god life. Murder Human king sick right people is kick endless. pain right die, We god are god. when Respect life how. Euthanasia tired bad is. All Murder murder murder murder murder. No right to kill kill kill. Pain god god god god god god god god god Mercy killing. life is god. Murder Human rights sick people endless. sick Right die, We are not god. Respect for life. Euthenesia is bad. Natural death. Murder murder murder murder murder. No right to kill kill kill. Pain god god god god god god god god god.

Although this essay is senseless, it might score very well compared to some genuinely well written essays due to the fact that it has a lot of the important words related to the prompt and issue at hand. This will happen in the system that considers the number of related words to the issue as the main criteria for obtaining a better score. In ideal world these kinds of essay should score poorly since they disregard almost all the rules of the language.

Our goal:

Our gibberish detection modules will detect a wide variety of gibberish contents from the essay. They might miss the very subtle cases, but they will detect all other explicit and basic cases.

1) The ratio based gibberish detection module will detect the essays with word salad in the case where the nouns will be too many or too few. (More details in the implementation section)

2) The rule based detection module will detect the portions of essay that contradict the rules of grammar. If there are many sentences that contradict the laws of grammar then this module considers that essay to be gibberish. (More details in the implementation section) The linear combination of the scores form these modules are used to determine the final degree of gibberishes. (Analysis of each score and combined score is presented in the implementation section)


=head2 2. Irrelevance:

Irrelevance can be defined as something that is not related to the matter being discussed. Irrelevance detection deals with identification of essays that are not relevant to the subject and measures how much the essay goes along with the topic.

Irrelevance also measures whether the student has understood what the essay topic means and what is expected as an answer. This gives a good insight on the capabilities of the student. The ideal answer should be prompt specific and must discuss all the issues involved without any deviation from the topic. Presence of irrelevant contents in the essay reveals that the student is unaware of what is required. A drift from the main issue at hand, and inclusion of issues that are not related to the question, suggests that the student is not well informed about the issue or he/she is not creative enough to relate the issue to something they have experienced and make an argument based on that. Irrelevance is the most common fault in essays and can be attributed to lack of knowledge on the issue. An essay created by a student in a perfectly grammatical form and presented in a very elegant manner should be considered as a poor essay if it lacks the relevant contents.

Example:

Suppose that if the same question about euthanasia, mentioned in the discussion of gibberish detection is considered. In the essay regarding that issue the student mentions that ``sponge bob square pants is horrible'', then this is considered irrelevant in the context of euthanasia. There is no relation whatsoever between sponge bob and euthanasia

Our goal:

In our approach to detect the amount of irrelevant data in any given essay, we will use the F-measure from Precision and Recall referred from Creighton University health Sciences Library and Learning Resources Center website [5]. The method is discussed in the implementation discussion for the Irrelevance measure in the implementation section.


=head2 3. Fact Finding:

Fact finding can be defined as the method of distinguishing facts from opinions. An opinionated essay is great since it requires a student to argue and display their writing skills but inclusion of facts in the essay means that the student has knowledge of the issue. To achieve this, it is important to distinguish facts from opinion. An opinion has a degree of truth while a fact by definition is like a proposition i.e. true in all cases.


Example:

``George Bush won the 2004 election'' is a fact and ``George Bush is a great leader'' is an opinion.


Fact finding is quite a challenging task when a machine has to tackle it. Once again, for 
humans it is easy to extract facts from a given content since we have a rich database of 
facts being collected over the years of exposure to all sort of information, and our ability 
to associate and reuse the gained information in different contexts.

Facts can be identified under two categories. Ones that are explicit and others that are very subtle. It is easy to pick out the explicit facts like ``There are 206 bones in our body''. This due to the presence of the number 206. On the other hand it is very difficult to extract facts like ``Some trees grow tall'', ``Yesterday it rained'' since there are no numbers or statistics present in the sentence. Humans can easily identify these sentences as facts since they have actually seen some trees that reach enormous heights and they might have witnessed the rain too. For the system it is not the same.

Association cannot be used when finding facts with a computer program due to the absence of a large reservoir of information based on which such association are made. If such reservoir existed, even then it is not advisable to use it since the computation time will not be favorable to work in real time.

In an alternative and rather naive approach, all the sentences with opinion words like nice, nicely, cute, feel, think are removed from the essay. Once these sentences are removed, all the remaining sentences are checked for features like dates, percentage symbols, numbers, and uncommon nouns. The sentences that contain these entities are considered as a factful sentence whose validity is not yet determined.

At the first glance, this method looks naive but surprisingly it works very well with the student essays. According to Manning & Schutze [6] the notion of ``minimum effort'' introduced by George Zipf can be the factor for success of this method. According to Zipf, the writer will try to use least amount of new words when producing the essay, so as to reduce the effort on their part. If this is true then it safe to assume that large number people will try to use same opinion words to express their opinions in their literary works.

Our goal:

The goal of the fact finder module is to identify factual sentences in an essay. The current version of fact finder can identify facts that contain proper nouns and statistics. For example, the fact ``Ferrari makes sports cars'' is identifiable, however, a fact such ``A leaf is green'' is not identifiable.

4. Fact Checking:

Usage of facts in an essay speaks volumes about a student's knowledge on the issue at hand. This is generally true for essays that are statistics driven. For example, in an essay on pollution in cities a student can use facts to argue their points. They can make strong claims by using statistics and numbers. On the other hand, if a student is to write an essay on whether elderly people should stay with their family or not, then it is not easy to incorporate statistics. In this case the students argues based on the personal opinions and that of others.

If the essay is fact driven then one expects to encounter couple of sentences that can be identified as potential factful sentences. Presence of these sentences in an essay does not guarantee that these are actual facts. Each factful sentences need to be cross-checked with some database of facts so as to accept or refute them as facts.

Example:

Suppose the student mentions in their essay that ``China is in Hawaii'', this is not true and hence the student can be penalized for stating something that is untrue.

The web is arguably the largest database for facts. It is true that along with the desired facts it also contains a lot of noise that requires filtering. It is not impossible to use web to check for facts although the task is not as simple as it sounds. Web-sites such as google.com, altavista.com and yahoo.com can be used to check the facts. They are desirable for the purpose due to their excellent response time to a query, their efficiency, and above all their organization of responses that makes the fact checking possible.

There are some facts that are difficult to verify. The usage of web in this case will not help in determining the correctness of the fact. For example if we take the fact ``it rained the whole week'', whatever the results for this fact are obtained from searching the web, are not useful. This is so because it might be true that it rained in Frankfurt for the entire week but this might not be the case in Brazil. This particular fact is hard to verify since it is not true in all cases.

Our goal:

The goal of the fact checker module is to verify factual sentences. The current version can verify factual sentences that contain a small number of words, and it can also verify certain large sentences that contain at least two proper nouns.

5. Essay Complexity:

The content creativity and complexity of an essay is very hard to determine given the nature of english language and mutiple ways to explain the same issue. But it is of great importance to determine various ways by which it is possible closely get the complexity and creativity. One method that uses multiple factors of essays is what has been used in this module. From an essay following things can easily be determined.

1. Essay length (by characters, words, sentences, and paragraphs)

2. Word tokens, and Word types

3. Average sentence length

4. Average word length

5. Number of words of sizes 4,5,6,7,8,9, .. etc


Using the above information in various combinations it is possible to derive a formula which somewhat
represents creativity and complexity. The formula used in this module is :

- Number of paragraphs (num)

  if(num == 5) then score = 1;

  if(4 == num == 6) then score = 0.6;

  if(3 == num == 7) then score = 0.3;

  else score = 0;

- Average Word size (ave)

  if(ave >= 5.5) then score = 1;

  if(5.25 <= ave < 5.5) then score = 0.6;

  if(4.75 <= ave < 5.25) then score = 0.3;

  else score = 0;
 
- Ratio of Word Tokens to Word Types (rat)

  if(rat >= 2) then score = 1;

  if(1.75 <= rat < 2) then score = 0.6;

  if(1.5 <= rat < 1.75) then score = 0.3;

  else score = 0;

The sum of all three scores is then divided by 3 and that is the essay complexity and creativity score. The above limits were empirically determined using the essays from umd corpus.

Example:

Automated essay scoring is being used these days to score essays written by students in tests such as GRE and TOEFL. It uses computers to grade essays. As computers are used it reduces load on humans. Automated essay scoring also works very fast and student can get results just after writing the essay.

First, computers are not that much efficient to grade human essays. They cannot understand the logic and sense that a student uses in each essay. Even if we make the computer learn some of the types of essays that students write there are many ways to write an essay and it is impossible to make the computer understand all this.

Secondly, it is very difficult to train the computers in all areas of human knowledge. Its is costly and time consuming. This can be used in training humans and will give much better results. Even if a computer is fully trained some new aspects tend to creep in every essay and this will raise problems.

Thirdly, as computers understand only a part of the essay they are able to grade only that part and all the rest is taken as irrelevant or misleading topics. Due to this the grade of a student is drastically reduced. It can even happen that a poorly written essay is given a much better score than a good one.

Finally, I feel that use of automated essay scoring is not good way of scoring essays and they are totally unfair to students. Use of computers should be reduced and more stress should be given to include humans in the process of scoring essays. This will result in a better grading system.

The score for the above essay in a 100 point scale is : 86.67

The following is the log results

    Essay Complexity Results:
    Total Chars : 1501
    Total Words : 279
    Total Word Type : 129
    Total Number of Sentences : 17
    Average Word length : 5.38
    Average Sentence length : 88.29
    Number of Words size 5 : 40
    Number of Words size 6 : 23
    Number of Words size 7 : 24
    Number of Words size 8 : 9
    Number of Words size 9 : 13
    Number of Words size 10 : 6
    Number of paragraphs 5
    Para = 5 => score = 1
    Ave. Word < 5.5 && >= 5.25 => score = 0.6
    Ratio of Tokens/Types >=2 => score = 1
    Complexity Score = 86.67


=head3 Our goal:

The goal is to maintain a level of complexity in essay and also make sure that the user writes a 5 paragraph format essay.

IMPLEMENTATION:

Ratio based Gibberish Detection module:

Introduction:

A gibberish essay produced by a student will adhere to a general format. The student would not actually enter random words but instead they will cleverly generate an essay that has a huge collection of non functional words dominated by nouns, verbs related to the issue at hand. While this is also true in a well structured essay, one would also see a sizeable amount of functional words such as conjunctions and determiners. We performed a part of speech tags count on the set of 500 essays [9] calculated the score described below for the chunk of 50 words in order, and observed that in each chunk of 50 words, the nouns were the dominant tags. The total number of nouns were twice as much as the total number of verbs. The total number of determiners were one-thirds the total number of nouns. The next highest appearing tags are the conjunctions and prepositions. We generated a formula to give out a score based on the gibberishness of these groups of 50 words. The total gibberishness of the essay is established based on how many groups of 50 words out of all crossed the threshold for the gibberishness. All of the above ratios were empirically determined.


Input: A part of speech tagged essay generated using the Brill Tagger [3]

The scoring formula:

Score = # of nouns/ (# of verbs + # of determiners + # of conjunctions)

The scores obtained from the above formula are divided into four count groups.

Group 1:

The first group is the group of 50 words with score less than 1.8 and greater than 1.2 the score implies that, this particular chunk of words is very well formulated. The total number of times this occurs is stored in a variable named counter1

Group 2:

The second group is the one with a score between 1.8-2.99 and 0.81-1.2. This implies that the group of words having score between 0.81 - 1.2 will have less then expected number of nouns compare to the number of determiners, verbs, and conjunctions. The group with a score between 1.8 - 2.99 is the one in which the number of nouns are more than expected compared to the number of verbs, determiners and conjunctions. Yet this group is not considered to be gibberish since it is just bit off. The total number of times this occurs is stored in a variable named counter2.

Group 3:

The third group of words is the one with score between 3.0 - 3.99 and 0.8 - 0.526 This implies that the group of words having score between 0.81 - 1.2 will have less then expected number of nouns compare to the number of determiners, verbs, and conjunctions. The group with a score between 1.8 - 2.99 is the one in which the number of nouns are more than expected compared to the number of verbs, determiners and conjunctions. This group is the borderline group. The words in this group lack proper structure. Arguably these are the parts extracted from the essays that have the score between 1 - 3. The total number of times this occurs is stored in a variable named counter3.

Group 4:

The fourth group of words are the ones whose score is 4.0 and greater or .0525 and below. This group is what we consider as gibberish. The total number of times this occurs is stored in a variable named counter4

Deciding the cutoff for the scoring formula:

The benchmarks and cutoff for these scores were determined by conducting the experiments with the scoring formula on the sample set of 500 hundred essays available from the class website. All the essays were tagged and put in a single file. At a time, a group of 50 words were taken in order of their appearance and the score was computed for it. The following results were obtained:


0.5429 0.5789 0.6222 0.6250 0.6316 0.6364 0.6429 0.6452 
0.6471 0.6562 0.6579 0.6579 0.6667 0.6970 0.6970 0.7059 
0.7073 0.7143 0.7188 0.7241 0.7241 0.7273 0.7273 0.7273 
0.7297 0.7333 0.7429 0.7500 0.7500 0.7667 0.7667 0.7667 
0.7692 0.7714 0.7714 0.7742 0.7742 0.7742 0.7778 0.7812 
0.7857 0.7879 0.7879 0.7895 0.7895 0.7931 0.7941 0.8000 
0.8000 0.8000 0.8056 0.8065 0.8065 0.8108 0.8125 0.8148 
0.8148 0.8148 0.8182 0.8182 0.8214 0.8235 0.8276 0.8276 
0.8276 0.8286 0.8333 0.8333 0.8333 0.8333 0.8333 0.8378 
0.8378 0.8387 0.8387 0.8387 0.8387 0.8387 0.8387 0.8400 
0.8400 0.8421 0.8421 0.8421 0.8438 0.8462 0.8462 0.8462 
0.8462 0.8478 0.8485 0.8485 0.8485 0.8485 0.8519 0.8519 
0.8529 0.8529 0.8529 0.8529 0.8529 0.8529 0.8571 0.8571 
0.8571 0.8571 0.8571 0.8611 0.8611 0.8611 0.8621 0.8621 
0.8621 0.8621 0.8621 0.8636 0.8649 0.8649 0.8649 0.8667 
0.8667 0.8667 0.8667 0.8667 0.8667 0.8684 0.8684 0.8710 
0.8710 0.8710 0.8710 0.8710 0.8718 0.8718 0.8718 0.8718 
0.8750 0.8750 0.8750 0.8750 0.8800 0.8800 0.8800 0.8824 
0.8846 0.8846 0.8846 0.8846 0.8846 0.8846 0.8846 0.8857 
0.8857 0.8857 0.8857 0.8889 0.8889 0.8889 0.8889 0.8889 
0.8889 0.8889 0.8889 0.8889 0.8889 0.8919 0.8919 0.8929 
0.8929 0.8929 0.8929 0.8929 0.8929 0.8929 0.8958 0.8958 
   .      .      .      .      .      .      .      .
   .      .      .      .      .      .      .      .
   .      .      .      .      .      .      .      . 
   .      .      .      .      .      .      .      .

2.1818 2.1818 2.1818 2.1875 2.1875 2.1875 2.1875 2.1875 2.1875 2.1875 2.1875 2.1875 2.1875 2.1905 2.1905 2.1905 2.2000 2.2000 2.2000 2.2000 2.2000 2.2000 2.2000 2.2000 2.2000 2.2083 2.2105 2.2105 2.2222 2.2222 2.2222 2.2222 2.2222 2.2222 2.2222 2.2273 2.2273 2.2308 2.2353 2.2353 2.2353 2.2353 2.2353 2.2353 2.2353 2.2353 2.2353 2.2381 2.2500 2.2500 2.2500 2.2500 2.2500 2.2500 2.2500 2.2500 2.2500 2.2632 2.2632 2.2632 2.2667 2.2667 2.2667 2.2778 2.2778 2.2857 2.2941 2.2941 2.2941 2.2941 2.2941 2.2941 2.3000 2.3125 2.3125 2.3125 2.3333 2.3333 2.3333 2.3333 2.3333 2.3333 2.3333 2.3333 2.3333 2.3333 2.3333 2.3333 2.3500 2.3529 2.3529 2.3529 2.3529 2.3529 2.3529 2.3529 2.3571 2.3636 2.3684 2.3684 2.3846 2.3846 2.3889 2.3889 2.4000 2.4000 2.4000 2.4000 2.4000 2.4000 2.4118 2.4118 2.4118 2.4211 2.4286 2.4286 2.4375 2.4375 2.4375 2.4375 2.4400 2.4444 2.4500 2.4500 2.4545 2.4545 2.4667 2.4706 2.4706 2.4762 2.5000 2.5000 2.5000 2.5000 2.5000 2.5000 2.5000 2.5333 2.5333 2.5333 2.5333 2.5333 2.5333 2.5385 2.5385 2.5556 2.5625 2.5625 2.5625 2.5714 2.5714 2.5882 2.6000 2.6000 2.6000 2.6111 2.6250 2.6250 2.6316 2.6471 2.6471 2.6667 2.6667 2.6923 2.7059 2.7143 2.7143 2.7143 2.7333 2.7692 2.7692 2.7857 2.7857 2.8125 2.8125 2.8333 2.8462 2.8462 2.8462 2.8462 2.8571 2.8750 2.9167 2.9286 2.9333 2.9444 3.0000 3.0000 3.0000 3.0000 3.0000 3.0000 3.0714 3.0769 3.0769 3.1429 3.1538 3.2143 3.2143 3.2308 3.2308 3.3077 3.3333 3.5000 3.5385 3.5833 3.6154 3.6364 3.6364 3.7000

counter1: 2349 (between 1.2 - 1.8 very good formulation) counter2: 1779 (between 1.8 to 3.0 and 1.2 to 0.8 ok formulation) counter3: 74 (between 3.0 to 4.0 and 0.8 to 0.525 poor formulation) counter4: 0 (above 4.0 and below 0.525 is gibberish)

gibberish percentage = ((counter4) / (counter1 + counter2 + counter3)) * 100

The intermediate results represented by the period symbols are omitted since they do not exhibit anything important. The cutoff for gibberish detection was decided from the above values. When gibberish essays were used in the experiments, the score obtained was above 5 but we thought it would be safe to keep the cutoff at 4.

We ran some experiments on essays with a score of 6 and 5 and their resulting scores belonged to Group 1 and Group 2.

Drawbacks:

Since this method only takes ratios in consideration then it will fail to identify very well structured/grammatical gibberish phrases.

Rule based gibberish detection module:

In rule based gibberish detection module, part of speech tagged files generated using the Brill Tagger, are used to create grammatical rule database. The rules are based on variable window size (we are using size 3, 4, 5)

Example:

Suppose we have the following dummy tags occurring in sentence form: AA BB CC DD AA

Then in window size 3, the following rules will be generated AA BB CC BB CC DD CC DD AA

For window size 4: AA BB CC DD BB CC DD AA

For window size 5: AA BB CC DD AA

These rules are stored in a database of rules along with their frequency counts. All rules that are occurring 2 or less times are not included in the database, since these can be considered as anomalies.

When a student writes an essay and submits it though web, the essay is tagged and then passed to the module for rule checking.

The system creates rule strings of size 3, 4, 5 for the essay and then compares these strings with the rules in the database. If the rule in the rule string is present in the database then it is considered to be valid. If not then the rule string is considered to be invalid.

Finally the ratio of number of invalid strings to the total number of strings is computed and the result is considered as the score from this module.

Drawbacks:

The rules database was generated using the tagged file of Associated Press Wire text using window size of 3, 4, and 5. The rule based system works pretty well except when the user writes gibberish sentences where all the words are ``nouns'' in a sentence. In the rule database a combination of 3, 4, and 5 nouns tag is a valid rule, when we compare it over the sentence which has continuously nouns only, the system would give this sentence a good grade but the sentence is actually a gibberish one. We are working on another method to handle this special case.


=head2 Context words POS based Gibberish Detection module

Introduction

This method uses the property that the keywords used in the essay occur in conjunction with surrounding
words in certain possible grammatical patterns only generally. In other words, the keywords will be preceded and succeeded by only specific parts of speech, i.e. you can�t have just any word before and after a keyword, which usually is the case in gibberish.
For e.g., keyword 'president' is mostly preceded by the word 'the' and followed by the word 'of' in majority of the instances ("the president of"). A gibberish sentence which is grammatically ill-formed, will probably not have
the phrase "the president of", but some gibberish words before and after "president". This method will thus detect such ill-formed occurrences of keywords and aid in finding gibberish.
Although we can look at 'n' words before and after the keyword, test runs have shown that looking at more than 1 word on each side gives lot of pattern mismatches, but good performance by looking at only 1 word on either side.Hence we look at only 1 word on either side of the keyword.
Also looking for exact phrase is expecting too much. Hence we only match the part of speech of the word on the left and right of the keyword instead of the words themselves. We build a patterns database from tagged text corpora of the UMD essay database. The tagged text corpus is searched for keywords. Each instance of the keyword is stored in another patterns database file, along with the parts of speech of the surrounding words.
A snapshot of the content of the patterns database is as follows:

XYZ JOINING/VBG DT NNP DISCUSSED/VBD NNP NNP DISCUSSED/VBD NNP VB DISCUSSED/VBD VBP JJ POISONOUS/NNP NNS VB PREVENTED/VBD XYZ

Input to the system: Name of the patterns database file The part of speech tagged essay

Output of the system: The overall NON-gibberish percentage in the essay expressed as fraction between 0 to 1. 0 - total gibberish 1 - full correct

Example:

Input to the system: We discussed the plan today.

Tagged Input: (the tags are hypothetical) We/NNP DISCUSSED/VBD THE/VBP PLAN/IN TODAY/NNS

        On running the system, it will find DISCUSSED to be a keyword and see if POS on the left i.e. NNP occurs on the left in the database and POS to the right i.e. VBP also occurs on the right in the database. We might want to
restrict the left and right side matching to be within the same sentence in the database, if test results show improvement.


=head3 The scoring formula:
        Gibberish % = ((TotalKeyWrds UNMatched - NumGraceKeywrds)/ Total num of words ) * 100;
        Score returned = 1 - ((Gibberish %)/100);  #indicates non-gibberish content, 1 being totally correct

where,
	TotalKeyWrds UNMatched = total number of keywords whose pattern matched with patterns in database
	NumGraceKeywrds = Since we cant expect all patterns to match even in a well written essay, we allow some	percentage of words to not match, we don�t count them in gibberish score.
	Total num of words = total keywords in the essay, excluding the stop words.

        By testing on large sample essays, we have come up with Grace percentage = 5%.

Examples :

        Here are some sentences processed by the gibberish system which show all possible scenarios that occur.

A) Perfectly valid sentences being detected as non-gibberish

1) Secondly, it is very difficult to train the computers in all areas of human knowledge.

SECONDLY/NNP IT/PRP IS/VBZ VERY/NNP DIFFICULT/NNP TO/TO TRAIN/NNP THE/DT COMPUTERS/NNS IN/IN ALL/PDT AREAS/NNP OF/IN HUMAN/NNP KNOWLEDGE/NNP Matched 7: SECONDLY DIFFICULT TRAIN COMPUTERS AREAS HUMAN KNOWLEDGE UNMatched 0 : Gibberish Percentage : 0.00

2) This can be used in training humans and will give much better results.

THIS/NNP CAN/NNP BE/VB USED/VBD IN/IN TRAINING/NNP HUMANS/NNP AND/CC WILL/MD GIVE/VBP MUCH/NNP BETTER/NNP RESULTS/NNP Matched 5 : BE TRAINING HUMANS BETTER RESULTS UNMatched 0 : Gibberish Percentage : 0.00

        Here are 2 valid sentences which result in matching of all keywords in the sentences, resulting in 
a gibberish percentage of 0 as expected. This shows that the gibberish method is capable of detecting valid sentences to be non-gibberish.

B) Totally gibberish sentences being detected as mainly gibberish

1) Higher Essay won Computer jumped online GRE automated States

HIGHER/JJR ESSAY/NNP WON/VBP COMPUTER/NN JUMPED/VBD ONLINE/NNP GRE/NNP AUTOMATED/VBD STATES/NNS Matched 3 : HIGHER ESSAY STATES UNMatched 5 : COMPUTER JUMPED ONLINE GRE AUTOMATED Gibberish Percentage : 62.50

2) hit result mountain Wars Trek Yeti Jedi opinion unfair go students topics was grade not good XYZ

HIT/NNP RESULT/NNP MOUNTAIN/NNP WARS/NNP TREK/NNP YETI/NNP JEDI/NNP OPINION/NNP UNFAIR/NNP GO/VB STUDENTS/NNS TOPICS/NNP WAS/VBD GRADE/NNP NOT/NNP GOOD/JJ XYZ/NNP Matched 5 : HIT RESULT WARS OPINION STUDENTS UNMatched 8 : MOUNTAIN TREK YETI JEDI UNFAIR TOPICS GRADE XYZ Gibberish Percentage : 61.54

        Here are two completely gibberish sentences. The gibberish detector detects most of the keywords as gibberish
since they don�t occur in any sensible grammatical pattern. However some keywords still escape as valid. This
is largely due to the fact that NNP keyword NNP is a valid sequence. This pattern allows many gibberish nouns to occur on left and right of given keyword and still be detected as valid. But by and large, gibberish sentences still 
result in higher unmatches than matches which is what we want.

C) Some valid words detected as gibberish

Use of computers should be reduced and more stress should be given to include humans in the process of scoring essays

USE/NNS OF/IN COMPUTERS/NNS SHOULD/MD BE/VB REDUCED/VB AND/CC MORE/JJR STRESS/NNP SHOULD/MD BE/VB GIVEN/NNP TO/TO INCLUDE/NNP HUMANS/NNP IN/IN THE/DT PROCESS/NNP OF/IN SCORING/NNP ESSAYS/NNP Matched 9 : COMPUTERS BE STRESS BE GIVEN HUMANS PROCESS SCORING ESSAYS UNMatched 1 : REDUCED Gibberish Percentage : 10.00

        As shown above, REDUCED is detected as gibberish, even though it�s valid. Possible cause may be the 
small size of the corpus used (UMD essays corpus). Some unwanted mismatches of this kind are bound to happen.
Hence we have a NumGraceKeywrds count, which denotes percentage of total keywords, which we allow to be gibberish.
These are not counted towards the gibberish score. By testing, we have found that on average, 5% valid keywords
are detected as gibberish, hence NumGraceKeywrds is set to 5%. This helps us fine-tune our system.


=head3 Conclusion:
        Although the system is not perfect, it performs remarkably well in detecting gibberish. By setting up
the Excused keywords percentage, we are able to makeup for erroneous mismatches done by the system. This
method does a good job of detecting gibberish usage of keywords, although some gibberish usages still escape 
through it. But this percentage is very small and the method is very successful in detecting gibberish. =head3 Context words POS based Gibberish Detection module


=head2 Combined Gibberish Detection:

        We have three gibberish detection methods based on 3 different approaches. Each method is good at detecting
one kind of gibberish, but fails to detect some other kind. But combined together, they perform very well.

We have combined the modules as follows : 1) The Rule based gibberish detection module gives 2 scores, based on different window sizes. Their average is taken and taken as the single score of the Rule based gibberish detection module. 2) So we have 3 scores from Ratio based, Rule based and Context Keywords POS based modules. 3) The difference in score reported by each module with respect to each other is calculated 4) The modules whose scores are nearest (have least difference) are selected. 5) The average of these two selected modules is the final score reported by the Gibberish Module as a whole.

Irrelevance detection:

For detecting irrelevance, a sample set of good essays belonging to the same prompt are taken and each line in the student answer is compared to the sample set. This is done after the student essay has passed the gibberish test. Stop words are removed from the sample set and the student answer and unigrams are extracted from both. Then a search is conducted to find the number of unigrams in each line of the answer that are present in the sample set unigrams.

Precision and recall [18] are the measures that are used in evaluating the relevance between the sample set and the student answer. Recall is defined as the ratio of the number of unigram matches for a line found to the number of unigrams in the sample set. Precision is defined as the ratio of the number of unigram matches found to the number of unigrams used in a line in the student answer.

r = (#unigram_matches / #unigrams_sample) * 100 p = (#unigram_matches / #unigrams_sentence) * 100

where r is recall and p is precision. Both values are represented as percentages.

F-Measure is the harmonic mean of precision and recall. It is used to combine precision and recall into a single parameter. F- Measure is defined as

f = ((b^2 + 1) * p * r / (b^2 * p) + r)

where f is F-measure, p is precision, r is recall and b is a constant that describes the weightage between p and r. When equal weightage is given to precision and recall in the calculation of F-measure, value is b is taken as 1. Thus we get

f = 2 * p * r / (p + r)

A relevant essay (or a single sentence) will give a high value for f-measure. This is because in case of relevant sentence precision will have a high value as there will be greater number of unigram matches and ratio will also tend to go up.

After testing on various sample data for different prompts, it was found that relevant sentences had a F-measure value greater than 0.70 and precision greater than 50. For some long sentences the former condition could not be satisfied even though sentence was relevant to the prompt in such case it was seen that, when number of unigrams in a sentence exceeded 10 and precision was greater than 2 the sentence was relevant. Those sentences that pass this criterion are termed as relevance and the rest as irrelevant. This system filters our relevant and irrelevant sentences. As the process takes place in a sentence to sentence basis the structure and flow of the essay has no effect on it.

This stage finds if the sentence is relevant or not and does not looks into the correctness of the essay. Relevant essay can be of two forms, one that answers all the needs and one that fails to do so. For example, an essay on benzene can be said to be relevant if the student answer deals with benzene and organic chemistry, and irrelevant if it dealt with inorganic chemistry. Now, the relevant essay can have correct information about benzene or incorrect ones. The irrelevant detection returns the back the relatedness of the essay to the main topic and gives a set of sentences that are irrelevant to the prompt in the student answer.

Sample essays set:

Sample essays were collected from various sources. Essay for a particular prompt written by the students in the umd essay corpus [9] with a score of 6 formed one sample essay set. Essays from www.wayabroad.com [19] for gre sample essays and www.testmagic.com [20, 21] for toefl essays were used. For more sample essays google was queried with the essay prompt and some sample essays were found from the webpages got as result.

Out of all the essays retrieved, good ones were selected manually. Main criterion was given to essays with a 5 paragraph format (introductory statement, thesis statement, main ideas, supporting ideas and conclusion) and relevance to the prompt. Spell check and grammar corrections were made on the essays before adding them to the sample set. For each prompt a sample set of its related essays was thus generated in this way. Size of the sample set, i.e. number of sample essays, for each prompt differed as for some prompts more samples were found while for some less from the Internet.


System testing:

During the creation of the sample essay set some essays we kept aside to be used as test cases. An essay relevant to a prompt is irrelevant to another, by this way a large number of irrelevant test cases also came up. Some test essays were created by mixing up sentences from relevant and irrelevant essays.

All the three forms of test cases were run for irrelevance detection on various prompts to find a criterion to filter relevant and irrelevant sentences. After running test cases on a few prompts the filtering criterion was developed and it was tested on other prompts and positive results were seen.

For more details read the irrelevance section in the evaluation report.

Finding Facts (fact extraction):

Opinions and facts make up the essays. While opinions purely reflects the writers thought or that of others whom the writer is quoting. Opinion cannot be deemed to be wrong since it is an individual's personal feeling towards the issue at hand. On the other hand a fact is like a proposition. It has to be true in all cases. A writer cannot be judged based on her/his opinion but can be definitely judged based on the quality of the potential facts they include in the essay.

For finding the facts in the essay, the first step is to remove the opinions from it. To remove the opinions, a list of target opinions was obtained from Janyce Wiebe's research [10]. A part of the list is given below,


=head3 Opinion Words

    accuse 
    all
    always 
    argue
    argue 
    argues
    beautiful
    believe 
    castigate 
    chastise 
    clever
    comment 
    confirm 
    criticize 
    cute
    demonstrate 
    doubt 
    express 
    forget 
    frame 
    handsome
    idiot
    know 
    least
    like
    most
    never
    none
    opinion
    persuade 
    pledge 
    realize 
    reckon 
    reflect 
    reply 
    scream 
    should
    show 
    signal 
    slow
    stupid
    suggest 
    tend
    think
    think 
    understand 
    volunteer 
    will
    
Each sentence from the essay is checked for the presence of the opinion words. If the target
opinion words are present in the sentence then the sentence is no longer considered as a 
potential fact. In this manner all the expressional/opinion sentences are removed and the 
remaining sentences are then considered as potential facts.

These potential facts are then filtered further to keep only those sentences that contain some sort of statistics, numbers, dates, uncommon nouns. The uncommon nouns are identified by their first initial being capital. The sentences that survive this filtering are considered to be the facts.

Drawbacks:

This system will not be able to identify facts like ``It rained yesterday'', ``it is sunny today''. Since these sentences do not contain any sort of statistics, numbers, dates, colors, or uncommon nouns. The inability of the module to recognize such facts is not necessarily a drawback. Even if these sentences were identified as facts, their validity would be hard to measure since there is no date and name of the city provided with respect to which this fact can be verified.


=head2 Fact Verification

Introduction

The fact checker uses Google as a source of information to verify facts. The facts that are verified can be broken into two types.

1) Facts which are less than 10 words.

2) Facts which are greater than 10 words.

This 10 word boundary was required because Google restricts the search string to 10 words or less.

To verify facts which are less than 10 words, the program first queries Google using the Google API, with the fact as a quoted string.

Take for example the fact, ``Ferrari makes cars''.

Since this fact is less than the 10 words, the program will query it as a quoted ``Ferrari makes cars'' search string. When given a search term as a quoted string, such as ``good day'', Google resatricts its search to web pages that actually contain the the string in that particular syntactic structure and order. The number of hits returned by Google for this particular search string is 51. The result of 51 hits inform us that 51 websites have the sentence ``Ferrari makes cars''. So it can be stated with a fair amount of confidence that Ferrari does indeed make cars.

However, this method is biased towards common sentence structures. For example, change the sentence ``Ferrari makes cars'', to ``Ferrari creates cars''. Both of these sentences have the same meaning. Ferrari is indeed an automobile manufacturer. However, when querying Google with the quoted sentence ``Ferrari creates cars'', the number of hits returned is 2. Which is a very small number compared to the 51 hits return by the query ``Ferrari makes cars''.

To overcome this problem, a hits cutoff was established. If for a particular query, the number of hits returned is less than 50, then the same string is query without quotes. Since the query ``Ferrari creates cars'', returned only 2, Google is now queried with the unquoted string,

Ferrari creates cars

By doing this, the search is not restricted to the exact syntactic structure of the string. Hence, the results are only restricted to the individual words in the sentence occuring together, or as seperate words.

Next, the snippets for each of the top 10 resulting web pages are retrieved. These snippets are then parsed, and the HTML tags are removed, and the snippets are brocken into separate sentences. The program then checks how many times two or more words of the search string, occur in a resulting snippet sentences.

The total number of snippet sentences which contain at least two or more words of the search string are chosen, the number of actual sentences in the resulting snippets are determined. In addition, since the count for the number of web pages which contain the exact string of the search is available, that number is added to the number of sentences which contain at least two or more words of the search string in the snippets. Hence, the ratio of these two counts is calculated.


That is,

Correctness Ratio = (Number snippet sentences which contain at least two or more search terms)
	/
	(Number of sentences in snippets)


If the resulting ratio is greater than 0.5, this means that the fact is correct, otherwise the fact is tagged 
as incorrect.

The intuition behind the ratio of 0.5 is that, if at least half the sentences returned by Google contain some of the words in the facts, it can be said that these words are related. In other words, these words did not occur together by chance. So the underlying conclusion is that, if a large source of text, in this case the World Wide Web, contains the search terms in a sentence in more than 50% of its top 10 results, it should be true.

Next, if a sentence contains greater than 10 words, Google will not accept the entire string. Hence, the program tries to identify the subjects and objects of a particular factual sentence. Take for example the following sentence:

``Automated essay scoring is being used these days to score essays written by students in tests such as GRE and TOEFL''.

The Brill POS tagger is run on the sentence to identify the POS tag of each word in the sentence. Next, bi-grams are created with a window size of 1 such as the following,

* Automated Essay

* Essay Scoring

* Scoring is etc.

The reason for choosing a bi-grams as search string instead of unigrams is simple. The search string should contain as much informative words as possible. A tri-gram model can be more specific, but once again, the reason for splitting up the sentence is to reduce the size of the search string to less than 10 words. Hence, the bi-gram model strikes a nice balance between the two problems discussed above.

Next, any bi-gram which contains only stop words is removed. Hence, in the example given above, bi-grams like ``on the'' and ``such as'' are removed from our list of bi-grams. Next, the bi-grams which contain keywords are identified. Keywords in our case are defined as being nouns, verbs and adjectives.

For the above fact, some of the POS tagged bi-grams are given below:

* Automated/NNP essay/NN

* essay/NN scoring/VBG

* as/IN GRE/NNP

* and/CC TOEFL/NNP

Next, the bi-grams which obey the following rules or properties are identified:

1) Noun/NN followed by a Proper Noun/NNP

2) Proper Noun/NNP followed by a Noun/NN

3) Noun/NN followed by a Verb/VB

4) Verb/VB followed by a Noun/NN

5) Adjective/JJ followed by a Noun/NN

6) Noun/NN followed by a Adjective/JJ

7) IN/DT/PREP followed by a Proper Noun/NNP

Rule 7 is a special case, since only proper noun is used for our search string.

Therefore, for the given fact above, the search string will be:

``Automated essay'' AND ``essay scoring'' AND ``GRE'' AND ``TOEFL''.

A search for this string in Google returns 33 hits. If the number of hits had been greater than 50, then the sentence is automatically tagged as a correct fact. However, in this case the number of hits returned is less than the cutoff point of 50 hits. Hence, the search terms are compared to the snippets returned. Then the correctness of the fact is determined as mentioned in the earlier part of this discussion.

``Finally, it can also be said that by using Google for the queries, our system is indirectly doing a LSA type dimensionality reduction step. Because, if the World Wide Web is thought of as a large coropora of text,when issuing queries to Google,Google returns the most relevant web pages to our search term. Similarly, in LSA, the dimensionality reduction step, brings together concepts more close to each other. Therefore, when the system finally looks for co-occurences of of words in the snippets, it is indirectly looking at the most relevant concepts. This enables us to avoid using complex mathematical steps such as reducing dimensionality step in LSA.

OVERALL PROPOSED SOLUTION:

Gibberish Detection:

Rule based: Hemal Lal

Ratio based: Ravindra Bharadia

Context words POS based: Sameer Atar

Irrelevance:

Aditya Polumetla

Hemal Lal

Fact finding:

Pratheepan Raveendranathan

Fact Checking:

Pratheepan Raveendranathan

Web:

CGI scripting & building infrastructure: Pratheepan Raveendranathan, Hemal Lal

Designing & branding: Ravindra Bharadia

SCORING METHOD

The Overall score is based on a scale of 0-6. Since there are more than one module for Gibberish the least difference method to score this module is used to determine the score. That is to say, find out the difference between two sub modules score and choose one whose difference is the least, meaning the scores from both sub module are pretty close.

The Gibberish module, and Irrelevance module represents a total of 4 points of the max 6 points, 2 points for each module.

Since there are not many facts in an essay, Facts dont hold more than 1 point, while the remain 1 point is given to Essay Complexity. If there are no facts in the essay the student gets 1 point by default.

The sum of all these module represents the final score for the essay.

EVALUATION

Gibberish Module

Test case 1 (Totally Gibberish):

Passage:

trophy tackler championship terry varsity. nfl team, halfback indisputably! windsor touchdowns football player. rozelle fans coach eluded coaches quarterback halfbacks. players teams bleachers scored league punted nevers sayers stadium wrigley! lineman leagues tubby fullback basketball brundage. steelers bleacher touchdown standlee shaughnessy nagurski luckman. kmetovic gallarneau countersmash bronko kicker athlete soccer kickoff rookie rimet fifa.

Essay Gibberish Percentage: 92%


Module 1 = 0
Module 2 = 1
Module 3 = 1
Module 4 = 0.16

Observation:

Since the passage is totally gibberish, the system correctly declares it so. Modules 1 and 4 contribute to the final score, since the difference between the given score is least between these two. The computed score is 92% based on the following calculations: 100% - ((0 + 0.16)/2)x100%. The passage is scored gibberish since it totally deviates from and breaks all the rules of grammar. This is the kind of gibberish where a student simply lists words in the hope that the system will look for relevant words for the topic at hand.


Test case 2 (Partially Gibberish):

Passage:

elath suez zionist yemeni jerusalem hebrew. jewish 1187 moslems moslem aramaean amorites fight nebo saudia yarmuk ommiads meccan amman yom kippur canaan zab sarijar. samarra ramadi litani akcay ghab dokan refugee hussein! canaanites israelite turkey 1097 semitic assad islam syrian mohammed weismann technikon sejera histadrut saladin aviv kfar 830 ibn egypt abu muhammad

``Man is a social animal. It is very much true that that one's character, his attitude, his line of thought and his actions are influenced by others. But ''who influences his whole being`` is what we tend to discuss here. The above statement-''Many people believe that a few individuals or small groups (family, friends, teachers, celebrities, for example) have caused them to think and behave in the way they do. Yet it is always society as a whole that defines us and our attitudes, not a few individuals.`` - indicates an extremist attitude where it is indicated that it is ''always society`` as a whole which is the defining factor in shaping a person. And it is in this extremist nature of the statement that I choose to disagree.


Essay Gibberish Percentage: 49%
      
Module 1 = 0.6
Module 2 = 1
Module 3 = 0.8
Module 4 = 0.41

Observation:

This Passage is semi-gibberish. The first portion of the passage is trivially gibberish, and the later portion is not at all gibberish. The score from the system is 49%, correctly indicating that the passage is semi-gibberish. In this case too the module 1 and module 4 contribute towards the final score. Module 2 & 3 are rule based and they suffer in this and previous case. This is so since most of the tag bi-grams and tri-grams in the gibberish portions are NN-NN-NN, which actually a valid rule in case of trigrams like President-George-Bush, Sir Richard Branson. The module 2 & 3 should come into picture when there are subtle cases of gibberish ness.


Test case 3 (Not totally syntactically gibberish):


Passage:

Foo boo to moo and black sofa invented space cabbage cabbage in zoo moo poo hoo hoo. what is what and what is what is what? do you what is know meow hoo hoo. when cabbage eats cabbage then it eats what eats what? today is meow and tomoro is hohohohohoh. why is kobayashi kobayashiand meow is benzene. when cabbage went weekly to monthly to moon is when picnic went kuku na paka.

by the way cutting is cutting and utting is cutting. kantai can tai a tai and untie a tai why cant i tie a tai and untie a tai as kantai can tie a tai and untie a tie. he ran and ran and ran and ran till cabbage caught the baseball in the baseball. why are we cabbaging the cabbage with new and old bus in yahoo and yahoo? we should kick the kick with kick and a kick but kicks and cabbage is cabaage kick with cutting the cheese. When hary met sally and sally met kantai, but kantai can tie a tie and untie a tie but why can't i tie a tie and untie a tie as kantai can tie a tie and untie a tie. god knows why.

Essay Gibberish Percentage: 20%


Module 1 = 1
Module 2 = 0.6
Module 3 = 0.6
Module 4 = 0.14

Observation:

This essay is gibberish semantically rather than syntactically. Module 1 suffers badly since the ratio of nouns to (determiners + conjunctions + verbs + adjectives) is within 0.5 and 4.0. This is so because the number of nouns does not stand out from the rest of the tags as it did in test 1 & 2. Module 2 & 3 do account for the gibberish sentences but module 4 provides the ideal result that is the essay is around 90% gibberish. Unfortunately the scoring mechanism does consider input from module 4 since the difference of scores between module 1 and the average of 2 & 3 is lower than module 1 & 4 or module 1 and the average of module 2 & 3.


Test case 4 (non gibberish):


Passage:

People may have two choices to eat, either they go out to fast food stands or restaurants, or they prepare food at home, whatever suitable to them. In my case I prefer to go out to eat, as it is easy to get, it saves my time, and I can try variety of interesting food of different countries. Being a working person, with all day long office work and driving long way, it becomes difficult to do all preparation for making food. For me easy way to get food is restaurant, where I can get prepared food at home or office by just ordering on phone, Along with that another comfort is, that when ever I have to eat together with my so many friends, I can always go to a restaurant, otherwise it's difficult to prepare food at home for so many people and don't get time to talk and having fun. So I always find it a easier way to eat out, apart from that It make my other outdoor activities possible because I don't have to bother about food wherever I go, to any fun place or theater or traveling, restaurants are always there throughout city and it becomes easy every time to get food whenever and whenever I need according to other activities.

Essay Gibberish Percentage: 5%


Module 1 = 1
Module 2 = 0.8
Module 3 = 1
Module 4 = 0.8

Observation:

The above passage is taken from www.testmagic.com and was given a score of 5. All the modules for gibberish detection correctly identify as these passages as non gibberish or very remotely gibberish (5%). This shows that non of the modules incorrectly declare a passage gibberish, when it is not. If the gibberish detection module fails to catch a gibberish sentence or a passage then this will be caught in the irrelevance detection and the student will be penalized. Giving student a benefit of doubt, none of the modules will penalize them for something that is not gibberish.

Irrelevance

Evaluation of the Irrelevance Detection Module

For an essay written for the prompt -

``Automated essay scoring is unfair to students, since there are many different ways for a student to express ideas intelligent ly and coherently. A computer program can not be expected to anticipate all of these possibilities, and will therefore grade s tudents more harshly than they deserve. Discuss whether you agree or disagree (partially or totally) with the view expressed p roviding reasons and examples.''

The following log was created (along with the various scores) for an essay that included all kinds of sentences. The sentences shown are boundary detected with all punctuations removed and converted into upper case.


IRRELEVANCE LOG

Number of unigrams in boundary detected sample file: 1301 Threshold value: 0.70


Line:IT USES COMPUTERS TO GRADE ESSAYS
Unigrams in boundary detected sentence: 4
Number of matches: 4
P = 100 R = 1.88679245283019 FM = 3.7037037037037
RELEVANT

Line:AS COMPUTERS ARE USED IT REDUCES LOAD ON HUMANS Unigrams in boundary detected sentence: 4 Number of matches: 4 P = 100 R = 1.88679245283019 FM = 3.7037037037037 RELEVANT

Line:AUTOMATED ESSAY SCORING IS BEING USED THESE DAYS TO SCORE ESSAYS WRITTEN BY STUDENTS IN TESTS SUCH AS GRE AND TOEFL Unigrams in boundary detected sentence: 12 Number of matches: 10 P = 83.3333333333333 R = 4.71698113207547 FM = 8.92857142857143 RELEVANT

Line:THIS IS NOT THE CASE WITH THE COMPUTER THAT WILL PURELY WORK OFF ALL THE AVAILABLE EVALUATING FACTORS IRRESPECTIVE OF THE TIME OR SITUATION Unigrams in boundary detected sentence: 10 Number of matches: 8 P = 80 R = 0.614911606456572 FM = 1.22044241037376 RELEVANT


COMMENTS: The above four sentences were taken from an essay written for the mentioned prompt and a manual score of 6 was given
 to it. They pass the criterion of precision(P) > 50 and f-measure(FM) > 0.7 and are thus marked as relevant. When precision e
quals 100 and f-measure is very high it can be said that the sentence is highly relevant  and addresses the requirements of th
e prompt.
In some cases there may be words that are relevant and were not found in the sample set, this causes the precision value to lo
wer a bit, but overall does not affects if rest of the sentence matches with the sample set.

Line:WHAT IS SO IMPORTANT ABOUT BOOKS IS THAT YOU CAN LEARN FROM WHAT OTHERS HAVE ALREADY LEARNED Unigrams in boundary detected sentence: 4 Number of matches: 3 P = 75 R = 0.230591852421214 FM = 0.459770114942529 IRELEVANT

Line:SO THERE IS NO NEED TO REINVENT THE WHEEL Unigrams in boundary detected sentence: 2 Number of matches: 0 P = 0 R = 0 FM = 0 IRELEVANT

Line:THAT IS YOU CAN OBTAIN OTHER PEOPLES KNOWLEDGE RECORDED IN FORM OF BOOKS Unigrams in boundary detected sentence: 6 Number of matches: 4 P = 66.6666666666667 R = 0.307455803228286 FM = 0.612088752869166 IRELEVANT

Line:INFACT PROMINENT WRITER LALA HARDAYAL ADVISED TO READ ON EVERY TOPIC RANGING FROM GEOGRAPHY TO CHEMISTRY TO LOGIC TO MATH EMATICS SO THAT ONE CAN HAVE A WIDER KNOWLEDGE BASE Unigrams in boundary detected sentence: 16 Number of matches: 6 P = 37.5 R = 0.461183704842429 FM = 0.911161731207289 IRELEVANT

Line:BOOKS ARE AVAILABE ON EVERY CONCEIVABLE TOPIC Unigrams in boundary detected sentence: 4 Number of matches: 4 P = 100 R = 0.307455803228286 FM = 0.613026819923372 IRELEVANT

LINE 4:THEY ARE THE OCEAN OF KNOWLEDGE Unigrams in boundary detected sentence: 2 Number of matches: 1 P = 50 R = 0.0768639508070715 FM = 0.153491941673062 IRELEVANT


COMMENTS: The above sentences were taken from an essay that was written for another totally different prompt and thus irreleva
nt to the given prompt. It can be seen clearly that there is a low value of f-measure and even when in cases it passes the set
 threshold value for f-measure of 0.7 the precision is still below 50. Thus we can see that the filtering criterion effectively 
removes the irrelevant sentences.

Line:RESEARCH IN INFORMATION TECHNOLOGY OR ELECTRONICS NOT ONLY HELPS PEOPLE FOR THEIR EMPLOYMENT PROBLEM SOLVE BUT ALSO HELP IN ADVANCEMENT OF THEIR LIFE STYLE DECREASES THE DISTANCES AND EASY TRANSPORTATION Unigrams in boundary detected sentence: 16 Number of matches: 7 P = 43.75 R = 1.60919540229885 FM = 3.10421286031042 RELEVANT

Line:BY REQUIRING COLLEGE STUDENTS TO TAKE CLASSES IN MULTICULTURAL STUDIES WE CAN MOVE TOWARDS PROVIDING OUR STUDENTS WITH A COMPLETE EDUCATION THAT PREPARES THEM Unigrams in boundary detected sentence: 11 Number of matches: 6 P = 54.5454545454545 R = 0.461183704842429 FM = 0.914634146341463 RELEVANT


COMMENTS: The above two sentences shows the effect of very long sentences in irrelevance detection. Even though they were take
n from an essay written for a different prompt and are clearly irrelevant to the given prompt they filter out as relevant. Thi
s is attributed to the condition that in a long irrelevant sentence there may creep up some words that match the sample set an
d are somewhat related to it. This causes the f-measure value to go up the threshold level.

Fact Finding/Checking

Fact Identification and Verification ====================================

The following sentences were used to Evaluate the Fact Finder and the Fact Checker Modules.

Fact 1: ``A human brain is used only a fraction of its full potential.'' (EXPECTED TO BE CORRECT)

OBSERVED - Statement not identified as fact.

Discussion:

The fact finder module did not identify this statement as a fact. This is because, the sentence contains a word such as potential. This where the idea of using opinion words to filter out sentences fail. However, if this fact was stated as

``A human brain uses only 3% of its full potential''

The fact finder module would have identified the above statement as a fact, due to the existence of a statistic.

        
Fact 2: "The world is two thirds water." (EXPECTED TO BE CORRECT)

OBSERVED - Statement not identified as fact.

Discussion:

This is a statement of fact, however, the fact finder module did not identify it. This sentence did pass the opinion word test, however, since it did not contain any statisic, or capitalized noun it was not identified as a fact.


Fact 3: "Richard Nixon was involved in the Watergate scandal." (EXPECTED TO BE CORRECT)

OBSERVED - Richard Nixon was involved in the Watergate scandal: Correct Fact. Correctness to Sentences Ratio : 0.64 (Ratio greater than 0.5, tagging fact as Correct)

Discussion:

This statement was correctly identified as a factual sentence. The capitalized nouns ``Richard'',``Nixon'', and ``Watergate'' were matched.

Next, the factual statement was identified as a correct fact by the module.


Fact 4: "Richard Nixon was not involved in the Watergate scandal." (EXPECTED TO BE CORRECT)

OBSERVED - Richard Nixon was not involved in the Watergate scandal : Incorrect fact Number of hits:0

Discussion:

This statement was correctly identified as a factual sentence. The capitalized nouns ``Richard'',``Nixon'', and ``Watergate'' were matched.

Next, the factual statement was identified as an incorrect fact, which is correct. The negation was correctly identified.


Fact 5: "Malcom X was a civil rights leader." (EXPECTED TO BE CORRECT)

OBSERVED - Malcom X was a civil rights leader : Correct Correctness to Sentences Ratio : 0.66 (Ratio greater than 0.5, tagging fact as Correct)

Discussion:

This statement was correctly identified as a factual sentence. The capitalized proper noun ``Malcom''.

Next, the factual statement was identified as a correct fact by the module.


Fact 6: "The Declaration of Independence was written in the year JULY 4 1776." (EXPECTED TO BE CORRECT)

OBSERVATION - The Declaration of Independence was written in the year JULY 4 1776:Correct

Discussion:

This statement was correctly identified as a factual sentence. The statistic 1776 qualifies this statement as a factual statement.

Next, the factual statement was identified as a correct fact by the module.


Fact 7:Tanganyika was ruled by Germans. (CORRECT)

OBSERVED - Tanganyika was ruled by Germans : Incorrect fact Number of hits:0

Discussion: This statement was correctly identified as a factual sentence. The capitalized proper noun matched.

Next, this statement was incorrectly tagged as an incorrect fact.


Fact 8:"Swahili is spoken in East Africa." (CORRECT)

OBSERVED - Swahili is spoken in East Africa: Correct Fact Correctness to Sentences Ratio : 1.14 (is greater than 0.5)

Discussion: This statement was correctly identified as a factual sentence. The capitalized proper noun matched.

Next, the factual statement was identified as a correct fact by the module.


Fact 9:"The bombing of Pearl Harbor by the Japanese to initiate United States entrance into the war." (CORRECT)

OBSERVED - The bombing of Pearl Harbor by the Japanese to initiate United States entrance into the war: Unable to verify

Discussion: This statement was correctly identified as a factual sentence. The capitalized proper noun matched.

Next, this fact was identified as not verifiable.


Fact 10:"Noam Chomsky was a rationalist." (CORRECT)

OBSERVED - Noam Chomsky was a rationalist :Incorrect fact Number of hits:0

Discussion:

This statement was correctly identified as a factual sentence. The capitalized proper noun matched.

Next, the factual statement was incorrectly identified as an incorrect fact by the module.


Fact 11:"Noam Chomsky was an empiricist." (INCORRECT)

OBSERVED -Noam Chomsky was a empiricist : Incorrect fact Number of hits:0

Discussion:

This statement was correctly identified as a factual sentence. The capitalized proper noun matched.

Next, the factual statement was correctly identified as an correct fact by the module.

In conclusion, the module seems to perform reasonable well, out of 11 statements of facts, the module identified 9 of them. Out of these 9 factual statements, 5 statements of facts matched the expected value.

SYSTEM EVALUATION


Test 1: Essay with ideal score of 1 � 1.5

Essay Total SCORE: 4.00 pts

Essay Gibberish Percentage: 0%


Module 1 = 1
Module 2 = 0.6
Module 3 = 1
Module 4 = 1.00

Essay Irrelevance Percentage: 68.42%


Module 1 = 21 hits from possible of 76 hits
Module 2 = Average F measure is 1.92
Module 2 = F-Score is 31.58 relevance
Essay Facts Correctness: 100 %
Essay Complexity: 60.00 %

Question: Automated essay scoring is unfair to students, since there are many different ways for a student to express ideas intelligently and coherently. A computer program can not be expected to anticipate all of these possibilities, and will therefore grade students more harshly than they deserve. Discuss whether you agree or disagree (partially or totally) with the view expressed providing reasons and examples.

Answer: Automatic Esay scoring is unfair to sudents. computer hav advanced and people want computer to everything. Computers are smart but it cannot reach the speed of human mind. essay expreses many ideas and computer cannot apreciate all this ideas.

Imagaine if i write my essay in poetic form how will computer understand wat i want to tell. how is computer know all the poetry. it cannot tell aboutmy good way of writing.supos some other student wrote same essay in simple english but my essay is using good language and that essay is simple.

Now we have been brot up by learning from teacher. we know teacher likes what kind of essay. we rite the essya in a fashion and the teachar likes and now how do we know based on what automatic essay scorer scores. Did the maker tell us what it consider when marking essay. we are in blackout.

Human essay evaluator can be forgiving. he/she can be lenient. if the content of essay is good then the evaluator is bit of allowing spelling mistakes and gramer mistakes. When we rite essay we want to tell what we thinkin. then content is more important but atomatic essay evaluator will not look at thiscontent and pick on spelling mistakes and grade hard

In conclusion i want to say that computer are good but not in essay checking.in essay checking the grade sudent hard and they are not sharp like human.

Irreverence Sentences:

1. automatic esay scoring is unfair to sudents (1.84331797235023) 2. computer hav advanced and people want computer to everything (0.921658986175115) 3. computers are smart but it cannot reach the speed of human mind (1.8348623853211) 4. imagaine if i write my essay in poetic form how will computer understand wat i want to tell (3.61990950226244) 5. supos some other student wrote same essay in simple english but my essay is using good language and that essay is simple (2.71493212669683) 6. now we have been brot up by learning from teacher (0) 7. we rite the essya in a fashion and the teachar likes and now how do we know based on what automatic essay scorer scores (1.8018018018018) 8. did the maker tell us what it consider when marking essay (0.921658986175115) 9. we are in blackout (0) 10. he she can be lenient (0) 11. if the content of essay is good then the evaluator is bit of allowing spelling mistakes and gramer mistakes (2.72727272727273) 12. when we rite essay we want to tell what we thinkin (0.925925925925926) 13. in essay checking the grade sudent hard and they are not sharp like human (2.73972602739726)

Discussion: The above essay should ideally be given a score of 1 -1.5. A score of 4 is given by our system since there are no gibberish sentences (2 points), there are no facts (1 point by default) and the student cannot be penalized for not including them. If incorrect facts were included in the essay then they would definitely contribute towards loss of points. The additional 1 point is the combination of relevance score (out of 2) and complexity score (out of 1). In both the final criteria the user scored pretty low and only managed to get a total score of 1 out of 3 points. Many sentences that show up as irrelevant are actually relevant, but they fell into this category due to the fact that they have a lot of spelling mistakes and grammatical mistakes. These erroneous sentences cannot be considered as gibberish but they are rather ill formed. If the words relevant to the issue at hand are misspelled then the relevance module cannot cross check them with the sample essays and hence just considers them as irrelevant.

Test 2: Essay with ideal score of 1 (with gibberish paragraphs)

Essay Total SCORE: 3 pts

Essay Gibberish Percentage: 52%


Module 1 = 1
Module 2 = 0.4
Module 3 = 0.6
Module 4 = 0.47


Essay Irrelevance Percentage: 90.91%


Module 1 = 8 hits from possible of 88 hits
Module 2 = Average F measure is 0.54
Module 2 = F-Score is 9.09 relevance


Essay Facts Correctness: 100 %
Essay Complexity: 76.67 %

Answer: Computers are stupid. the shoulnt be allowed to grade peoples essays. i had this computers once that was so stupid it was stupider than my dog and tats pretty stupid. My dog was so dump that he once ate his own droppings dogs are suptid

My dog wood grade esays better than a computer even thou he once hate my homework. Teacher did not believe me when I told her that Ace had eated my homework so teachers are stupid too. Evan stupider than dogs are fish they just swim around in the bowl and don't do anything sometimes they don't even swim much and just sit there i hate fish except for eating them.

I hate computers almost as much as i hate fish. fish should not grade esays and computers shouldnt grade essaiys.

Irrelevent Sentences: 1. the shoulnt be allowed to grade peoples essays (1.84331797235023) 2. i had this computers once that was so stupid it was stupider than my dog and tats pretty stupid (0.91743119266055) 3. teacher did not believe me when i told her that ace had eated my homework so teachers are stupid too (1.81818181818182) 4. evan stupider than dogs are fish they just swim around in the bowl and don t do anything sometimes they don t even swim much and just sit there i hate fish except for eating them (0) 5. i hate computers almost as much as i hate fish (0.925925925925926) 6. fish should not grade esays and computers shouldnt grade essaiys (1.8348623853211) 7. foo boo to moo and black sofa invented space cabbage cabbage in zoo moo poo hoo hoo (0) 8. what is what and what is what is what (0) 9. do you what is know meow hoo hoo (0) 10. when cabbage eats cabbage then it eats what eats what (0) 11. today is meow and tomoro is hohohohohoh (0) 12. why is kobayashi kobayashiand meow is benzene (0) 13. when cabbage went weekly to monthly to moon is when picnic went kuku na paka (0) 14. by the way cutting is cutting and utting is cutting (0.930232558139535) 15. kantai can tai a tai and untie a tai why cant i tie a tai and untie a tai as kantai can tie a tai and untie a tie (0) 16. he ran and ran and ran and ran till cabbage caught the baseball in the baseball (0) 17. why are we cabbaging the cabbage with new and old bus in yahoo and yahoo (0) 18. we should kick the kick with kick and a kick but kicks and cabbage is cabaage kick with cutting the cheese (0) 19. when hary met sally and sally met kantai but kantai can tie a tie and untie a tie but why can t i tie a tie and untie a tie as kantai can tie a tie and untie a tie (0) 20. god knows why (0)

Discussion: This essay should obtain a score of 1 or less. Our system gives it a grade of 3 due to the following reasons. The system thinks that the essay is 52% gibberish, and this is true as the last 2 paragraphs are gibberish. 1 point is assigned to this essay with respect to its gibberish ness (approximately 50%), since 2 points are assigned to a totally non gibberish essay. According to the system the essay complexity is 76% and hence 0.75 points are assigned to the essay with respect to its complexity. An additional point is provided by default for fact correctness since there are no facts. 0.2 points are assigned for relevant contents. An interesting thing to not e in this case is that although module 1 completely fails to identify the gibberish sentences, but module 2, 3, 4 correctly identify them. Module 1 is the one that displays the gibberish sentences to the user and unfortunately in this case the gibberish sentences are not displayed since they are identified by modules 2,3 and 4. In our effort to continuously improve the system we will rectify this.

Test 3: Essay with ideal score of 0 Essay Total SCORE: 1 pt Essay Gibberish Percentage: 100% Module 1 = 0 Module 2 = 0 Module 3 = 0 Module 4 = 0.12 Essay Irrelevance Percentage: 100.00% Module 1 = 0 hits from possible of 148 hits Module 2 = Average F measure is 0.00 Module 2 = F-Score is 0.00 relevance Essay Facts Correctness: 100 % Essay Complexity: 43.33 % Gibberish Essay: trophy tackler championship terry varsity. nfl team, halfback indisputably! windsor touchdowns football player. rozelle fans coach eluded coaches quarterback halfbacks. players teams bleachers scored league punted nevers sayers stadium wrigley! lineman leagues tubby fullback basketball brundage. steelers bleacher touchdown standlee shaughnessy nagurski luckman. kmetovic gallarneau countersmash bronko kicker athlete soccer kickoff rookie rimet fifa.

horst gainer baseball southside bando nba championships. hegersville voss rockn knute spicer champions athletic. teammates fielded teammate bulldogs. mets ridiron obson athletes winning goalposts olympics backfield sprint sidelined darold cepeda. bullpen batting fumble krystian tacklers gymnasium cinch. sports slugged batted game thorpe umpire spectators redskins superstars.

jordan syria israel lebanon palestinians sana qatar bahrain yemen emirates israelis. oman israeli palestinian arab sheikdoms arabia saudi bahrein palestine recongized. counterattacks iraq golan cyprus aden plo kuwait gaza lebanese nasser zionists arabian arabs withdrawl ullman israels iran arabic abdel sinai jordanian gamal haifa sycophants sixteenfold aqaba modernizer.


Gibberish Passages:
TROPHY TACKLER CHAMPIONSHIP TERRY VARSITY NFL TEAM HALFBACK INDISPUTABLY WINDSOR TOUCHDOWNS FOOTBALL PLAYER ROZELLE FANS COACH 
ELUDED COACHES QUARTERBACK HALFBACKS PLAYERS TEAMS BLEACHERS SCORED LEAGUE PUNTED NEVERS SAYERS STADIUM WRIGLEY LINEMAN LEAGUES 
TUBBY FULLBACK BASKETBALL BRUNDAGE STEELERS BLEACHER TOUCHDOWN STANDLEE SHAUGHNESSY NAGURSKI LUCKMAN KMETOVIC GALLARNEAU COUNTERSMASH 
BRONKO KICKER ATHLETE SOCCER KICKOFF ROOKIE RIMET FIFA 
2. HORST GAINER BASEBALL SOUTHSIDE BANDO NBA CHAMPIONSHIPS HEGERSVILLE VOSS ROCKN KNUTE SPICER CHAMPIONS ATHLETIC TEAMMATES FIELDED 
TEAMMATE BULLDOGS METS RIDIRON OBSON ATHLETES WINNING GOALPOSTS OLYMPICS BACKFIELD SPRINT SIDELINED DAROLD CEPEDA BULLPEN BATTING 
FUMBLE KRYSTIAN TACKLERS GYMNASIUM CINCH SPORTS SLUGGED BATTED GAME THORPE UMPIRE SPECTATORS REDSKINS SUPERSTARS JORDAN SYRIA ISRAEL 
LEBANON PALESTINIANS SANA QATAR BAHRAIN YEMEN EMIRATES ISRAELIS 
3. OMAN ISRAELI PALESTINIAN ARAB SHEIKDOMS ARABIA SAUDI BAHREIN PALESTINE RECONGIZED COUNTERATTACKS IRAQ GOLAN CYPRUS ADEN PLO KUWAIT 
GAZA LEBANESE NASSER ZIONISTS ARABIAN ARABS WITHDRAWL ULLMAN ISRAELS IRAN ARABIC ABDEL SINAI JORDANIAN GAMAL HAIFA SYCOPHANTS 
SIXTEENFOLD AQABA MODERNIZER

Irrelevent Sentences: trophy tackler championship terry varsity (0) 2. nfl team halfback indisputably (0) 3. windsor touchdowns football player (0) 4. rozelle fans coach eluded coaches quarterback halfbacks (0) 5. players teams bleachers scored league punted nevers sayers stadium wrigley (0) 6. lineman leagues tubby fullback basketball brundage (0) 7. steelers bleacher touchdown standlee shaughnessy nagurski luckman (0) 8. kmetovic gallarneau countersmash bronko kicker athlete soccer kickoff rookie rimet fifa (0) 9. horst gainer baseball southside bando nba championships (0) 10. hegersville voss rockn knute spicer champions athletic (0) 11. teammates fielded teammate bulldogs (0) 12. mets ridiron obson athletes winning goalposts olympics backfield sprint sidelined darold cepeda (0) 13. bullpen batting fumble krystian tacklers gymnasium cinch (0) 14. sports slugged batted game thorpe umpire spectators redskins superstars (0) 15. jordan syria israel lebanon palestinians sana qatar bahrain yemen emirates israelis (0) 16. oman israeli palestinian arab sheikdoms arabia saudi bahrein palestine recongized (0) 17. counterattacks iraq golan cyprus aden plo kuwait gaza lebanese nasser zionists arabian arabs withdrawl ullman israels iran arabic abdel sinai jordanian gamal haifa sycophants sixteenfold aqaba modernizer (0)

Discussion: This essay should score 0. Our system gives it a score of 1 by default since there are no facts. The essay is totally gibberish and irrelevant. All our modules give correct scores for this essay. The gibberish passages are shown by module 1 and all the sentences are correctly classified as irrelevant since they all are. There are no factful sentences and hence by the rule that if there are no facts then a student should not be penalized for that, the system gives a score of 1 by default and hence this essay scores 1.

Test 4: Essay with ideal score of 5 � 6

Essay Total SCORE: 5.63 pts Essay Gibberish Percentage: 0 % Essay Irrelevance Percentage: 0.00% Essay Facts Correctness: 100.00 % Essay Complexity: 63.33 %

Answer: Automated essay scoring is being used these days to score essays written by students in tests such as GRE and TOEFL. It uses computers to grade essays. As computers are used it reduces load on humans. Automated essay scoring also works very fast and student can get results just after writing the essay.

Facts: 1. Automated essay scoring is being used these days to score essays written by students in tests such as GRE and TOEFL : Correct


Discussion:
The above essay is well written and it should be given a score of 6. The system does give it a score of 6 and this shows that the system 
awards the finer essays with more points than the ones with problems.


Test 5: Essay with ideal score of 4

Score : By System = 4.53; Actual Score = 4;

   Essay Gibberish Percentage: 0%
      Module 1 = 1
      Module 2 = 0.6
      Module 3 = 0.8
      Module 4 = 1.00

   Essay Irrelevance Percentage: 50.00%
      Module 1 = 31 hits from possible of 84 hits
      Module 2 = Average F measure is 2.89
      Module 2 = F-Score is 50.00 relevance

   Essay Facts Correctness: 100 %

   Essay Complexity: 53.33 %

Essay :

Writing an essay is a creative and thoughtful thing to do. Today, I have to write this essay on Automated Essay grading and it is really hard to even be so creative, how can a machine be so understanding what I write. Even though essay grading using a machine would reduce the cost, and save time it is not going understand what I am trying to explain. Human graders are more fair to students then any automated system.

Firstly, just looking at the language english we know how complex it is and how many different ways you can explain the same thing. The computer based on sample answers can think like that but can it capture all the possible knowledge. I dont think so. Therefore computer grading cannot grade accurately and will not be fair to students.

Secondly, sometimes some essays dont follow the english rules but it does make sense when you read it. Computers follow some set of rules, if the rules are not met then it will say that the essay is garbage, while human grader will atleast understand the idea and give some points. This shows that humans are considerate and fair to student's effort while computers are not.

Lastly, computers cannot be biased like humans. In any situation human grading might be deterioted based on the environment and condition while computer grading will not. Situations like family issues, world news etc. Therefore computer grading can be useful in these kind of situation only if it satisfies the above two claims.

To conclude, we can say that computer grading technology might be useful and good but its still far away from reaching the point of human grading.

Details :

Gibberish Sentences

None


Irrelevant Sentences

level 1

1. i dont think so (100.00) 2. situations like family issues world news etc (100.00) 3. the computer based on sample answers can think like that but can it capture all the possible knowledge (88.89) 4. firstly just looking at the language english we know how complex it is and how many different ways you can explain the same thing (87.50) 5. secondly sometimes some essays dont follow the english rules but it does make sense when you read it (83.33) 6. computers follow some set of rules if the rules are not met then it will say that the essay is garbage while human grader will atleast understand the idea and give some points (81.82) 7. writing an essay is a creative and thoughtful thing to do (81.82) 8. this shows that humans are considerate and fair to student s effort while computers are not (81.25) 9. to conclude we can say that computer grading technology might be useful and good but its still far away from reaching the point of human grading (80.77) 10. today i have to write this essay on automated essay grading and it is really hard to even be so creative how can a machine be so understanding what i write (80.65) 11. in any situation human grading might be deterioted based on the environment and condition while computer grading will not (78.95) 12. even though essay grading using a machine would reduce the cost and save time it is not going understand what i am trying to explain (76.00)


level 2

1. writing an essay is a creative and thoughtful thing to do (1.84331797235023) 2. today i have to write this essay on automated essay grading and it is really hard to even be so creative how can a machine be so understanding what i write (3.61990950226244) 3. the computer based on sample answers can think like that but can it capture all the possible knowledge (1.81818181818182) 4. i dont think so (0) 5. secondly sometimes some essays dont follow the english rules but it does make sense when you read it (1.82648401826484) 6. this shows that humans are considerate and fair to student s effort while computers are not (2.73972602739726) 7. in any situation human grading might be deterioted based on the environment and condition while computer grading will not (2.72727272727273) 8. situations like family issues world news etc (0)

 
Factful Sentences

None


Discussion :

The above essay is well written gibberish wise, it does contain 5 paragraph and follows pretty good complexity and therefore the scores from these modules are pretty high. Also there are no facts in the essay, the system doesnt find any either. The essay contains alot of irrelevant sentences, most of them are caught by the system except few ones which are so general that they might have occured in the sample dataset using which the system was built. The score given by the system differs in the sense that when the essay was written we never took facts and complexity into consideration and thus gave a pretty low score. But now since our system consider these modules in scoring it gives little high score, an extra point.

test 6: Essay with ideal score of 5.5

Score : By System = 4.69;

   Essay Gibberish Percentage: 0%
      Module 1 = 1
      Module 2 = 0.6
      Module 3 = 0.6
      Module 4 = 1.00

   Essay Irrelevance Percentage: 41.94%
      Module 1 = 49 hits from possible of 161 hits
      Module 2 = Average F measure is 2.83
      Module 2 = F-Score is 58.06 relevance

   Essay Facts Correctness: 66.67 %

   Essay Complexity: 86.67 %


Essay :

Every person in his or her life has written an essay, be it a high school level or a college level. It is an art that we learn from day one, an art where you write your thoughts, argue, express your views and feelings. These features can only be understood if you can feel what is being said, now if computers are given these essays to grade can they have the same feeling, visions, as humans do? Its questionable and I do not think that in any way its possible. Therefore automated essay scoring is unfair to students.

First, lets argue about computers understanding what is being said. To express one issue, there are so many different ways in English you can write that, and thats called the creativity and freedom of expressiveness. Can computer figure out which way is good and which is not. Its very hard even for us human to try and guess which one is better having all the knowledge about the issue. The computer models are built using few sample essays (400-500 essays), and this is the only knowledge it has, how in this world can it capture everything about that topic? Its just impossible. Computers can't learn as we can and can't interpret as we can and therefore can not grade as we can. It is not fair to students who work really hard on it and then watch the computer give it some score which it can't even interpret well.

Secondly, lets look at the methods the computer models are built for the automatd essay graders. The two famous ones are Latent Semantic Analysis (LSA), and the E-Rater modl. LSA learns from large corpora the context knowledge, and forms the mutual relations between words and contexts. It does not consider word order nor the syntax of the sentences. It is not possible for it to capture any creative writing without knowing how can a good sentence be formed. Learning just by what you see is not a good way. The E-rater has a similar problem. It learns from students essays and tries to gather important facts, but no matter how much information you give to it, its still not possible to have complete knowledge of the context. One better thing that E-rater has is that it can check sentence structure, asses the quality of grammars, and feedback, and few others. This makes it a better system than LSA but not better than a Human. Of all these factors, being good or better, we still dont believe that it will be fair to students who write proper english sentences compared to those who write street language essays.

Lastly, there are couple of issues that has led to extra attention to automated grading. These are costs and fairness overall. By using the automated system we reduce the high cost incured due to human graders, the fees, the resources, and saves time, etc.The overall fairness comes into effect when the grader gets biased against few format of answers, the automated graders dont get biased. Even though these issues do influence the idea of automated graders, it is still not fair for students to have the thoughts of unfairness.

Overall, the idea of Automated essay grading is a good idea but it is still few years away. For it to be fair to students it needs to take into consideration every factor which human graders do.

Details :

Gibberish Sentences

None


Irrelevant Sentences

level 1

1. it does not consider word order nor the syntax of the sentences (100.00) 2. the e rater has a similar problem (100.00) 3. the two famous ones are latent semantic analysis lsa and the e rater modl (100.00) 4. one better thing that e rater has is that it can check sentence structure asses the quality of grammars and feedback and few others (95.83) 5. it is not possible for it to capture any creative writing without knowing how can a good sentence be formed (95.00) 6. its questionable and i do not think that in any way its possible (92.31) 7. learning just by what you see is not a good way (90.91) 8. every person in his or her life has written an essay be it a high school level or a college level (90.48) 9. first lets argue about computers understanding what is being said (90.00) 10. lsa learns from large corpora the context knowledge and forms the mutual relations between words and contexts (88.24) 11. lastly there are couple of issues that has led to extra attention to automated grading (86.67) 12. its very hard even for us human to try and guess which one is better having all the knowledge about the issue (86.36) 13. it is not fair to students who work really hard on it and then watch the computer give it some score which it can t even interpret well (85.71) 14. for it to be fair to students it needs to take into consideration every factor which human graders do (84.21) 15. to express one issue there are so many different ways in english you can write that and thats called the creativity and freedom of expressiveness (84.00) 16. can computer figure out which way is good and which is not (83.33) 17. these are costs and fairness overall (83.33) 18. it is an art that we learn from day one an art where you write your thoughts argue express your views and feelings (82.61) 19. computers can t learn as we can and can t interpret as we can and therefore can not grade as we can (81.82) 20. the overall fairness comes into effect when the grader gets biased against few format of answers the automated graders dont get biased (81.82) 21. it learns from students essays and tries to gather important facts but no matter how much information you give to it its still not possible to have complete knowledge of the context (81.25) 22. the computer models are built using few sample essays 400 500 essays and this is the only knowledge it has how in this world can it capture everything about that topic (80.65)


level 2

1. every person in his or her life has written an essay be it a high school level or a college level (1.81818181818182) 2. it is an art that we learn from day one an art where you write your thoughts argue express your views and feelings (3.61990950226244) 3. its questionable and i do not think that in any way its possible (0.925925925925926) 4. first lets argue about computers understanding what is being said (0.925925925925926) 5. its very hard even for us human to try and guess which one is better having all the knowledge about the issue (2.72727272727273) 6. the two famous ones are latent semantic analysis lsa and the e rater modl (0) 7. lsa learns from large corpora the context knowledge and forms the mutual relations between words and contexts (1.79372197309417) 8. it does not consider word order nor the syntax of the sentences (0) 9. it is not possible for it to capture any creative writing without knowing how can a good sentence be formed (0.91324200913242) 10. the e rater has a similar problem (0) 11. one better thing that e rater has is that it can check sentence structure asses the quality of grammars and feedback and few others (0.900900900900901) 12. lastly there are couple of issues that has led to extra attention to automated grading (1.81818181818182) 13. these are costs and fairness overall (0.925925925925926)

 
Factful Sentences

1. The computer models are built using few sample essays 400 500 essays and this is the only knowledge it has how in this world can it capture everything about that topic : Incorrect fact 2. Computers cant learn as we can and cant interpret as we can and therefore can not grade as we can : Correct fact 3. LSA learns from large corpora the context knowledge and forms the mutual relations between words and contexts : Correct fact

Discussion :

This essay is supposed to get high score since its complexity is good and also no gibberish. But the system found alot of irrelevant sentences and facts that are incorrect. Since there are incorrect fact the overall score goes down, the facts represents 1 pt out of 6 pts. The irrelevance represents 2 pts out of 6pts. Looking at the detail score version of this essay we see that most of the points the student lost is due to facts incorrectness and irrelevance. This essay was expected to do it if the system performed well. We have observed that the thresholds used for irrelevance sometimes just becomes too harsh to student and sentences that are relevant are flagged irrelevance.


=cut

RELATION TO PREVIOUS WORK:

Our evaluator system is loosely based on the E-rater and Criterion software from ETS [4]. In the implementation of our system we did not lift any implementation ideas from the discussion in the ETS, but we are surely following their general ideas.

The team members came up with the ideas for rule based and ratio based gibberish detection. We cannot claim that these are novel ideas, but there was no explicit source for these ideas. The idea for Irrelevance testing was an amalgamation of observation drawn from several issues.

For fact checking and fact finding issues we referred to the ideas used by the ETS.

References:

[1] Marti A. Hearst, ``The debate on automated essay grading'' (http://www.knowledge-technologies.com/presskit/KAT_IEEEdebate.pdf)

[2] Salvatore Valenti, ``An Overview of Current Research on Automated Essay Grading'' Journal of Information Technology Education, Volume 2, 2003 (http://jite.org/documents/Vol2/v2p319-330-30.pdf)

[3] Brill Part of Speech Tagger (http://www.cs.jhu.edu/~brill)

[4] Burstein, Chodorow, Leacock. ``Criterion Online Essay Evaluation: An Application for Automated Essay Evaluation of Student Essays'', Proceeding from Innovative Applications for Artificial Information conference; Acapulco Mexico, 2003

[5] Creighton University Health Sciences Library and Learning Resources Center (http://www.hsl.creighton.edu/hsl/Searching/Recall-Precision.html)

[6] Manning, C. and H. Schutze, ``Foundation of Statistical Natural Language Procession'', MIT, 1999.

[7] Higgins, D., Burstein, J., Marcu, D., & Gentile, C. (2004, May). Evaluating multiple aspects of coherence in student essays. In Proceedings of the Annual Meeting of HLT/NAACL, Boston, MA

[8] http://www.wayabroad.com/twe/

[9] UMD Essay Corpus (http://www.d.umn.edu/~tpederse/Courses/CS8761-FALL04/Assign/ umd-essay-corpus.tar.gz)

[10] Wiebe, Janyce. Learning Subjective Adjectives from Corpora. Proc. 17th National Conference on Artificial Intelligence (AAAI-2000). Austin, Texas, July 2000 (http://www.cs.pitt.edu/~wiebe/pubs/aaai00/adjsMPQA)