April 9, 2004

Here are a few general comments on the proposals. Remember that you should 
submit a revised version of the proposal by 5pm on Thursday 4/15 via 
the web drop. doc or pdf file is fine. This version of the proposal should 
take into account these general issues, plua any others that I might have 
raised on your individual proposals.


Background Section
==================

In your background section, you should summarize at least two published 
papers or articles that are directly relevant to your project, and you 
should make it clear what the relevance is. In other words, the papers you 
summarize should inform your project in some way, such that they give you 
an idea about something to try. You can have more than two references of 
course, but a minimum of two should describe approaches to your problem 
that either give you ideas about what you could do, or perhaps describe 
ideas that did not work (which you can mention as a reason why you are 
doing something different). Finally, make sure that you summarize the 
article in your own words, and that the proposal itself is written in all 
of your own words (you should not copy text from other sources and include 
it in your writing).

References
==========

You should follow a standard style of referencing. This means that you 
should have a bibliography at the end, and when you reference an article 
you should have a citation in your proposal that looks something like 
this: 

(Smith and Jones, 1998) report on an experiment using a random number  
generator to decipher the Voynich Manuscript. The results they reported  
were not terribly promising, suggesting that the Voynich Manuscript is at  
least not random text. For that reason, I am going to etc. etc. etc. 

There was some question about what constitutes a published reference. This 
means something that has appeared in a conference or journal, and is not 
just a page or paper that someone has put on a web site. In particular 
what you want are most likely papers that have been published elsewhere, 
and that are put on the web as a convenience for readers. See the 
guidelines for citing such material here: 

http://www.apastyle.org/elecsource.html#71

If you have "web only" material that informs your work, that is fine, but 
it will not count as one of the two published references I mention above. 
If you include such references, make sure that you give the complete url 
(such that a reader can find the resource you refer to immediately).

Evaluation
==========

You need to work out some of the details of your evaluation plan, in 
particular what data you will use to evaluate your approach. You should 
start to look more precisely at different sources of data. If you are 
dealing with the Voynich Manuscript, you should think about the issue of 
whether or not only using English text is sufficient. If you are using 
Google Sets, you should investigate the use of WordNet or other existing 
resources for sets of related words. I mentioned that there is a Perl 
interface to WordNet known as WordNet::QueryData, which can be found here:

http://search.cpan.org/dist/WordNet-QueryData

The data that you compare your approach to is what I refer to as "gold 
standard" data. The means it is "as good as gold", in other words you know 
that it is correct. In the case of Google Sets, this means you have a set 
of words that you know to be related. In the case of the Voynich 
Manuscript, it might mean text that you know to be human language, or not. 

The other issue you need to consider with respect to evaluation is how you 
will know how to judge any differences between your program's output on 
data you are testing (ie the Voynich Manuscript) or gold standard data 
where you know the expected output (ie Robinson Crusoe is human language). 
If you have a difference in some score of .5 between the Voynich 
Manuscript and Robinson Crusue, what can you conclude from that? In 
isolation, that value of .5 doesn't mean much. But if the difference 
between Robinson Crusoe and the Bible is .00004, perhaps a difference of  
.5 shows that the Voynich Manuscript is very different for some reason. 

Voynich Issues
==============

All of the Voynich proposals included the idea of using the profiler.pl 
program, and then using entropy of some kind. Those are fine, but you need 
to come up with a third approach that goes somewhat beyond those. I would 
expect that you can demo your profiler.pl performance on the Voynich 
Manuscript on Thursday, April 15. 

Also, with entropy note that you need to decide if you are finding the 
entropy of characters or words, or something else. And you should look at 
the Voynich Manuscript to see if ideas like characters or words even make 
sense before going to far with that. 

Google Issues
=============

It seems that you can't get PageRank values from the Google API, which is 
too bad. Those could be useful in your approaches. I also mentioned that 
the paper by Brin et. al. about Page Rank is a published reference, so you 
should make sure to get the complete reference, and also discuss in your 
background section how PageRank will impact your approach. Finally, I also 
mentioned that some of the material on Information Retrieval in Chapter 17 
of our text might be useful for the Google Sets projects, particularly 
some measures like tf/idf. 

Grading
=======

I am not going to be assigning grades for each stage of the project. What 
I expect is that you'll make a good faith effort to follow the general 
guidelines for the project that I'm giving, and also act upon any feedback 
I provide you either as a group or individually. If you do that 
throughout the course of the project, then you can assume that you will 
get full credit. I will warn you if you are doing (or not doing) anything 
that will negatively affect your grade.