April 9, 2004 Here are a few general comments on the proposals. Remember that you should submit a revised version of the proposal by 5pm on Thursday 4/15 via the web drop. doc or pdf file is fine. This version of the proposal should take into account these general issues, plua any others that I might have raised on your individual proposals. Background Section ================== In your background section, you should summarize at least two published papers or articles that are directly relevant to your project, and you should make it clear what the relevance is. In other words, the papers you summarize should inform your project in some way, such that they give you an idea about something to try. You can have more than two references of course, but a minimum of two should describe approaches to your problem that either give you ideas about what you could do, or perhaps describe ideas that did not work (which you can mention as a reason why you are doing something different). Finally, make sure that you summarize the article in your own words, and that the proposal itself is written in all of your own words (you should not copy text from other sources and include it in your writing). References ========== You should follow a standard style of referencing. This means that you should have a bibliography at the end, and when you reference an article you should have a citation in your proposal that looks something like this: (Smith and Jones, 1998) report on an experiment using a random number generator to decipher the Voynich Manuscript. The results they reported were not terribly promising, suggesting that the Voynich Manuscript is at least not random text. For that reason, I am going to etc. etc. etc. There was some question about what constitutes a published reference. This means something that has appeared in a conference or journal, and is not just a page or paper that someone has put on a web site. In particular what you want are most likely papers that have been published elsewhere, and that are put on the web as a convenience for readers. See the guidelines for citing such material here: http://www.apastyle.org/elecsource.html#71 If you have "web only" material that informs your work, that is fine, but it will not count as one of the two published references I mention above. If you include such references, make sure that you give the complete url (such that a reader can find the resource you refer to immediately). Evaluation ========== You need to work out some of the details of your evaluation plan, in particular what data you will use to evaluate your approach. You should start to look more precisely at different sources of data. If you are dealing with the Voynich Manuscript, you should think about the issue of whether or not only using English text is sufficient. If you are using Google Sets, you should investigate the use of WordNet or other existing resources for sets of related words. I mentioned that there is a Perl interface to WordNet known as WordNet::QueryData, which can be found here: http://search.cpan.org/dist/WordNet-QueryData The data that you compare your approach to is what I refer to as "gold standard" data. The means it is "as good as gold", in other words you know that it is correct. In the case of Google Sets, this means you have a set of words that you know to be related. In the case of the Voynich Manuscript, it might mean text that you know to be human language, or not. The other issue you need to consider with respect to evaluation is how you will know how to judge any differences between your program's output on data you are testing (ie the Voynich Manuscript) or gold standard data where you know the expected output (ie Robinson Crusoe is human language). If you have a difference in some score of .5 between the Voynich Manuscript and Robinson Crusue, what can you conclude from that? In isolation, that value of .5 doesn't mean much. But if the difference between Robinson Crusoe and the Bible is .00004, perhaps a difference of .5 shows that the Voynich Manuscript is very different for some reason. Voynich Issues ============== All of the Voynich proposals included the idea of using the profiler.pl program, and then using entropy of some kind. Those are fine, but you need to come up with a third approach that goes somewhat beyond those. I would expect that you can demo your profiler.pl performance on the Voynich Manuscript on Thursday, April 15. Also, with entropy note that you need to decide if you are finding the entropy of characters or words, or something else. And you should look at the Voynich Manuscript to see if ideas like characters or words even make sense before going to far with that. Google Issues ============= It seems that you can't get PageRank values from the Google API, which is too bad. Those could be useful in your approaches. I also mentioned that the paper by Brin et. al. about Page Rank is a published reference, so you should make sure to get the complete reference, and also discuss in your background section how PageRank will impact your approach. Finally, I also mentioned that some of the material on Information Retrieval in Chapter 17 of our text might be useful for the Google Sets projects, particularly some measures like tf/idf. Grading ======= I am not going to be assigning grades for each stage of the project. What I expect is that you'll make a good faith effort to follow the general guidelines for the project that I'm giving, and also act upon any feedback I provide you either as a group or individually. If you do that throughout the course of the project, then you can assume that you will get full credit. I will warn you if you are doing (or not doing) anything that will negatively affect your grade.