.... Generated from weps2sval2.pl (v 0.02) by pod2text NAME weps2sval2.pl SYNOPSIS ./wpes2sval2.pl --input --output OR ./weps2sval2.pl --input --output --include_discards OR ./wpes2sval2.pl --input --output --verbose --include_discards OR ./wpes2sval2.pl --input --output --verbose OR ./wpes2sval2.pl --input --output --raw OR ./wpes2sval2.pl --input --output --raw --include_discards OR ./wpes2sval2.pl --input --output --raw --include_discards --verbose DESCRIPTION This program converts Web People Search (WePS) Data in to SensEval-2 format. WePS data was released as part of the WePS shared task as part of SemEval-2007. The data consists of web search results for individuals. For more information on the task please refer to the SEE ALSO section of this perldoc. The WePS data an be obtained from L weps2sval2.pl expects 2 mandatory parameters, --input and --output. WePS data found at the above link follows the following folder structure. weps2sval2.pl assumes some part of this folder structure for its working. After unpacking the tar ball archive the following folder structure can be found. The root folder 'weps2007_data_1.0' contains 3 sub folders "scorer", "test", "traininig". The folder scorer has a scoring program provided with the corpus. We do not use this folder. Folders "test" and "traininig" are of our interest. They contain the search results for the ambiguous name. The "traininig" folder has 49 instances of ambiguous names and "test" folder has 30 instances of ambiguous names. The data collected per ambiguous name search result is rank of this result in all results returned, title of the page, snippet text from the web page, and the URL of the web page. Each of these folders "traininig" and "test" has a similar sub-folder structure in them. They contain "description_files", "truth_files", "web_pages" sub folders. The "description_files" folder has XML files. Every XML file describes the raw data always with its rank, and URL, and occasionally with title and snippet information. The "truth_files" folder has the XMLs with manual annotation of clusters of the search results. We read XML files from each of these folders and use it to extract rank, title, snippet and answer information per ambiguous name. The "web_pages" folder has the raw web pages downloaded and stored off-line form the search results. This folder stores all the web pages in the raw sub folder with their rank as the parent folder and the html file as index.html. We use this index.html file for extracting the raw text for our analysis. The "test" folder is similar to "traininig" except that its truth_files folder has 3 sub folders that annotation. We use the XMLs "official_annotation" folder for extracting answers for the ambiguous name instances in the test folder. Below is the tree representation of the described folder structure. + weps2007_data_1.0/ + scorer/ + test/ + description_files/ Alvin_Cooper.xml Arthur_Morgan.xml ... William_Dickson.xml + truth_files/ + annotation_2/ + annotation_1/ + official_annotation/ Alvin_Cooper.xml ... William_Dickson.xml + web_pages/ + Alvin_Cooper/ Alvin_Cooper.xml + raw/ 000/ index.html *.jpeg *.gif 001/ index.html ... 099/ index.html ... + William_Dickson/ William_Dickson.xml + raw/ 000/ index.html 001/ index.html ... 099/ index.html + traininig/ + description_files/ Alvin_Cooper.xml Arthur_Morgan.xml ... William_Dickson.xml + truth_files/ Alvin_Cooper.xml ... William_Dickson.xml + web_pages/ + Alvin_Cooper/ Alvin_Cooper.xml + raw/ 000/ index.html 001/ index.html ... 099/ index.html ... + William_Dickson/ William_Dickson.xml + raw/ 000/ index.html 001/ index.html ... 099/ index.html Please refer to the websites listed in the SEE ALSO Section for for more details on the WePS task and "WePS data". weps2sval2.pl converts the data either from the snippets (title or snippet text) or using the web pages from the WePS corpus to the headless SenseEval-2 format. The program expects two required parameters. These parameters are --input This option takes as its parameter the directory in which the WePS data resides. This directory should have structure like the test and traininig directories as described above. It is an mandatory parameter, without which weps2sval2.pl will give the following error message. No input directory specified or mandatory parameter --input not specified. Try weps2sval2.pl --help for help. If the name of the input directory is test then it should have "truth_files/official_annotation" folder containing the *.clus.xml in the input directory. This is required due to the directory structure of the WePS data. The program will work without any problems for any other directory names like "testdata", "train", etc. If the specified input directory does not exists then weps2sval2.pl will give the following error. Can not open input directory . Aborting... --output This option takes as its parameter the directory in which the program weps2sval2.pl should store the resulting XML files. This is a mandatory parameter and without which weps2sval2.pl will give the following error. Option output requires an argument No output directory specified or mandatory parameter --output not specified. Try weps2sval2.pl --help for help. If the specified directory already exists then program halts by giving the following error. Output directory already exists, can not proceed. Try weps2sval2.pl --help for help. This is done in order to prevent weps2sval2.pl from over writing any existing data. Example: command with the above two options looks as below. weps2sval2.pl --input testdata/ --output A1out/ The resulting XML is as below. TITLE Bryan Adams SNIPPET Official site for Bryan Adams with recent news, tour dates, a song archive, video clips, and a biography. Also find information on new CDs, DVDs, books, and memorabilia. URL www.bryanadams.com/ . . . --include_discards This switch tells weps2sval2.pl to include the instances of ambiguous name that were marked as discards by the human annotators of the data in the output XML. By default such discards are not included in the output SenseEval2 format XML. We have observed during our development and testing that some instances of the data do not have entity tags associated with their ranks in the respective *.clust.xml. We treat such instances as discards. The name and their ranks for which no entities found in the *.clus.xml are as follows. Name Rank Folder Stephen_Clark 48 test Stephen_Clark 58 test Christine_Borgman 7 traininig Christine_Borgman 29 traininig Christine_Borgman 61 traininig More on the elements with missing entries in XML can be found in the BUGS section of the documentation. Example: command with the --include_discard switch along with the mandatory parameter looks as below. weps2sval2.pl --input testdata/ --output A2out/ --include_discards The resulting XML is as below. TITLE Bryan Adams SNIPPET Official site for Bryan Adams with recent news, tour dates, a song archive, video clips, and a biography. Also find information on new CDs, DVDs, books, and memorabilia. URL www.bryanadams.com/ . . . TITLE Bryan Adams Blog SNIPPET Personal blog for an MIT graduate student interested in robots. He is not the singer, Bryan Adams. URL people.csail.mit.edu/bpadams/blog --verbose This switch tells weps2sval2.pl to display its progress while processing the data from the input folder. This switch is useful while processing large data sets when the program becomes unresponsive. When the switch is specified the command looks as below. weps2sval2.pl --input testdata/ --output A3out --verbose and it would generate output like below on the STDOUT. now processing Violet_Howard.xml: ************************ now processing Chris_Brockett.xml: **************** --raw This switch tells eps2sval2.pl to use the index.html files found in the sub folders web_pages/ folder in the directory structure for extracting the text. This switch can be used in conjunction with any of the above switches to obtain the mixture of results as in the following command examples. Example: Command with the --raw and --include_discards switches used in conjunction. This will result in raw text from the HTML files to be used for the creation of the contexts. It will also include HTML files for the discarded instances. weps2sval2.pl --input testdata/ --output A2out/ --raw --include_discards The resulting XML is as below. . . . IMAGE BryanAdams Net IMAGE Home FreeMail FreeFun Mailinglist Download About IMAGE IMAGE IMAGE IMAGE IMAGE IMAGE IMAGE IMAGE IMAGE IMAGE IMAGE IMAGE Hello there Welcome to BryanAdams Net How are you doing on a day like today I d like to thank the Our visitors fans who visited this site the past three years It s been great to offer services to such great people This site is temporary unavailable but will be back as soon as possible Bye Remi remi bryanadams net IMAGE IMAGE IMAGE IMAGE IMAGE Copyright C 1996 2006 RemiStats . . . Example: Command with the --raw switch used alone. This will result in raw text from the HTML files to be used for the creation of the contexts. weps2sval2.pl --input testdata/ --output A2out/ --raw The resulting XML is as below. . . . IMAGE BryanAdams Net IMAGE Home FreeMail FreeFun Mailinglist Download About IMAGE IMAGE IMAGE IMAGE IMAGE IMAGE IMAGE IMAGE IMAGE IMAGE IMAGE IMAGE Hello there Welcome to BryanAdams Net How are you doing on a day like today I d like to thank the Our visitors fans who visited this site the past three years It s been great to offer services to such great people This site is temporary unavailable but will be back as soon as possible Bye Remi remi bryanadams net IMAGE IMAGE IMAGE IMAGE IMAGE Copyright C 1996 2006 RemiStats . . . Example: Command with the --raw, --verbose and --include_discards switches used in conjunction. This will work exactly the same as in example with --raw and --include_discards options, expect that the --verbose switch will make the program display its progress while processing. weps2sval2.pl --input testdata/ --output A2out/ --raw --include_discards --verbose BUGS WePS data has the following discrepancies. 1. Some contexts have answers in the *.clust.xml but no tag in corresponding *.xml file in description_files/ folder. They are. Name Rank Folder Stephen_Clark 33 test Violet_Howard 10 test Violet_Howard 18 test 2. Some have tag but no answers, as of now we assign "discarded" for such unavailable answers to the respective instances. Instances that do not have answers are. Name Rank Folder Stephen_Clark 48 test Stephen_Clark 58 test Christine_Borgman 7 traininig Christine_Borgman 29 traininig Christine_Borgman 61 traininig due to such anomalies in the data the expected number of instances and answers vary for each folder. Below are the counts that we see in the weps2sval2.pl output XML files. 1. Count of answers. for --include_discards switch we should expect the following figures as count of answers. traininig - 3611. test - 3327. for without --include_discards switch we should expect the following figures as count of answers. traininig - 2340. test - 2875. 2. Number of instances. for --include_discards switch we should expect the following figures as count of instances. traininig - 3477. test - 2968. for without --include_discards switch we should expect the following figures as count of instances. traininig - 2206. test - 2516. AUTHORS Atul Kulkarni, kulka053 at d.umn.edu, University of Minnesota, Duluth Ted Pedersen, tpederse at d.umn.edu, University of Minnesota, Duluth Last modified by : $Id: weps2sval2.pl,v 1.20 2008/05/23 13:07:03 tpederse Exp $ COPYRIGHT Copyright (C) 2007-2008, Atul Kulkarni and Ted Pedersen This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.