We collected data from World Wide Web for the following five ambiguous person names: 1. Richard Alston 2. Sarah Connor 3. George Miller 4. Ted Pedersen 5. Michael Collins For each of the above names we have performed the following step: Data Collection: - Collected data using Google search engine and WebService-GoogleHack. - Read the top 50 html/htm pages (first level). For each first level page also traversed the links to html/htm pages in it, if the link was in the same web-space as the first level page. Data Formatting and Cleaning: - Stripped all the HTML tags from the contents of the first and second level pages using HTML-Format-2.04 - Divided this cleaned data into smaller chunks of text called contexts using a package called NameConflate. Each context contains one instance of the ambiguous name (head word). - A context might contain one of the specified variants of the ambiguous name instead of the above mentioned names, for example: for the name "Michael Collins", a context might contain "Collins, M.T." or "M. Collins" etc. - Manually removed all the occurrences of the following strings from the contexts: 1. [IMAGE] 2. --+ 3. ==+ 4. **+ Contact tpederse@d.umn.edu for additional info or questions. ------------ Anagha Kulkarni, September 2006 (This data was collected and annotated in Summer of 2006) Ted Pedersen, January 2008 (Ted Pedersen data was modified such that sense2 was recognized as being the same person as sense 1, so Ted Pedersen became a 3-way ambiguity rather than a 4-way)