Name Discrimination Data
This page contains data where ambiguous entity names in text have been 
disambiguated. The data has either been manually disambiguated, or created 
by conflating multiple names into a single ambiguous pseudo-name.
 Kulkarni Name Corpus 
 
This data was manually disambiguated by Anagha Kulkarni as a part of her 
M.S. thesis. It has subsequently been used in our 
 
IJCAI-2007 workshop 
and 
 
CICLING-2007 
papers. 
It consists of Google search results for  
person names that are known to be ambiguous. The names have been 
manually disambiguated in this data. Please cite her thesis if you use 
the data: 
 
This data is described in the following 
README 
and is available for download 
here.  
 Name Conflate Data 
This is data where we have created artificial ambiguities by conflating 
together the occurrences of person or places names. We typically 
create this data from the English GigaWord corpus using the 
 nameconflate.pl 
program  
-  
(CICLING 2006) 
2, 3, and 4-way name conflations in Spanish, English, Bulgarian, and 
Romanian. The Spanish data includes Bill Clinton-Yasar Arafat, Boris 
Yeltsin-John Paul II, New York-Washington, New York-Brasil-Washington, 
OTAN-EZLN. The English data includes Bill Clinton-Tony Blair, Bill 
Clinton-Tony Blair-Ehud Barak, Mexico-Uganda,  
Mexico-India-California-Peru, Microsoft-IBM. The Bulgarian data includes 
Burgas-Varna, Franciya-Germaniya, Petar Stoyanov-Georgi Parvanov, 
Simeon Sakskoburggotski-Nikolay Vasilev, BSP-SDS. The Romanian data 
includes Traian Basescu-Adrian Nastase, Bucuresti-Brasov, PD-PSD, Ion 
Iliescu-Adrian Nastase-Traian Basescu. 
 [paper with experiments and results for this data: 
 
CICLING-2006]
 
 
-   
(IICAI 2005)  20 Newsgroup data (email) by topic and 2 and 3-way 
name conflations, including Tony Blair-Vladimir Putin, Mexico-Uganda, 
Microsoft-Compaq, Serena Williams-Tiger Woods, Sonia Ghandi-Leonid 
Kuchma, Tony Blair-Vladimir Putin-Saddam Hussein, Mexico-Uganda-India, 
Microsoft-Compaq-Serena Williams. 
 [paper with experiments and results for this data: 
 
IICAI-2005]
 
 
-  
(ACL 2005 demo) 
20 Newsgroup data (email) by topic and 2 and 3-way name conflations, 
including American Airlines-Tom Cruise, Britney Spears-George Bush, Bill 
Gates-Tom Cruise, American Airlines-Hewlett Packard-BMW,
American Airlines-Hewlett Packard-Britney Spears, Bill Gates-Tom 
Cruise-George Bush. 
 [paper with experiments and results for this data: 
 
ACL-2005 demo]
 
 
-  
(AAAI 2005 demo)  
2-way name conflations of the form Name-Name, Country-Noun, Name-Noun, 
Noun-Noun, Noun-Verb. 
 
 
-  
(ACL 2005 workshop) 
2, 3, and 4-way name conflations, including Bill Clinton-Tony Blair, 
Bill Clinton-Tony Blair-Ehud Barak, Mexico-India-California, John 
Grisham-Bayer-Bank of America, Bill Clinton-Tony Blair-Ehud 
Barak-Vladimir Putin. 
 [paper with experiments and results for this data: 
 
ACL-2005 student research workshop]
 
 
-  
(CICLING 2005) 
2-way name conflations, including David Beckham-Ronaldo, Japan-France, 
Micrsoft-IBM, Peres-Milosevic, Rolf-Tajik, Egyptian-Jordan.
 [paper with experiments and results for this data: 
 
CICLING-2005]
 Acknowledgments 
All of the data on this page was created by 
 Anagha Kulkarni,  either 
using the nameconflate.pl program she wrote, or via manual annotation. 
  
 
