NAME nameconflate.pl SYNOPSIS nameconflate.pl --words "Brad Pitt","Russell Crowe" --inpformat giga INPUT OR nameconflate.pl --regexs "/(President )?(George )?(W. )?Bush/","/(Prime Minister |PM )?(Tony )?Blair/" --inpformat giga INPUT OR nameconflate.pl --regexsfile REGEXSFILE --inpformat giga INPUT DESCRIPTION nameconflate.pl is a data creation utility for word sense disambiguation/discrimination experiments. In particular, we have used this to test the SenseClusters package. (http://senseclusters.sourceforge.net) nameconflate can also be viewed as a document format converter utility because it accepts input files in either English GigaWord format or plain text format and converts it to SENSEVAL2 format. The main speciality of nameconflate is its capability to take given words or patterns and mask each of occurrence of the given words or words matching the given patterns in the input data and thus artificially create ambiguity. For example, one could create an artificial ambiguity for the names "Brad Pitt" and "Russell Crowe" via the following command: nameconflate.pl --words "Brad Pitt","Russell Crowe" --inpformat giga INPUT The output in this case might look like this (at the STDOUT): ... few years back. The new film's premiere party at Universal Studios drew a certain co-star from another Damon flick. His ``Ocean's Eleven'' partner in crime B_R (still super scruffy for an upcoming role) brought his wife, Jennifer Aniston, who gave Camryn Manheim a big hug and eagerly posed with a few was made. During preproduction, Howard went to Princeton, N.J., to meet his somewhat reticent subject on his home turf and to record background film for B_R who plays the gradually aging Nash from his student days through his winning of a Nobel Memorial Prize in economic science in 1994. All the ... Instead of looking for exact occurrences of given words in the input text, nameconflate can also search for patterns specified using Perl regular expressions. So to mask each of the following variation of the name George Bush: President George W. Bush President George Bush President Bush Bush And to mask each of the following variation of the name Tony Blair: Prime Minister Tony Blair PM Tony Blair Tony Blair Blair we could use following nameconflate command: nameconflate.pl --regexs "/(President )?(George )?(W. )?Bush/","/(Prime Minister |PM )?(Tony)?Blair/" --inpformat giga INPUT OR nameconflate.pl --regexsfile REGEXFILE --inpformat giga INPUT NOTE: In the second comand the regular expressions are specified in the file REGEXSFILE. Contents of Sample REGEXSFILE: /(President )?(George )?(W. )?Bush/ /(Prime Minister |PM )?(Tony )?Blair/ And a possible output (at STDOUT): major, independent player in counterterrorism investigations, however, making it more difficult to coordinate counter-terrorism efforts on a bureauwide basis. Meanwhile, officials said that after the XYZ administration came into office, top Justice Department officials did not initially see the urgent need to upgrade counterterrorism. In August, officials said, the bureau's acting it back into balance. But the question making its way through the Pakistani capital is whether the president, Pervez Musharraf, has as much power as XYZ implies. At the same time, many in India are wondering whether General Musharraf, who is known as a hawk, really wants to rein in Pakistan. for three days as a prank or, as the plan's mastermind puts it, ``an experiment in real life''? Think ``Lord of the Flies'' meets ``The XYZ Witch Project.'' Last year the reviewer King Kaufman called this a ``chilling'' little novel ``about that most elemental of human fears, being buried alive.'' Visitors, said that 46 current and 165 former heads of foreign governments have engaged in the ``hand-shake diplomacy'' of U.S. exchange programs. Alumni include British XYZ and Hamid Karzai, head of Afghanistan's interim government. If the goal of bringing Muslim opinion leaders to the United States is persuading them to agree ... When using --regexs/--regexsfile option instead of --words: 1. The answer tag contains an extra field "org_word" which stands for "original word" and its value specifies the exact word that matched one of the given patterns. 2. Each of the specified regex patterns is internally assigned an identifier based on its position in the argument list and this id along with the string "sense" is used to create the unique senseid for each given pattern and is used in the answer tag. For usage information, type 'nameconflate.pl --help' at the command prompt. Some additional information: The English GigaWord format looks like... Chinese dissidents in US favor partial revoking of MFN WASHINGTON, May 12 (AFP)

Chinese dissidents in the United States generally favor a partial withdrawal of Beijing's privileged trading status targeting state-owned firms, not complete revocation, dissident leaders said here Thursday.

There are differing views among the dissident community in the United States on the best way to advance human rights in China, but "an overwhelming majority do agree on the middle policy of targeted revocation," Zhao Haiching, the president of the National Council for Chinese Affairs (NCCA) told a press conference.

AUTHOR Anagha Kulkarni, kulka020@d.umn.edu, University of Minnesota, Duluth Ted Pedersen, tpederse@d.umn.edu, University of Minnesota, Duluth BUGS SEE ALSO SenseClusters : http://sourceforge.net/projects/sensecluster English GigaWord : http://wave.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05 SENSEVAL2 : http://www.sle.sharp.co.uk/senseval2/ COPYRIGHT Copyright (C) 2005-2006, Ted Pedersen and Anagha Kulkarni This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.