Curriculum Vitae ----------------- Anagha K Kulkarni 1220-1/2 E 1st Street Apt C Duluth, MN - 55805 E-Mail: kulka020@d.umn.edu Phone: (O) 218-726-6149 (R) 218-728-1635 Web-page: http://www.d.umn.edu/~kulka020/ Education --------- Master of Science (M.S.) Department of Computer Science, University of Minnesota, Duluth Duration: Sept 2004 - present Major: Computer Science Minor: Statistics GPA: 3.765 Bachelor of Computer Engineering (B.E.) Government College of Engineering, University of Pune, India Duration: July 1997 - June 2001 Major: Computer Engineering Grade: First Class Research --------- Unsupervised Clustering of Similar Contexts -------------------------------------------- Contexts (which can range from few words to paragraphs to entire documents) can be separated into different groups without using any pre-compiled knowledge resource by the way of various clustering methods. This methodology can be used to address various problems like: Word Sense Disambiguation, which is the problem of assigning the true meaning to a multi-sense word depending upon its surrounding context. Proper Name Discrimination, which is the problem of identifying the unique underlying entity for a name shared by multiple people or places or organizations based on the context it, is mentioned in. Email Clustering where the emails have to be clustered into separate clusters based upon their contents or topics. Word Clustering is problem of grouping together related words where "related" could mean various types of relations like antonyms, synonyms, numeric characters etc. SenseClusters' package (http://senseclusters.sourceforge.net/) implements all of the above, and has been used in all of our experimental work. This package was started by Dr. Ted Pedersen and Amruta Purandare in Fall 2002 and has been continued by Dr. Pedersen and me since Fall 2004. I have been responsible for version starting 0.55 up to the current version of 0.71. I have adapted and extended the package to include functionality that supports: Proper Name Discrimination Email Clustering Word Clustering Labeling the clusters - Each of the discovered clusters is assigned a list of descriptive (most frequent word-pairs in the cluster) and discriminating (most common word-pairs that are unique to the cluster) labels, which help identify the entity that the cluster represents without having to manually, examine the cluster contents. I have also substantially modified the web interface to the package, which can be found at http://marimba.d.umn.edu/cgi-bin/SC-cgi/index.cgi For future versions we plan to include cluster stopping and improved clustering. Data Creation Utilities ------------------------ The SenseClusters package requires input in an XML format. We have two data creation utilities that create experimental data for SenseClusters. Name Conflate (http://www.d.umn.edu/~kulka020/kanaghaName.html) This utility consists of a Perl program which converts GigaWord English Corpus formatted text or plain text to XML format. Along with this format conversion the utility conflates all the occurrences of words specified (target / head words) by the user. It also chops the input text into equal sized contexts where each such context has a single occurrence of a conflated (head) word. Text2Headless (http://www.d.umn.edu/~kulka020/kanaghaText2Headless.html) This utility creates XML formatted data from plain text files. Each input plain text file becomes a context in the XML formatted output file. These contexts do not have any target / head word and thus the name 'Headless'. Cluster Stopping problem -------------------------- Cluster Stopping is the problem of deciding when to stop separating the instances into further clusters or in other words the problem of identifying the number of clusters that the dataset naturally falls into. A typical plot of gain versus the number of clusters resembles a "knee" shape. This shape indicates that increasing the number of clusters is beneficial until some point after which there is no further gain to be achieved (represented by the plateau in the graph) by separating the instances into more number of clusters. Thus the aim of many cluster stopping methods is to locate this "knee point" which is the point at which the plateau starts. I have studied this problem in two phases: During the summer 2005 internship at Mayo Clinic where I have explored and implemented three methods proposed in the clustering literature namely: Gap Statistics by Tibshirani, Walther and Hastie (detailed report available at: http://www.d.umn.edu/~kulka020/GapStatistics.doc and implementation at http://search.cpan.org/dist/Statistics-Gap/) Calinski and Harabasz method (http://search.cpan.org/dist/Statistics-CalinskiHarabasz/) Hartigan method (http://search.cpan.org/dist/Statistics-Hartigan/) Using Empirical methods of locating the "knee point" in the graph of gain vs. number of clusters: The criterion measures like internal similarity of a cluster, external similarity between clusters are used by clustering methods to make clustering decision. These measures can also be used to locate the "knee point" because these usually represent the gain measure of the clustering. Presentations -------------- * Name Discrimination and Email Clustering using Unsupervised Clustering and Labeling of Similar Contexts. Presented at the Second Indian International Conference on Artificial Intelligence, Pune, India, Dec 20, 2005. * Exploration of Three Cluster Stopping Rules for the task of Word Sense Discrimination. At the Division of Biomedical Informatics, Mayo Clinic, Rochester, MN, USA, Aug 25, 2005. * Unsupervised Discrimination and Labeling of Ambiguous Names. At the Student Research Workshop of the 43rd Annual Meeting of the Association of Computational Linguistics, Ann Arbor, MI, USA, June 27, 2005. * SenseClusters: Unsupervised Clustering and Labeling of Similar Contexts. At the Interactive Poster Session of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, MI, USA, June 26, 2005. Publications ------------- Pedersen T., Kulkarni A., Angheluta R., Kozareva Z. and Solorio T. (2006) Improving Name Discrimination: A Language Salad Approach. To appear in the Proceedings of the EACL 2006 Workshop on Cross-Language Knowledge Induction, April 3, 2006, Trento, Italy. http://www.d.umn.edu/~kulka020/eacl06-salad.pdf Pedersen T. and Kulkarni A. (2006) Selecting the "Right" Number of Senses Based on Clustering Criterion Functions. To appear in the Proceedings of the Posters and Demo Program of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics, April 3-7, 2006, Trento, Italy. http://www.d.umn.edu/~kulka020/eacl06-demo.pdf Pedersen T., Kulkarni A., Angheluta R., Kozareva Z. and Solorio T. (2006) An Unsupervised Language Independent Method of Name Discrimination Using Second Order Co-Occurrence Vectors. To appear in Proceedings of the Seventh International Conference on Intelligent Text Processing and Computational Linguistics, Volume TBD of Lecture Notes in Computer Science, Springer, February 19-25, 2006, Mexico City, Mexico. http://www.d.umn.edu/~kulka020/cicling2006.pdf Kulkarni A. and Pedersen T. (2005) Name Discrimination and Email Clustering using Unsupervised Clustering and Labeling of Similar Contexts. To appear in Proceedings of the Second Indian International Conference on Artificial Intelligence, December 20-22, 2005, Pune, India. http://www.d.umn.edu/~kulka020/kulkarniP.pdf Pedersen T. and Kulkarni A. (2005) Identifying Similar Words and Contexts in Natural Language with SenseClusters. In Proceedings of the Twentieth National Conference on Artificial Intelligence, July 2005, Pittsburgh, PA, USA. (Intelligent Systems Demonstration) http://www.d.umn.edu/~kulka020/aaai2005.pdf Kulkarni A. (2005) Unsupervised Discrimination and Labeling of Ambiguous Names. In Proceedings of the Student Research Workshop of the 43rd Annual Meeting of the Association of Computational Linguistics, June 25-30, 2005, Ann Arbor, MI, USA. http://www.d.umn.edu/~kulka020/kulkarni.pdf Kulkarni A. and Pedersen T. (2005) SenseClusters: Unsupervised Clustering and Labeling of Similar Contexts. In Proceedings of the Demonstration and Interactive Poster Session of the 43rd Annual Meeting of the Association for Computational Linguistics, June 26, 2005, Ann Arbor, MI, USA. http://www.d.umn.edu/~kulka020/acl2005-demo.pdf Savova G., Pedersen T., Purandare A. and Kulkarni A. (2005) Resolving Ambiguities in Biomedical Text with Unsupervised Clustering Approaches. University of Minnesota Supercomputing Institute Research Report UMSI 2005/80 and CB Number 2005/21, May. http://www.d.umn.edu/~kulka020/ResearchReport_2005-80.pdf Pedersen T., Purandare A., and Kulkarni A. (2005) Name Discrimination by Clustering Similar Contexts. In Proceedings of the Sixth International Conference on Intelligent Text Processing and Computational Linguistics, Volume 3406 of Lecture Notes in Computer Science, Springer, February 13-19, 2005, Mexico City, Mexico. http://www.d.umn.edu/~kulka020/cicling2005.pdf Academic Projects ------------------ SEAM: Simple Essay Analysis Mechanism -------------------------------------- This was a class project for the Natural Language Processing graduate course. The developed system (http://seam.sourceforge.net/) assigns any given essay a holistic score based on four factors: relevance of the essay to the essay topic, facts found in the essay, correct facts and the amount of gibberish found in the essay. The web-interface for the system can be accessed via http://marimba.d.umn.edu/cgi-bin/RacingClub/welcome.cgi Online Diagnosis of Diseases and Online Home Remedies ------------------------------------------------------ This was the senior year project at under-graduation level. The project involved design and development of a portal using Selectica Configurator Technology. The portal provided the user with the options of: Diagnosing a disease from the symptoms Finding an apt home remedy for a disease The portal also provided facilities like: Information about the herbs used Description of general types of procedures (like decoction, straining etc) used in the home remedies Mailing a query to a doctor Work experience ---------------- Academic Work Experience ------------------------- * Department of Computer Science, University of Minnesota, Duluth, MN, USA Duration: August 2004 - present Designation: Graduate Research Assistant to Dr. Ted Pedersen Responsibilities: Natural Language Processing (NLP) related research and development. * Division of Biomedical Informatics, Mayo Clinic, Rochester, MN, USA Duration: June 2005 - August 2005 Designation: Summer Intern Responsibilities: Exploring and implementing various Cluster-Stopping techniques. Industry Work Experience ------------------------- * Patni Computer Systems Private Limited, Pune, India Duration: May 2003 - June 2004 Designation: Software Engineer Responsibilities: Software research and development Projects: --------- DBD Commission Disbursement Involved designing and developing a web-based intranet application for the process of requisitions and monitoring of the commission disbursement to the agents hired by the client. Web Channel Integration Involved integrating external-facing web sites under a single comprehensive program and provide the solution to manage the site contents (Content Manager). The project consisted of: Creating authoring templates, viewing templates, publishing templates (according to the Digital Design Standard specified by the client) using WebSphere Application Developer (WSAD). The templates took the care of reusing/ re-purposing the data. Generation of sitemap and site navigation management was also taken care of. A handy site search was also provided. Implementing Content Management resource. Implementing the necessary workflow for Transaction tracking and workflow reporting. Porting the published pages to production server. * Werhardt Infosys Private Limited, Pune, India Duration: February 2002 - April 2003 Designation: Software Engineer Responsibilities: Software research and development, testing and maintenance Projects: Fault Tree Maintenance Fault Tree Maintenance is software designed and developed for manufacturing industries. It provides facility to log machine breakdowns, analyze the logs and report accordingly in various formats. Also monitors the job done for the fault recovery and the time required. My work involved development of the software and its web interface. SOHAM (ERP Solution) SOHAM is an integrated set of software to operate business processes effectively. It is a customer-oriented manufacturing management system that integrates Sales, Planning, Procurement, Inventory, Distribution and Finance. I was responsible for the development of the Accounting module. Technical skills ----------------- Programming Languages: Perl, C, C++, Visual Basic Technologies: CGI, ASP, JSP Operating Systems: Linux, Windows Databases: Oracle, MS Access, IBM DB2 Tools: MS Visual Studio, IBM Web Sphere, Websphere Portal Content Publishing Version Control Systems: CVS, Visual SourceSafe Markup Languages: XML, HTML, DHTML References ------------------ Dr. Ted Pedersen Associate Professor, Department of Computer Science, University of Minnesota, Duluth E-Mail: tpederse@d.umn.edu Dr. Ron Regal Professor Department of Mathematics & Statistics, University of Minnesota, Duluth E-Mail: rregal@d.umn.edu Dr. Serguei Pakhomov Assistant Professor Department of Medical Informatics, Mayo College of Medicine E-Mail: Pakhomov.Serguei@mayo.edu