///////////////////////////////////////////////////////////////////// Supervised Methods for Automatic Acronym Expansion in Medical Text Summary report of the internship at the Mayo Clinic, Division of Biomedical Informatics, Rochester. ///////////////////////////////////////////////////////////////////// Student ------- Mahesh Joshi Graduate Student, Department of Computer Science University of Minnesota, Duluth Supervisor ---------- Dr. Serguei Pakhomov Research Associate Division of Biomedical Informatics, Mayo Clinic, Rochester Abstract -------- Usage of acronyms and abbreviations is very common in medical text. As a result of their widespread use, we observe high levels of ambiguity among acronyms or abbreviations. For example, AC could stand for 'anti-inflammatory corticosteroid', 'acid controller', 'antecubital' or any one of as many as 13 possible expansions that we have encountered in a fairly small subset of clinical notes. The ability to automatically resolve the expansion of an acronym can be very useful in medical information retrieval. For example, using automatic expansion resolution, a search query for 'anti-inflammatory corticosteroid' can focus only those occurrences of AC where the acronym appears in the sense that is being searched. As a part of our work, we performed a preliminary study of identifying the right kind of data to train supervised methods. We then focussed on solving this problem using traditional word sense disambiguation methods. The intuition behind this approach is that just like the meaning of an ambiguous word will most likely depend upon its surrounding context, the expansion of an acronym will in most cases be decided by the context in which it appears. We experimented with various features like unigrams and bigrams in context of the acronym, parts of speech of surrounding words, and global features like section identifiers, service codes and gender codes associated with the clinical note containing the acronym. Publications read and/or discussed in detail -------------------------------------------- @inproceedings{Pakhomov02, author = {Pakhomov S.}, title = {Semi-Supervised Maximum Entropy Based Approach to Acronym and Abbreviation Normalization in Medical Texts}, booktitle = {Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, (ACL-02)}, pages = {160-167}, year = {2002}, address = {Philadelphia}} @inproceedings{PakhomovPC05, author = {Pakhomov S., Pedersen T. and Chute C.}, title = {Abbreviation and Acronym Disambiguation in Clinical Discourse}, booktitle = {Proceedings of the Americal Medical Informatics Association (AMIA) Annual Symposium (AMIA-2005)}, year = {2005}, address = {Washington D.C.}} @article{LiuTF04, author = {Liu H., Teller V. and Friedman C.}, title = {A Multi-aspect Comparison Study of Supervised Word Sense Disambiguation}, journal = {Journal of the American Medical Informatics Association}, volume = {11}, number = {4}, pages = {320-331}, year = {2004}} Publications read and/or discussed in general --------------------------------------------- @inproceedings{Pedersen00, author = {Pedersen T.}, title = {A Simple Approach to Building Ensembles of Naive Bayesian Classifiers for Word Sense Disambiguation}, booktitle = {Proceedings of the First Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL-00)}, year = {2000}, address = {Seattle}} @article{ Dietterich98, author = {Dietterich T.}, title = {Approximate Statistical Test For Comparing Supervised Classification Learning Algorithms}, journal = {Neural Computation}, volume = {10}, number = {7}, pages = {1895-1923}, year = {1998}} @inproceedings{ YeatesBW00, author = {Stuart Yeates and David Bainbridge and Ian H. Witten}, title = {Using Compression to Identify Acronyms in Text}, booktitle = {Data Compression Conference}, pages = {582}, year = {2000}} Time breakup ------------ * Meetings & Discussions (Weekly NLP Research Group Meeting, meetings and discussions with Dr. Pakhomov, medical Indexing Experts) : 20% * Reading (Research papers, GATE documentation and code) : 20% * Data analysis and searching (Generating data for manual annotation) : 20% * Programming, testing and experimenting : 40% Software developed or enhanced ------------------------------ * GATE plugins and wrappers to annotate, mark and extract textual features like unigrams, bigrams, parts of speech, clinical note specific features. * Enhanced Weka Wrapper for GATE: Modified the existing Weka wrapper in GATE to add functionality for extracting features at variable position. * Modified Ngram Statistics Package for portability on Microsoft Windows OS * A generic Word Sense Disambiguation program that integrates GATE, NSP and Weka. * Perl script to generate a summary report of a set of experiments * Modified Weka StringToNominal filter to accept multiple attributes for conversion from string to nominal format * Modified GATE UI code to make some attributes read-only while performing annotation * Perl scripts to analyze coverage of acronyms in UMLS LRABR table of abbreviations and Mayo master-sheet data * Java program to extract required data from UIMA CAS xml files and put it in a simple XML format, to be used by the medical indexing experts for annotation using GATE. Papers or reports written ------------------------- * Presentation at the Division of Biomedical Informatics Seminar Any other resources created --------------------------- * Automatically pre-annotated test bed for manual annotation for acronym senses * A list of ambiguous acronyms used by the Medical Index Any other significant work product or deliverable not included above -------------------------------------------------------------------- * Started the development of an Inteface with the Semantic Relatendess package. In progress.