Overview There are 875,255 concepts in UMLS (2003AA). Each unique concept id (CUI) can have concepts in multiple languages. For example the CUI C0010200 (cough, coughing) has entries in MRCON for many different languages. Here's a sample. C0010200|BAQ|P|L1518678|PF|S1814587|EZTULA|0| C0010200|ENG|P|L0010200|VS|S0028564|Cough|0| C0010200|FRE|S|L1305261|PF|S1547207|Toux|0| C0010200|ITA|P|L1305225|PF|S1547171|Tosse|0| C0010200|SPA|S|L1305239|PF|S1547185|Tos|0| There are actually thirteen different English entries for this concept. The CUI, SUI combination is a unique way of identifying the entry in MRCON. The paper by Burgun and Bedenreider equates these concepts to WordNet synsets. The entry for C0010200 in MRDEF is C0010200|MSH|A sudden, audible expulsion of air from the lungs through a partially closed glottis, preceded by inhalation. It is a protective response that serves to clear the trachea, bronchi, and/or lungs of irritants and secretions, or to prevent aspiration of foreign materials into the lungs. Of the 2,146,899 pairs in MRCON 695 are Basque 723 are Danish 36,358 are Dutch 1,773,525 are English 22,382 are Finnish 36,489 are French 71,316 are German 485 are Hebrew 718 are Hungarian 24,992 are Italian 722 are Norwegian 63,581 are Portuguese 44,906 are Russian 69,284 are Spanish 723 are Swedish Hence, 82.6% of the content is English. The entries in MRCON are from 15 different languages: Basque, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish. There are also index files for each language (MRXW.BAQ, MRXW.DAN, etc.). Again, all the files are strictly ASCII. The Russian and Hebrew words are transliterated into ASCII characters, and diacritical marks from other languages are not present. Most concepts seem to have entries from more than one language. I think all the concepts have at least one English entry. All the other content (definitions, for example) is in English. I've come to the conclusion that the relationships between concepts provided by UMLS are difficult to work with, and one would do well to avoid them. I don't really know how they came up with some of these relationships. By my count there are 129,417 reflexive relationships in META/MRREL (to be fair, some of these are 'RQ' relations, which means the source has asserted relatedness and possible synonymy). It's quite common for concepts to be their own parents (there are 13,725 such cases). I thought that maybe the weirdness was coming from a small number of vocabularies, but this doesn't appear to be the case. I did notice, however, that MeSH doesn't define any I'm-my-own-parent relationships. I've been looking at the semantic network in UMLS some more. I had mentioned in a previous e-mail that I thought that the semantic network might be useful in doing some type of edge-counting. After looking at the documentation a bit more, I've discovered some problems. One of the changes made to the semantic network in version 2003AB is the addition of the transitive closure on the relationships in the semantic network. What this means to us is that the shortest path between any semantic type and its least-specific subsumer is at most 1. Likewise the shortest path between two semantic types--assuming a path exists--would be at most 2 (one edge connecting type1 to the root and one edge connecting type2 to the root). We *could* try to find the _longest_ path, but that is one of those dreaded NP problems.