Challenges and insights from a Human Disease Ontology mapping project
Ontologies and vocabularies standardize domain knowledge, exemplified by a recent SciBite and University of Maryland project for the Human Disease Ontology Knowledgebase.
Overview
Ontologies and controlled vocabularies exist to define domain knowledge, serving to harmonize concepts across datasets, thus enabling semantic alignment and data standardization. However, ontologies and vocabularies built for different purposes overlap in terms of domain. A study by Kamdarn et al. found that, generally, well-established ontologies and controlled terminologies do not reuse terms from other ontologies [1]. The reasons for this might be inability to find the required term in another ontology due to differing labels, the use case for the term may be slightly different or the definitions between equivalent terms may diverge. Additionally, groups may want to set up bespoke ontologies from scratch. In order to optimize the utility of ontologies for biomedical data integration, it is necessary to map or match concepts between different resources [2].
Here we describe a specific ontology mapping project that was recently undertaken by SciBite and researchers from the University of Maryland School of Medicine for the Human Disease Ontology Knowlegebase [3].
We illustrate some of the challenges and intricacies of this seemingly straightforward process, together with some considerations on how to improve the process. We found that both automated and manual approaches were necessary to complete the project, highlighting the importance of a combined approach to maximize the efficiency and accuracy of matching ontology terms.
Case Study: Mapping disease vocabularies
Ontology mapping is a tricky business for a person, let alone a computer. The ambiguity and nuances of the English language means simple word or phrase matching will not be sufficient for the more nuanced matches. Take, for instance, the word ‘trunk’. Even within the anatomical domain, without additional context, this could be referring to the main part of a body, the woody stem of a tree or a prehensile appendage of an elephant. Outside of anatomy it could be a large traveling case, a male swimming garment or – if you are in the United States – the boot of a car or a pair of underwear (Figure 1).
Rules-based and machine-learning approaches have been developed to address ontology mapping challenges [4], but despite this, Subject Matter Expert validation is essential to ensure the quality and accuracy of the suggested mappings [5].
This was highlighted in a recent project that SciBite undertook with the Human Disease Ontology (DO) Knowledgebase group at University of Maryland School of Medicine. The DO group connected with SciBite requesting disease term mapping based on OMIM IDs (Online Mendelian Inheritance in Man) [6] between the Disease Ontology and UniProt Disease (UPDISEASE) entries [7]. The purpose of this effort, for the DO project, was to facilitate timely disease term mapping between UniProt and the DO. Cross-mapping between resources is a timely and curatorially heavy effort. Automated, expert mappings discerned from SciBite greatly facilitated the speed with which this effort could be accomplished.
Combined mapping approach
The disease mapping project involved aligning the approximate 6,500 UniProt Disease classes to the DO classes, utilizing several of SciBite’s tools in addition to expert curators. Throughout this process, SciBite and DO curators noted that the combination of automated and manual approaches was key to a successful mapping project. We found that automated mapping tools alone were unlikely to create a complete and accurate mapping. Thus, it is necessary to employ manual approaches to both refine the automated approach, verify the mappings created by the tool and find mappings that are missed by the tool.
The primary tool used was SciBite’s mapping tool Workbench;[8]. Workbench uses SciBite vocabularies (VOCabs) TERMite ;(SciBite’s Named Entity Recognition engine) to align data or ontologies to another data standard or ontology. Workbench is designed to assist the user in the time-consuming and error-prone process of matching terms between sources.
In addition to Workbench, SciBite’s ontology management platform CENtree was incorporated into the mapping process to allow the curators to visualize the DO terms, their synonyms, definitions and position within the hierarchy to assist with validation of the suggested mappings (Figure 2).
Leveraging SciBite vocabularies
Workbench is built to utilize SciBite VOCabs and TERMite to provide comprehensive synonym coverage of life sciences domains. The key VOCab for the UPDISEASE:DO mapping project was SciBite’s DOID VOCab. This VOCab is built from the DO ontology but optimized for named entity recognition (NER) by expert curators. This process includes augmenting with additional synonyms from literature review and rules-based synonym generation plus adding context and disambiguation where appropriate to increase the precision and recall in NER.
Workbench requires a TERMite-ready VOCab for each of the terminologies being mapped, which meant a rudimentary VOCab also had to be created for the UniProt Disease entries. Fortunately, a simple three-column file would suffice for the initial mapping procedure (Figure 3).
The results of the initial results of the initial mapping in Workbench were promising (Figure 4).
Approximately 90% of UniProt disease terms were mapped to a DO term, either as an EXACT or BROADER match. An exact match is where at least one label or synonym matches exactly between the source term and target, whereas a broader match holds a similarity between labels or synonyms and is above the user-defined cut-off score.Although a large proportion of the UPDISEASE classes have been successfully mapped to DO, there were still 50% that were not exact and required curator review. Many of the broader matches could be verified by a quick eyeball of the corresponding terms, as in Figure 5a.
The UPDISEASE term “Dystonia 6, torsion” is clearly the same as the DO term “torsion dystonia 6”, so these are quick for the curator to review. Other reviews may take longer, as in the example in Figure 5b.
The UPDISEASE term “Corticosterone methyloxidase 1 deficiency” was reported as having NO mapping, but a curator checking the ontology would find that an EXACT mapping to DO actually does exist, differing only in the order of the terms within the disease name.
Semi-automated synonym creation
Curator review of the mapping output can also identify possible improvements to the mapping process. One of the major reasons for non-exact matches in this mapping project was the different word order between the labels of UPDISEASE terms and the DO, or the addition of ‘type’ in the DO label, resulting in no mapping being found (Figure 6a and 6b).
领英推荐
This could be resolved by augmenting the UPDISEASE VOCab using a semi-automated approach. This involved using regular expressions to create additional synonyms for the UPDISEASE terms that would match the word order in DO (Figure 7).
By taking this approach of creating synonyms with matching word orders, it was possible to map an additional 10-15% of UPDISEASE terms to DO, thus further reducing the manual effort required by the curators.
Unfortunately, there were too many variations of word order and word addition to capture with regular expressions. Figure 8 shows the initial broader mapping of the UPDISEASE term “Muscular dystrophy congenital LMNA-related” to the DO term “muscular dystrophy”. After manual review it was found the exact mapping was in fact to DO term “congenital muscular dystrophy due to LMNA mutation”. It would be extremely time-consuming to construct regular expressions to cover all the different types of synonym variation for little gain (generally only a handful terms follow each particular pattern), therefore manual review of these was essential.
At this point most of the low hanging fruit had been mapped, but you may well remember that the UniProt Disease entities also had mnemonics or acronyms associated with them. It was decided to add the mnemonics to the UPDISEASE VOCab and perform the mapping process again using only the remaining unmapped UPDISEASE terms. Adding these mnemonics to the VOCab used in the initial mapping would likely have resulted in many false positives due to the short length of the synonyms. This would have meant the curators would have more mappings to manually verify, increasing the time and effort.
Adding the mnemonic synonyms to the UniProt Disease Entities resulted in an additional ~300 terms being mapped.
Getting into the weeds
As with all ontology mapping exercises, there are several hundred or (if you’re unlucky) thousands of terms that cannot be automatically mapped and the only option, until automated approaches improve, is for a curator to search for a matching term within the target source. This is where the curator’s expertise and fastidiousness come into their own (with a little help from an ontology viewer such as CENtree).
Take the example shown in Figure 9.
The UniProt Disease term “Crisponi/Cold-induced sweating syndrome 2” has a mnemonic synonym “CISS2”. This term does not exist as an exact match in DO, but the curator has found a similar term “cold-induced sweating syndrome 2”, however it has no synonyms to provide additional evidence to its meaning (Figure 9a). The question is, are these equivalent?By looking at DO in an ontology browser such as CENtree, the curator could examine the terms surrounding this potential target term. The parent term of the DO term is “cold-induced sweating syndrome”, and among its synonyms is “Crisponi syndrome” (Figure 9b), so the curator can infer from this that the UPDISEASE term is in fact equivalent to the DO term “cold-induced sweating syndrome 2”.
A final example of how manual review was necessary for this mapping exercise is shown in Figure 10. The UPDISEASE entity “Agammaglobulinemia 9, autosomal recessive” expresses not only the disease but its autosomal recessive inheritance. In some cases, the Disease Ontology also expresses the inheritance within the term label, such as “autosomal recessive hypercholesterolemia”. However, due to nomenclature variations, this is not consistent and there are terms where the inheritance is expressed only within the ontology hierarchy. Only by viewing the ontology is the curator able to combine this information and conclude that the two terms are equivalent.
Summary
The project successfully identified the majority of disease term mappings between UniProt and the DO, providing a ML-ready dataset of disease-to-disease mappings. The iterative mapping approach facilitated new solutions to tricky term mappings. For the DO project curators, this approach provided a targeted set of mappings to review, thus reducing the time burden to identify related data across resources.
Ontology mapping continues to be challenging due to the nuances of language and context-dependent interpretation of certain words and phrases. While some of the burden can be taken by automated approaches for the more straightforward connections, much of the work relies on subject matter experts and curators to verify the matches or to search for the appropriate mappings.
Automated approaches to mapping ontologies are improving all the time but, for now, if you want to distinguish between an elephant’s proboscis and a pair of underpants don’t rely on an algorithm!
Key Points
References
?
About the authors
Rachael Huntley, Lead Scientific Curator, SciBite
Rachael Huntley is Lead Scientific Curator at SciBite with over 20 years biocuration experience. Dr. Huntley received her PhD in plant biochemistry from the University of Cambridge and completed post-doctoral research in both Cambridge, UK and Stanford, USA. During her time at EMBL-EBI and University College London she contributed to functional annotation of human proteins and microRNAs involved in human health and disease. Throughout her biocuration career, she has worked closely with the Gene Ontology Consortium and major pharmaceutical companies and has contributed to the development of ontologies, biocuration standards and curation tools.
Lynn Schriml, Associate Professor, University of Maryland
Lynn Schriml is Associate Professor at the University of Maryland, School of Medicine in the Department of Epidemiology and Public Health and at the Institute of Genome Science (IGS) in Baltimore, Maryland. Dr. Schriml’s current research focuses on developing bioinformatic tools, metadata standards and ontologies to gain a broader understanding of the relationship between infectious pathogens, their genomic sequence and disease. Dr. Schriml is the primary developer of a suite of OBO Foundry biomedical ontologies including the Disease Ontology, Symptom Ontology, Transmission Method Ontology, Influenza Ontology, Environmental (EnvO) ontology and geographic locations gazetteer (GAZ) vocabulary.
Bio/cheminformatician, software developer, open scientist.
6 个月Please let me know if you would like me to give a talk to your team on SeMRA - software I wrote for analyzing the entire disease vocabulary landscape (and generic for other domains). It's also hooked up to the Biomappings prediction and curation interface.