Semantic similarity

Semantic similarity measures over biomedical ontologies sit at the core of several of our research lines, from disease-gene prioritization to ontology-aware function transfer and biodiversity knowledge-graph search. We develop, benchmark and apply measures that exploit the OWL axiomatic structure of an ontology rather than only its lexical or taxonomic skeleton, and we have repeatedly shown that this richer semantics translates into measurable improvements on downstream prediction tasks. Operating within KAUST's Computer Science Program, our distinctive contribution is that we treat similarity not just as an information-theoretic quantity on a directed acyclic graph, but as a learnable function over the logical content of an ontology, expressed in vector spaces that machine-learning models can consume.

Foundations and benchmarks

A first strand systematically evaluates how the choice of measure, ontology and annotation set affects downstream tasks. The pair of studies titled Evaluating the effect of annotation size on measures of semantic similarity showed that measures behave very differently as annotation coverage grows, with practical consequences for protein-function and disease-gene applications. The review Semantic similarity and machine learning with ontologies structures the field by distinguishing measures that operate on graph structure, on logical entailment, and on learned embeddings, and discusses how each interacts with modern machine-learning pipelines. Notions of similarity for systems biology models extends the same conceptual analysis to systems-biology model comparison, where similarity must respect the structure of mathematical models rather than only ontology annotations.

Phenotype-based disease and gene discovery

The largest single application area is phenotype-driven disease and gene discovery. PhenomeNET: a whole-phenome approach to disease gene discovery introduced cross-species phenotype comparison as a strategy for prioritizing disease genes, and Similarity-based search of model organism, disease and drug effect phenotypes generalised this into a search engine over model-organism, disease and drug-effect phenotypes. Analysis of the human diseasome using phenotype similarity between common, genetic, and infectious diseases applied the same toolkit at population scale to construct a phenotype-based diseasome covering Mendelian, common and infectious disease. Contribution of model organism phenotypes to the computational identification of human disease genes later quantified, species by species, how much each model-organism resource contributes to human disease-gene discovery, and Semantic prioritization of novel causative genomic variants showed that semantic similarity over phenotype ontologies can rank rare causative variants in individual patients. Crucial supporting work on cross-species ontology integration, in the two studies titled Quantitative comparison of mapping methods between Human and Mammalian Phenotype Ontology, established how human and mouse phenotype encodings can be aligned in a way that preserves similarity-based reasoning.

From measures to representation learning

More recently, the focus has shifted from explicit similarity measures to learned representations that capture ontology semantics. OPA2Vec: combining formal and informal content of biomedical ontologies to improve similarity-based prediction jointly embeds axiomatic and annotation content and outperforms classical Resnik-style similarity on multiple prediction tasks. A Machine Learning Based Approach for Similarity Search on Biodiversity Knowledge Graphs takes the same idea outside biomedicine, into similarity search over digitised natural-history collections. On the clinical side, Evaluating semantic similarity methods for comparison of text-derived phenotype profiles and Effects of Negation and Uncertainty Stratification on Text-Derived Patient Profile Similarity show how similarity behaves when phenotype profiles are extracted from clinical narrative text rather than curated by experts, and identify the conditions under which text-derived similarity supports differential diagnosis.

These methods underpin our software releases such as SmuDGE, Multi-Drug Embedding, DeepMOCCA and PathoPhenoDB, and they are the analytical backbone of ongoing CRG-funded programmes on explainable machine learning over biomedical ontologies, variant prioritization in complex disease, and smart analytics infrastructure for the life sciences. Together they form a coherent platform on which clinical, ecological and pharmacological similarity questions can all be answered through the same ontology-aware machinery.

Projects

Software

Publications (15)

(2022) Sarah Alghamdi, Paul N. Schofield, Robert Hoehndorf. Contribution of model organism phenotypes to the computational identification of human disease genes Disease Models & Mechanisms.
(2022) Luke T. Slater, Sophie Russell, Silver Makepeace, Alexander Carberry et al.. Evaluating semantic similarity methods for comparison of text-derived phenotype profiles BMC Medical Informatics and Decision Making.
(2021) Luke T. Slater, Andreas Karwath, Robert Hoehndorf, Georgios V. Gkoutos. Effects of Negation and Uncertainty Stratification on Text-Derived Patient Profile Similarity Frontiers in Digital Health.
(2020) Kulmanov, Smaili, Gao, Hoehndorf. Semantic similarity and machine learning with ontologies Briefings in Bioinformatics.
(2019) Claus Weiland, Maxat Kulmanov, Marco Schmidt, Robert Hoehndorf. A Machine Learning Based Approach for Similarity Search on Biodiversity Knowledge Graphs Biodiversity Information Science and Standards.
(2018) Ron Henkel, Robert Hoehndorf, Tim Kacprowski, Christian Knupfer et al.. Notions of similarity for systems biology models Briefings in Bioinformatics.
(2018) Smaili, Gao, Hoehndorf. OPA2Vec: combining formal and informal content of biomedical ontologies to improve similarity-based prediction Bioinformatics.
(2017) Kulmanov, Hoehndorf. Evaluating the effect of annotation size on measures of semantic similarity Journal of Biomedical Semantics.
(2017) Boudellioua, Mahamad Razali, Kulmanov, Hashish et al.. Semantic prioritization of novel causative genomic variants PLOS Computational Biology.
(2016) Maxat Kulmanov, Robert Hoehndorf. Evaluating the effect of annotation size on measures of semantic similarity Proceedings of Bio-Ontologies SIG.
(2015) Hoehndorf, Gruenberger, Gkoutos, Schofield. Similarity-based search of model organism, disease and drug effect phenotypes Journal of Biomedical Semantics.
(2015) Robert Hoehndorf, Paul N Schofield, Georgios V Gkoutos. Analysis of the human diseasome using phenotype similarity between common, genetic, and infectious diseases Scientific Reports.
(2012) Oellrich, Gkoutos, Hoehndorf, Rebholz-Schuhmann. Quantitative comparison of mapping methods between Human and Mammalian Phenotype Ontology Journal of Biomedical Semantics.
(2011) Anika Oellrich, Robert Hoehndorf, Georgios V. Gkoutos, Dietrich Rebholz-Schuhmann. Quantitative comparison of mapping methods between Human and Mammalian Phenotype Ontology Proceedings of the 3rd Workshop for Ontologies in Biomedicine and Life sciences (OBML).
Hoehndorf, Schofield, Gkoutos. PhenomeNET: a whole-phenome approach to disease gene discovery Nucleic Acids Research.

Semantic similarity

Foundations and benchmarks

Phenotype-based disease and gene discovery

From measures to representation learning

Projects

Software

Publications (15)

Related Highlights

Towards sound, complete, and explainable machine learning with biomedical ontologies (CRG11)

CompleX: Variant Prioritization in Complex Disease

Bio2Vec: Smart analytics infrastructure for the life sciences