Semantic similarity

Semantic similarity measures over biomedical ontologies sit at the core of several of our research lines, from disease-gene prioritization to ontology-aware function transfer and biodiversity knowledge-graph search. We develop, benchmark and apply measures that exploit the OWL axiomatic structure of an ontology rather than only its lexical or taxonomic skeleton, and we have repeatedly shown that this richer semantics translates into measurable improvements on downstream prediction tasks. Operating within KAUST's Computer Science Program, our distinctive contribution is that we treat similarity not just as an information-theoretic quantity on a directed acyclic graph, but as a learnable function over the logical content of an ontology, expressed in vector spaces that machine-learning models can consume.

Foundations and benchmarks

A first strand systematically evaluates how the choice of measure, ontology and annotation set affects downstream tasks. The pair of studies titled Evaluating the effect of annotation size on measures of semantic similarity showed that measures behave very differently as annotation coverage grows, with practical consequences for protein-function and disease-gene applications. The review Semantic similarity and machine learning with ontologies structures the field by distinguishing measures that operate on graph structure, on logical entailment, and on learned embeddings, and discusses how each interacts with modern machine-learning pipelines. Notions of similarity for systems biology models extends the same conceptual analysis to systems-biology model comparison, where similarity must respect the structure of mathematical models rather than only ontology annotations.

Phenotype-based disease and gene discovery

The largest single application area is phenotype-driven disease and gene discovery. PhenomeNET: a whole-phenome approach to disease gene discovery introduced cross-species phenotype comparison as a strategy for prioritizing disease genes, and Similarity-based search of model organism, disease and drug effect phenotypes generalised this into a search engine over model-organism, disease and drug-effect phenotypes. Analysis of the human diseasome using phenotype similarity between common, genetic, and infectious diseases applied the same toolkit at population scale to construct a phenotype-based diseasome covering Mendelian, common and infectious disease. Contribution of model organism phenotypes to the computational identification of human disease genes later quantified, species by species, how much each model-organism resource contributes to human disease-gene discovery, and Semantic prioritization of novel causative genomic variants showed that semantic similarity over phenotype ontologies can rank rare causative variants in individual patients. Crucial supporting work on cross-species ontology integration, in the two studies titled Quantitative comparison of mapping methods between Human and Mammalian Phenotype Ontology, established how human and mouse phenotype encodings can be aligned in a way that preserves similarity-based reasoning.

From measures to representation learning

More recently, the focus has shifted from explicit similarity measures to learned representations that capture ontology semantics. OPA2Vec: combining formal and informal content of biomedical ontologies to improve similarity-based prediction jointly embeds axiomatic and annotation content and outperforms classical Resnik-style similarity on multiple prediction tasks. A Machine Learning Based Approach for Similarity Search on Biodiversity Knowledge Graphs takes the same idea outside biomedicine, into similarity search over digitised natural-history collections. On the clinical side, Evaluating semantic similarity methods for comparison of text-derived phenotype profiles and Effects of Negation and Uncertainty Stratification on Text-Derived Patient Profile Similarity show how similarity behaves when phenotype profiles are extracted from clinical narrative text rather than curated by experts, and identify the conditions under which text-derived similarity supports differential diagnosis.

These methods underpin our software releases such as SmuDGE, Multi-Drug Embedding, DeepMOCCA and PathoPhenoDB, and they are the analytical backbone of ongoing CRG-funded programmes on explainable machine learning over biomedical ontologies, variant prioritization in complex disease, and smart analytics infrastructure for the life sciences. Together they form a coherent platform on which clinical, ecological and pharmacological similarity questions can all be answered through the same ontology-aware machinery.

Projects

Software

Publications (15)