Towards sound, complete, and explainable machine learning with biomedical ontologies (CRG11)

Sun, Jan 1 2023 - Thu, Dec 31 2026

Applied Ontology Neuro-Symbolic AI Ontology engineering Semantic similarity

Overview

Deep learning has driven much of the recent progress in bioinformatics, but the resulting models are essentially black boxes: it is hard or impossible to ask why a prediction was made, to verify it against existing knowledge, or to detect biases that emerge from the interaction of dataset, architecture, and training objective. In clinical and biomedical settings these limitations matter, both because clinicians need to trust the systems they use and because formal guarantees of soundness and fairness can only be obtained when the inferential machinery can be inspected. The grand challenge addressed by this project, funded under the CRG11 call and running 2023-2026, is to combine the pattern-recognition power of deep learning with the soundness, completeness, and interpretability of logic-based knowledge representation, specifically for the rich biomedical ontologies formalized in the Web Ontology Language.

The technical advance is to design ontology embeddings that preserve the syntax-semantics relationship of Description Logics, not just the graph structure. The project pursues three complementary mathematical routes. The first is geometric: the model theory of Description Logic is mapped into vector spaces so that entailment becomes a geometric relation, extending earlier work on EL Embeddings: Geometric construction of models for the Description Logic EL++ (IJCAI 2019) and now strengthened in Enhancing Geometric Ontology Embeddings for EL++ with Negative Sampling and Lattice-Preserving ALC Ontology Embeddings (both Neural-Symbolic Learning and Reasoning, 2024), and culminating in Lattice-Based ALC Ontology Embeddings With Saturation (Neurosymbolic Artificial Intelligence, 2025), which extends soundness from EL++ to the substantially more expressive ALC. The second route uses category theory to relate the semantics of Description Logic to operations in real-valued vector spaces, so that conjunction and disjunction become genuine vector operations. The third uses differentiable fuzzy logics to encode axioms as soft constraints inside neural training. All methods are released through the mOWL library, described in mOWL: Python library for machine learning with biomedical ontologies (Bioinformatics, 2023), and the design space they sit in is mapped by Ontology Embedding: A Survey of Methods, Applications and Resources (IEEE TKDE, 2025) and Neuro-Symbolic AI in Life Sciences (Handbook on Neurosymbolic AI, 2025).

These embedding methods are validated on benchmarks that recast standard bioinformatics tasks as question answering, multi-modality, and zero-shot inference. Concrete delivered results so far include Protein function prediction as approximate semantic entailment (Nature Machine Intelligence, 2024), which formulates protein function prediction as a logical inference task and shows that approximate entailment beats prior state-of-the-art, and Predicting protein functions using positive-unlabeled ranking with ontology-based priors (Bioinformatics, 2024), which exploits ontology structure to compensate for the long-tailed positive-only labels typical of biological data. Knowledge integration for molecular property prediction is addressed in Large-Scale Knowledge Integration for Enhanced Molecular Property Prediction (NeSy, 2024), and variant prioritization in Prioritizing genomic variants through neuro-symbolic, knowledge-enhanced learning (Bioinformatics, 2024). From Axioms over Graphs to Vectors, and Back Again (ESWC 2023) shows what is preserved and what is lost in each step of the pipeline.

The driving application is non-opioid analgesia: more than fifty such drugs have been developed since the 1960s but the mechanisms of most are unknown or uncertain, and opioid prescription has risen in Saudi Arabia in parallel with the global trend. By formulating drug repurposing as approximate question answering over a knowledge graph of genes, proteins, diseases, drugs, metabolites, phenotypes, and functions, the project aims to surface candidate mechanisms with attached symbolic explanations rather than opaque scores. The methods are also intended to support variant prioritization for rare genetic disease (a high burden in Saudi Arabia under high consanguinity) and low-resource learning where formal knowledge compensates for limited training data.

Period: 2023–2026

Funding

KAUST Competitive Research Grant (CRG11) — Grant ID: URF/1/5041-01-01 (PI) — USD 232,191

Team

Robert Hoehndorf — PI (KAUST (Professor of Computer Science))
Paul N Schofield — CoI (University of Cambridge)
Fernando Zhapa-Camacho — PhD (alumnus) (KAUST)
Mahdi Bu Ali — MSc (alumnus)
Tengwei Song — Postdoc (former member)

Software

mOWL — Python library for machine learning with biomedical ontologies. Unifies projection-, axiom- and geometric-embedding methods (EL Embeddings, ELBE, BoxSquaredEL, OWL2Vec*, DL2Vec, OPA2Vec) behind one API, with first-class OWLAPI access and PyTorch integration. https://github.com/bio-ontology-research-group/mowl
Interpretable Learning — Generates interpretable symbolic rules from learned representations over biomedical knowledge bases. https://github.com/bio-ontology-research-group/interpretable-learning

Publications acknowledging this project (6)

Topics: Applied Ontology, Neuro-symbolic AI, Ontology engineering, Semantic similarity

Towards sound, complete, and explainable machine learning with biomedical ontologies (CRG11)

Overview

Funding

Team

Software

Publications acknowledging this project (6)

Share

Related Sites

Bio-Ontology Research Group (BORG)

Towards sound, complete, and explainable machine learning with biomedical ontologies (CRG11)

Overview

Funding

Team

Software

Publications acknowledging this project (6)

Related People

Principal Investigators

Robert Hoehndorf

Related Researchers

Fernando Patricio Zhapa-Camacho

Mahdi Bu Ali

Tengwei Song

Share

Related Sites