crg explainable ml ontologies

Towards sound, complete, and explainable machine learning with biomedical ontologies (CRG11)

Overview

Deep learning has driven much of the recent progress in bioinformatics, but the resulting models are essentially black boxes: it is hard or impossible to ask why a prediction was made, to verify it against existing knowledge, or to detect biases that emerge from the interaction of dataset, architecture, and training objective. In clinical and biomedical settings these limitations matter, both because clinicians need to trust the systems they use and because formal guarantees of soundness and fairness can only be obtained when the inferential machinery can be inspected. The grand challenge addressed by this project, funded under the CRG11 call and running 2023-2026, is to combine the pattern-recognition power of deep learning with the soundness, completeness, and interpretability of logic-based knowledge representation, specifically for the rich biomedical ontologies formalised in the Web Ontology Language.

The technical advance is to design ontology embeddings that preserve the syntax-semantics relationship of Description Logics, not just the graph structure. The project pursues three complementary mathematical routes. The first is geometric: the model theory of Description Logic is mapped into vector spaces so that entailment becomes a geometric relation, extending earlier work on EL Embeddings: Geometric construction of models for the Description Logic EL++ (IJCAI 2019) and now strengthened in Enhancing Geometric Ontology Embeddings for EL++ with Negative Sampling and Lattice-Preserving ALC Ontology Embeddings (both Neural-Symbolic Learning and Reasoning, 2024), and culminating in Lattice-Based ALC Ontology Embeddings With Saturation (Neurosymbolic Artificial Intelligence, 2025), which extends soundness from EL++ to the substantially more expressive ALC. The second route uses category theory to relate the semantics of Description Logic to operations in real-valued vector spaces, so that conjunction and disjunction become genuine vector operations. The third uses differentiable fuzzy logics to encode axioms as soft constraints inside neural training. All methods are released through the mOWL library, described in mOWL: Python library for machine learning with biomedical ontologies (Bioinformatics, 2023), and the design space they sit in is mapped by Ontology Embedding: A Survey of Methods, Applications and Resources (IEEE TKDE, 2025) and Neuro-Symbolic AI in Life Sciences (Handbook on Neurosymbolic AI, 2025).

These embedding methods are validated on benchmarks that recast standard bioinformatics tasks as question answering, multi-modality, and zero-shot inference. Concrete delivered results so far include Protein function prediction as approximate semantic entailment (Nature Machine Intelligence, 2024), which formulates protein function prediction as a logical inference task and shows that approximate entailment beats prior state-of-the-art, and Predicting protein functions using positive-unlabeled ranking with ontology-based priors (Bioinformatics, 2024), which exploits ontology structure to compensate for the long-tailed positive-only labels typical of biological data. Knowledge integration for molecular property prediction is addressed in Large-Scale Knowledge Integration for Enhanced Molecular Property Prediction (NeSy, 2024), and variant prioritisation in Prioritizing genomic variants through neuro-symbolic, knowledge-enhanced learning (Bioinformatics, 2024). From Axioms over Graphs to Vectors, and Back Again (ESWC 2023) shows what is preserved and what is lost in each step of the pipeline.

The driving application is non-opioid analgesia: more than fifty such drugs have been developed since the 1960s but the mechanisms of most are unknown or uncertain, and opioid prescription has risen in Saudi Arabia in parallel with the global trend. By formulating drug repurposing as approximate question answering over a knowledge graph of genes, proteins, diseases, drugs, metabolites, phenotypes, and functions, the project aims to surface candidate mechanisms with attached symbolic explanations rather than opaque scores. The methods are also intended to support variant prioritisation for rare genetic disease (a high burden in Saudi Arabia under high consanguinity) and low-resource learning where formal knowledge compensates for limited training data.

Period: 2023–2026

Funding

  • KAUST Competitive Research Grant (CRG11) — Grant ID: URF/1/5041-01-01 (PI)

Team

Software

Publications acknowledging this project (17)

  • (2025) Lattice-based ALC ontology embeddings with saturation
  • (2024) Predicting protein functions using positive-unlabeled ranking with ontology-based priors Supplementary Material
  • (2024) Neuro-symbolic AI in Life Sciences
  • (2023) DeepGOMeta: Functional Insights into Microbial Communities with Deep Learning-Based Protein Function Prediction
  • (2022) Exploring the Use of Ontology Components for Distantly-Supervised Disease and Phenotype Named Entity Recognition
  • (2022) Context-based protein function prediction in bacterial genomes
  • (2022) INDIGENA: inductive prediction of disease--gene associations using phenotype ontologies Supplementary Material
  • (2022) Causal Knowledge Graphs: Leveraging Background Knowledge for Causal Inference at Scale
  • (2022) Causal Knowledge Graphs: Leveraging Background Knowledge for Causal Inference at Scale
  • (2022) Large-Scale Knowledge Integration for Enhanced Molecular Property Prediction
  • (2020) PAVS: A database of phenotype-associated variants in Saudi Arabia
  • (2018) Ontology Embedding: A Survey of Methods, Applications and Resources
  • (2015) The application of Large Language Models to the phenotype-based prioritization of causative genes in rare disease patients
  • (2015) The application of Large Language Models to the phenotype-based prioritization of causative genes in rare disease patients
  • (2012) Exploring the Use of Ontology Components for Distantly-Supervised Disease and Phenotype Named Entity Recognition
  • … and 2 more.

Topics: Applied Ontology, Neuro-symbolic AI, Ontology engineering, Semantic similarity