KAUST Center of Excellence for Generative AI (Health and Wellness, BCB theme)
Overview
The KAUST Center of Excellence for Generative AI has a Health and Wellness theme, and within that theme the Bioinformatics and Computational Biology (BCB) workstream is responsible for the bio-side problems that generic generative models cannot solve on their own: protein and antibody design, population-specific drug development, and clinical foundation models that can actually reason with structured biomedical knowledge. The BORG group leads the neuro-symbolic and ontology-based components of this work (Sub-WBS FCC/1/5940-07-02), in partnership with Xin Gao's group on AI/drug design and the wider CoE on clinical LLMs and imaging.
The scientific problem we own is the integration of formal biomedical knowledge — ontologies of anatomy, gene function, disease, phenotype — with large neural models. Foundation models for biomedicine routinely fail in ways that matter clinically: they hallucinate anatomical relations, conflate phenotypes, and cannot ground a recommendation in a verifiable chain of biological reasoning. Our approach is to make ontology axioms part of the model itself rather than an external retrieval step. This means embedding description-logic ontologies (ALC and richer fragments) into vector spaces in a way that preserves entailment, and combining those embeddings with sequence- and structure-based encoders for proteins, drugs, and patient data.
Methodological contributions
The line of work began with the realisation that ontology embeddings need to respect the logical structure of the source ontology, not just its graph form. Lattice-based ALC ontology embeddings with saturation (2025) constructs embeddings directly over the description-logic lattice and proves that saturation under entailment improves downstream prediction; this gives us a principled way to feed structured knowledge into neural models for drug, disease, and phenotype prediction. Neuro-symbolic AI in Life Sciences (2024) lays out the broader programme — where logic-based ML methods buy you reliability, and where they collapse — and frames the CoE's bet on neuro-symbolic clinical AI.
Downstream of the embedding methodology, the workstream has produced concrete prediction tools. DeepGOMeta: functional insights into microbial communities with deep learning-based protein function prediction (2023) extends our DeepGO family to metagenomic settings, predicting protein function for organisms that have never been cultured — a capability the CoE drug pipeline needs whenever the molecular target sits in a poorly characterised pathway. INDIGENA: inductive prediction of disease–gene associations using phenotype ontologies (2022) uses ontology-aware inductive learning to predict gene–disease links for genes that were not part of the training set, which is the realistic setting whenever a Saudi-specific variant has no prior literature. Exploring the use of ontology components for distantly-supervised disease and phenotype named entity recognition (2022) turns the ontologies into supervision signal for clinical text mining, supporting the clinical-LLM side of the CoE.
Two later outputs target rare-disease diagnosis directly. The application of large language models to the phenotype-based prioritization of causative genes in rare disease patients tests whether LLM-based reasoning can match or extend phenotype-driven gene prioritization, providing a quantitative readout of where general-purpose LLMs actually help in clinical genomics and where structured knowledge remains essential. Improving the classification of cardinality phenotypes using collections sharpens phenotype representations for a class of clinically important but historically under-modelled features (counts of structures, repetitive elements), making them usable for downstream prediction.
The aggregate is a stack of methods — ontology embedding, neuro-symbolic prediction, ontology-supervised NER, and LLM-based phenotype reasoning — that the rest of the CoE consumes for population-specific drug discovery, Arabic clinical LLMs, and AI-driven diagnostics. The project runs from 2024 and is staffed in our group by Mahdi Bu Ali (MSc) and several ongoing PhD students.
Period: 2024–ongoing
Funding
- KAUST Center of Excellence for Generative AI (CoE)
— Grant ID:
FCC/1/5940-07-02(PI)
Team
- Robert Hoehndorf — PI (KAUST (Professor of Computer Science))
- Mahdi Bu Ali — MSc (alumnus)
Software
- GO-Agent — LLM-agent framework that decomposes protein-function prediction into tool-calling sub-tasks (sequence search, structure lookup, domain reasoning) and stitches the evidence into a final GO annotation. https://github.com/bio-ontology-research-group/go-agent
Publications acknowledging this project (10)
- (2025) Lattice-based ALC ontology embeddings with saturation
- (2024) Neuro-symbolic AI in Life Sciences
- (2023) DeepGOMeta: Functional Insights into Microbial Communities with Deep Learning-Based Protein Function Prediction
- (2022) Exploring the Use of Ontology Components for Distantly-Supervised Disease and Phenotype Named Entity Recognition
- (2022) INDIGENA: inductive prediction of disease--gene associations using phenotype ontologies Supplementary Material
- (2015) The application of Large Language Models to the phenotype-based prioritization of causative genes in rare disease patients
- (2015) The application of Large Language Models to the phenotype-based prioritization of causative genes in rare disease patients
- (2012) Exploring the Use of Ontology Components for Distantly-Supervised Disease and Phenotype Named Entity Recognition
- (2012) Improving the classification of cardinality phenotypes using collections
- () Lattice-based ALC ontology embeddings with saturation
Topics: Drug mechanisms, Neuro-symbolic AI, Rare disease