Biomedical informatics
Our biomedical informatics work converts heterogeneous research-grade data into usable inputs for clinicians and computational biologists. Within KAUST's Computer Science Program, we build biomedical knowledge bases, mine text for structured biological assertions, standardise clinical phenotype encodings, and develop analytics over electronic health records and rare-disease cohorts. The distinctive feature of our approach is that almost every component is grounded in formal ontologies, so that text-mined facts, curated databases and clinical observations share a common semantic substrate and can be compared, reasoned over and embedded jointly.
Knowledge bases for infectious and rare disease
Several of our long-running contributions are integrated knowledge bases. PathoPhenoDB: linking human pathogens to their disease phenotypes in support of infectious disease research is a curated and text-mined resource that connects pathogens to the disease phenotypes they cause, distributed both as an OWL ontology and as an interactive web application, and accompanied by the methodological papers Ontology based mining of pathogen--disease associations from literature. The infrastructure work Aber-OWL: a framework for ontology-based data access in biology provides the reasoning backend that supports queries over these resources, while The role of ontologies in biological and biomedical research: a functional perspective and Datamining with Ontologies articulate the broader rationale and methodology for ontology-grounded data integration.
Text mining and clinical NLP
To populate and extend these knowledge bases, we develop biomedical text-mining methods that operate at the ontology level. Ontology based text mining of gene-phenotype associations: application to candidate gene prediction demonstrated that literature-mined gene-phenotype associations meaningfully complement curated databases for gene prioritization. Ontology-Based Concept Recognition by Using Word Embeddings and Combining lexical and context features for automatic ontology extension show how distributional semantics can be used to recognise concepts and extend ontologies semi-automatically. On the clinical side, Improved characterisation of clinical text through ontology-based vocabulary expansion and Effects of Negation and Uncertainty Stratification on Text-Derived Patient Profile Similarity address two specific obstacles in turning narrative clinical text into ontology-coded phenotype profiles suitable for similarity-based diagnosis. Multi-faceted semantic clustering with text-derived phenotypes extends this into patient stratification.
Clinical decision support and EHR analytics
These components are combined into decision-support tools that operate on patient data. Starvar: symptom-based tool for automatic ranking of variants using evidence from literature and genomes ranks variants by combining text-mined evidence with patient symptoms. The application of Large Language Models to the phenotype-based prioritization of causative genes in rare disease patients revisits gene prioritization with modern LLMs, comparing them against established semantic-similarity baselines on real rare-disease cohorts. Predicting candidate genes from phenotypes, functions and anatomical site of expression adds anatomical context to the prediction problem, and Ontology-based prediction of cancer driver genes shows that the same ontology-aware embedding strategy generalises to oncology. Causal relationships between diseases mined from the literature improve the use of polygenic risk scores demonstrates how literature-derived causal graphs sharpen polygenic risk modelling for downstream EHR studies. Foundational analyses such as Evaluation of research in biomedical ontologies, Ranking Adverse Drug Reactions With Crowdsourcing, Usage of cell nomenclature in biomedical literature, and the BioHackathon reports anchor this work in community standards and shared benchmarks.
These tools and resources are used in active programmes on rare-disease diagnostic support, infectious-disease surveillance, drug repurposing and cancer prognostics, and they support clinical collaborators through software such as PathoPhenoDB, SmuDGE, Multi-Drug Embedding and DeepMOCCA. We are now extending this stack toward operational decision-support for genetic-medicine clinics, with explicit attention to populations and disease patterns that are common in the Middle East but under-represented in international resources.
Software
- SmuDGE — Semantic disease-gene embeddings; integrates phenotype, function and pathway ontologies into a unified vector space for downstream prediction.
- Multi-Drug Embedding — Drug repurposing method that learns joint embeddings of drugs, targets and diseases from biomedical knowledge graphs and the scientific literature.
- DeepMOCCA — Graph neural network for cancer survival analysis that integrates multi-omics (mutation, expression, methylation, CNV) with a curated cancer knowledge graph.
- PathoPhenoDB — Curated database of pathogens and the disease phenotypes they cause, distributed as an OWL ontology and an interactive web application.
- NanoDesigner — Iterative refinement framework for nanobody/CDR design that explicitly models the antigen–CDR interdependence; companion code to the NanoDesigner paper.
Publications (46)
- (2026) Guzman-Vega, Cardona-Londono, Gonzalez-Alvarez, Pena-Guerra et al.. VarLand: A pipeline to map the structural landscape of missense variants at the proteome scale Journal of Biological Chemistry.
- (2025) Alarawi, Altammami, Abutarboush, Kulmanov et al.. Genomic diversity and antimicrobial resistance of Staphylococcus aureus in Saudi Arabia: a nationwide study using whole-genome sequencing Microbial Genomics.
- (2025) Gomez, Al Mahri, Abdullah, Malik et al.. Age-related differences in gene expression and pathway activation following heatstroke Physiological Genomics.
- (2025) Kafkas, Abdelhakim, Althagafi, Toonsi et al.. The application of Large Language Models to the phenotype-based prioritization of causative genes in rare disease patients Scientific Reports.
- (2025) Alhattab, Barakeh, Khoja, Elhadi et al.. Sa1216: Development of colorectal cancer and matched healthy organoids from Saudi patients: a case study Gastroenterology.
- (2025) Schofield, Hoehndorf, Gkoutos, Smith. The informatics of developmental phenotypes Kaufman’s Atlas of Mouse Development Supplement.
- (2024) Toonsi, Gauran, Ombao, Schofield et al.. Causal relationships between diseases mined from the literature improve the use of polygenic risk scores Bioinformatics.
- (2023) Mazen Hassanain, Yang Liu, Weam Hussain, Albandri Binowayn et al.. Genomic landscape in Saudi patients with hepatocellular carcinoma using whole-genome sequencing: a pilot study Frontiers in Gastroenterology.
- (2023) Senay Kafkas, Marwa Abdelhakim, Mahmut Uludag, Azza Althagafi et al.. Starvar: symptom-based tool for automatic ranking of variants using evidence from literature and genomes BMC Bioinformatics.
- (2021) Liu-Wei, Kafkas, Chen, Dimonaco et al.. DeepViral: prediction of novel virus–host interactions from protein sequences and infectious disease phenotypes Bioinformatics.
- (2021) Jun Chen, Azza Althagafi, Robert Hoehndorf. Predicting candidate genes from phenotypes, functions and anatomical site of expression Bioinformatics.
- (2021) Luke T. Slater, John A. Williams, Andreas Karwath, Hilary Fanning et al.. Multi-faceted semantic clustering with text-derived phenotypes Computers in Biology and Medicine.
- (2021) Luke T. Slater, William Bradlow, Simon Ball, Robert Hoehndorf et al.. Improved characterisation of clinical text through ontology-based vocabulary expansion Journal of Biomedical Semantics.
- (2021) Luke T. Slater, Andreas Karwath, Robert Hoehndorf, Georgios V. Gkoutos. Effects of Negation and Uncertainty Stratification on Text-Derived Patient Profile Similarity Frontiers in Digital Health.
- (2020) Sara Althubaiti, Senay Kafkas, Marwa Abdelhakim, Robert Hoehndorf. Combining lexical and context features for automatic ontology extension Journal of Biomedical Semantics.
- (2020) Rutger A. Vos, Toshiaki Katayama, Hiroyuki Mishima, Shin Kawano et al.. BioHackathon 2015: Semantics of data for life sciences and reproducible research F1000Research.
- (2019) Kafkas, Abdelhakim, Hashish, Kulmanov et al.. PathoPhenoDB: linking human pathogens to their disease phenotypes in support of infectious disease research Scientific Data.
- (2019) Katayama, Kawashima, Micklem, Kawano et al.. BioHackathon series in 2013 and 2014: improvements of semantic interoperability in life science data and services F1000Research.
- (2019) Althubaiti, Karwath, Dallol, Noor et al.. Ontology-based prediction of cancer driver genes Scientific Reports.
- (2019) Timothy K. Cooper, Kathleen A. Silva, Victoria E. Kennedy, Sarah M. Alghamdi et al.. Hyaline Arteriolosclerosis in 30 Strains of Aged Inbred Mice Veterinary Pathology.
- … and 26 more.