Biomedical informatics

Our biomedical informatics work converts heterogeneous research-grade data into usable inputs for clinicians and computational biologists. Within KAUST's Computer Science Program, we build biomedical knowledge bases, mine text for structured biological assertions, standardize clinical phenotype encodings, and develop analytics over electronic health records and rare-disease cohorts. The distinctive feature of our approach is that almost every component is grounded in formal ontologies, so that text-mined facts, curated databases and clinical observations share a common semantic substrate and can be compared, reasoned over and embedded jointly.

Knowledge bases for infectious and rare disease

Several of our long-running contributions are integrated knowledge bases. PathoPhenoDB: linking human pathogens to their disease phenotypes in support of infectious disease research is a curated and text-mined resource that connects pathogens to the disease phenotypes they cause, distributed both as an OWL ontology and as an interactive web application, and accompanied by the methodological papers Ontology based mining of pathogen--disease associations from literature. The infrastructure work Aber-OWL: a framework for ontology-based data access in biology provides the reasoning backend that supports queries over these resources, while The role of ontologies in biological and biomedical research: a functional perspective and Datamining with Ontologies articulate the broader rationale and methodology for ontology-grounded data integration.

Text mining and clinical NLP

To populate and extend these knowledge bases, we develop biomedical text-mining methods that operate at the ontology level. Ontology based text mining of gene-phenotype associations: application to candidate gene prediction demonstrated that literature-mined gene-phenotype associations meaningfully complement curated databases for gene prioritization. Ontology-Based Concept Recognition by Using Word Embeddings and Combining lexical and context features for automatic ontology extension show how distributional semantics can be used to recognize concepts and extend ontologies semi-automatically. On the clinical side, Improved characterisation of clinical text through ontology-based vocabulary expansion and Effects of Negation and Uncertainty Stratification on Text-Derived Patient Profile Similarity address two specific obstacles in turning narrative clinical text into ontology-coded phenotype profiles suitable for similarity-based diagnosis. Multi-faceted semantic clustering with text-derived phenotypes extends this into patient stratification.

Clinical decision support and EHR analytics

These components are combined into decision-support tools that operate on patient data. Starvar: symptom-based tool for automatic ranking of variants using evidence from literature and genomes ranks variants by combining text-mined evidence with patient symptoms. The application of Large Language Models to the phenotype-based prioritization of causative genes in rare disease patients revisits gene prioritization with modern LLMs, comparing them against established semantic-similarity baselines on real rare-disease cohorts. Predicting candidate genes from phenotypes, functions and anatomical site of expression adds anatomical context to the prediction problem, and Ontology-based prediction of cancer driver genes shows that the same ontology-aware embedding strategy generalizes to oncology. Causal relationships between diseases mined from the literature improve the use of polygenic risk scores demonstrates how literature-derived causal graphs sharpen polygenic risk modeling for downstream EHR studies. Foundational analyses such as Evaluation of research in biomedical ontologies, Ranking Adverse Drug Reactions With Crowdsourcing, Usage of cell nomenclature in biomedical literature, and the BioHackathon reports anchor this work in community standards and shared benchmarks.

These tools and resources are used in active programs on rare-disease diagnostic support, infectious-disease surveillance, drug repurposing and cancer prognostics, and they support clinical collaborators through software such as PathoPhenoDB, SmuDGE, Multi-Drug Embedding and DeepMOCCA. We are now extending this stack toward operational decision-support for genetic-medicine clinics, with explicit attention to populations and disease patterns that are common in the Middle East but under-represented in international resources.

Software

Publications (36)

(2026) Guzman-Vega, Cardona-Londono, Gonzalez-Alvarez, Pena-Guerra et al.. VarLand: A pipeline to map the structural landscape of missense variants at the proteome scale Journal of Biological Chemistry.
(2026) Zhapa-Camacho, Hoehndorf. INDIGENA: inductive prediction of disease–gene associations using phenotype ontologies Bioinformatics.
(2025) Kafkas, Abdelhakim, Althagafi, Toonsi et al.. The application of Large Language Models to the phenotype-based prioritization of causative genes in rare disease patients Scientific Reports.
(2025) Alarawi, Altammami, Abutarboush, Kulmanov et al.. Genomic diversity and antimicrobial resistance of Staphylococcus aureus in Saudi Arabia: a nationwide study using whole-genome sequencing Microbial Genomics.
(2025) Gomez, Al Mahri, Abdullah, Malik et al.. Age-related differences in gene expression and pathway activation following heatstroke Physiological Genomics.
(2025) Alhattab, Barakeh, Khoja, Elhadi et al.. Sa1216: Development of colorectal cancer and matched healthy organoids from Saudi patients: a case study Gastroenterology.
(2025) Schofield, Hoehndorf, Gkoutos, Smith. The informatics of developmental phenotypes Kaufman’s Atlas of Mouse Development Supplement.
(2024) Kulmanov, Tawfiq, Liu, Al Ali et al.. A reference quality, fully annotated diploid genome from a Saudi individual Scientific Data.
(2024) Toonsi, Gauran, Ombao, Schofield et al.. Causal relationships between diseases mined from the literature improve the use of polygenic risk scores Bioinformatics.
(2023) Senay Kafkas, Marwa Abdelhakim, Mahmut Uludag, Azza Althagafi et al.. Starvar: symptom-based tool for automatic ranking of variants using evidence from literature and genomes BMC Bioinformatics.
(2023) Mazen Hassanain, Yang Liu, Weam Hussain, Albandri Binowayn et al.. Genomic landscape in Saudi patients with hepatocellular carcinoma using whole-genome sequencing: a pilot study Frontiers in Gastroenterology.
(2021) Liu-Wei, Kafkas, Chen, Dimonaco et al.. DeepViral: prediction of novel virus–host interactions from protein sequences and infectious disease phenotypes Bioinformatics.
(2021) Jun Chen, Azza Althagafi, Robert Hoehndorf. Predicting candidate genes from phenotypes, functions and anatomical site of expression Bioinformatics.
(2021) Luke T. Slater, John A. Williams, Andreas Karwath, Hilary Fanning et al.. Multi-faceted semantic clustering with text-derived phenotypes Computers in Biology and Medicine.
(2021) Luke T. Slater, Andreas Karwath, John A. Williams, Sophie Russell et al.. Towards similarity-based differential diagnostics for common diseases Computers in Biology and Medicine.
(2021) Luke T. Slater, William Bradlow, Dino FA. Motti, Robert Hoehndorf et al.. A fast, accurate, and generalisable heuristic-based negation detection algorithm for clinical text Computers in Biology and Medicine.
(2020) Sara Althubaiti, Senay Kafkas, Marwa Abdelhakim, Robert Hoehndorf. Combining lexical and context features for automatic ontology extension Journal of Biomedical Semantics.
(2020) Rutger A. Vos, Toshiaki Katayama, Hiroyuki Mishima, Shin Kawano et al.. BioHackathon 2015: Semantics of data for life sciences and reproducible research F1000Research.
(2019) Althubaiti, Karwath, Dallol, Noor et al.. Ontology-based prediction of cancer driver genes Scientific Reports.
(2019) Senay Kafkas, Robert Hoehndorf. Ontology based mining of pathogen--disease associations from literature Journal of Biomedical Semantics.

Show 16 more

(2019) Kafkas, Abdelhakim, Hashish, Kulmanov et al.. PathoPhenoDB: linking human pathogens to their disease phenotypes in support of infectious disease research Scientific Data.
(2019) Kafkas, Hoehndorf. Ontology based text mining of gene-phenotype associations: application to candidate gene prediction Database.
(2019) Katayama, Kawashima, Micklem, Kawano et al.. BioHackathon series in 2013 and 2014: improvements of semantic interoperability in life science data and services F1000Research.
(2019) Timothy K. Cooper, Kathleen A. Silva, Victoria E. Kennedy, Sarah M. Alghamdi et al.. Hyaline Arteriolosclerosis in 30 Strains of Aged Inbred Mice Veterinary Pathology.
(2018) Sohaib Younis, Claus Weiland, Robert Hoehndorf, Stefan Dressler et al.. Taxon and trait recognition from digitized herbarium specimens using deep convolutional neural networks Botany Letters.
(2018) Senay Kafkas, Robert Hoehndorf. Ontology based mining of pathogen-disease associations from literature Bio-Ontologies COSI.
(2018) Sara Althubaiti, Senay Kafkas, Robert Hoehndorf. Ontology-Based Concept Recognition by Using Word Embeddings Bio-Ontologies COSI.
(2017) Kafkas, Sarntivijai, Hoehndorf. Usage of cell nomenclature in biomedical literature BMC Bioinformatics.
(2016) Boudellioua, Saidi, Hoehndorf, Martin et al.. Prediction of Metabolic Pathway Involvement in Prokaryotic UniProtKB Data by Association Rule Mining PLoS ONE.
(2016) Hoehndorf, Gkoutos, Schofield. Datamining with Ontologies Data Mining Techniques for the Life Sciences.
(2015) Robert Hoehndorf, Luke Slater, Paul N Schofield, Georgios V Gkoutos. Aber-OWL: a framework for ontology-based data access in biology BMC Bioinformatics.
(2015) Martin Hrab\ve de Angelis, George Nicholson, Mohammed Selloum, Jacqueline K White et al.. Analysis of mammalian gene function through broad-based phenotypic screens across a consortium of mouse clinics Nature Genetics.
(2015) Gottlieb, Hoehndorf, Dumontier, Altman. Ranking Adverse Drug Reactions With Crowdsourcing J Med Internet Res.
(2015) Hoehndorf, Schofield, Gkoutos. The role of ontologies in biological and biomedical research: a functional perspective Briefings in Bioinformatics.
(2014) Hoehndorf, Hancock, Hardy, Mallon et al.. Analyzing gene expression data in mice with the Neuro Behavior Ontology Mamm Genome.
(2014) Rutger Vos, Jordan Biserkov, Bachir Balech, Niall Beard et al.. Enriched biodiversity data as a resource and service Biodiversity Data Journal.