Biomedical informatics

Our biomedical informatics work converts heterogeneous research-grade data into usable inputs for clinicians and computational biologists. Within KAUST's Computer Science Program, we build biomedical knowledge bases, mine text for structured biological assertions, standardise clinical phenotype encodings, and develop analytics over electronic health records and rare-disease cohorts. The distinctive feature of our approach is that almost every component is grounded in formal ontologies, so that text-mined facts, curated databases and clinical observations share a common semantic substrate and can be compared, reasoned over and embedded jointly.

Knowledge bases for infectious and rare disease

Several of our long-running contributions are integrated knowledge bases. PathoPhenoDB: linking human pathogens to their disease phenotypes in support of infectious disease research is a curated and text-mined resource that connects pathogens to the disease phenotypes they cause, distributed both as an OWL ontology and as an interactive web application, and accompanied by the methodological papers Ontology based mining of pathogen--disease associations from literature. The infrastructure work Aber-OWL: a framework for ontology-based data access in biology provides the reasoning backend that supports queries over these resources, while The role of ontologies in biological and biomedical research: a functional perspective and Datamining with Ontologies articulate the broader rationale and methodology for ontology-grounded data integration.

Text mining and clinical NLP

To populate and extend these knowledge bases, we develop biomedical text-mining methods that operate at the ontology level. Ontology based text mining of gene-phenotype associations: application to candidate gene prediction demonstrated that literature-mined gene-phenotype associations meaningfully complement curated databases for gene prioritization. Ontology-Based Concept Recognition by Using Word Embeddings and Combining lexical and context features for automatic ontology extension show how distributional semantics can be used to recognise concepts and extend ontologies semi-automatically. On the clinical side, Improved characterisation of clinical text through ontology-based vocabulary expansion and Effects of Negation and Uncertainty Stratification on Text-Derived Patient Profile Similarity address two specific obstacles in turning narrative clinical text into ontology-coded phenotype profiles suitable for similarity-based diagnosis. Multi-faceted semantic clustering with text-derived phenotypes extends this into patient stratification.

Clinical decision support and EHR analytics

These components are combined into decision-support tools that operate on patient data. Starvar: symptom-based tool for automatic ranking of variants using evidence from literature and genomes ranks variants by combining text-mined evidence with patient symptoms. The application of Large Language Models to the phenotype-based prioritization of causative genes in rare disease patients revisits gene prioritization with modern LLMs, comparing them against established semantic-similarity baselines on real rare-disease cohorts. Predicting candidate genes from phenotypes, functions and anatomical site of expression adds anatomical context to the prediction problem, and Ontology-based prediction of cancer driver genes shows that the same ontology-aware embedding strategy generalises to oncology. Causal relationships between diseases mined from the literature improve the use of polygenic risk scores demonstrates how literature-derived causal graphs sharpen polygenic risk modelling for downstream EHR studies. Foundational analyses such as Evaluation of research in biomedical ontologies, Ranking Adverse Drug Reactions With Crowdsourcing, Usage of cell nomenclature in biomedical literature, and the BioHackathon reports anchor this work in community standards and shared benchmarks.

These tools and resources are used in active programmes on rare-disease diagnostic support, infectious-disease surveillance, drug repurposing and cancer prognostics, and they support clinical collaborators through software such as PathoPhenoDB, SmuDGE, Multi-Drug Embedding and DeepMOCCA. We are now extending this stack toward operational decision-support for genetic-medicine clinics, with explicit attention to populations and disease patterns that are common in the Middle East but under-represented in international resources.

Software

Publications (46)