Rare disease

The diagnosis of rare and Mendelian disease has been transformed by exome and genome sequencing, but interpretation remains the bottleneck: a typical patient genome contains tens of thousands of rare variants, only one or a few of which are causative. Effective diagnostic support requires the integration of patient-specific molecular data with structured background knowledge about genes, phenotypes, and disease mechanisms. We develop methods, anchored on the PhenomeNET phenotype network and the PVP family of variant prioritisation tools, that combine automated reasoning over phenotype ontologies with machine learning to rank candidate variants by the clinical phenotype of the patient. The Human Phenotype Ontology (HPO) is at the centre of this approach, together with model-organism phenotype resources that allow inferences to bridge species.

From phenotype networks to variant prioritisation

The PhenomeNET line of work began with PhenomeNET: a whole-phenome approach to disease gene discovery, which transformed phenotype ontologies into a formal representation enabling cross-species comparison of phenotypes, and was extended in Similarity-based search of model organism, disease and drug effect phenotypes to support real-time similarity queries over a large repository of model-organism, disease, and drug-effect phenotypes. The integration of phenotype ontologies across species was put on a sound basis in Integrating phenotype ontologies with PhenomeNET, and the role of model-organism phenotypes in human gene discovery was assessed in Contribution of model organism phenotypes to the computational identification of human disease genes. Building on this foundation, Semantic prioritization of novel causative genomic variants introduced the PVP framework, which combines pathogenicity prediction with semantic similarity between patient and disease phenotypes to rank candidate variants in whole-genome sequencing data.

Deep learning, structural variants, and oligogenic disease

Subsequent systems brought deep learning into variant prioritisation. DeepPVP: phenotype-based prioritization of causative variants using deep learning replaced hand-crafted scoring with a neural model trained jointly on variant features and phenotype similarity. DeepSVP: integration of genotype and phenotype for structural variant prioritization using deep learning extended the approach to structural and copy-number variants by learning gene-function similarity from biomedical ontologies, while OligoPVP: Phenotype-driven analysis of individual genomic information to prioritize oligogenic disease variants addressed Mendelian disorders that require two or more interacting variants. Semantic Disease Gene Embeddings (SmuDGE): phenotype-based disease gene prioritization without phenotypes showed that even when a patient is incompletely phenotyped, embedding-based representations recover diagnostic signal from related sources. Most recently, Prioritizing genomic variants through neuro-symbolic, knowledge-enhanced learning introduced EmbedPVP, which fuses sequence-derived and ontology-derived representations in a neuro-symbolic framework, and The application of Large Language Models to the phenotype-based prioritization of causative genes in rare disease patients evaluated how LLMs can extend phenotype-based gene prioritisation beyond curated genotype-to-phenotype databases.

Predicting the phenotypic consequences of a genetic variant requires gene-level phenotype models. DeepPheno: Predicting single gene loss-of-function phenotypes using an ontology-aware hierarchical classifier predicts HPO/MPO phenotypes from gene functions, complementing Predicting candidate genes from phenotypes, functions and anatomical site of expression. Tools such as Starvar: symptom-based tool for automatic ranking of variants using evidence from literature and genomes add literature-derived evidence and patient symptoms to the ranking. Curated knowledge bases support the broader programme: DDIEM: drug database for inborn errors of metabolism catalogues treatment strategies for inborn errors of metabolism, and PathoPhenoDB: linking human pathogens to their disease phenotypes in support of infectious disease research links pathogens to clinical phenotypes for infectious-disease applications.

External assessments, including CAGI6 ID panel challenge: assessment of phenotype and variant predictions in 415 children with neurodevelopmental disorders (NDDs) and Critical assessment of variant prioritization methods for rare disease diagnosis within the rare genomes project, have benchmarked these tools on independent cohorts. The work feeds active programmes including a public Saudi pangenome for Middle-Eastern rare-disease interpretation, the KAUST Center of Excellence for Generative AI in Health and Wellness, and clinical decision-support tools such as GenomeLinter that aim to put phenotype-aware genome interpretation in the hands of non-specialist clinicians.

Projects

Software

Publications (32)