CompleX: Variant Prioritization in Complex Disease
Overview
The hardest cases in clinical genome sequencing are the ones where no single variant explains the disease. As Mendelian gene discovery slows and the diagnostic rate for whole-exome sequencing stalls below 50%, growing evidence points to oligogenic and polygenic origins: combinations of medium-rare or common alleles that, individually, look unremarkable. Population-level approaches lack the power to find them, and traditional single-gene Mendelian reasoning ignores them. The CompleX project (2019–2021, with the Universities of Cambridge and Birmingham) set out to break this impasse by extending phenotype-driven, knowledge-based variant prioritization — the strategy behind our PVP and DeepPVP tools — to genotypes that involve two or more interacting genes.
The plan was to assemble, for the first time, a knowledge graph capturing genotype–phenotype associations for compound mutations across human (DIDA, ClinVar), mouse (MGI), and zebrafish (ZFIN), enriched with interaction evidence from STRING, BioGRID, and YeastNet; then to design machine-learning algorithms that could exploit this graph as background knowledge to rank candidate variant sets from individual genomes. Because the combinatorial explosion of variant pairs is unmanageable by exhaustive search, we reformulated the prioritization step as a Prize-Collecting Steiner Tree problem and combined it with neural-symbolic feature learning over the integrated phenotype, interaction, and ontology graph.
Concrete deliverables
The project produced the technical scaffolding that knowledge-graph-based variant prioritization now rests on. DeepGOPlus: improved protein function prediction from sequence (Bioinformatics, 2019) and its web companion DeepGOWeb: fast and accurate protein function prediction on the (Semantic) Web made deep-learning function predictions available at scale and via SPARQL — essential for inferring the functional context of variants in genes that lack experimental annotation. PathoPhenoDB (Scientific Data, 2019) extended phenotype-driven prioritization beyond host genetics by linking human pathogens to disease phenotypes, allowing the same machinery to address infectious-disease cases. Semantic similarity and machine learning with biomedical ontologies (Briefings in Bioinformatics, 2020) consolidated the methodological foundation — how to combine ontology-based semantic similarity with embeddings learned over the knowledge graph — that the prioritization algorithms depend on. How much do model organism phenotypes contribute to the computational identification of human disease genes? (2021) quantified, for the first time at this scale, the contribution of mouse, zebrafish, fly, and yeast phenotypes to disease-gene prediction, settling a long-standing question about which model-organism evidence is worth the integration cost.
Alongside these, the project delivered Klarigi: characteristic explanations for semantic data, a tool that turns the prioritization output into human-readable explanations grounded in ontology terms — a step toward clinical usability — and PhenomeBrowser, an integrated query interface over phenotypes, their semantics, and phenotype-based machine-learning predictions. Students Sara Althubaiti and Sarah Alghamdi, identified at the start of the proposal as project leads, produced their doctoral work directly within this programme; Mona Alshahrani and Imane Boudellioua completed adjacent PhDs on the underlying representation-learning methods.
The combination is what mattered: by 2021 the group had a working knowledge graph for compound genotype–phenotype associations, a family of embedding-based prioritization methods (now consumed by downstream PVP variants and the DeepPheno line of work), and explanation tooling that made the predictions auditable. The intellectual property generated — updates to PVP, DeepPVP, and OligoPVP — became the basis for the group's continuing work on rare and complex disease diagnosis, including subsequent applications to Saudi cohorts.
Period: 2019–2021
Funding
- KAUST Competitive Research Grant
— Grant ID:
URF/1/3790-01-01(PI) — USD 240,000
Team
- Robert Hoehndorf — PI (KAUST (Professor of Computer Science))
- Paul N Schofield — CoI (University of Cambridge)
- Georgios V Gkoutos — CoI (University of Birmingham)
- Imane Boudellioua — PhD (alumnus) (King Fahd University of Petroleum and Minerals (Assistant Professor))
- Mona Alshahrani — PhD (alumnus) (Jubail University College (Assistant Professor))
- Sarah Alghamdi — PhD (alumnus)
- Azza Althagafi — PhD (alumnus), MSc (alumnus)
- Abeer Almutairi — MSc (alumnus)
- Sara Althubaiti — MSc (alumnus)
- Hatoon Al Ali — MSc (alumnus)
- Safana Bakheet — MSc (alumnus)
- Ashraf Kibraya — Postdoc
Software
- OPA2Vec — Combines ontology axioms with associated annotation properties (labels, synonyms, definitions) into a single corpus, then trains Word2Vec to produce semantically rich vectors for ontology classes. https://github.com/bio-ontology-research-group/opa2vec
- DeepPheno — Predicts loss-of-function organism-level phenotypes (HPO/MPO) directly from a gene's annotated functions, using a hierarchical neural classifier over phenotype ontologies. https://github.com/bio-ontology-research-group/deeppheno
- PhenomeNET-VP — Phenotype-driven variant prioritization for whole-exome and whole-genome sequencing data; widely used implementation of the phenotype-aware variant ranking approach. https://github.com/bio-ontology-research-group/phenomenet-vp
- DeepSVP — Prioritizes structural and copy-number variants by combining patient phenotype with gene-function similarity learned from biomedical ontologies. https://github.com/bio-ontology-research-group/DeepSVP
- EmbedPVP — Embedding-based phenotype-aware variant predictor that ranks candidate causative variants using joint sequence- and phenotype-derived representations. https://github.com/bio-ontology-research-group/EmbedPVP
- STARVar — Symptom-based tool for automatic ranking of variants using evidence from the biomedical literature and population genomes; combines text mining with phenotype matching. https://github.com/bio-ontology-research-group/STARVar
- INDIGENA — Inductive prediction of disease–gene associations from phenotype ontologies; generalises to unseen diseases via ontology-aware embeddings. https://github.com/bio-ontology-research-group/indigena
- predCAN — Ontology-based prediction of cancer driver genes by integrating phenotype, pathway and function knowledge with somatic-variant features. https://github.com/bio-ontology-research-group/predCAN
- DeepViral — Predicts virus–host protein-protein interactions from sequence and infectious-disease phenotypes; trained jointly across coronaviruses, influenza, and other RNA viruses. https://github.com/bio-ontology-research-group/DeepViral
- SmuDGE — Semantic disease-gene embeddings; integrates phenotype, function and pathway ontologies into a unified vector space for downstream prediction. https://github.com/bio-ontology-research-group/SMUDGE
- PhenomeNet — Cross-species phenotype ontology and similarity network combining HPO, MPO, ZP and others; the substrate behind PhenomeNET-VP and DeepPheno. https://github.com/bio-ontology-research-group/phenomeblast
Publications acknowledging this project (17)
- (2021) How much do model organism phenotypes contribute to the computational identification of human disease genes?
- (2020) DeepGOWeb: Fast and accurate protein function prediction on the (Semantic) Web
- (2020) Semantic similarity and machine learning with biomedical ontologies
- (2019) DeepGOPlus: Improved protein function prediction from sequence
- (2019) PathoPhenoDB: linking human pathogens to their disease phenotypes in support of infectious disease research [live]
- (2015) Ontology-based prediction of cancer driver genes
- (2015) Klarigi: Explanations for Semantic Groupings Supplementary Material
- (2012) DDIEM: Drug Database for Inborn Errors of Metabolism
- (2012) Linking common human diseases to their phenotypes; development of a resource for human phenomics
- (2012) Komenti: A Semantic Text-mining Framework
- (2012) Towards semantic interoperability: finding and repairing hidden contradictions in biomedical ontologies
- (2012) Improved characterisation of clinical text through ontology-based vocabulary expansion
- () PhenomeBrowser: Integrating phenotypes, their semantics, and phenotype-based machine learning across domains, organisms, and applications
- () Klarigi: Characteristic Explanations for Semantic Data
- () Klarigi: Characteristic Explanations for Semantic Data
- … and 2 more.
Topics: Applied Ontology, Neuro-symbolic AI, Rare disease, Semantic similarity