CompleX: Variant Prioritization in Complex Disease

Tue, Jan 1 2019 - Fri, Dec 31 2021

Applied Ontology Neuro-Symbolic AI Rare disease Semantic similarity

Overview

The hardest cases in clinical genome sequencing are the ones where no single variant explains the disease. As Mendelian gene discovery slows and the diagnostic rate for whole-exome sequencing stalls below 50%, growing evidence points to oligogenic and polygenic origins: combinations of medium-rare or common alleles that, individually, look unremarkable. Population-level approaches lack the power to find them, and traditional single-gene Mendelian reasoning ignores them. The CompleX project (2019–2021, with the Universities of Cambridge and Birmingham) set out to break this impasse by extending phenotype-driven, knowledge-based variant prioritization — the strategy behind our PVP and DeepPVP tools — to genotypes that involve two or more interacting genes.

The plan was to assemble, for the first time, a knowledge graph capturing genotype–phenotype associations for compound mutations across human (DIDA, ClinVar), mouse (MGI), and zebrafish (ZFIN), enriched with interaction evidence from STRING, BioGRID, and YeastNet; then to design machine-learning algorithms that could exploit this graph as background knowledge to rank candidate variant sets from individual genomes. Because the combinatorial explosion of variant pairs is unmanageable by exhaustive search, we reformulated the prioritization step as a Prize-Collecting Steiner Tree problem and combined it with neural-symbolic feature learning over the integrated phenotype, interaction, and ontology graph.

Concrete deliverables

The project produced the technical scaffolding that knowledge-graph-based variant prioritization now rests on. DeepGOPlus: improved protein function prediction from sequence (Bioinformatics, 2019) and its web companion DeepGOWeb: fast and accurate protein function prediction on the (Semantic) Web made deep-learning function predictions available at scale and via SPARQL — essential for inferring the functional context of variants in genes that lack experimental annotation. PathoPhenoDB (Scientific Data, 2019) extended phenotype-driven prioritization beyond host genetics by linking human pathogens to disease phenotypes, allowing the same machinery to address infectious-disease cases. Semantic similarity and machine learning with biomedical ontologies (Briefings in Bioinformatics, 2020) consolidated the methodological foundation — how to combine ontology-based semantic similarity with embeddings learned over the knowledge graph — that the prioritization algorithms depend on. How much do model organism phenotypes contribute to the computational identification of human disease genes? (2021) quantified, for the first time at this scale, the contribution of mouse, zebrafish, fly, and yeast phenotypes to disease-gene prediction, settling a long-standing question about which model-organism evidence is worth the integration cost.

Alongside these, the project delivered Klarigi: characteristic explanations for semantic data, a tool that turns the prioritization output into human-readable explanations grounded in ontology terms — a step toward clinical usability — and PhenomeBrowser, an integrated query interface over phenotypes, their semantics, and phenotype-based machine-learning predictions. Students Sara Althubaiti and Sarah Alghamdi, identified at the start of the proposal as project leads, produced their doctoral work directly within this program; Mona Alshahrani and Imane Boudellioua completed adjacent PhDs on the underlying representation-learning methods.

The combination is what mattered: by 2021 the group had a working knowledge graph for compound genotype–phenotype associations, a family of embedding-based prioritization methods (now consumed by downstream PVP variants and the DeepPheno line of work), and explanation tooling that made the predictions auditable. The intellectual property generated — updates to PVP, DeepPVP, and OligoPVP — became the basis for the group's continuing work on rare and complex disease diagnosis, including subsequent applications to Saudi cohorts.

Period: 2019–2021

Funding

KAUST Competitive Research Grant — Grant ID: URF/1/3790-01-01 (PI) — USD 240,000

Team

Robert Hoehndorf — PI (KAUST (Professor of Computer Science))
Paul N Schofield — CoI (University of Cambridge)
Georgios V Gkoutos — CoI (University of Birmingham)
Imane Boudellioua — PhD (alumnus) (King Fahd University of Petroleum and Minerals (Assistant Professor))
Mona Alshahrani — PhD (alumnus) (Jubail University College (Assistant Professor))
Sarah Alghamdi — PhD (alumnus)
Azza Althagafi — PhD (alumnus)
Abeer Almutairi — MSc (alumnus)
Sara Althubaiti — MSc (alumnus)
Hatoon Al Ali — MSc (alumnus)
Safana Bakheet — MSc (alumnus)
Ashraf Kibraya — Postdoc

Software

OPA2Vec — Combines ontology axioms with associated annotation properties (labels, synonyms, definitions) into a single corpus, then trains Word2Vec to produce semantically rich vectors for ontology classes. https://github.com/bio-ontology-research-group/opa2vec
DeepPheno — Predicts loss-of-function organism-level phenotypes (HPO/MPO) directly from a gene's annotated functions, using a hierarchical neural classifier over phenotype ontologies. https://github.com/bio-ontology-research-group/deeppheno
PhenomeNET-VP — Phenotype-driven variant prioritization for whole-exome and whole-genome sequencing data; widely used implementation of the phenotype-aware variant ranking approach. https://github.com/bio-ontology-research-group/phenomenet-vp
DeepSVP — Prioritizes structural and copy-number variants by combining patient phenotype with gene-function similarity learned from biomedical ontologies. https://github.com/bio-ontology-research-group/DeepSVP
EmbedPVP — Embedding-based phenotype-aware variant predictor that ranks candidate causative variants using joint sequence- and phenotype-derived representations. https://github.com/bio-ontology-research-group/EmbedPVP
STARVar — Symptom-based tool for automatic ranking of variants using evidence from the biomedical literature and population genomes; combines text mining with phenotype matching. https://github.com/bio-ontology-research-group/STARVar
INDIGENA — Inductive prediction of disease–gene associations from phenotype ontologies; generalizes to unseen diseases via ontology-aware embeddings. https://github.com/bio-ontology-research-group/indigena
predCAN — Ontology-based prediction of cancer driver genes by integrating phenotype, pathway and function knowledge with somatic-variant features. https://github.com/bio-ontology-research-group/predCAN
DeepViral — Predicts virus–host protein-protein interactions from sequence and infectious-disease phenotypes; trained jointly across coronaviruses, influenza and other RNA viruses. https://github.com/bio-ontology-research-group/DeepViral
SmuDGE — Semantic disease-gene embeddings; integrates phenotype, function and pathway ontologies into a unified vector space for downstream prediction. https://github.com/bio-ontology-research-group/SMUDGE
PhenomeNet — Cross-species phenotype ontology and similarity network combining HPO, MPO, ZP and others; the substrate behind PhenomeNET-VP and DeepPheno. https://github.com/bio-ontology-research-group/phenomeblast