CompleX: Variant Prioritization in Complex Disease

Overview

The hardest cases in clinical genome sequencing are the ones where no single variant explains the disease. As Mendelian gene discovery slows and the diagnostic rate for whole-exome sequencing stalls below 50%, growing evidence points to oligogenic and polygenic origins: combinations of medium-rare or common alleles that, individually, look unremarkable. Population-level approaches lack the power to find them, and traditional single-gene Mendelian reasoning ignores them. The CompleX project (2019–2021, with the Universities of Cambridge and Birmingham) set out to break this impasse by extending phenotype-driven, knowledge-based variant prioritization — the strategy behind our PVP and DeepPVP tools — to genotypes that involve two or more interacting genes.

The plan was to assemble, for the first time, a knowledge graph capturing genotype–phenotype associations for compound mutations across human (DIDA, ClinVar), mouse (MGI), and zebrafish (ZFIN), enriched with interaction evidence from STRING, BioGRID, and YeastNet; then to design machine-learning algorithms that could exploit this graph as background knowledge to rank candidate variant sets from individual genomes. Because the combinatorial explosion of variant pairs is unmanageable by exhaustive search, we reformulated the prioritization step as a Prize-Collecting Steiner Tree problem and combined it with neural-symbolic feature learning over the integrated phenotype, interaction, and ontology graph.

Concrete deliverables

The project produced the technical scaffolding that knowledge-graph-based variant prioritization now rests on. DeepGOPlus: improved protein function prediction from sequence (Bioinformatics, 2019) and its web companion DeepGOWeb: fast and accurate protein function prediction on the (Semantic) Web made deep-learning function predictions available at scale and via SPARQL — essential for inferring the functional context of variants in genes that lack experimental annotation. PathoPhenoDB (Scientific Data, 2019) extended phenotype-driven prioritization beyond host genetics by linking human pathogens to disease phenotypes, allowing the same machinery to address infectious-disease cases. Semantic similarity and machine learning with biomedical ontologies (Briefings in Bioinformatics, 2020) consolidated the methodological foundation — how to combine ontology-based semantic similarity with embeddings learned over the knowledge graph — that the prioritization algorithms depend on. How much do model organism phenotypes contribute to the computational identification of human disease genes? (2021) quantified, for the first time at this scale, the contribution of mouse, zebrafish, fly, and yeast phenotypes to disease-gene prediction, settling a long-standing question about which model-organism evidence is worth the integration cost.

Alongside these, the project delivered Klarigi: characteristic explanations for semantic data, a tool that turns the prioritization output into human-readable explanations grounded in ontology terms — a step toward clinical usability — and PhenomeBrowser, an integrated query interface over phenotypes, their semantics, and phenotype-based machine-learning predictions. Students Sara Althubaiti and Sarah Alghamdi, identified at the start of the proposal as project leads, produced their doctoral work directly within this programme; Mona Alshahrani and Imane Boudellioua completed adjacent PhDs on the underlying representation-learning methods.

The combination is what mattered: by 2021 the group had a working knowledge graph for compound genotype–phenotype associations, a family of embedding-based prioritization methods (now consumed by downstream PVP variants and the DeepPheno line of work), and explanation tooling that made the predictions auditable. The intellectual property generated — updates to PVP, DeepPVP, and OligoPVP — became the basis for the group's continuing work on rare and complex disease diagnosis, including subsequent applications to Saudi cohorts.

Period: 2019–2021

Funding

  • KAUST Competitive Research Grant — Grant ID: URF/1/3790-01-01 (PI) — USD 240,000

Team

Software

Publications acknowledging this project (17)

  • (2021) How much do model organism phenotypes contribute to the computational identification of human disease genes?
  • (2020) DeepGOWeb: Fast and accurate protein function prediction on the (Semantic) Web
  • (2020) Semantic similarity and machine learning with biomedical ontologies
  • (2019) DeepGOPlus: Improved protein function prediction from sequence
  • (2019) PathoPhenoDB: linking human pathogens to their disease phenotypes in support of infectious disease research [live]
  • (2015) Ontology-based prediction of cancer driver genes
  • (2015) Klarigi: Explanations for Semantic Groupings Supplementary Material
  • (2012) DDIEM: Drug Database for Inborn Errors of Metabolism
  • (2012) Linking common human diseases to their phenotypes; development of a resource for human phenomics
  • (2012) Komenti: A Semantic Text-mining Framework
  • (2012) Towards semantic interoperability: finding and repairing hidden contradictions in biomedical ontologies
  • (2012) Improved characterisation of clinical text through ontology-based vocabulary expansion
  • () PhenomeBrowser: Integrating phenotypes, their semantics, and phenotype-based machine learning across domains, organisms, and applications
  • () Klarigi: Characteristic Explanations for Semantic Data
  • () Klarigi: Characteristic Explanations for Semantic Data
  • … and 2 more.

Topics: Applied Ontology, Neuro-symbolic AI, Rare disease, Semantic similarity