Bio2Vec: Smart analytics infrastructure for the life sciences

Overview

By the mid-2010s the life sciences had produced an extraordinary investment in machine-readable knowledge: biomedical ontologies were used throughout biology to annotate data, and large RDF knowledge graphs such as Bio2RDF aggregated billions of statements from dozens of major databases. At the same time, large personal genomic datasets, the UK 100,000 Genomes project, UK Biobank, and the Saudi Human Genome Program, were coming online, and translating these into clinical insight depended on integrating them with that existing background knowledge. Generic knowledge-graph machine learning methods, however, did not handle either the size or the rich Description Logic semantics of biomedical knowledge graphs, and computational biologists still spent most of their time integrating and cleaning data instead of analysing it. Bio2Vec, a CRG2017-funded collaboration with Stanford/Maastricht (Michel Dumontier), KAUST (Xin Gao), and Bonn (Jens Lehmann) running 2018-2020, set out to build the missing semantic analytics infrastructure for the life sciences.

The technical legacy of the project is a family of ontology- and knowledge-graph embedding methods that have become standard tools. Onto2Vec: joint vector-based representation of biological entities and their ontology-based annotations (Bioinformatics, 2018) was the first method to learn entity representations that jointly capture an ontology's axioms and the data annotated with it, and it outperformed semantic-similarity baselines on protein-protein interaction and gene-disease association tasks. OPA2Vec: combining formal and informal content of biomedical ontologies to improve similarity-based prediction (Bioinformatics, 2018) extended this to use both formal axioms and the natural-language annotations (labels, definitions, synonyms) attached to ontology classes. EL Embeddings: Geometric construction of models for the Description Logic EL++ (IJCAI 2019) was, to the team's knowledge, the first method to construct embeddings of an ontology that are provably sound with respect to the model theory of the underlying logic, opening up a line of work on syntax-semantics-preserving embeddings that continues in subsequent projects. Vec2SPARQL: integrating SPARQL queries and knowledge graph embeddings (SWAT4LS 2018) showed how to combine logical retrieval with vector similarity in a single query language, and Semi-Supervised Entity Alignment via Knowledge Graph Embedding with Awareness of Degree Difference (WWW 2019) addressed cross-graph alignment, a perennial integration headache.

The infrastructure was tied to concrete biomedical use cases. DeepGOPlus: improved protein function prediction from sequence (Bioinformatics, 2020) used learned ontology structure to deliver competitive function prediction from sequence alone, and was complemented by DeepPheno: Predicting single gene loss-of-function phenotypes using an ontology-aware hierarchical neural network (PLOS Computational Biology, 2020) for phenotype prediction. DeepPVP: phenotype-based prioritization of causative variants using deep learning (BMC Bioinformatics, 2019) and OligoPVP: Phenotype-driven analysis of individual genomic information to prioritize oligogenic disease variants (Scientific Reports, 2018) applied the framework to variant prioritisation; Ontology-based prediction of cancer driver genes (Scientific Reports, 2019) to cancer genomics; and PathoPhenoDB: linking human pathogens to their disease phenotypes (Scientific Data, 2019) to infectious disease research. The methodological synthesis is captured in Semantic similarity and machine learning with ontologies (Briefings in Bioinformatics, 2020), which has become a widely cited reference in the area.

Bio2Vec delivered both an infrastructure (the embedding stack, the analytics methods built on SANSA, the integration with Bio2RDF and the AberOWL ontology repository) and a body of evidence that semantic background knowledge measurably improves prediction on real biomedical problems. The methods opened the way for everything that followed in the group's neuro-symbolic line, including IBNSINA-QI, the CRG11 explainable-ML project, and the function-prediction stack used in DeepGOMeta and DeepGOZero. For Saudi Arabia, the project produced reusable, knowledge-based analytics that fit the Vision 2030 transition to a knowledge-based economy and laid the technical groundwork for population-scale genomics applications that came online in subsequent projects.

Period: 2018–2020

Funding

  • KAUST Competitive Research Grant — Grant ID: URF/1/3454-01-01 (PI) — USD 113,250

Team

  • Robert Hoehndorf — PI (KAUST (Professor of Computer Science))
  • Xin Gao — CoI (KAUST (CBRC))
  • Michel Dumontier — CoI (Maastricht University)
  • Jens Lehmann — CoI (Amazon (formerly TU Dresden))
  • Mona Alshahrani — PhD (alumnus) (Jubail University College (Assistant Professor))
  • Maxat Kulmanov — PhD (alumnus) (KAUST (Research Scientist))
  • Sumyyah Toonsi — MSc (alumnus)

Software

Publications acknowledging this project (20)

Topics: Applied Ontology, Neuro-symbolic AI, Semantic similarity