Data integration and ontologies for microbial cell factories

Overview

Industrial biotechnology depends on integrating data from sources that were never designed to be combined: enzyme and reaction databases, protein-family resources, pathway maps, strain collections, phenotypic screens, and the experimental literature itself. Each uses different identifiers, different vocabularies, and different levels of granularity, and naive integration produces brittle pipelines that break whenever a source is updated. The Computational Bioscience Research Center's Microbial Cell Factories initiative, led by Vladimir Bajic with Hoehndorf as CoI heading the data-integration work package, set out to develop ontology-based infrastructure for synthetic biology and metabolic engineering in which different data sources are reconciled through their underlying biology rather than through string-matched identifiers.

The data-integration package produced a set of methods that have since become foundational tools in BORG and beyond. DeepGOPlus: Improved protein function prediction from sequence combined a convolutional sequence model with a similarity-based component and a hierarchical post-processing step that enforces the structure of the Gene Ontology, and was followed by DeepGOWeb: Fast and accurate protein function prediction on the (Semantic) Web, which exposed the method as a SPARQL- and HTTP-accessible service and integrated it with the OpenAPI ecosystem. Predicting protein functions from sequence by a neuro-symbolic deep learning model set out the wider framework in which ontology axioms are used as constraints during training of neural function predictors.

A parallel line of work addressed the semantics of the underlying ontologies. Formal axioms in biomedical ontologies improve analysis and interpretation of associated data showed empirically that adding logical axioms to ontologies, often regarded as overhead, materially improves the quality of downstream similarity, embedding, and enrichment analyses. Semantic similarity and machine learning with biomedical ontologies consolidated these results into a methodological review that has become a standard reference. Quantitative evaluation of ontology design patterns for combining pathology and anatomy ontologies provided concrete guidance for combining disjoint terminologies, and Vec2SPARQL: integrating SPARQL queries and knowledge graph embeddings demonstrated how knowledge-graph embeddings can be queried jointly with the symbolic graph, a pattern reused in later BORG work on the PhenomeBrowser.

The biomedical application layer demonstrated that this infrastructure pays off on concrete questions: Ontology-based prediction of cancer driver genes and DeepPVP: phenotype-based prioritization of causative variants using deep learning used phenotype ontologies to prioritize candidate variants and driver genes; PathoPhenoDB: linking human pathogens to their disease phenotypes and Ontology based mining of pathogen-disease associations from literature built a reusable resource linking pathogen species to the phenotypes they cause; and DDIEM: Drug Database for Inborn Errors of Metabolism assembled curated treatment information for a class of disease in which Saudi Arabia carries a particularly high burden. The 25 publications produced under the project span the technology stack from logical foundations to deployed databases, and many of them seeded subsequent BORG projects on neuro-symbolic learning, ontology embeddings, and variant prioritization.

Period: 2016–2018

Funding

  • KAUST Center Competitive Funding — Grant ID: FCC/1/1976-08-01 (WP-lead) — USD 115,691

Team

Software

Publications acknowledging this project (25)

Topics: Applied Ontology, Drug mechanisms, Microbial communities, Neuro-symbolic AI