Computational methods for functional metagenomics: from protein functions to multi-scale interactions

Overview

Metagenomic sequencing has made it routine to read the DNA of an entire microbial community, but most analysis pipelines stop at taxonomic composition or at the level of individual protein families. The really biologically informative questions, which proteins do what, which proteins interact, which metabolic pathways are reconstructible, and how the community as a whole interacts with its environment or host, remain largely out of reach computationally. Even associations that are very robust empirically, for example between gut microbiome composition and colorectal cancer or inflammatory bowel disease, remain mechanistically obscure: the directionality of causation, the role of host-microbe metabolite exchange, and the contribution of direct protein-protein contacts are all under-determined by current pipelines. This project, funded under CRG2021 and running 2022-2024, set out to close that gap by treating metagenomes as multi-scale interactomes and developing the machine-learning machinery to annotate them at that level.

The function-prediction core extended a family of deep-learning models developed by the group. DeepGOPlus: improved protein function prediction from sequence (Bioinformatics, 2020) combined convolutional sequence models with diamond-based similarity to push CAFA-level performance using sequence alone; DeepGOZero: improving protein function prediction from sequence and zero-shot learning based on ontology axioms (Bioinformatics, 2022) added zero-shot prediction for previously unseen GO classes by exploiting the formal axioms of the Gene Ontology; Protein function prediction as approximate semantic entailment (Nature Machine Intelligence, 2024) recast the prediction problem itself as approximate inference and improved both accuracy and interpretability. The metagenome-specific extension was delivered in DeepGOMeta for functional insights into microbial communities using deep learning-based protein function prediction (Scientific Reports, 2024), which applies and extends DeepGO to assembled metagenomes and short-read inputs and demonstrates utility in characterising real microbial communities. Web-scale deployment was provided by DeepGOWeb: fast and accurate protein function prediction on the (Semantic) Web (Nucleic Acids Research, 2021).

Beyond function, the project tackled interactions and pathways. Cross-organism interactions were addressed by extending DeepViral: prediction of novel virus-host interactions from protein sequences and infectious disease phenotypes (Bioinformatics, 2021) from viruses to bacteria-host pairs, allowing host-microbe contacts to be predicted directly from sequence. Metabolic pathway reconstruction was approached by combining predicted enzyme functions with answer-set-programming-style completion against KEGG and MetaCyc. Reproducible CWL workflows wrap these steps for both assembly-based and assembly-free modes, with benchmarks drawn from UniProt-derived synthetic reads, the MBARC-26 mock community, Red Sea marine metagenomes (about 950 L of seawater sampled across late 2020 and early 2021 and sequenced on Illumina NovaSeq, PacBio, and Oxford Nanopore), and public health-associated cohorts including colorectal cancer and IBD microbiomes.

What the project delivered, in summary, is a stack that pushes metagenomic annotation past taxonomy and protein families into functional, interaction-level, and pathway-level structure, evaluated on both synthetic and real samples. Applications include marine bioprospecting and ecosystem monitoring in the Red Sea, aligned with the Saudi Green Initiative, and clinical metagenomics where the same framework is used to surface candidate causal mechanisms linking gut microbiome to disease state in cancer and IBD. The software, workflows, benchmarks, and trained models are released as Free Software in containerised, FAIR-compliant form, and feed directly into downstream projects on protein function, knowledge-graph methods, and population-scale genomics.

Period: 2022–2024

Funding

  • KAUST Competitive Research Grant — Grant ID: URF/1/4675-01-01 (PI) — USD 247,500

Team

Software

Publications acknowledging this project (16)

  • (2025) Lattice-based ALC ontology embeddings with saturation
  • (2024) Predicting protein functions using positive-unlabeled ranking with ontology-based priors Supplementary Material
  • (2024) Neuro-symbolic AI in Life Sciences
  • (2023) DeepGOMeta: Functional Insights into Microbial Communities with Deep Learning-Based Protein Function Prediction
  • (2022) Exploring the Use of Ontology Components for Distantly-Supervised Disease and Phenotype Named Entity Recognition
  • (2022) Context-based protein function prediction in bacterial genomes
  • (2022) INDIGENA: inductive prediction of disease--gene associations using phenotype ontologies Supplementary Material
  • (2022) mOWL: revision document
  • (2022) Large-Scale Knowledge Integration for Enhanced Molecular Property Prediction
  • (2018) Ontology Embedding: A Survey of Methods, Applications and Resources
  • (2015) The application of Large Language Models to the phenotype-based prioritization of causative genes in rare disease patients
  • (2015) The application of Large Language Models to the phenotype-based prioritization of causative genes in rare disease patients
  • (2012) Exploring the Use of Ontology Components for Distantly-Supervised Disease and Phenotype Named Entity Recognition
  • (2012) Improving the classification of cardinality phenotypes using collections
  • (2012) STARVar: Symptom-based Tool for Automatic Ranking of Variants using evidence from literature and genomes
  • … and 1 more.

Topics: Applied Ontology, Microbial communities, Neuro-symbolic AI, Protein function