Computational methods for functional metagenomics: from protein functions to multi-scale interactions
Overview
Metagenomic sequencing has made it routine to read the DNA of an entire microbial community, but most analysis pipelines stop at taxonomic composition or at the level of individual protein families. The really biologically informative questions, which proteins do what, which proteins interact, which metabolic pathways are reconstructible, and how the community as a whole interacts with its environment or host, remain largely out of reach computationally. Even associations that are very robust empirically, for example between gut microbiome composition and colorectal cancer or inflammatory bowel disease, remain mechanistically obscure: the directionality of causation, the role of host-microbe metabolite exchange, and the contribution of direct protein-protein contacts are all under-determined by current pipelines. This project, funded under CRG2021 and running 2022-2024, set out to close that gap by treating metagenomes as multi-scale interactomes and developing the machine-learning machinery to annotate them at that level.
The function-prediction core extended a family of deep-learning models developed by the group. DeepGOPlus: improved protein function prediction from sequence (Bioinformatics, 2020) combined convolutional sequence models with diamond-based similarity to push CAFA-level performance using sequence alone; DeepGOZero: improving protein function prediction from sequence and zero-shot learning based on ontology axioms (Bioinformatics, 2022) added zero-shot prediction for previously unseen GO classes by exploiting the formal axioms of the Gene Ontology; Protein function prediction as approximate semantic entailment (Nature Machine Intelligence, 2024) recast the prediction problem itself as approximate inference and improved both accuracy and interpretability. The metagenome-specific extension was delivered in DeepGOMeta for functional insights into microbial communities using deep learning-based protein function prediction (Scientific Reports, 2024), which applies and extends DeepGO to assembled metagenomes and short-read inputs and demonstrates utility in characterising real microbial communities. Web-scale deployment was provided by DeepGOWeb: fast and accurate protein function prediction on the (Semantic) Web (Nucleic Acids Research, 2021).
Beyond function, the project tackled interactions and pathways. Cross-organism interactions were addressed by extending DeepViral: prediction of novel virus-host interactions from protein sequences and infectious disease phenotypes (Bioinformatics, 2021) from viruses to bacteria-host pairs, allowing host-microbe contacts to be predicted directly from sequence. Metabolic pathway reconstruction was approached by combining predicted enzyme functions with answer-set-programming-style completion against KEGG and MetaCyc. Reproducible CWL workflows wrap these steps for both assembly-based and assembly-free modes, with benchmarks drawn from UniProt-derived synthetic reads, the MBARC-26 mock community, Red Sea marine metagenomes (about 950 L of seawater sampled across late 2020 and early 2021 and sequenced on Illumina NovaSeq, PacBio, and Oxford Nanopore), and public health-associated cohorts including colorectal cancer and IBD microbiomes.
What the project delivered, in summary, is a stack that pushes metagenomic annotation past taxonomy and protein families into functional, interaction-level, and pathway-level structure, evaluated on both synthetic and real samples. Applications include marine bioprospecting and ecosystem monitoring in the Red Sea, aligned with the Saudi Green Initiative, and clinical metagenomics where the same framework is used to surface candidate causal mechanisms linking gut microbiome to disease state in cancer and IBD. The software, workflows, benchmarks, and trained models are released as Free Software in containerised, FAIR-compliant form, and feed directly into downstream projects on protein function, knowledge-graph methods, and population-scale genomics.
Period: 2022–2024
Funding
- KAUST Competitive Research Grant
— Grant ID:
URF/1/4675-01-01(PI) — USD 247,500
Team
- Robert Hoehndorf — PI (KAUST (Professor of Computer Science))
- Takashi Gojobori — CoI (KAUST (CBRC))
- Maxat Kulmanov — PhD (alumnus), Postdoc (KAUST (Research Scientist))
- Rund Tawfiq — PhD (alumnus) (Sano Centre Krakow (Postdoctoral researcher))
- Daulet Toibazar — MSc (alumnus)
- Amal Alhelal — MSc (alumnus)
- Md Nurul Muttakin — MSc (alumnus)
- Shahad Qatan — MSc (alumnus)
- Kexin Niu — MSc (alumnus)
- Asaad Mohammedsaleh — MSc (alumnus)
Software
- DeepGOPlus — CNN-ensemble protein-function predictor that augments sequence-based scoring with k-nearest-neighbour homology and GO axioms; one of the strongest CAFA-evaluated open models. https://github.com/bio-ontology-research-group/deepgoplus
- DeepGO — Original sequence-based, ontology-aware deep classifier for predicting Gene Ontology functional annotations; basis of the entire DeepGO family of tools. https://github.com/bio-ontology-research-group/deepgo
- DeepGO2 — Next-generation DeepGO model with transformer protein embeddings and improved hierarchical multi-label prediction. https://github.com/bio-ontology-research-group/deepgo2
- DeepGOZero — Zero-shot extension of DeepGO using model-theoretic ELEmbeddings to predict GO classes that have never been observed during training. https://github.com/bio-ontology-research-group/deepgozero
- PU-GO — Positive-unlabeled ranking of protein functions with ontology-based priors; directly addresses the partial-annotation problem in CAFA benchmarks. https://github.com/bio-ontology-research-group/PU-GO
- PhenoGoCon — Predicts gene–phenotype associations from predicted Gene Ontology functions; bridges GO function prediction and HPO/MPO phenotype prediction. https://github.com/bio-ontology-research-group/phenogocon
- Genomic Context — Bacterial protein-function prediction that exploits operon and genome-neighbourhood structure in addition to sequence and homology. https://github.com/bio-ontology-research-group/Genomic_context
- PathoPhenoDB — Curated database of pathogens and the disease phenotypes they cause, distributed as an OWL ontology and an interactive web application. https://github.com/bio-ontology-research-group/pathophenodb
Publications acknowledging this project (16)
- (2025) Lattice-based ALC ontology embeddings with saturation
- (2024) Predicting protein functions using positive-unlabeled ranking with ontology-based priors Supplementary Material
- (2024) Neuro-symbolic AI in Life Sciences
- (2023) DeepGOMeta: Functional Insights into Microbial Communities with Deep Learning-Based Protein Function Prediction
- (2022) Exploring the Use of Ontology Components for Distantly-Supervised Disease and Phenotype Named Entity Recognition
- (2022) Context-based protein function prediction in bacterial genomes
- (2022) INDIGENA: inductive prediction of disease--gene associations using phenotype ontologies Supplementary Material
- (2022) mOWL: revision document
- (2022) Large-Scale Knowledge Integration for Enhanced Molecular Property Prediction
- (2018) Ontology Embedding: A Survey of Methods, Applications and Resources
- (2015) The application of Large Language Models to the phenotype-based prioritization of causative genes in rare disease patients
- (2015) The application of Large Language Models to the phenotype-based prioritization of causative genes in rare disease patients
- (2012) Exploring the Use of Ontology Components for Distantly-Supervised Disease and Phenotype Named Entity Recognition
- (2012) Improving the classification of cardinality phenotypes using collections
- (2012) STARVar: Symptom-based Tool for Automatic Ranking of Variants using evidence from literature and genomes
- … and 1 more.
Topics: Applied Ontology, Microbial communities, Neuro-symbolic AI, Protein function