Protein function
Determining what a protein does, from its sequence alone, is one of the foundational problems of computational molecular biology. Experimental characterisation cannot keep pace with sequencing throughput, and large-scale ontologies such as the Gene Ontology (GO) provide the structured background knowledge needed to make automated assignment of function tractable. Our work centres on the DeepGO family of systems, which couple deep neural networks with the formal axioms of the Gene Ontology so that predicted annotations are not only accurate but also logically consistent with what is already known about biological processes, molecular functions, and cellular components. By treating GO not as a flat label set but as a structured theory, we use the ontology to constrain learning, support transfer between organisms, and reason about classes that have never been seen in training data.
The DeepGO family
The line of work began with DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier, which introduced a hierarchical neural classifier that respects the true-path rule of the Gene Ontology and combines sequence with protein-protein interaction features. Sequence-only prediction was substantially improved in DeepGOPlus: improved protein function prediction from sequence, where a CNN ensemble was combined with k-nearest-neighbour homology scoring to produce one of the strongest models in the CAFA evaluation, as analysed in The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. To put these models in the hands of biologists, DeepGOWeb: fast and accurate protein function prediction on the (Semantic) Web exposed DeepGOPlus through a website, a REST API, and a SPARQL endpoint, ensuring that predicted functions remain consistent with the GO hierarchy.
Reasoning with ontology axioms
A central question is how to predict functions that are rare or entirely absent from training data. DeepGOZero: improving protein function prediction from sequence and zero-shot learning based on ontology axioms answered this through model-theoretic embeddings of the EL description-logic fragment of GO, allowing the model to predict GO classes by reasoning over their logical definitions rather than memorising labels. The partial-annotation problem inherent to CAFA-style benchmarks was tackled in Predicting protein functions using positive-unlabeled ranking with ontology-based priors, which replaces the naive closed-world assumption with a positive-unlabeled formulation informed by ontological priors. We have also explored agentic approaches in LLM Agent Based Protein Function Prediction, which decomposes annotation into tool-calling sub-tasks that combine sequence search, structural lookup, and domain reasoning before synthesising a final GO assignment.
Beyond model organisms, microbial and environmental proteomes pose distinctive challenges: most existing predictors are trained on eukaryotic data, and many environmental proteins have no homologues in reference databases. DeepGOMeta for functional insights into microbial communities using deep learning-based protein function prediction retrained the DeepGO architecture on microbial annotations, enabling functional characterisation of metagenomic samples and linking proteins to biogeochemical processes. Earlier foundations for this approach were laid in Prediction of Metabolic Pathway Involvement in Prokaryotic UniProtKB Data by Association Rule Mining, and the practical use of these tools for genome annotation is described in Annotating genomes with DeepGO protein function prediction tools.
These methods underpin a range of applied programmes, from coral genomics in the Red Sea to AI-tailored soil microbiomes for desert revegetation, and they feed directly into downstream phenotype prediction systems such as DeepPheno and PhenoGoCon. By combining deep representation learning with the formal semantics of biomedical ontologies, our protein function prediction work provides a route from raw sequence to mechanistically informed, machine-readable functional annotations at proteome scale.
Projects
- Enabling desert revegetation by AI-tailored soil microbiome fortification (2023–2025)
- Evolutionary potential of corals to adapt to climate warming (2022–2025)
- Computational methods for functional metagenomics: from protein functions to multi-scale interactions (2022–2024)
Software
- DeepGOPlus — CNN-ensemble protein-function predictor that augments sequence-based scoring with k-nearest-neighbour homology and GO axioms; one of the strongest CAFA-evaluated open models.
- DeepGO — Original sequence-based, ontology-aware deep classifier for predicting Gene Ontology functional annotations; basis of the entire DeepGO family of tools.
- DeepGO2 — Next-generation DeepGO model with transformer protein embeddings and improved hierarchical multi-label prediction.
- DeepGOZero — Zero-shot extension of DeepGO using model-theoretic ELEmbeddings to predict GO classes that have never been observed during training.
- DeepGOMeta — DeepGO trained specifically for metagenomic communities; predicts functional roles of proteins recovered from environmental samples and links them to biogeochemical processes.
- PU-GO — Positive-unlabeled ranking of protein functions with ontology-based priors; directly addresses the partial-annotation problem in CAFA benchmarks.
- DeepPheno — Predicts loss-of-function organism-level phenotypes (HPO/MPO) directly from a gene's annotated functions, using a hierarchical neural classifier over phenotype ontologies.
- GO-Agent — LLM-agent framework that decomposes protein-function prediction into tool-calling sub-tasks (sequence search, structure lookup, domain reasoning) and stitches the evidence into a final GO annotation.
- PhenoGoCon — Predicts gene–phenotype associations from predicted Gene Ontology functions; bridges GO function prediction and HPO/MPO phenotype prediction.
- Genomic Context — Bacterial protein-function prediction that exploits operon and genome-neighbourhood structure in addition to sequence and homology.
Publications (13)
- (2025) Zhapa-Camacho, Mashkova, Hoehndorf, Kulmanov. LLM Agent Based Protein Function Prediction Biocomputing 2026.
- (2025) Kulmanov, Hoehndorf. Computational prediction of protein functional annotations Protein Function Prediction.
- (2025) Tawfiq, Niu, Kulmanov, Hoehndorf. Annotating genomes with DeepGO protein function prediction tools Protein Function Prediction.
- (2024) Zhapa-Camacho, Tang, Kulmanov, Hoehndorf. Predicting protein functions using positive-unlabeled ranking with ontology-based priors Bioinformatics.
- (2024) Tawfiq, Niu, Hoehndorf, Kulmanov. DeepGOMeta for functional insights into microbial communities using deep learning-based protein function prediction Scientific Reports.
- (2022) Maxat Kulmanov, Robert Hoehndorf. DeepGOZero: improving protein function prediction from sequence and zero-shot learning based on ontology axioms Bioinformatics.
- (2021) Maxat Kulmanov, Fernando Zhapa-Camacho, Robert Hoehndorf. DeepGOWeb: fast and accurate protein function prediction on the (Semantic) Web Nucleic Acids Research.
- (2020) Kulmanov, Hoehndorf. DeepGOPlus: improved protein function prediction from sequence Bioinformatics.
- (2019) Zhou, Jiang, Bergquist, Lee et al.. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens Genome Biology.
- (2017) Kulmanov, Khan, Hoehndorf. DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier Bioinformatics.
- (2016) Boudellioua, Saidi, Hoehndorf, Martin et al.. Prediction of Metabolic Pathway Involvement in Prokaryotic UniProtKB Data by Association Rule Mining PLoS ONE.
- (2011) Simon Jupp, Robert Stevens, Robert Hoehndorf. Exploring Gene Ontology Annotations with OWL Proceedings of the 13th Bio-Ontology Meeting.
- Jupp, Stevens, Hoehndorf. Logical Gene Ontology Annotations (GOAL): exploring gene ontology annotations with OWL Journal of Biomedical Semantics.