Protein function

Determining what a protein does, from its sequence alone, is one of the foundational problems of computational molecular biology. Experimental characterisation cannot keep pace with sequencing throughput, and large-scale ontologies such as the Gene Ontology (GO) provide the structured background knowledge needed to make automated assignment of function tractable. Our work centres on the DeepGO family of systems, which couple deep neural networks with the formal axioms of the Gene Ontology so that predicted annotations are not only accurate but also logically consistent with what is already known about biological processes, molecular functions, and cellular components. By treating GO not as a flat label set but as a structured theory, we use the ontology to constrain learning, support transfer between organisms, and reason about classes that have never been seen in training data.

The DeepGO family

The line of work began with DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier, which introduced a hierarchical neural classifier that respects the true-path rule of the Gene Ontology and combines sequence with protein-protein interaction features. Sequence-only prediction was substantially improved in DeepGOPlus: improved protein function prediction from sequence, where a CNN ensemble was combined with k-nearest-neighbour homology scoring to produce one of the strongest models in the CAFA evaluation, as analysed in The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. To put these models in the hands of biologists, DeepGOWeb: fast and accurate protein function prediction on the (Semantic) Web exposed DeepGOPlus through a website, a REST API, and a SPARQL endpoint, ensuring that predicted functions remain consistent with the GO hierarchy.

Reasoning with ontology axioms

A central question is how to predict functions that are rare or entirely absent from training data. DeepGOZero: improving protein function prediction from sequence and zero-shot learning based on ontology axioms answered this through model-theoretic embeddings of the EL description-logic fragment of GO, allowing the model to predict GO classes by reasoning over their logical definitions rather than memorising labels. The partial-annotation problem inherent to CAFA-style benchmarks was tackled in Predicting protein functions using positive-unlabeled ranking with ontology-based priors, which replaces the naive closed-world assumption with a positive-unlabeled formulation informed by ontological priors. We have also explored agentic approaches in LLM Agent Based Protein Function Prediction, which decomposes annotation into tool-calling sub-tasks that combine sequence search, structural lookup, and domain reasoning before synthesising a final GO assignment.

Beyond model organisms, microbial and environmental proteomes pose distinctive challenges: most existing predictors are trained on eukaryotic data, and many environmental proteins have no homologues in reference databases. DeepGOMeta for functional insights into microbial communities using deep learning-based protein function prediction retrained the DeepGO architecture on microbial annotations, enabling functional characterisation of metagenomic samples and linking proteins to biogeochemical processes. Earlier foundations for this approach were laid in Prediction of Metabolic Pathway Involvement in Prokaryotic UniProtKB Data by Association Rule Mining, and the practical use of these tools for genome annotation is described in Annotating genomes with DeepGO protein function prediction tools.

These methods underpin a range of applied programmes, from coral genomics in the Red Sea to AI-tailored soil microbiomes for desert revegetation, and they feed directly into downstream phenotype prediction systems such as DeepPheno and PhenoGoCon. By combining deep representation learning with the formal semantics of biomedical ontologies, our protein function prediction work provides a route from raw sequence to mechanistically informed, machine-readable functional annotations at proteome scale.

Projects

Software

Publications (13)