IBNSINA-QI: Integrating Biomedical Networks and Semantic Information for Neural network Analysis of Quantitative Information

Overview

Biological measurements are inherently high-dimensional and heterogeneous: omics platforms produce thousands to millions of features per individual, and they coexist with qualitative information such as diagnoses, phenotype calls, and prescriptions. Biomedical ontologies and knowledge graphs encode rich qualitative background knowledge, but they are largely disconnected from the quantitative measurements that gave rise to the categorical phenotypes in the first place. Conversely, graph neural networks and other methods that handle quantitative data on graphs do not yet exploit the formal semantics of axiomatised biomedical ontologies. IBNSINA-QI, funded under CRG2020 and running 2021-2023, set out to bridge these worlds: to build a knowledge-based framework in which qualitative and quantitative individual-level data can be analysed jointly, and in which machine learning is constrained by formal background knowledge.

The first technical advance was an OWL knowledge base that integrates phenotype, anatomy, chemical, function, and clinical-measurement ontologies (HPO, the Disease Ontology, CMO, Uberon, ChEBI, GO, PATO, MP, NBO, the Mouse Pathology Ontology). Classes from the Clinical Measurement Ontology are logically decomposed into the anatomical entity, chemical, process, and quality being measured, and linked into phenotype ontologies through axiom patterns that are reasoned with the Elk reasoner. The knowledge base is then enriched with annotations from GOA, UniProt, MGI, GTEx, and the Virtual Metabolic Human, and mapped to clinical terminologies (ICD, SNOMED CT) so that data from UK Biobank can be ingested. The framework for ontology-driven, ABox-aware semantics is presented in mOWL: Python library for machine learning with biomedical ontologies (Bioinformatics, 2023) and refined in Lattice-Preserving ALC Ontology Embeddings (NeSy 2024) and Enhancing Geometric Ontology Embeddings for EL++ with Negative Sampling (NeSy 2024). Cardinality phenotypes, a notoriously awkward case, were handled in Improving the classification of cardinality phenotypes using collections (Journal of Biomedical Semantics, 2023).

On the machine-learning side, the project produced a series of concrete results that demonstrate the value of combining quantitative omics with formal background knowledge. Prioritizing genomic variants through neuro-symbolic, knowledge-enhanced learning (Bioinformatics, 2024) is the headline application: it shows that variant prioritisation for rare disease is sharpened when neural models are constrained by phenotype-gene axioms. DeepSVP: integration of genotype and phenotype for structural variant prioritization using deep learning (Bioinformatics, 2022) extends the approach from single-nucleotide variants to structural variants. Causal relationships between diseases mined from the literature improve the use of polygenic risk scores (Bioinformatics, 2024) addresses the second motivating use case of the proposal, knowledge-based polygenic risk, by combining literature-derived disease causation with PRS computation. Contribution of model organism phenotypes to the computational identification of human disease genes (Disease Models and Mechanisms, 2022) quantifies how much cross-species phenotype data adds to gene discovery. The methods generalise: Critical assessment of variant prioritization methods for rare disease diagnosis (Human Genomics, 2024) benchmarks the field, and the more recent The application of Large Language Models to the phenotype-based prioritization of causative genes in rare disease patients (Scientific Reports, 2025) shows how the same knowledge-enhanced architecture extends to LLM-based pipelines.

Across 18 publications the project delivered a working framework, a software stack, and concrete biomedical applications in variant prioritisation, structural variant analysis, polygenic risk scoring, and rare disease diagnosis. The methods are particularly relevant for Saudi Arabia, where high consanguinity is associated with elevated rates of congenital disorders and inborn errors of metabolism, and where heavy investment in multi-omics and next-generation sequencing under Vision 2030 creates immediate demand for analytic frameworks that can fuse quantitative data with formalised biomedical knowledge. The outputs feed directly into the follow-on CRG11 project on sound, complete, and explainable machine learning with biomedical ontologies.

Period: 2021–2023

Funding

  • KAUST Competitive Research Grant — Grant ID: URF/1/4355-01-01 (PI) — USD 239,999

Team

Software

Publications acknowledging this project (18)

  • (2025) Lattice-based ALC ontology embeddings with saturation
  • (2024) Predicting protein functions using positive-unlabeled ranking with ontology-based priors Supplementary Material
  • (2024) Neuro-symbolic AI in Life Sciences
  • (2023) DeepGOMeta: Functional Insights into Microbial Communities with Deep Learning-Based Protein Function Prediction
  • (2022) Exploring the Use of Ontology Components for Distantly-Supervised Disease and Phenotype Named Entity Recognition
  • (2022) mOWL: revision document
  • (2021) How much do model organism phenotypes contribute to the computational identification of human disease genes?
  • (2020) DeepGOWeb: Fast and accurate protein function prediction on the (Semantic) Web
  • (2015) The application of Large Language Models to the phenotype-based prioritization of causative genes in rare disease patients
  • (2015) The application of Large Language Models to the phenotype-based prioritization of causative genes in rare disease patients
  • (2012) Exploring the Use of Ontology Components for Distantly-Supervised Disease and Phenotype Named Entity Recognition
  • (2012) Improving the classification of cardinality phenotypes using collections
  • (2012) Linking common human diseases to their phenotypes; development of a resource for human phenomics
  • (2012) Komenti: A Semantic Text-mining Framework
  • (2012) STARVar: Symptom-based Tool for Automatic Ranking of Variants using evidence from literature and genomes
  • … and 3 more.

Topics: Applied Ontology, Neuro-symbolic AI, Rare disease