The Modelling and Analysis of Biological Network Activity (BioMANTA) Project encompasses development of novel biological network analysis methods and infrastructure for querying biological data in a semantically-enabled format, and aims to create a semantic interactome model. Research within the BioMANTA project will focus on computational modelling and analysis, primarily using Semantic Web technologies and Machine Learning methods, of large-scale protein-protein interaction and compound activity networks across a wide variety of species. A range of information such as kinetic activity, tissue expression, and subcellular localization and disease state attributes will be included in the resulting data model.
This project is two-year scientific research collaboration between
- The Computational Sciences Center of Emphasis, Pfizer Global Research and Development, Pfizer Inc., Cambridge, Massachusetts, USA,
- The Institute for Molecular Bioscience (IMB), and
- The School of Information Technology and Electronic Engineering (ITEE), The University of Queensland, Australia
Protein interactions are a fundamental component of biological processes. Many proteins are functional only in multimeric complexes, or require interaction partners to achieve their correct localisation or function. For this reason, the study of protein-protein interaction (PPI) networks has become an area of growing interest in computational biology.
Through the use of Semantic Web technologies such as Resource Description Framework (RDF) and Web Ontology Language (OWL), interaction data is modelled to create a knowledge representation in which meaning is vested in the ontology rather than instances of data. Stochastic and computational intelligence methods are applied to this data to infer high coverage networks. Semantic inferencing is used to infer previously unknown and meaningful pathways.
Major project components
- The BioMANTA Ontology An OWL DL ontology incorporating the PSI-MI Ontology, the NCBI Taxonomy, and elements of BioPax ontology and Gene Ontology (describing subcellular localisation). This allows us to re-use existing ontologies, thereby reducing overheads associated with knowledge acquisition in the ontology development process. We are able to integrate existing public data that contain annotation in these formats.
- Data conversion & semantic protein integration A set of software components that convert protein-protein databases (DIP, MPact, IntAct, etc.) from PSI-MI XML to RDF compliant with the BioMANTA ontology. These software allow us to make these protein-protein interaction datasets (and more generally, any PSI-MI XML data) semantically available for querying and inference within BioMANTA.
- A RDF triple store based on RDF Molecules and the MapReduce architecture A proof-of-concept RDF triple store using RDF molecules and Hadoop scale-out architectures. Regular RDF graphs are deconstructed into RDF "molecules", which are distributed over distributed compute nodes in the MapReduce architecture, and are subsequently combined to form equivalent RDF graphs. Such an approach makes the distributed SPARQL querying and reasoning on RDF triple stores possible.
- A quantitative framework to integrate networks extracted from independent data sources (gene expression, subcellular localisation, and ortholog mapping) The model is multi-layer, with a first layer based on Decision Trees where each Decision tree is built on each dataset independently. The tree nodes are cut using Shannon's entropy (mutual information); the decision of these independent trees is integrated using logistic regression, and the parameters are optimised using maximum likelihood.