PIPA: A High-Throughput Pipeline for Protein Function Annotation

Abstract

We developed Pipeline for Protein Annotation (PIPA), a genome-wide protein function annotation pipeline that runs in a high performance computing environment. PIPA integrates different tools and employs the Gene Ontology (GO) to provide consistent annotation and resolve prediction conflicts. PIPA has three modules that allow for easy development of specialized databases and integration of various bioinformatics tools. The first module, the pipeline execution module, consists of programs that enable the user access to and control of the pipelinepsilas parallel execution of multiple jobs, each searching a particular database for a chunk of the input data. The execution module wraps the second module, the core pipeline module. The integrated resources, the program for terminology conversion to GO, and the consensus annotation program constitute the main components of the core module. The third module is the preprocessing module. This last module contains the program for customized generation of protein function databases and the GO-mapping generation program, which creates GO mappings for the terminology conversion program. The current implementation of PIPA annotates protein functions by combining the results of an inhouse-developed database for enzyme catalytic function prediction (CatFam) and the results of multiple integrated resources, such as the 11 member databases of InterPro and the Conserved Domains Database, into common GO terms. A Web-page-based graphical user interface is developed based on the User Interface Toolkit. The pipeline is deployed on two LINUX clusters, JVN at the Army Research Laboratory Major Shared Resource Center and JAWS at the Maui High Performance Computing Center. Currently, scientists at the Naval Medical Research Center are using PIPA to predict protein functions for newly sequenced bacterial pathogens and their near-neighbor strains. Validation tests show that, on average, the CatFam database yields predictions of enzyme catalytic functions with accuracy greater than 95%. Test results of the consensus GO annotation show an improvement in performance of up to 8% when compared with annotations in which consensus is not used.