Data Intensive Computing
Analysis of Proteomics from Complex Communities Using High Performance Computing
Challenge:
PNNL's Environmental Biomarkers Initiative (EBI) has been characterizing the response of complex microbial communities modeled on those from the Columbia River to uranium exposure. One component of analysis uses tandem mass spectroscopy (MS)-based proteomics to characterize changes in protein composition in the presence of uranium. The sizable challenge associated with this analysis is that peptide identification from MS spectra is largely based on matching spectra to known peptide sequences. However, for these communities, the species and their sequences are unknown. In fact, it is unclear which known species are present in these samples and at what percentages.
Therefore, peptide identification methods which attempt to use a small number of protein sequences will fail to identify the vast majority of spectra (and thus peptides) present in complex community samples.
Approach:
Our solution is to use an existing peptide identification application developed at PNNL by William Cannon, called Polygraph, to match spectra with peptides from a large set of candidate sequences gathered from publicly available databases. Polygraph runs in parallel on a supercomputer cluster and is capable of handling arbitrarily large protein reference files. The protein reference file that we've assembled for this demonstration has approximately 500,000 protein sequences and approximately 10 million tryptic peptides.
Our approach allows identification of proteins that are present in the protein reference file, and also allows extension of the search to identify those proteins related by evolution, which include identical peptides. We are using the MeDICi framework for implementation of this workflow since it provides the ability to deal with very large data files and the flexibility to reconfigure the workflow to implement novel analyses.
Impact:
The workflow we have developed will be used by the EBI to characterize changes in the composition of complex communities in response to environmental perturbation. These changes can be assessed at the levels of functional groups (for example, photosynthetic proteins), species composition, or composition based on higher evolutionary groups (such as classes and phyla).
Importantly, the workflow drastically reduces the time-to-solution for biologists and analysts examining these problems and allows them to optimize parameters specifically for their problem. The workflow we have implemented is directly applicable to problems in a number of other scientific areas including human health, climate change and bioremediation. For example, characterization of the human microbiome is the focus of a new National Institutes of Health initiative that will require tools like this workflow to analyze proteomics data from complex microbial communities.
