NEW YORK (GenomeWeb News) – Aiming to establish a framework for the analysis of organelle proteomics data, researchers at the Cambridge Centre for Proteomics have published a study detailing best practices for the field.
The study, published last week in Molecular & Cellular Proteomics, presents recommendations for steps including data processing and visualization, quality control, and protein localization prediction along with examples of their implementation within software packages developed by the CCP group.
Organelle proteomics has traditionally relied on biochemical purification and analysis of select cell compartments; however, in recent years scientists have put forth new techniques that allow for broader study of the subcellular proteome.
These newer methods involve fractionating cell contents by density, allowing for the quantification of proteins in the different density fractions. Proteins can then be localized to specific regions of the cell by comparing their density profiles both to each other and to that of proteins of known localization.
Such approaches allow researchers to avoid the difficulties inherent in biochemical purifications of organelles while also taking into account phenomena like proteins that appear in multiple locations throughout the cell, said Kathryn Lilley, director of the CCP and author on the MCP paper.
However, Lilley noted, these newer techniques produce very rich subcellular proteomic datasets, and, she told ProteoMonitor, "there have been few [informatics tools] out there capable of mining them robustly."
Additionally, the authors noted, in the case of some tools that have been used for such a purpose, the developers have not provided access to the underlying code in order for others to repeat the analyses.
Aiming to fill this gap, Lilley and colleagues including Laurent Gatto, director of the CCP's Computational Proteomics Unit and a co-author on the MCP study, have developed several bioinformatic tools useful in the analysis of such data, including the software packages MSnbase – intended for data visualization and processing of quantitative proteomics data – and pRoloc – which enables identification of protein groupings using unsupervised and supervised machine learning.
The MCP paper seeks to provide a framework for the use of these and similar tools for organelle proteomics, Gatto told ProteoMonitor, noting that beyond actually building programs, a key part of software development and implementation is educating users on how and when to apply a package's tools.
Researchers need "to be able to use the software in an informed way, as opposed to just using a piece of software and trusting the output," he said. "Because sometimes software has been designed to do one specific thing, and there is no guarantee that the software will apply to a [given] dataset."
Such concerns are not unique to organelle proteomics. Indeed, in a January interview with ProteoMonitor, Olga Vitek, a researcher at Purdue University and an expert in the statistics of mass spec-based proteomics, noted that proteomics in general suffers from a lack of researcher understanding regarding the informatics tools underpinning their work.
"What I see a lot is people are asking for [integrated] pipelines where you put your samples in and you get p-values and your IDs at the end, and they are hoping that by pushing a button the whole thing will run," Vitek said. "But that is not possible because even one small change in the workflow will require different statistics."
Certain issues are more specific to organelle proteomics, however. For instance, the challenge of identifying suitable proteins to use as organelle-specific markers in an experiment. Such markers provide anchors of sorts to which proteins of unknown localization can be linked.
Searching for suitably validated markers, though, can be a "somewhat discouraging experience," Lilley said. "If you go into the databases and look at the ontologies … you type in your favorite protein, and someone has seen it in the cytoplasm, someone has seen it in the [endoplasmic reticulum], someone has seen it in the Golgi."
"It's fair to say that proteins are highly dynamic and in different situations may be in different places," she added. "But also the methods people have used to work out where a protein is have led to false discoveries, which means that some of the data populating the databases is just wrong. And unless you are very sure about your marker set, you can end up analyzing your data in a way where what comes out the other end is going to be severely skewed."
As part of the pRoloc software, the CCP team has included curated marker sets for Arabidopsis thaliana, Drosophila melanogaster, Saccharomyces cerevisiae, Gallus gallus, mouse, and human.
The pRoloc program, Gatto said, allows researchers to use such marker data not only to link proteins of unknown localization to organelles represented by a marker protein, but also to identify new clusters and organelles for which no marker data was available – an important feature, he noted, given the lack of good markers for many structures.
Gatto noted that using their tools to reanalyze a Drosophila melanogaster dataset from 2009, the CCP researchers were able to identify seven organelles not identified in the original analysis, as well as distinguish between the ER and Golgi, which the original analysis combined.
Lilley noted that while the density-based fractionation methods have made inroads into organelle proteomics research, these analyses have, by and large, been restricted to targeted portions of the cell due to limitations in mass spec multiplexing as well as accuracy and sensitivity.
She offered as an example her group's work on plant secretory pathways, in which they began their work by getting rid of the nucleus, which doesn't contain any of their components of interest.
More recently, though, Lilley said, technology improvements have allowed researchers to move towards mapping the full complements of a cell's subcellular components. Among the key advances, she said, has been the increased multiplexing capabilities of isobaric tagging reagents, which enables quantitative analysis of a larger number of subcellular fractions. For instance, proteomics firm Proteome Sciences last year introduced 8-plex and 10-plex versions of its TMT isobaric tagging reagents, and has announced plans this year to launch 20-plex and 30-plex reagents.
The increasing accuracy of mass spec instruments is also key, Lilley said. "You want to be able to recapitulate very accurately the distribution of your proteins through your separation process, and if your measurements have technical inaccuracy then that is going to impact the data you get out the other end."
She added that she and her colleagues are in the process of publishing a large-scale study that will examine the organelle proteome across the full breadth of the cell.