NEW YORK (GenomeWeb) – Progress has been made over the past few years on improving the statistical design of proteomics experiments, but the area remains a sticking point for the field, according to several leading researchers.
Issues like small sample cohorts and the use of inappropriate statistical analyses call into question the robustness of some proteomic studies, they said, which can lead to overconfidence in reported results and poor reproducibility down the road.
Olga Vitek, an associate professor at Northeastern University and expert in the statistics of clinical proteomics, suggested that at the root of the problem is a failure by proteomics researchers to thoroughly consider the design of their experiments from a statistical perspective.
"The mistake that people make is, they don't really think through this [part], not only in terms of how many replicates they need, but also in terms of how they translate their experimental objective into a statistical or analytical objective," she said. "And because they miss that initial step, they often miss the next step," the number of replicates required. For instance, she said, an experiment aiming to do unsupervised discovery of similarities across replicates is different from one looking to compare average conditions across replicates, which is also different from an experiment looking for predictive biomarkers.
"Different goals require different [statistical methods], and different methods require different samples sizes," Vitek said.
She added that while many proteomics studies use appropriate statistical designs, poorly designed studies are not uncommon.
"It is something that is still underappreciated," she said. "For example, I still see papers where people do [principal component analysis] and they say that they found biomarkers, which is ridiculous. Or they [use a] t-test and they have proteins with small p-values and they claim that they found biomarkers, and that is completely wrong."
And without an understanding of the statistical tools best suited for a given experiment, it's impossible to determine what size of a cohort will be necessary to sufficiently power a study, Vitek said.
Often, she noted, researchers will try to correct for this lack of planning on the back end.
"People will say, 'OK, let me just collect some data and then I will find a person who will analyze it for me," she said. "But by that time, it is way too late."
In terms of biomarker discovery work, in particular, Vitek said she believes that "the majority of studies out there are not sufficiently powered."
This problem stems from the basic fact that while mass spec-based proteomics experiments are able to measure many thousands of proteins, they remain relatively low throughput, with many studies looking at on the order of dozens, rather than hundreds or thousands, of samples. An additional challenge, particularly in clinical research, is putting together large cohorts of what can be rare or difficult-to-obtain samples.
"The more things you end up measuring, the more things will end up changing just by random chance, and so you need to set a higher and higher threshold for the measure of significance for any one peptide, and many people don't think about that," said Michael MacCoss, associate professor at the University of Washington.
Additionally, MacCoss said, the use of binary case-control type designs to identify proteins involved in a particular perturbation also presents certain challenges.
"When you start to do the math you see that … you often will need to do many [different] perturbations to show whether the change you are seeing is specific to the perturbation you are interested in," he said. "This idea of just looking at kind of binary conditions, I'm not saying you can't get useful data out of it, but you often end up with more questions than you started with."
"This is something that genomics has known for a while," he added. "But we are kind of relearning it in proteomics."
"It is certainly true in proteomics that many, perhaps most, studies have been underpowered," said Ruedi Aebersold, a professor at the Swiss Federal Institute of Technology (ETH) Zurich.
This "is one of the difficult traps" of proteomics research, he said. "You are likely to overfit [in underpowered experiments] based on some confounding factor or random fluctuation."
He noted that he and his colleagues are currently looking at the impact of batch effects on proteomics experiments and are finding that some underpowered experiments are basically reflecting differences in who prepared a sample and how or from which facility it came.
"If you don't have enough numbers and you don't factor in the possibility of batch effects and other confounding factors in your study, you might perfectly separate two groups, like a case and control, but it simply is a reflection of the confounding factors," Aebersold said.
Given the difficulty of proteomics research and science more generally, Aebersold said he was hesitant to say that researchers running underpowered experiments were "wasting their time," but "if you have a cohort of 10 cases and 10 controls and you measure thousands of proteins and you want to find a biomarker, that basically is not going to work."
The community is increasingly aware of this fact, Aebersold said, but research practices can take time to change, even after the ineffectiveness of a method has been established.
He cited the example of 2D gel workflows commonly used in the early days of proteomics. Researchers would run samples on gels and then cut out spots of interest for analysis by mass spec.
A study led by Harvard University researcher Steven Gygi when he was a postdoc in Aebersold's lab demonstrated that this approach was inherently limited to around two orders of magnitude of dynamic range, meaning, Aebersold said, that "you were really only skimming the top level [of the proteome]."
"And this was very revealing, because it meant that if you want to look for something more than two orders of magnitude down, you would never find it," he said.
Nonetheless, researchers continued to use this approach for years after the publication of this study, Aebersold said. "It was amazing to see for how many years, if you went to HUPO or ASMS or any proteomic conference, people were still doing 2D gels with plasma to find biomarkers for disease. This simply wasn't going to work because everyone always found the same 50 to 80 proteins, and you knew that in these 50 to 80 proteins, it was extremely unlikely there was a marker for cancer. But enormous numbers of these studies were done, poster after poster."
"It may be generally true that sometimes one does stuff that simply, in hindsight, doesn't work, but if you already know that it is not going to work, then one shouldn't do it," he said.
One problem is that statistics make up a limited part of the typical training of a proteomics researcher, and there is still not enough communication and collaboration in the field with statistics experts, Vitek said.
"People who do machine learning or statistics, they wouldn't know much about medicinal chemistry," she said. "And people who know about medicinal chemistry, they never really spend much time thinking about [machine learning and statistics]. So, what is needed is either training in these areas, or, to create interdisciplinary teams who work together designing the experiment."
"I think that in pretty much all situations, it boils down to applying the state-of-the-art for the task at hand," said Laurent Gatto, head of the Computational Proteomics Unit at the Cambridge Systems Biology Centre at the University of Cambridge. "It the scientist cannot identify what that method is and why it is, then they have a big problem. Coming up with a way to analyze data without the appropriate background is doomed to fail."
He added that biostatisticians and computational scientists working in proteomics have developed a variety of quality methods and software packages for proteomics research, "and that's what the vast majority of [researchers] should use."
Appropriate use of these developed methods and software programs still requires a certain level of expertise on the part of the research group, though. Gatto cited the example of the Association of Biomolecular Resource Facilities' Proteome Informatics Research Group 2015 Study that found that a group's performance was not dependent on what method or software they used but rather on "their understanding and application of sound statistical data analysis."
Gatto said he believed the field's use of statistics "is improving, albeit slowly." He suggested that one way to drive further improvement would be continued emphasis on open data sharing.
"People tend to be much more careful with what they do and how they do it when they have to be open about it," he said.
Other pressure points that could be leveraged to encourage better statistical design of experiments are funding agencies and journals. To an extent, these institutions are already playing this role, Aebersold suggested.
"If you go to [the National Institutes of Health] or the EU and propose a biomarker project with 10 controls and 10 disease samples, you will never get that funded," he said. "You will not get it published in a really high-end journal, either."
Al Burlingame, professor at the University of California, San Francisco and editor-in-chief of the journal Molecular & Cellular Proteomics noted that while MCP does not publish many of the clinical biomarker type experiments where sample cohort size is a major issue, the journal does have requirements around the number of technical and biological replicates in an experiment.
He said that one area where appropriate use of statistics stands out in his experience is the growing number of non-proteomics researchers now using proteomic tools.
"We have a growing number of submissions to the journal that are from labs that are maybe more biological and who don't have the track record of dealing with these issues that the experts do," he said. "So those are more problematic usually, and many times when we carry out the reviews, people have to go back and do more experiments or do more replicates or comply with the statistical issues that they might be missing in their first attempt."
That said, Burlingame said he has been surprised at the quality of work coming out of some of these labs that are new to proteomics. "The new folks are in a couple of categories: One, those who pay really serious attention to what they are doing, and then the others who try to get a fast publication out, and, of course, we reject those."
Many journals, though, lack staff statisticians who can make sure experiments are up to standard, Vitek said. And while these journals will typically reach out to statisticians as part of the review process, "at the moment, there are not enough experts who can comment on the [experimental] designs," she said. "So it is still happening sometimes that poor-quality manuscripts get through the reviews just for lack of experts who can do the review."
The situation is somewhat better in industries like pharma, Vitek said, where there is strong incentive not to waste money on research that is unlikely to be reproducible.
"But in academic institutions, it's somewhat different," she said. "People say, 'Well, I don't have the samples, I don't have the money, everything is expensive. I understand that you want me to do replicates, but I have my special situation.'"
Vitek acknowledged that researchers operating under financial constraints or looking at rare biological samples may not be able to analyze as many replicates as they would like.
"I understand that, right," she said. "And that doesn't mean that you should stop and not do anything, but it's about managing the expectations of what you can reasonably obtain from this data. It's still okay to do three replicates, and do some, you know, pilot study that generates the hypothesis. But be aware that this is a hypothesis, right? It becomes about managing expectations."