By Tony Fong
Name: Steven Skates
Position: Associate professor of medicine (biostatistics), Harvard Medical School
Background: PhD, statistics, University of Chicago
Last week the Association for Mass Spectrometry: Applications to the Clinical Laboratory held its first conference in San Diego. There, Steven Skates, associate professor of medicine at Harvard Medical School, presided over a session on the computational, statistical, and epidemiological considerations that proteomics researchers should incorporate into their study designs in order to get the best data possible [See PM 02/12/10].
Skates focused his talk on the development of mass spec-based assays for use in protein biomarker verification. In particular, he focused on using selective-reaction monitoring to develop assays that can be used in longitudinal biomarker studies, and the importance of achieving single-digit coefficients of variation in developing mass spec-based assays.
Skates is involved in a number of longitudinal studies honing in on early detection for ovarian cancer. In proteomics, such studies are still uncommon. The three best-known studies that have incorporated proteomics are the Framingham Heart Study, the Busselton Health Study, and the UK Collaborative Trial of Ovarian Cancer Screening, of which Skates is a member.
ProteoMonitor spoke with Skates during the conference about his work. Below is an edited version of the conversation.
The focus of your talk focused on clinical assay precision for longitudinal studies. How is it different from studies that aren't longitudinal?
It's less clear what the assay precision needs to be in those situations. For example, the typical use of CA125 is to differentiate people [with CA125 serum levels] above 35 [units per milliliter] from people [with levels] below 35 [units per milliliter]. So what's the problem of having a 25 percent assay [coefficient of variation] down around 10 [units per milliliter] or up around 100 [units per milliliter]? There's really no clinically relevant problem there.
Of course, everyone wants a measurement process to be as precise as possible, just so that they're clear that number will be in a certain small bandwidth and you don't have to, hopefully, worry about that bandwidth. What you need to worry about is the complexity of the clinical situation, and using that precise number to understand it more, to provide guidance in some sort of clinical decision-making.
Conceivably, except around the clinical decision point of 35, you could have a fairly big assay variation without having much impact on the clinical decision. There aren't many assays where you're going to have a situation where you've got a broad, big variation at the high end, a big variation at the low end, and [precision] in the middle where the clinical decision point is going to be.
But the importance of the assay measurement process in using the clinical variation, the biological variation that you see longitudinally is that it provides a very clear goal with which to try for each target, with which to get an assay precision. You want to get that assay precision down near the lowest level of the biological variation that you see in your population.
If your biological variation ranges from 50 to 100 percent in CV, there's no point in worrying about reducing the assay precision from 20 percent down to 5 percent … because the biological variation simply overwhelms it. If there's going to be a biomarker there, there's going to be a huge signal, and assay precision of 20 percent or 5 percent doesn't make any difference in that situation.
But if your biological variation ranges like the example I gave yesterday, which is in CA125, [where] 95 percent population range is around 7 percent to 40 percent. So there's a significant fraction that's under 10 to 15 percent biological variation.
[ pagebreak ]
And so if you're looking for biomarkers and the biological variation is of that order of magnitude or is down around that low end, you don't want your assay variation to contribute significantly to masking any signal that's there. You want your assay variation to really get under the radar of the biological variation, and I think that's useful for saying, 'OK for this assay, do I need to improve it any more than what I've got?'
Often … you see MRM variations around 20 and 25 percent. Should we work on those variations and get that down around 10 or 5 percent? Well, you should if the biological variation for that particular molecule or that particular protein is low, down around the 10 or 15 percent mark, or there's at least a significant fraction that's down at that level.
Therefore, I think the biological variation of these proteins, plasma proteins, or whatever background matrix you're measuring in, is a useful quantity with which to say, 'Ok, how accurate, and for which candidate markers do I need to work on to improve the accuracy and come up with ways to optimize that particular target assay, whether its mass spec or immunoassay?'
For regulatory purposes is it important to have these assays be as precise for non-longitudinal studies as possible, or is it not clear what the FDA would want at this point??>
I'm not an expert on regulatory issues, so this would be only speculation, but I'm guessing the FDA wants a certain level for clinically useful assays — both clinically approved assays to be under or … I would imagine they would want you to explain why it's still justified if it's over a certain level. And the level I've heard is around a CV of 15 percent for an assay to be used reliably in the clinic.
That's what is accepted. I don't know how much of that is based on detailed, quantitative reasoning, or if that has just arisen out of clinicians' experience with assays above that range and below that range and what they felt is required in the range of clinical settings to provide reliable information on which to base clinical decisions.
That 15 percent is for protein-based assays or for any assay?
It really comes from my discussions with clinical biochemists and I think it's certainly for protein assays. Whether it applies to other assays, I'm not sure.
Why did you choose to focus your talk on longitudinal studies?
Mainly because of my interest in early detection, and I think that for early detection, a longitudinal approach for each person, essentially making the screening process personalized to each individual undergoing that screening [is more effective].
So if we take CA125 as an example, instead of having one reference level — 35, and if you're above that, it's a positive test, and below that, it's a negative test — the idea for early detection application, at least one that's being pursued in clinical trials at the moment, is to establish each woman's baseline CA125 and to the fluctuation about that baseline, and then to observe significant fluctuations above that baseline as much more sensitive and a much more specific approach to inferring whether or not she has a detected ovarian cancer.
Whether or not that's going to work, we'll know in another five years at the completion of these clinical trials. But it certainly in retrospective analyses looks much more sensitive and specific to do a personalized longitudinal baseline approach and look for differences from that than just having a fixed global reference cut-point.
I haven't come across many longitudinal studies in proteomics. Is this something that's starting to happen?
Yes, it is. The reason that there's not many of them is that they need to be very large and over many years to prospectively identify enough cases for you to be able to say anything that's meaningful clinically, at least in the sense of [being] statistically sound, and [being able to ] differentiate true signal from noise.
For example, in a trial that I'm part of, a team that's conducting it, the UK Collaborative Trial of Ovarian Cancer Screening, there are 200,000 post-menopausal women in the trial. They've been randomized, half to control, half to screen. And the endpoint of the trial is the number of deaths from ovarian cancer. And what one hopes, if there is a positive result, is there will be statistically fewer deaths in the screened arm than in the control arm.
[ pagebreak ]
Just to clarify the screening, in that 100,000 women that's actually split into two groups, half is the personalized CA125 approach, and the other 50,000 is annual ultrasound as the first-line test, so there are two modes or two screening approaches that are being judged.
But the 50,000 who are undergoing the annual CA125 test, the way that is set up is to do the personalized longitudinal screening approach, and there will be a bank of samples from that … many [from women] who will be followed up to 10 years of screen.
It's been a trial that's been going on since 2001 and it will give us not only a definitive answer as to whether this personalized CA125 approach works, but it also will have been valuable by a longitudinal biorepository for other markers of ovarian cancer that can be tested and markers for other diseases as well.
Any insights that you can share from what you've seen so far from that study?
I can only provide anecdotal cases, but there have been examples where low levels of CA125 rose to intermediate levels, which are below the usual cut-off of 35, but the rise was sufficient enough to prompt ultrasound, and they found a mass and they intervened promptly with surgery and found early-stage cancer.
Preliminarily, there is evidence that this approach finds cancer in early stage. Whether that translates to mortality reduction, we certainly don't know at this point. And whether that can be extrapolated to all other cases or many other cases, that is information I don't have at this stage, and only a few individuals in the trial team have access to.
Can you describe your role in that study, and what you're doing??
My main role was developing this longitudinal CA125 algorithm and then getting it implemented and monitoring it to make sure that it's working as we go along. I'm on the trial management committee that meets once a year and we review any issue that comes up and we try to troubleshoot them and keep the trial on track.
Can you briefly describe that algorithm? What is it tracking?
It's mainly tracking CA125. You look at the longitudinal pattern in the subject you've got and you've had data from some previous studies before this trial got started where you had longitudinal CA125 data and you had the outcomes as to whether the woman had ovarian cancer or not. … That gives us a pattern of what CA125 looks like over time prior to the detection of ovarian cancer.
And we look at the pattern in women who don't have ovarian cancer and on the whole, they're stable and vary about the baseline that's individual to each woman. And within the screening trial, when we have CA125 levels, what the algorithm does is essentially measure how close they are to a pattern that comes from women who had ovarian cancer in previous screening trials versus the women who didn't have ovarian cancer from previous screening trials.
And we can measure that and then the top 2 percent of women based on that measure, instead of just an absolute CA125 measurement. The top 2 percent go on to ultrasound, and the top 10 to 15 percent go on to a more rapid CA125 test. Instead of waiting a year, they get another CA125 test in three months and then that risk calculation as to whether they look more like the pattern of women who had ovarian cancer in previous studies. That risk calculation with the new CA125 in three months gets updated and they get re-triaged.
In that way, those who are rising over time … and steadily providing more information that they look more like the women who had ovarian cancer previously get funneled very rapidly into the ultrasound. And once you get to the ultrasound as a followup first-line test to CA125, it can then from our previous studies essentially cut the numbers down to a tenth of what they were — so by 90 percent specificity. Added on top of the 98 percent specificity that you get with your first-line test with the longitudinal CA125 test [you get] a specificity in excess of 99.6 percent for the overall study.
[ pagebreak ]
And that gives you a positive predictive value that is above a minimum of one in 10, but it's more like one in five, one in three. So the positive predictive value is looking very good for longitudinal CA125.
In fact, there's a publication that came out on the prevalence screen where we show that the longitudinal CA125 is at a positive predictive value over 30 percent.
Can you describe the in-house work that you're doing at Massachusetts General Hospital and how that ties in with the UK longitudinal study you're involved in?
At MGH we did a much smaller trial, a pilot trial of screening high-risk women. Instead of post-menopausal women, which are where most ovarian cancer occurs, we're focusing on those who have family history of multiple breast or ovarian cancers in the family.
It's a much more specific population, a higher risk, and therefore we're going to find more ovarian cancers in a small population than we would have found had we used just normal risk.
It's a study that's just finishing up. It's a pilot study so it's only one arm, but it was a multi-centered study. It was conducted under the auspices of the Cancer Genetics Network, and instead of screening every year like the normal risk study is doing, we screened every three months and still applied the longitudinal CA125 algorithm.
This approach is also being used in a number of other studies. There's a high-risk study in the UK, there is another high-risk study through the Gynecologic Oncology Group, and there's a normal risk pilot study that's being conducted out of MD Anderson at about five different centers.
And I'm part of all of those, or the algorithm is being tested out in all of those studies. The main one is the big UK CTOCS.
Do the results from the Mass General study confirm what you've seen in the UK CTOCS?
It also started at about the same time. It's using the findings that we've had previously from other screening trials that were conducted in the 80s and 90s where we went back and used the data from those trials, which were CA125-based at multiple time points to develop this longitudinal approach.
It's really parallel with the UK CTOCS rather than finding results from the UK CTOCS and then applying them at the MGH. And the MGH was essentially a coordinating center for a trial of over 20 sites through the US through the Cancer Genetics Network.
There have been examples in the study where we've found cancer early and this longitudinal CA125 has given patterns exactly the same as we've found in previous trials. And we've found it using this pattern-matching algorithm.
So, preliminary findings from both studies certainly come from one another.
Are there plans to incorporate other biomarkers into future studies?
We're certainly looking for markers that complement CA125. If we found one that added significantly and complemented CA125 significantly, we'd like to mount a subsequent trial to the pilot trial.
There are other people who are looking at other markers. Exactly how well they complement CA125 is still up for debate.
We've found some that have some promise, but they haven't proven to have a huge amount of complementarity in our estimation. We know that CA125 is not expressed by at least 20 percent of ovarian cancers. Immunohistochemically, at least 20 percent of cancers don't stain positive for CA125. One would hope that you've found markers that would add to that 80 percent sensitivity, if one could call it that for CA125.
[ pagebreak ]
It is likely that these markers add only about 5 percent … so then the question becomes 'Should we wait until we've found more markers that add 10, 15, or more percent to the sensitivity before we go ahead with another trial in ovarian cancer screening?'
So my efforts now are focused on finding complementary markers to CA125 that improve above and beyond that 5 percent improvement that we've got at the moment.
So the decision right now is to wait for that extra 10 percent?
That's my decision. It's tentative. There are good reasons for continuing these trials at least on a pilot level: One, to offer patients some hope because we believe the longitudinal process at least in high-risk women is worthwhile, and some improvement on current approaches, although definitive evidence of that, we don't have it yet.
But we feel that we've got retrospective evidence that it's at least reasonable to offer it in a trial where women have informed consent.
For that reason, it's certainly reasonable to think that offering a trial, even if you have only a small amount of additional sensitivity, might still be worthwhile because that would then offer the women this longitudinal screening process and build up a biorepository that could be used to validate any new candidate markers that come along.
We heard from Keith Baggerly and David Ransohoff [at the MSACL conference] about their views on the state of clinical proteomics assays. What's your view of development of these assays?
Both Keith and David pointed out issues in study design that had major flaws that were found only in retrospect that potentially explained why the so-called findings were not repeatable by other people in these studies.
And one of David's points is that it happened in a 2004 publication in The New York Times [about Lance Liotta and Emanuel Petricoin's work in which they claimed to have found ovarian cancer protein biomarkers using the SELDI instrument] and again in 2008 [with the OvaSure test, jointly developed by LabCorp and researchers at Yale].
The question is: 'Have we learned from that?' There is certainly, I think, a growing awareness that just having samples from what you think are cases and controls isn't sufficient scrutiny to then devote a massive amount of time and research effort into finding biomarkers that essentially contrast between the cases and controls.
One needs to understand how those cases and controls were collected, and not only collected, [but] what patients they came from. David's example [from a prostate cancer trial ] of having controls the majority are of whom are women, [and] cases all of whom are men and who are significantly older, is a clear example of investing a lot in sample analysis before understanding the biases that exist in those samples.
So having people like David and Keith getting that message out to the biomarker research community, I think, [puts us] on track for reducing the frequency with which we get false positive results. We certainly want to eliminate it, but I think those sort of discussions and sessions are going a long way and serve a very useful purpose in making not only the technology robust but [in terms of the] clinical aspects of the study design and the experimental study design … making those robust, and therefore having greater assurance that the results that come out of the studies are going to have direct clinical relevance.