Skip to main content
Premium Trial:

Request an Annual Quote

Should Gene Expression-Based Prognostics Require More Patient Training Samples?

Premium

Prognostic tests such as Genomic Health's Oncotype Dx and Agendia's MammaPrint should be constructed using many more patient samples if they are to reach expected levels of reproducibility and test-to-test comparability, according to a recent study.

Moreover, one of the study authors said he believes regulatory bodies should develop requirements for the size of patient-sample populations for prognostic tests based on gene expression.

Because the fluctuation of each gene's expression across different samples is much larger than the difference in the expression of any two genes ranked near each other by expression, the relatively small number of samples used to develop these tests raises the question of whether their success rate "is too optimistic or not," Eytan Domany, a professor in the Department of Physics of Complex Systems at the Weizmann Institute of Science in Rehovot, Israel, told Pharmacogenomics Reporter last week. "My feeling is that it may be too optimistic."


Asked whether regulatory bodies should develop requirements for the size of patient-sample populations for prognostic tests, Domany said, "It's difficult to answer, but for this particular problem, [which involves gene ranking,] I think the answer is yes."

In a study appearing in last week's Proceedings of the National Academy of Sciences, Domany and colleagues discussed a statistical measure to gauge the reproducibility of the methods used to assemble these tests. Their research also indicates that prognostic gene-expression tests are not as successful when used in the general population as they are in the set of patient samples originally used in their construction.

This discrepancy stems from the methods employed to establish which genes will be interrogated in a prognostic test. As long as these methods are in use, researchers will need to examine thousands of samples — rather than 100 to a few hundred that are typically used today — in order to identify a set of prognostic genes that have the most dramatic expression changes in an average of one-half of all samples.

Gene lists and measurement parameters for these predictive tests are usually constructed by analyzing a small number of samples for genes having the largest expression changes in patients with poor prognoses, a process known as training. In the case of the test behind Agendia's MammaPrint, for example, researchers randomly selected 77 patients for training, and narrowed a list of 232 predictive genes down to a 70-gene list, which they tested on 19 remaining patients.

Domany and colleagues examined the Agendia study, which used Rosetta microarrays on a sample of 96 patients, as well as a study by Veridex, which used Affymetrix arrays to identify 76 prognostic genes from a sample of 296 patients. "There were only three genes that appeared on both lists," Domany said. "So it's slightly better than random, but not much better."

In order to test training reproducibility, Domany's team repeated the analysis conducted by different groups of researchers. In the course of analyzing Agendia's patient sample, Domany's team randomly selected a different training set of 77 patient samples from among the same pool of 96. Using identical methods, "even that suffices to generate a list of 70 genes [that] … was very different from the first" because of the individual patients' gene-expression variations, he said.

The fluctuation of each gene's expression between different patients is much larger than the difference in the expression of any two genes chosen for the correlation between their high expression and prognosis, said Domany. That is, just because two genes tend to show similarly high expression in some bad-prognosis samples, there is no guarantee they will both behave similarly in another patient. When compared to a large number of other genes, the "rank" of each gene in comparison to others changes markedly in different patient samples.

Asked whether regulatory bodies should develop requirements for the size of patient-sample populations for prognostic tests, Domany said, "It's difficult to answer, but for this particular problem, [which involves gene ranking,] I think the answer is yes."

Some members of the diagnostics community agree. "I don't know how many patients it takes, but definitely our experience has been, 'The bigger your training set, the more reproducible it is over time,'" Rob Seitz, CEO of Applied Genomics, said in an interview with Pharmacogenomics Reporter this week. His company develops antibody-based prognostic tests whose protein targets were originally identified through gene-expression studies.


"The issue isn't whether I'd come up with that same gene list on a second cohort. For our way of thinking, it's simply, 'Does your hypothesis generated from one set hold true on a second set?'"

Although he agrees that it is difficult to make predictive gene lists reproducible and comparable to each other, Seitz said it's not insurmountable. "If I go to one dataset and come up with an interesting gene list — however you want to define that — the issue isn't whether I'd come up with that same gene list on a second cohort," he said. "For our way of thinking, it's simply, 'Does your hypothesis generated from one set hold true on a second set?'"

Asked whether the amount of gene-list overlap is important for clinical use of prognostic tests, Agendia CSO Renè Bernards said, "I think, to be honest, it's not important." Bernards' research group at the Netherlands Cancer Institute produced one of the two gene-list projects examined in Domany's paper.

When a tumor activates a Ras gene, which often happens in cancer, at least 25 genes could exhibit differential expression, said Bernards. "Whether you then measure the activity of gene 1 or gene 25 is more or less irrelevant, because that gene reports the activity of the Ras pathway in that tumor cell," and any of these genes can report activity in that pathway, although they may not be directly involved in a pathway, he said.

Many prognostic tests are looking at the same pathway through different reporter genes, said Bernards.

Prognostic tests are most useful in identifying cancer patients who may be able to skip expensive and risky chemotherapy, but they are tuned to err on the side of too much chemotherapy, rather than too little. The genes that these tests interrogate are chosen such that the chance of mischaracterizing a patient with a poor outcome is less than 10 percent, said Domany. It is in identifying patients who would have a favorable outcome that all prognostic methods, including expression tests, are more likely to make a mistake.

"I don't know these products exactly, and I don't know exactly what genes [they] are based on, and I'm not sure at what stage of testing they are — I really don't know much about it — for me the whole thing is a scientific problem," said Domany. But based on the current published results and the available technologies, "we cannot say that we have nailed the problem down, that we know which are the predictive genes — I don't think we do," he said.

More Research Details

Domany and colleagues were primarily interested in answering the following question: Using identical methods and different collections of patient samples, how many samples are needed in order to attain 50 percent overlap between two lists? The overlap is, in essence, a measure of the reproducibility of gene ranking.

The group defined the overlap as a random variable and calculated its distribution, "and what we found is that, in order to get a stable list of 70 genes, these groups will need something like 2,400 patients in one case and 3,400 in the other case," Domany said.

Training samples of around 100 patients "are much, much too small to expect stable gene lists," he said. If you insist on prioritizing your genes, one by one, ranking them, and picking the top 70 — this way of producing gene lists will be stable if you have several thousand patients to measure your correlations on," he said.

There may be other clinical implications as well. A heterogeneous disease, such as breast cancer, may have more genetic "types" than there are patients in a typical training sample. The risk is that some disease types aren't represented in the sample. "It is heterogeneous enough [that] you really have to use very large samples to have a faithful representation of the disease," said Domany.

The logistics of establishing reproducible gene lists, however, are far from trivial. "You have to freeze these tumors and have five-year clinical follow-up, and you cannot find thousands of such cases — of frozen tumor and clinical five-year follow up — in a few months," said Domany.

Eventually the field will probably use more biological knowledge, larger patient samples, and different methods to settle on a set of classifying genes, Domany said. "It is not a problem in machine learning, which is what people try to make it — it's a problem in biology," he said.

— Chris Womack ([email protected])

Filed under

The Scan

For Better Odds

Bloomberg reports that a child has been born following polygenic risk score screening as an embryo.

Booster Decision Expected

The New York Times reports the US Food and Drug Administration is expected to authorize a booster dose of the Pfizer-BioNTech SARS-CoV-2 vaccine this week for individuals over 65 or at high risk.

Snipping HIV Out

The Philadelphia Inquirer reports Temple University researchers are to test a gene-editing approach for treating HIV.

PLOS Papers on Cancer Risk Scores, Typhoid Fever in Colombia, Streptococcus Protection

In PLOS this week: application of cancer polygenic risk scores across ancestries, genetic diversity of typhoid fever-causing Salmonella, and more.