Projects supported by the US National Institutes of Health will have produced 68,000 total human genomes — around 18,000 of those whole human genomes — through the end of this year, National Human Genome Research Institute estimates indicate. And in his book, The Creative Destruction of Medicine, the Scripps Research Institute's Eric Topol projects that 1 million human genomes will have been sequenced by 2013 and 5 million by 2014.
"There's a lot of inventory out there, and these things are being generated at a fiendish rate," says Daniel MacArthur, a group leader in Massachusetts General Hospital's Analytic and Translational Genetics Unit. "From a capacity perspective ... millions of genomes are not that far off. If you look at the rate that we're scaling, we can certainly achieve that."
The prospect of so many genomes has brought clinical interpretation into focus — and for good reason. Save for regulatory hurdles, it seems to be the single greatest barrier to the broad implementation of genomic medicine.
But there is an important distinction to be made between the interpretation of an apparently healthy person's genome and that of an individual who is already affected by a disease, whether known or unknown.
In an April Science Translational Medicine paper, Johns Hopkins University School of Medicine's Nicholas Roberts and his colleagues reported that personal genome sequences for healthy monozygotic twin pairs are not predictive of significant risk for 24 different diseases in those individuals. The researchers then concluded that whole-genome sequencing was not likely to be clinically useful for that purpose. (See sidebar, story end.)
"The Roberts paper was really about the value of omniscient interpretation of whole-genome sequences in asymptomatic individuals and what were the likely theoretical limits," says Isaac Kohane, chair of the informatics program at Children's Hospital Boston. "That was certainly an important study, and it was important to establish what those limits of knowledge are in asymptomatic populations. But, in fact, the major and most important use cases [for whole-genome sequencing] may be in cases of disease."
Still, targeted clinical interpretations are not cut and dried. "Even in cases of disease, it's not clear that we know now how to look across multiple genes and figure out which are relevant, which are not," Kohane adds.
While substantial progress has been made — in particular, for genetic diseases, including certain cancers — ambiguities have clouded even the most targeted interpretation efforts to date. Technological challenges, meager sample sizes, and a need for increased, fail-safe automation all have hampered researchers' attempts to reliably interpret the clinical significance of genomic variation. But perhaps the greatest problem, experts say, is a lack of community-wide standards for the task.
Genes to genomes
When scientists analyzed James Watson's genome — his was the first personal sequence, completed in 2007 and published in Nature in 2008 — they were surprised to find that he harbored two putative homozygous SNPs matching Human Gene Mutation Database entries that, were they truly homozygous, would have produced severe clinical pheno-types.
But Watson was not sick.
As researchers search more and more genomes, such inconsistencies are increasingly common.
"My take on what has happened is that the people who were doing the interpretation of the raw sequence largely were coming from a SNPs world, where they were thinking about sequence variants that have been observed before, or that have an appreciable frequency, and weren't thinking very much about the single-ton sequence variants," says Sean Tavtigian, associate professor of oncology at the University of Utah.
"There is a qualitative difference between looking at whole-genome sequences and looking at single genes or, even more typically, small numbers of variants that have been previously implicated in a disease," Boston's Kohane adds.
"Previously, because of the cost and time limitations around sequencing and genotyping, we only looked at variants in genes for which we had a clinical indication. Now, since we can essentially see that in the near future we will be able to do a full genome sequence for essentially the same cost as just a focused set-of-variants test, all of the sudden we have to ask ourselves: What is the meaning of variants that fall outside where we would have ordinarily looked for a given disease or, in fact, if there is no disease at all?"
Mass General's MacArthur says it has been difficult to pinpoint causal variants because they are enriched for both sequencing and annotation errors. "In the genome era, we can generate those false positives at an amazing rate, and we need to work hard to filter them back out," he says.
"Clinical geneticists have been working on rare diseases for a long time, and have identified many genes, and are used to working in a world where there is sequence data available only from, say, one gene with a strong biological hypothesis. Suddenly, they're in this world where they have data from patients on all 20,000 genes," MacArthur adds. "There's a fundamental mind-shift there, in shifting from one gene through to every gene. My impression is that the community as a whole hasn't really internalized that shift; people still have a sense in their head that if you see a strongly damaging variant that segregates with the disease, and maybe there's some sort of biological plausibility around it as well, that that's probably the causal variant."
[pagebreak]
Studies have shown that that's not necessarily so. Because of this, "I do worry that in the next year or so we'll see increasing numbers of mutations published that later prove to just be benign polymorphisms," MacArthur adds.
"The meaning of whole-genome -sequence I think is very much front-and-center of where genomics is going to go. What is the true, clinical meaning? What is the interpretation? And, there's really a double-edged sword," Kohane says. On one hand, "if you only focus on the genes that you believe are relevant to the condition you're studying, then you might miss some important findings," he says. Conversely, "if you look at every-thing, the likelihood of a false positive becomes very, very high. Because, if you look at enough things, invariably you will find something abnormal," he adds.
False positives are but one of the several challenges scientists working to analyze genomes in a clinical context face.
Technical difficulties
That advances in sequencing technologies are far outstripping researchers' abilities to analyze the data they produce has become a truism of the field. But current sequencing platforms are still far from perfect, making most analyses complicated and nuanced. Among other things, improvements in both read length and quality are needed to enable accurate and reproducible interpretations.
"The most promising thing is the rate at which the cost-per-base-pair of massively parallel sequencing has dropped," Utah's Tavtigian says. Still, the cost of clinical sequencing is not inconsequential. "The $1,000, $2,000, $3,000 whole-genome sequences that you can do right now do not come anywhere close to 99 percent probability to identify a singleton sequence variant, especially a biologically severe singleton sequence variant," he says. "Right now, the real price of just the laboratory sequencing to reach that quality is at least $5,000, if not $10,000."
However, Tavtigian adds, "techniques for multiplexing many samples into a channel for sequencing have come along. They're not perfect yet, but they're going to improve over the next year or so."
Using next-generation sequencing platforms, researchers have uncovered a variety of SNPs, copy-number variants, and small indels. But to MacArthur's mind, current read lengths are not up to par when it comes to clinical-grade sequencing, and they have made supernumerary quality-control measures necessary.
"There's no question that we're already seeing huge improvements. ... And as we add in to that changes in technology — for instance much, much longer sequencing reads, more accurate reads, possibly combining different platforms — I think these sorts of [quality-control] issues will begin to go away over the next couple of years," MacArthur says. "But at this stage, there is still a substantial quality-control component in any sort of interpretation process. We don't have perfect genomes."
In a 2011 Nature Biotechnology paper, Stanford University's Michael Snyder and his colleagues sought to examine the accuracy and completeness of single-nucleotide variant and indel calls from both the Illumina and Complete Genomics platforms by sequencing the genome of one individual using both technologies. Though the researchers found that more than 88 percent of the unique single-nucleotide variants they detected were concordant between the two platforms, only around one-quarter of the indel calls they generated matched up. Overall, the authors reported having found tens of thousands of platform-specific variant calls, around 60 percent of which they later validated by genotyping array.
For clinical sequencing to ever become widespread, "we're going to have to be able to show the same reproducibility and test characteristic modification as we have for, let's say, an LDL cholesterol level," Boston's Kohane says. "And if you measure it in one place, it should not be too different from another place. ... Even before we can get to the clinical meaning of the genomes, we're going to have to get some industry-wide standards around quality of sequencing."
Scripps' Topol adds that when it comes to detecting rare variants, "there still needs to be a big upgrade in accuracy."
Analytical issues
Beyond sequencing, technological advances must also be made on the analysis end. "The next thing, of course, is once you have better -accuracy ... being able to do all of the analytical work," Topol says. "We're getting better at the exome, but every-thing outside of protein-coding -elements, there's still a tremendous challenge."
Indeed, that challenge has inspired another — a friendly competition among bioinformaticians working to analyze pediatric genomes in a pedigree study.
With enrollment closed and all sequencing completed, participants in the Children's Hospital Boston-sponsored CLARITY Challenge have rolled up their shirtsleeves and begun to dig into the data — de-identified clinical summaries and exome or whole-genome sequences generated by Complete Genomics and Life Technologies for three children affected by rare diseases of unknown genetic basis, and their parents. According to its organizers, the competition aims to help set standards for genomic analysis and interpretation in a clinical setting, and for returning actionable results to clinicians and patients.
"A bunch of teams have signed up to provide clinical-grade reports that will be checked by a blue-ribbon panel of judges later this year to compare and contrast the different forms of clinical reporting at the genome-wide level," Kohane says. The winning team will be announced this fall and will receive a $25,000 prize, he adds.
While the competition covers all aspects of clinical sequencing — from readout to reporting — it is important to recognize that, more generally, there may not be one right answer and that the challenges are far-reaching, affecting even the most basic aspects of analysis.
[pagebreak]
"There is a lot of algorithm investment still to be made in order to get very good at identifying the very rare or singleton sequence variants from the massively parallel sequencing reads efficiently, accurately, [and with] sensitivity," Utah's Tavtigian says.
Picking up a variant that has been seen before is one thing, but detecting a potentially causal, though as-yet-unclassified variant is a beast of another nature.
"Novel mutations usually need extensive knowledge but also validation. That's one of the challenges," says Zhongming Zhao, associate professor of biomedical informatics at Vanderbilt University. "Validation in terms of a disease study is most challenging right now, because it is very time-consuming, and usually you need to find a good number of samples with similar disease to show this is not by chance."
Search for significance
Much like sequencing a human genome in the early- to mid-2000s was more laborious than it is now, genome interpretation has also become increasingly automated.
Beyond standard quality-control checks, the process of moving from raw data to calling variants is now semiautomatic. "There's essentially no manual intervention required there, apart from running our eyes over [the calls], making sure nothing has gone horribly wrong," says Mass General's MacArthur. "The step that requires manual intervention now is all about taking that list of variants that comes out of that and looking at all the available biological data that exists on the Web, [coming] up with a short-list of genes, and then all of us basically have a look at all sorts of online resources to see if any of them have some kind of intuitive biological profile that fits with the disease we're thinking about."
Of course, intuitive leads are not foolproof, nor are current mutation data-bases. (See sidebar, story end.) And so, MacArthur says, "we need to start replacing the sort of intuitive biological approach with a much more data-informed approach."
Developing such an approach hinges in part on having more genomes. "If we get thousands — tens of thousands — of people sequenced with various different phenotypes that have been crisply identified, that's going to be so important because it's the coupling of the processing of the data with having rare variants, structural variants, all the other genomic variations to understand the relationship of whole-genome sequence of any particular phenotype and a sequence variant," Scripps' Topol says.
Vanderbilt's Zhao says that sample size is still an issue. "Right now, the number of samples in each whole-genome sequencing-based publication is still very limited," he says. At the same time, he adds, "when I read peers' grant applications, they are proposing more and more whole-genome sequencing."
When it comes to disease studies, sequencing a whole swath of apparently healthy people is not likely to ever be worthwhile. According to Utah's Tavtigian, "the place where it is cost-effective is when you test cases and then, if something is found in the case, go on and test all of the first-degree relatives of the case — reflex testing for the first-degree relatives," he says. "If there is something that's pathogenic for heart disease or colon cancer or whatever is found in an index case, then there is a roughly 50 percent chance that the first-degree relatives are going to carry the same thing, whereas if you go and apply that same test to someone in the general population, the probability that they carry something of interest is a lot lower."
But more genomes, even familial ones, are not the only missing elements. To fill in the functional blanks, researchers require multiple data types.
"We've been pretty much sequence-centric in our thinking for many years now because that was where are the attention [was]," Topol says. "But that leaves the other 'omes out there."
From the transcriptome to the proteome, the metabolome, the microbiome, and beyond — Topol says that because all the 'omes contribute to human health, they all merit review.
"The ability to integrate information about the other 'omics will probably be a critical direction to understand the underpinnings of disease," he says. "I call it the 'panoromic' view — that is really going to become a critical future direction once we can do those other 'omics readily. We're quite a ways off from that right now."
Mass General's MacArthur envisages "rolling in data from protein-protein interaction networks and tissue expression data — pulling all of these together into a model that predicts, given the phenotype, given the systems that appear to be disrupted by this variant, what are the most likely set of genes to be involved," he says. From there, whittling that set down to putative causal variants would be simpler.
"And at the end of that, I think we'll end up with a relatively small number of variants, each of which has a probability score associated with it, along with a whole host of additional information that a clinician can just drill down into in an intuitive way in making a diagnosis in that individual," he adds.
According to MacArthur, "we're already moving in this direction — in five years I think we will have made substantial progress toward that." He adds, "I certainly think within five years we will be diagnosing the majority of severe genetic disease patients; the vast majority of those we'll be able to assign a likely causal variant using this type of approach."
Tavtigian, however, highlights a potential pitfall. While he says that "integration of those [multivariate] data helps a lot with assessing unclassified variants," it is not enough to help clinicians ascertain causality. Functional assays, which can be both inconclusive and costly, will be needed for some unclassified variant hits, particularly those that are thought to be clinically meaningful.
"I don't see how you're going to do a functional assay for less than like $1,000," he says. "That means that unless the cost of the sequencing test also includes a whole bunch of money for assessing the unclassified variants, a sequencing test is going to create more of a mess than it cleans up."
[pagebreak]
Rare, common
Despite the challenges, there have been plenty of clinical sequencing success stories. Already, Scripps' Topol says there have been "two big fronts in 2012: One is the unknown diseases [and] the other one, of course, is cancer." But scientists say that despite the challenges, whole--genome sequencing might also become clinically useful for asymptomatic individuals in the future.
Down the line, scientists have their sights set on sequencing asymptomatic individuals to predict disease risk. "The long-term goal is to have any person walk off the street, be able to take a look at their genome and, without even looking at them clinically, say: 'This is a person who will almost certainly have phenotype X,'" MacArthur says. "That is a long way away. And, of course, there are many phenotypes that can't be predicted from genetic data alone."
Nearer term, Boston's Kohane imagines that newborns might have their genomes screened for a number of neonatal or pediatric conditions.
Overall, he says, it's tough to say exactly where all of the chips might fall. "It's going to be an interesting few years where the sequencing companies will be aligning themselves with laboratory testing companies and with genome interpretation companies," Kohane says.
Even if clinical sequencing does not show utility for cases other than genetic diseases, it could still become common practice.
"Worldwide, there are certainly millions of people with severe diseases that would benefit from whole--genome sequencing, so the demand is certainly there," MacArthur says. "It's just a question of whether we can develop the infrastructure that is required to turn the research-grade genomes that we're generating at the moment into clinical-grade genomes. Given the demand and the practical benefit of having this information ... I don't think there is any question that we will continue to drive, pretty aggressively, towards large-scale -genome sequencing."
Kohane adds that "although rare diseases are rare, in aggregate they're actually not — 5 percent of the population, or 1 in 20, is beginning to look common."
Despite conflicting reports as to its clinical value, given the rapid declines in cost, Kohane says it's possible that a whole-genome sequence could be less expensive than a CT scan in the next five years. Confident that many of the interpretation issues will be worked out by then, he adds, "this soon-to-be-very-inexpensive test will actually have a lot of clinical value in a variety of situations. I think it will become part the decision procedure of most doctors."
[Sidebar] 'Predictive Capacity' Challenged
In Science Translational Medicine in April, Johns Hopkins University School of Medicine's Nicholas Roberts and his colleagues showed that personal genome sequences for healthy monozygotic twin pairs are not predictive of significant risk for 24 different diseases in those individuals and concluded that whole-genome sequencing was unlikely to be useful for that purpose.
As the Scripps Research Institute's Eric Topol says, that Roberts and his colleagues examined the predictive capacity of personal genome sequencing "without any genome sequences" was but one flaw of their interpretation.
In a comment appearing in the same journal in May, Topol elaborated on this criticism, and noted that the Roberts et al. study essentially showed nothing new. "We cannot know the predictive capacity of whole-genome sequencing until we have sequenced a large number of individuals with like conditions," Topol wrote.
Elsewhere in the journal, Tel Aviv University's David Golan and Saharon Rosset noted that slightly tweaking the gene-environment parameters of the mathematical model used by Roberts et al. showed that the "predictive capacity of genomes may be higher than their maximal estimates."
Colin Begg and Malcolm Pike from Memorial Sloan-Kettering Cancer Center also commented on the study in Science Translational Medicine, reporting their -alternative calculation of the predictive capacity of personal sequencing and their analysis of cancer occurrence in the second breast of breast cancer patients, both of which, they wrote, "offer a more optimistic view of the predictive value of genetic data."
In response to those comments, Bert Vogelstein — who co-authored the Roberts et al. study — and his colleagues wrote in Science Translational Medicine that their "group was the first to show that unbiased genome-wide sequencing could illuminate the basis for a hereditary disease," adding that they are "acutely aware of its immense power to elucidate disease pathogenesis." However, Vogelstein and his colleagues also said that recognizing the potential limitations of personal genome sequencing is important to "minimize false expectations and foster the most fruitful investigations."
[Sidebar] 'The Single Biggest Problem'
That there is currently no comprehensive, accurate, and openly accessible database of human disease-causing mutations "is the single greatest failure of modern human genetics," Massachusetts General Hospital's Daniel MacArthur says.
"We've invested so much effort and so much money in researching these Mendelian diseases, and yet we have never managed as a community to centralize all of those mutations in a single resource that's actually useful," MacArthur says. While he notes that several groups have produced enormously helpful resources and that others are developing more, currently "none covers anywhere close to the whole of the literature with the degree of detail that is required to make an accurate interpretation."
Because of this, he adds, researchers are pouring time and resources into rehashing one another's efforts and chasing down false leads.
"As anyone at the moment who is sequencing genomes can tell you, when you look at a person's genome and you compare it to any of these databases, you find things that just shouldn't be there — homozygous mutations that are predicted to be severe, recessive, disease-causing variants and dominant mutations all over the place, maybe a dozen or more, that they've seen in every genome," MacArthur says. "Those things are clearly not what they claim to be, in the sense that a person isn't sick." Most often, he adds, the researchers who reported that variant as disease-causing were mistaken. Less commonly, the database moderators are at fault.
"The single biggest problem is that the literature contains a lot of noise. There are things that have been reported to be mutations that just aren't. And, of course, a lot of the databases are missing a lot of mutations as well," MacArthur adds. "Until we have a complete database of severe disease mutations that we can trust, genome interpretation will always be far more complicated than it should be."