By Julia Karow
As exome and whole-genome sequencing are finding their way into the clinic for the diagnosis of inherited diseases, doctors need to be aware of pitfalls associated with using the new tools and analyzing the data, according to Rick Dewey, a cardiology fellow at the Stanford Center for Inherited Cardiovascular Disease.
Data management and storage remain major bottlenecks in the use of next-generation sequencing in the clinic, he said, as does the need to confirm suspicious variants by an independent method. In addition, the human reference genome, disease mutation databases, control data to filter variants, and algorithms predicting the effects of novel variants all have their limitations and can lead clinicians to wrong conclusions.
To illustrate his point, Dewey presented a case where whole-genome sequencing of an individual initially pointed to a plausible causative mutation that was even supported by functional experiments in an animal model system. But further analysis ruled out the mutation as the true cause, and revealed problems with both the human reference sequence and commonly used control data.
Dewey, who spoke during an Illumina-sponsored web seminar last week, also stressed the need for more comprehensive, searchable, and publicly available variant databases for Mendelian disorders, and for better genetics training for physicians, including bioinformatic approaches for data analysis.
The case that led him and his colleagues initially on a wrong track was that of a 19-year-old who suddenly died in his sleep. He had no medical history, cardiovascular phenotype, or sudden death in his family. No alcohol or drugs were involved in his death, and an autopsy showed that his heart and blood vessels appeared normal. Genetic testing for variants in long QT syndrome genes revealed no pathogenic mutations, and none of his first-degree relatives showed signs of cardiomyopathy.
To find a molecular cause for the patient's sudden death, the researchers performed whole-genome sequencing on his DNA, using the Helicos platform, and obtained 2.8 million variants. Assuming that variants associated with sudden death would be rare, they filtered out common variants and annotated the remaining ones. Using functional prediction algorithms, they scored non-synonymous variants and further annotated those predicted to be most damaging by comparing them to a set of about 250 genes that had previously been associated with familial cardiomyopathy or arrhythmias, or that encode proteins important for heart muscle cells.
The analysis identified two variants in the same gene, which encodes a potassium channel that contributes to action potential repolarization. Mutations in that gene could possibly lead to an arrhythmic phenotype, they would be consistent with an autosomal recessive mode of inheritance, and both positions where the variants occurred were evolutionarily conserved. In addition, neither variant was present in control samples, including more than 600 samples from the 1000 Genomes Project and 60 samples from Complete Genomics. Further evidence came from the expression of the mutant channels in the Xenopus frog, which resulted in shortening of the QT interval, an arrhythmic phenotype.
But although the researchers thought they had solved the riddle, their analysis turned out to be wrong. Further genotyping showed that the variants were present in four out of six additional healthy control samples, and one of the variants occurred in 27 out of 30 other unaffected controls that were used for a separate study.
It also turned out that both variants fell into a pseudogene region that had only recently been annotated in the human reference genome and was not contained in a previous version of the reference that the researchers had used for their analysis.
The lesson from this study, Dewey said, is that researchers need to be aware of pseudogenic regions in the genome, and of the limitations of control data, which could omit harmless variants or contain samples with Mendelian phenotypes. But it also illustrates that experimental data that connect variants to a molecular function "may not be the ultimate evidence for causality," he said.
But there are other challenges that clinicians need to be aware of. One is the sheer amount of data, and the computing time involved in alignment and variant calling, he said. Raw, uncompressed reads take up between 8 and 40 gigabytes for an exome, and between 1 and 4 terabytes for a genome, depending on read depth, so a single genome takes up the entire hard drive of a desktop computer.
BAM files, which he said are the files typically stored after alignment, still take up between 50 and 150 gigabytes for a genome, and even a variant file for one genome requires a gigabyte of storage.
In addition, he said, while the VCF file format that has emerged as the dominant format for variants is easy to read for a computer, it is "very difficult" for human interpretation.
Errors made by next-generation sequencing platforms are also "an issue that is of great concern," according to Dewey, in particular with regard to clinical use of the data. While deep sequencing, as well as sequencing families and comparing their genomes, can reduce the number of errors somewhat, "almost all" clinically actionable variants still need to be confirmed by Sanger sequencing, array genotyping, or targeted next-gen sequencing, he said, a "major challenge and bottleneck in the application of the technology to clinical care."
One way to filter out errors computationally is to identify error-rich regions of the genome from analyzing family data, regions that "represent areas difficult to align or properly genotype using next-gen sequencing technologies," Dewey said.
In the future, it might be possible to get away without secondary confirmation in certain areas of the genome where alignment and variant identification can be done with confidence, he said, "but I don't think we are quite there."
Another concern is that the initial variant file obtained from a whole-genome sequencing experiment may be incomplete, because most variant-calling algorithms do not distinguish between a no-call and a call that is homozygous reference.
Also, hundreds of thousands of positions are represented by the minor allele in the human reference sequence, compared to the three major HapMap populations, and a homozygous call at those is not maintained in the variant file. Those positions include thousands that have been linked to disease in GWAS studies, and hundreds of rare variants associated with disease. To remedy this problem, Dewey and his colleagues have built an augmented reference genome, in which they inserted the major allele at every position where it was represented by the minor allele (IS 9/21/2011).
Another "major challenge" is finding a causative variant among hundreds of novel variants that could result in a loss of function of the gene.
Prediction algorithms can help analyze novel variants that have not been associated with disease, but all of these algorithms have shortcomings, he said. The same is true for databases that contain clinical associations of rare variants — such as HGMD and OMIM — which contain many misannotations.
Sequencing families can be particularly helpful in nailing down a causative mutation, since it permits researchers to see if a mutation co-segregates with disease, and helps them weed out sequencing errors. "The situations where [NGS] analysis has been most fruitful have been those in which we have large families … in which the phenotype is relatively clear," Dewey said.
As an example, he cited a recent case of a family with hypertrophic cardiomyopathy, where a 17-gene testing panel had revealed no mutation.
He and his colleagues performed exome sequencing of the family on the Illumina HiSeq platform and annotated rare variants by their co-segregation with the phenotype. They also applied an inheritance algorithm to affected individuals in order to identify genomic regions that represent a common recent ancestor, in whom the mutation has arisen. They came up with a region on chromosome 19 with six novel mutations in potentially relevant genes, including one strong candidate gene.
"Whole-genome and exome sequencing certainly is here, and it's starting to make its appearance in clinical care," Dewey said.
However, scaling the infrastructure required to store and analyze the data routinely will be "a major challenge moving forward," particularly with regard to the "fragmented electronic medical record system" that exists in the US today.
Also, collaborations between centers with patients that have similar phenotypes are "going to be extremely important," as well as sharing of clinically associated or putatively pathogenic variants.
Have topics you'd like to see covered in Clinical Sequencing News? Contact the editor at jkarow [at] genomeweb [.] com.