By Julia Karow
Researchers plan to sequence thousands of vertebrate genomes de novo over the next several years for efforts such as the Genome 10K project and human cancer studies, but are current short-read, high-throughput sequencing methods — so successful at resequencing human genomes at low cost — able to generate genome assemblies of sufficient quality?
While some scientists doubt that short-read technologies alone can ever produce assemblies good enough to conduct meaningful analyses, others believe that with improvements in both the sequencing platforms and computational methods, they can.
Researchers at the Broad Institute, for example, recently published a new assembly program, ALLPATHS-LG, which enables them to generate high-quality draft assemblies of mammalian genomes from large amounts of short-read data. And at least two projects are currently assessing short-read algorithms for assembling complex genomes from short reads.
The first vertebrate genomes assembled entirely from short reads — the panda genome and two human genomes — were published about a year ago by researchers at China's BGI, who used the SOAPdenovo algorithm developed there (IS 12/8/2009and IS 12/15/2009).
In the current issue of Nature Methods, however, researchers led by Evan Eichler at the University of Washington School of Medicine published a withering analysis of the two de novo human genome assemblies — a Yoruban HapMap sample and a Han Chinese individual, YH. Comparing the two assemblies to the human reference genome, they found that they are about 16 percent shorter.
For the YH assembly, they also found that about 420 megabases of common repeat sequences — mostly LINE1 and Alu repeats — are missing. Also, 72 megabases of common duplications are almost entirely absent, and those duplications reported in the YH genome are likely false, they reported. Further, almost 2,400 protein-coding exons are entirely missing, and for 83 genes, all exons are almost completely absent. About 30 percent of genes are fragmented, meaning they are in more than one scaffold. In addition, some of the sequences reported as novel human DNA likely represent contamination from other species.
A Crisis in Comparative Genomics?
"This is a watershed moment in genomics," the authors wrote, cautioning that "without complementary efforts to fully sequence complex genomes, the field of comparative genomics may face a crisis."
They noted that besides the problem of contaminating DNA, segmental duplications and larger common repeats are "the most noticeable casualties of a de novo [next-generation sequencing] assembly." This is due to the short read lengths and insert sizes and the high error rates — compared to capillary sequencing — of platforms like the Illumina technology, with 75- to 100-base reads and libraries with 200 to 500 base pair inserts.
The problem cannot be solved with better assembly methods, they claimed. "We believe that the limitations we present in this work are due to the properties of the data and whole-genome shotgun sequencing approach in general, rather than algorithmic inefficiency."
Rather, they said, "in our opinion, it is critical to develop new hybrid sequencing approaches, such as multiplatform strategies including the third-generation long-read technologies, high-quality finished long-insert clones, and new assembly algorithms that can accommodate these heterogeneous datasets."
[ pagebreak ]
High-Quality Draft Assemblies from Short-Read Data
In the meantime, researchers from the Broad Institute published a new de novo assembly algorithm, called ALLPATHS-LG, which enabled them to generate draft assemblies of the human and mouse genomes from Illumina sequence data that "have good accuracy, short-range contiguity, long-range connectivity, and coverage of the genome."
"What our article shows is that with Illumina sequence alone (and the right algorithms), one can get pretty close to the 'gold standard' for assembly quality, which was using Sanger-chemistry capillary sequencing (at huge expense)," said David Jaffe, the paper's senior author, in an e-mail message.
For their paper, published in PNAS last week, the Broad researchers used Illumina technology to sequence a human sample from the 1,000 Genomes pilot project as well as the mouse reference genome strain. They used four library types, including 150-base fragment libraries, 2.5-kilobase short-jump libraries, 7.5-kilobase long-jump libraries, and 35.3-kilobase fosmid-jump libraries, and read lengths between 26 base and 101 bases. They targeted a 100-fold sequence coverage, noting that "despite using higher coverage, the proposed model is dramatically cheaper because the per-base cost of massively parallel sequencing is about 10,000-fold lower than the current cost of capillary sequencing." They then assembled the data using ALLPATHS-LG, an improved version of their previously published ALLPATHS program, and compared the resulting genomes to the human and mouse reference genomes.
To compare the effect of the algorithm on the results, they also assembled the same datasets using the SOAPdenovo algorithm, with "extensive input" from the BGI developers.
They found that the two ALLPATHS-LG assemblies covered about 90 percent of the genomes, and about 95 percent to 97 percent of the exons. The base accuracy was at least 99.95 percent, with scaffold lengths of 11.5 megabases for human and 7.2 megabases for mouse. Compared to SOAPdenovo, ALLPATHS-LG assemblies "had much greater long-range connectivity and significantly higher short-range accuracy," though the SOAPdenovo assembly took only three days, instead of three weeks for ALLPATHS-LG.
ALLPATHS-LG still missed about 60 percent of segmental duplications, the authors noted, but capillary-based assemblies missed about 40 percent of them. "Clearly, additional work is needed to represent these biologically important regions," they wrote.
Overall, they concluded, improvements in the de novo assembly of genomes from next-gen sequencing data will enable scientists to apply the approach to efforts like the Genome 10K project or to the analysis of rearranged human tumor genomes.
According to Steven Salzberg, a professor of computer science at the University of Maryland, the ALLPATHS-LG results for the human and mouse genomes "are quite impressive" and the assemblies "are sufficiently high quality that they should be useful for many research goals (though not all).
"I hope the community gets the message, though, that 100-fold coverage in multiple, high-quality paired-end libraries is absolutely essential to get a decent assembly, and even then it will take an expert team such as David Jaffe's group to get the best results," he told In Sequence in an e-mail message.
According to Jaffe, the Broad Institute is already applying ALLPATHS-LG in various research projects. Over the last few months, they have used it to assemble about a dozen vertebrate genomes, and they are getting ready to assemble "tons of bacterial genomes" with it. The vertebrate assemblies "are all 'pushbutton' assemblies — we did not vary the arguments to ALLPATHS-LG," he said.
Jaffe and his team are also working on improving ALLPATHS-LG further. By comparing the human and mouse assemblies to the reference genomes, they are able to "know where we're making mistakes and then try to understand why," Jaffe explained, adding that they have already been able to "substantially improve accuracy." These improvements — which will also tackle gaps and missing sequence — will be applicable to all genomes, not only the human and mouse genomes, he added.
[ pagebreak ]
The Broad researchers are also comparing the vertebrate assemblies to each other and have found that many of the differences, for example in scaffold length, are due to differences in data quality. "We think our assembly algorithm should be analyzing this and providing feedback to the users, such as, 'It looks like something might be wrong with the data from a particular library,'" Jaffe said.
Finally, the scientists plan to make the ALLPATHS-LG algorithm "as easy to use as possible" for other users, and to "package up" the library construction protocols.
Others projects aiming to assemble large genomes de novo are taking note of these and other new developments in assembly algorithms.
The Genome 10K project, for example — which plans to sequence 10,000 vertebrate species by 2015, and 101 in the next two years in collaboration with BGI — is organizing a "Genome Assembly Workshop" in March at which it will discuss the results of an "Assemblathon," an ongoing project organized by the University of California, Santa Cruz, and UC Davis to compare methods for assembling complex genomes from Illumina reads. A similar effort, called the "De Novo Genome Assessment Project," is being organized in Europe. Both projects will start with simulated data and synthetic genomes and plan to proceed to real data in a second phase.
"From this, we hope to determine just how well the latest methods can assemble," said David Haussler, a professor of biomolecular engineering at UCSC and one of the leaders of the Genome 10K project.
Jaffe said he likes the "focus on rigorous assessment and comparison of assemblies" of the project, although his team has "considerably greater interest" in assembling real data rather than synthetic data.
"Using simulated data is not that informative and likely misleading as to how real data will perform, and using a real data set for which we don’t know the actual answer is a big missed opportunity," said Chad Nusbaum, co-director of the genome sequence and analysis program at the Broad. "The right way to do this is to use a real data set from a genome where the answer is known. There's no other way to validate the answer and without being able to do that, the value is greatly diminished." He added that the Broad would be happy to contribute its human and mouse Illumina datasets to the effort.
The current plan of the Genome 10K project is to sequence the first 101 species at BGI using Illumina sequencing technology alone, and the Assemblathon will help them choose the best assembly program. However, "you should not read this as a sole endorsement of the Illumina technology," Haussler said, adding that he and his colleagues are also interested in complementary technologies, for example Pacific Biosciences' long-read technology, for additional species to be sequenced.
The quality goal of the Genome 10K project is to generate genome assemblies as good as the draft dog genome, according to Haussler, which has "a level of contiguity and completeness that really facilitates fundamental comparative genome analysis." Whether or not assemblies from short reads alone will be able to achieve that goal remains to be seen.
Have topics you'd like to see covered in In Sequence? Email the editor at jkarow [at] genomeweb [.] com.