Scientists have developed new approaches to obtain better genome assemblies from the short sequence reads yielded by systems like Illumina’s Genome Analyzer and Applied Biosystems’ SOLiD platform.
At the Biology of Genomes meeting at Cold Spring Harbor Laboratory two weeks ago, several research groups showed how they have improved the quality of short-read assemblies of bacterial, yeast, insect, and plant genomes by using strategies such as gene-boosted assembly, reduced representation libraries, and paired-end sequencing in combination with new short-read assembly software.
Yet these advancements were presented as a debate continues over whether short-read assemblies provide a cost advantage over alternative approaches, such as 454 sequencing.
One new strategy for assembling genomes from short unpaired reads was presented by Steve Salzberg from the Center for Bioinformatics and Computational Biology at the University of Maryland, whose approach is designed for organisms in which a closely related species has already been sequenced.
The strategy combines comparative assembly, de novo assembly, and so-called “gene-boosted assembly,” which predicts genes that span gaps in the assembly and uses these genes to find unassembled reads that can fill the gaps.
The new approach enabled Salzberg’s team of four scientists to generate a draft genome of a bacterium that is comparable in quality to microbial assemblies generated by researchers in the late 1990s for “a couple of thousand dollars,” he said.
Such a project is currently about four times cheaper than sequencing and assembling a genome from 454 data, which has longer reads, even though the latter approach would only require only about half the sequence coverage, Salzberg told In Sequence last week.
The new gene-boosted assembly strategy could be used in projects where a genome from a close relative is available, he said — for example the reference strains that are currently being generated as part of the Human Microbiome Project (see Transcript, in this issue).
However, for sequencing projects where no such related genome is available, “Solexa reads are just too short to produce a good assembly,” he said. “In those cases, length matters — a lot.”
As an alternative in such cases, paired-end reads from Illumina would be “a big help,” he said.
Salzberg and his colleagues applied their new assembly strategy to a novel strain of Pseudomonas aeruginosa, a pathogen with a 6.5-megabase genome with high GC content. Illumina’s sequencing service generated 8.6 million unpaired 33-base-pair reads on a quarter of an Illumina Genome Analyzer run for Salzberg’s team, or about 280 megabases of sequence data, covering the genome more than 40-fold.
To start with, the scientist assembled the reads by comparing them to two closely related P. aeruginosa strains by using the AMOScmp comparative-assembly algorithm that is part of their AMOS, or A Modular Open Source, assembly software system. After merging the two resulting assemblies, they obtained an assembly that contained 1,850 contigs up to 232 kilobases in length.
Next, they searched for protein-coding genes that were not completely covered by the contigs, looked for the missing protein-coding sequence of these genes in unassembled reads, and stitched these together using ABBA, a new assembler that is also part of the AMOS system. After adding the new contigs to the comparative assembly, they reduced the number of contigs to 120, the largest being 500 kilobases in length.
To further improve this assembly, they merged it with a de novo assembly of the short-read data that they had independently generated using the Velvet short-read assembly software, which was developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute. The result was an assembly with 76 contigs in one large scaffold, the largest 500 kilobases as before, with no genes contained in the gaps between the contigs. In addition, the researchers retained several hundred unplaced short contigs from the Velvet assembly, covering about 7 percent of the genome.
Based on quality-control tests, the researchers estimated that their assembly is 99.97 percent accurate, allowing them to call polymorphisms in the P. aeruginosa strain.
Right now, Salzberg’s team is applying the gene-boosted assembly strategy to the Drosophila simulans genome. However, that project, which aims to improve that fly’s genome assembly in the gene-containing regions, uses long Sanger reads from the National Center for Biotechnology Information’s Trace Archive.
Allpaths Lead to Broad
Researchers from the Broad Institute have used their own short-reads assembler, Allpaths, to generate de novo assemblies of several bacterial genomes and one yeast genome sequenced with paired reads on Illumina’s Genome Analyzer.
According to Chad Nusbaum, co-director of the Broad Institute’s genome-sequencing and -analysis program, the results, compared to previous microbial sequencing projects the institute has conducted using 454’s sequencing platform, show that “the cost and quality of the assemblies of 454 data and Illumina data are similar,” and he expects the cost of data assemblies from both platforms to “decline significantly” over the next few months.
Because new sequencing platforms are in “stiff competition” with each other and are continually improving, it is too early to say which platform will eventually be cheaper or better, he told In Sequence.
“The development of new software tools now is going to be paid off when the reads are longer and the applications increase.” |
Short-read assemblies will be “potentially useful” for large-scale genome sequencing efforts such as the Human Microbiome Project, and Broad researchers are currently modeling “what it would mean to sequence a significant number of genomes this way,” he said.
In his conference presentation, Nusbaum said that to generate the assemblies, the scientists used two different libraries, a 200-base insert library yielding 36-base paired-end reads and a 4-kilobase insert library generating 26-base paired-end reads.
As an example, he showed results for the 4.6-megabase E. coli genome, which was sequenced at 150-fold coverage. Using the Allpaths algorithm, which Broad scientists published earlier this year (see In Sequence 3/18/2008), the researchers were able to generate an assembly with 24 contigs and an N50 of 360 kilobases, meaning 50 percent of the contigs had at least that size.
Ninety-nine percent of the genome was covered by the assembly, Nusbaum pointed out, and about half the genome was covered by the eight largest contigs. The results produced fewer than 10 errors per megabase of genome, or a quality value better than Q50.
The researchers have also assembled the genomes of the bacteria Staphylococcocus aureus and Rhodobacter sphaeroides, and of the fission yeast Schizosaccharomyces pombe, which is 12.5 kilobases in length, and obtained assemblies of similar quality. The 18 contigs in the S. aureus assembly have an N50 of 620 kilobases, the 40 R. sphaeroides have an N50 of 170 kilobases, and the S. pombe assembly consists of 86 contigs with an N50 of 300 kilobases.
Next in line is the 43-megabase genome of the mold Neurospora crassa, which the researchers are currently working on, Nusbaum told In Sequence.
Despite these results, short-read sequencing tools such as Illumina’s and ABI’s are generally still limited in their utility for de novo bacterial assemblies where no reference sequence is available, according to George Weinstock, associate director of the genome center at Washington University. Though it is possible to obtain long contigs or scaffolds with short reads, “you have to go to much higher sequence coverage and it loses any cost advantage,” he told In Sequence last week.
However, this might change once the read length from these platforms increases to 50 or even 100 bases, “which is on the radar screen of the companies,” he said. “Then you have long enough reads to do good de novo assemblies, as we used to do with the first 454 instrument,” which produced 100-base reads.
“The development of new software tools now is going to be paid off when the reads are longer and the applications increase,” he added.
Cut Here: Reduced Representation
But short-read assemblies need not stop at bacterial genomes. Elliott Margulies from the National Human Genome Research Institute presented a new approach at the Cold Spring Harbor Lab meeting for assembling mid-sized, and potentially large, genomes from short sequence reads: dividing the genome up into smaller regions, sequencing and assembling these regions individually, and combining the results.
A strategy like this is necessary for assembling larger genomes, such as those from insects or mammals, Margulies explained, because short unpaired reads cannot be mapped uniquely to large portions of these genomes, and because “prohibitively large memory” would be required for the entire assembly on the computational side.
In order to avoid the cost and time of generating clone libraries — the traditional way of dividing up a genome — his lab chose to partition the genome using so-called reduced representation libraries. These libraries are constructed by cutting the genome with a restriction enzyme, running the DNA fragments out on a gel, and isolating pools of fragments of different sizes from the gel. Each of these libraries represents a fraction of the genome up to 20 megabases in size. By using several restriction enzymes that generate overlapping fragments, the scientists can obtain overlapping contigs for the meta-assembly step at the end.
Margulies and his colleagues have already applied their strategy to sequence a 120-megabase Drosophila genome, using unpaired reads from their Illumina Genome Analyzer. They assembled reads from each reduced representation library using the Velvet assembly software.
Eighty-eight percent of the fly’s euchromatic genome was represented in the contigs, Margulies reported, but the contigs were small. He said he expects their size will improve with paired-end Illumina reads.
This week, Margulies told In Sequence that the contigs have already grown longer as a result of modifying parameters in the Velvet software.
His team is now working on improving the reduced representation libraries to make them more specific for the subset of sequences in the target fraction size, as well as improvements to the meta-assembly, and is considering to “spike in” longer reads from Sanger or 454. The latter approach could be used, for example, as a way to improve genomes already sequenced at 2X coverage, Margulies told In Sequence.
But he and his colleagues have also used their strategy with data generated from the human genome. According to the abstract, “these studies indicate that future mammalian genome-sequencing efforts can effectively utilize our approach, providing a projected 50-fold cost reduction without significantly compromising data quality or completeness.”
Daniel Zerbino, a researcher at the EBI and one of the developers of Velvet, pointed to another approach to assemble portions of a genome de novo. At this month’s conference, he reported that researchers led by Detlef Weigel at the Max Planck Institute for Developmental Biology in Tübingen, Germany, used Velvet to generate targeted de novo assemblies of genomic regions of several Arabidopsis thaliana strains that differ from the reference genome. These researchers preselected reads surrounding divergent regions of the Arabidopsis genome before assembling them with reads that did not match the reference.
In another attempt to scale up short-read assemblies from microbial genomes, Zerbino and his colleagues used Velvet to assemble 548 million paired-end Illumina reads from the 150-megabase human X-chromosome, generated by researchers at the Wellcome Trust Sanger Institute. Due to the large number of repeats in the X-chromosome, the resulting assembly consisted of almost 900,000 very short contigs, but the project served as a proof of principle that Velvet can handle a large amount of data like this, he said.