An international research consortium that plans to start sequencing the Atlantic salmon genome early next year expects to conduct the first phase of the project with Sanger sequencing technology after a review of Roche/454 Life Sciences' Genome Sequencer FLX technology last year, which generated an assembly of insufficient quality.
The researchers used the initial version of the GS FLX technology to sequence and de novo
assemble a number of bacterial artificial chromosomes from a region of the genome, and first presented their results at the Plant and Animal Genome conference in San Diego in January. Last month, they published their results in BMC Genomics
Based on their data assembly, which contained numerous gaps, they recommend that the first phase of the genome project be conducted by Sanger sequencing technology.
Atlantic salmon is an important aquaculture fish whose genome differs substantially from those of the five fish species that have been sequenced so far. And with about 3 billion base pairs, its genome is about the same size as the human genome.
However, it became pseudo-tetraploid following a genome duplication approximately 50 million years ago and harbors a large number of complex and repetitive regions. The genome could serve as a reference genome for other salmonid species, which include other salmon species, trout, and Arctic char.
The consortium pursuing the Atlantic salmon genome consists of researchers and funding agencies from Canada, Chile, and Norway, according to Willie Davidson, one of the organizers and a professor of molecular biology and biochemistry at Simon Fraser University in Burnaby, BC. It is funded with an undisclosed amount from Genome British Columbia, the Norwegian Research Council, a Norwegian industry partner, and the Chilean Research Council.
Over the last few years, researchers have already worked on developing genomic resources for the Atlantic salmon genome under the Genomic Research on Atlantic Salmon Project, called GRASP
, and the Consortium for Genomic Research on All Salmonids Program, or cGRASP
. These resources include BAC libraries, physical and linkage maps, ESTs, a gene expression microarray, and a genotyping SNP chip.
More than a year ago, the consortium decided to look into alternatives to the established Sanger technology for the salmon sequencing project in order to speed up the project and keep its cost down. Because of its relatively long reads of 250 base pairs, the 454 GS FLX technology seemed the only viable alternative, according to Davidson.
The researchers contacted Roche about testing its technology. The firm was receptive to the idea and offered to sequence for free a megabase of salmon DNA — eight partially overlapping BACs. Davidson said Roche saw the project as a challenge; up until then, the company had only sequenced microbial genomes de novo, and was resequencing a human genome.
However, "The idea that you can resequence a human genome is very different from the de novo assembly of a [complex] genome," Davidson said.
Roche's 454 Life Sciences subsidiary generated the unpaired shotgun sequence data of the eight BACs about a year ago, providing approximately 30-fold total coverage. The researchers used 454's Newbler program to assemble the reads into 803 contigs with an N50 size of 11.5 kilobases. Adding 126 Sanger-generated BAC-end sequences improved the assembly only slightly by bringing the N50 contig size up to 13.5 kilobases, they said.
Thus, the contigs were small and their correct order could not be determined from the 454 shotgun sequence data, so the result "was not really very good," according to Davidson.
However, the scientists were able to identify and annotate genes and to predict the order of the contigs by comparing the sequence with other fish genomes.
"We were able to get good scaffolds [with the 454 GS FLX], but the problem was, there were huge gaps.”
Roche then offered to generate another 26-fold coverage of the same eight BACs with paired-end reads, a relatively recent capability of the platform at the time. The paired-end sequencing "worked rather nicely," according to Davidson, and allowed the researchers to order the contigs into scaffolds.
Combining the GS FLX shotgun and paired-end data, they obtained 289 large contigs, 106 of which were assembled into three large scaffolds with an N50 scaffold size of 362 kilobases. Including the BAC-end sequences in the assembly resulted in a total of 286 contigs, 175 of which were assembled into four large scaffolds with an N50 and largest scaffold size of 539 kilobases.
"We were able to get good scaffolds, but the problem was, there were huge gaps," Davidson said, "and we felt that for a reference genome, this would not be sufficient."
For comparison, they also sequenced a ninth BAC from the same region by Sanger sequencing, which resulted in 20 contigs with an N50 size of 33 kilobases and two scaffolds with an N50 size of 138 kilobases.
Based on the results of the project, the consortium's scientific committee recommended that the first phase of the genome project, which is scheduled to start in early 2009, should be conducted with Sanger sequencing. This phase is supposed to generate a "good foundation" on the order of 4- to 5-fold coverage of the salmon genome, although this coverage could still change.
In the meantime, Roche is sequencing another set of approximately 20 salmon BACs, totaling 3.5 megabases, using its Titanium upgrade for the GS FLX, which provides average read lengths of 400 to 500 base pairs. Along the way, they are also testing other assembly methods, according to Davidson. Results from this project are expected by the end of the year.
The main reason that the original GS FLX did not perform well enough to become the workhorse for the project is its average read length of only about 250 base pairs, Davidson said. Simulations have shown that a minimum read length of between 350 and 400 base pairs is required to assemble the Atlantic salmon genome, he said, mainly because of the length of its repeat elements.
According to Ulrich Schwoerer, head of global marketing for 454 sequencing at Roche, the longer read length of the Titanium upgrade should bring the platform within the desired range for salmon and other large genomes.
Schwoerer told In Sequence via e-mail that compared to its first runs for the project, “we not only doubled the read length but also have 20 kb tag paired runs, which improves dramatically the performance for de novo sequencing of complex genomes.”
He added that the “read length alone” for the GS FLX Titanium “is capable of assembling more complex genomes,” and that combined with other improvements in the system “we can even assemble genomes with higher structural complexity,” such as those with a high proportion of repeats.
Schwoerer said that the company is currently working on “numerous” de novo sequencing projects for “complex diploid organisms and polyploid plants” and expects to publish its results “soon.”
For the second phase of the salmon project, the scientists recommended giving new sequencing technologies another look. "It may not be Titanium — there may be a brand-new sequencing technology that comes on the street. If that's the case, then we will use that," Davidson said. "But certainly, at the moment, I think Titanium is the leader of the pack" for the purpose of de novo sequencing, he added.
The consortium has not yet considered any of the other existing second-generation technologies, which provide even shorter read lengths than the GS FLX, because "our sense is that if it's a function of read length, [and] if we were not going to get it with 250 base pairs, we are not going to get it with 36."
The consortium has not yet chosen a sequencing center for the first phase of the project, but will issue a request for proposals shortly, Davidson said. It plans to award the project to a single sequencing center, candidates for which include large academic sequencing centers in the US and in Europe. The aim is to complete the project within two to two-and-a-half years.
Davidson declined to say how much funding is available for the project but said that a mammalian genome of comparable size and complexity, sequenced with 6-fold coverage by the Sanger technology, including assembly, annotation "and some other bits," would currently cost on the order of $10 million.
The researchers did not perform a cost analysis of the 454 technology, so it is unclear whether the project would be cheaper on that platform. Sequencing by 454 "ought to be much faster" than Sanger sequencing, Davidson said, but the assembly and analysis could take more time, so the total project time might not be shorter.
To be sure, other researchers have apparently obtained good results using the 454 platform for de novo sequence assemblies of smaller, less complex eukaryotic genomes. Stephen Richards for example, an assistant professor at the Human Genome Sequencing Center at Baylor College of Medicine, reported at the Biology of Genomes meeting at Cold Spring Harbor Laboratory in May that he had sequenced several strains of Drosophila melanogaster, which has a 123-megabase genome, using the GS FLX with both 250-base pair reads and the 400-base-pair read Titanium chemistry. He assembled the data de novo using the Newbler 2.0 assembler.
"Our experience with Drosophila has been extremely positive" with the assembly resulting in N50 contig sizes of 30 to 50 kilobases, from 12-fold sequence coverage of the genomes, he told In Sequence by e-mail this week.
"Comparison against the reference D. melanogaster
genome has been generally good, and certainly no worse than draft 8X Sanger sequences," he said. Researchers who want to repeat his assembly can access the sequence data here
Also, Joe Ecker, a researcher at the Salk Institute, reported at a symposium at Yale University last week that his group, in collaboration with Mike Snyder's lab at Yale and Dan Rokhsar at the Department of Energy's Joint Genome Institute, had sequenced and de novo assembled the 115-megabase Arabidopsis thaliana genome. The researchers are currently analyzing and annotating the assembly.