By Julia Karow
This article, originally published Oct. 29, has been updated with additional information and comments.
A Norwegian consortium of researchers, in collaboration with Roche's 454 Life Sciences, has completed a high-quality annotated draft genome of the Atlantic cod, Gadus morhua, using a shotgun sequencing strategy that relied exclusively on sequence data generated on 454's GS FLX Titanium platform.
The cod is among the first vertebrates to have its genome assembled entirely from next-generation shotgun sequencing data. In a second project phase, which may also involve short-read technologies, the researchers plan to fill in gaps in the genome, improve the annotation, and sequence the transcriptomes and genomes of additional cod samples from different geographic areas in order to identify SNPs that can be linked with certain traits.
According to Michael Egholm, 454's chief technology officer and vice president of R&D, researchers at the Max Planck Institute for Evolutionary Anthropology in Leipzig recently used a similar 454 shotgun sequencing and de novo assembly strategy to sequence the genome of the bonobo. "Both feats are impressive and there are several other large genome projects underway that will soon be made public," he told In Sequence last week.
The project started last year and received 10 million Norwegian kroners (about $1.8 million) in funding from the Research Council of Norway. Consortium participants presented their results at a meeting in Oslo last month and plan to submit their findings for publication in the near future.
According to Kjetill Jakobsen, the consortium's leader and a professor of biology at the Centre for Ecological and Evolutionary Synthesis at the University of Oslo, he and his colleagues initially discussed several strategies to sequence the cod, which has a repeat-rich genome of at least 800 megabases, and ended up using a mixture of unpaired and paired-end reads generated on the 454 platform.
Using this approach, sequencing for the project cost on the order of 2 to 3 million Norwegian kroners ($350,000 to $530,000) in consumables, Jakobsen said.
This strategy differs from that chosen by an international consortium to sequence the 3-gigabase Atlantic salmon genome, which decided this summer to sequence BAC and fosmid clones by Sanger sequencing technology or a technology with equivalent read length for the first part of the project (see In Sequence 6/23/2009). A spokesperson for the project told In Sequence last week that the consortium will shortly announce details about its sequencing strategy, which she said will involve Sanger sequencing.
The cod researchers, in their first proposal, Jakobsen recalled, suggested sequencing an ordered BAC library, using a mixture of 454 or Sanger and Illumina sequencing, but that approach did not win approval from three anonymous international peer reviewers, who said it would be too expensive and suggested a shotgun sequencing approach. According to a rough estimate, the BAC sequencing approach would have cost about 10 times more, Jakobsen said — between 20 and 30 million Norwegian kroners ($3.4 to $5.2 million), including at least 20 million kroners for Sanger sequencing with 5-fold coverage and 500,000 kroners ($90,000) for BAC preps and Illumina sequencing.
After adjusting their strategy accordingly, the researchers opted to collaborate with 454 on the project and to use 454's GS FLX Titanium platform exclusively, "due to the [long] read length and because we thought that we could do this within the budget," he said.
Besides using unpaired shotgun data, they decided to add paired-end reads with 400-base pair reads on each end, using 2-kilobase, 3-kilobase, 8-kilobase, 20-kilobase insert libraries. These were not yet commercially available from 454 at the time, Jakobsen said.
In total, the researchers generated approximately 25 gigabases of data for their assembly, corresponding to almost 27-fold coverage of the genome. The data came from 71 runs on the 454 platform and included 44 shotgun runs, 10 3-kb library paired-end runs, nine 8-kb library runs, six 2-kb library runs, and two 20-kb library runs.
[ pagebreak ]
More than 80 percent of the runs were performed on two 454 instruments at the University of Oslo, most within a three-month period at the beginning of the year, while the remainder were performed at 454.
In addition, collaborators at the Max Planck Institute for Molecular Genetics sequenced the ends of 30,000 BACs using Sanger technology. Those data were not used to generate the current assembly, Jakobsen noted, but only to confirm it. "We will, however, [use them] to make super-scaffolds," he said.
The data assembly proved to be challenging and required improvements to both 454's Newbler assembly software and the computing architecture at the University of Oslo. The researchers initially tried to assemble the shotgun and paired-read data together, "but we had great problems with getting the programs to work," Jakobsen said. After managing to assemble the shotgun data on their own, "we saw that we did not get as much out of the shotguns as we had expected, so we had to add the pairs," he said.
Bioinformaticians at 454 then made several changes to the Newbler de novo assembler software, in particular to how the paired reads were mapped onto the shotgun reads. In parallel, the university beefed up its computing hardware, to a high-performance computing cluster with 24 CPUs and a number of "huge-mem" machines with 128 gigabytes of memory each, allowing the researchers to assemble the entire data within a week or so now.
According to 454's Egholm, with the new version of the assembler, which was also used to assemble the bonobo genome, "we have solved key issues with previous limitations on genome size and also deployed a long read overlapper." The updated assembler, which 454 plans to release to its customers before the end of this year, "appears to be comfortably handling human-sized genomes and we expect to push the limits further in the not-too-distant future," he said.
The resulting cod genome assembly is a high-quality draft with a scaffold length that "is just as good as that" of other, published fish genomes, according to Jakobsen. Between 80 percent and 90 percent of the genome is assembled in scaffolds, "including all kinds of repeats," he said, although some repeat sequences are probably not represented. The N50 scaffold size is 571 kilobases, the average scaffold size is 43 kilobases, and there are more than 14,000 scaffolds in total, covering 618 megabases. The largest scaffold is almost 6 megabases long. "Of course, there is some way left until we are at the level of the human genome or the cattle," Jakobsen said, noting that the fish species sequenced so far are less closely related to one another than mammals.
Paired-end reads were crucial to the success of the project. "If you use paired ends, then you get very good scaffolds," Jakobsen said. "These scaffolds contain holes — the contigs will not necessarily be extremely long. But it's a completely useful genome, it can be annotated, it has full biological meaning, and you can, of course, improve the contigs by doing additional sequencing or by adding BAC information or other things."
Interestingly, based on their assembly, the researchers estimated that the cod genome is between 700 and 800 megabases long, shorter than the previous 930-megabase estimate from other methods. "We are not completely certain what is really the right number because some of these repeat sequences are extremely hard to get into the alignment," Jakobsen said, but he believes that 930 megabases is an overestimate.
The researchers have already sent their assembly to the Ensembl pipeline at the European Bioinformatics Institute and the Wellcome Trust Sanger Institute for annotation. "They are really pleased with it, and they get very good gene predictions and annotations out of this," Jacobsen said, and he expects the annotation to become publicly available before the end of the year. Based on computational predictions and recently completed cDNA sequence data, the cod genome possesses between 25,000 and 27,000 genes.
During the next phase of the project, which will likely require additional funding, the cod consortium plans to fill in sequence gaps, improve the annotation manually, and to generate more basic information about the genome, for example determine the number of chromosomes through karyotyping.
In addition, the scientists plan to continue to sequence cDNAs, both from the current sample, to improve the genome annotation, and from other cod samples from various geographic areas, to identify SNPs. "We are building up a biobank of cod material, and we would like to sequence cDNA and sequence several individuals completely; generate more SNPs and look at the variation," Jakobsen said. The ultimate aim is to relate various traits — particularly those relevant to fishery and aquaculture — to genotypes.
For resequencing additional cod genomes, the consortium might use short-read sequencing technologies, and Jakobsen said he has already established a collaboration with a lab in Oslo that has Illumina's Genome Analyzer platform. He said he is also interested in using 454's 1-kilobase reads, which the company has said it plans to make available to early-access customers soon.
Short-read technologies might also prove useful in future de novo genome sequencing projects, he said. "It's probable that we might want to include some short pairs from Illumina in a future project; that could well be the case," Jakobsen said, adding that such data could improve contig length.
Provided sufficient funding, he and his colleagues would like to sequence other cod-like species de novo, for example Arctic cod and burbot.