This article has been updated from a previous version to include comments from a Roche/454 official and other parties.
The International Collaboration to Sequence the Atlantic Salmon Genome seeks to award a multi-million dollar contract to one or a consortium of several large-scale sequencing centers to sequence, assemble, and annotate the Atlantic salmon genome using Sanger technology or the "equivalent," according to a request for proposals published last week.
The goal of the ICSASG, a collaboration between researchers, funding agencies, and industry partners from Canada, Norway, and other countries, is to produce a genome sequence of the Atlantic salmon, Salmo salar, that identifies and maps all of its genes and that can serve as a reference for other salmonid species and more distantly related fish species.
Funding for the project comes from several ICSASG members, including collaboration coordinator Genome British Columbia, the Research Council of Norway, and the Norwegian Fishery and Aquaculture Industry Research Fund. The consortium does not disclose the exact amount of funding for the project, but Pierre Meulien, chair of the international steering committee for the ICSASG, told In Sequence last week that the budget for the contract is several million dollars.
Notably, in the RFP, the ICSASG recommends that Sanger sequencing or a technology of equivalent read length, rather than one of the current next-generation sequencing technologies, be used for the first phase of the project. The second phase — which will not be part of the initial contract — could involve novel sequencing technologies "to complete the sequence, identify SNPs, and finish selected regions of interest."
"We decided that we would need to ensure that the first phase of the sequencing project gave us a very good, solid scaffold on which to base our whole-genome analysis," Meulien said. "We know that we will be able to assemble properly using Sanger or equivalent reads of 750 base pairs."
Since there is no reference genome of a closely related fish species available, and because the approximately 3-gigabase Atlantic salmon genome is known to be complex and repetitive, "sequencing and assembly will be extremely challenging," according to the RFP.
Specifically, the genome underwent a duplication about 25 million to 100 million years ago and is considered to be pseudotetraploid, though the individual fish to be sequenced — a female named "Sally" — is a double haploid. About 30 percent to 35 percent of the salmon genome contains repetitive DNA.
Last year, members of the ICSASG tested the feasibility of using shotgun and paired-end reads from the 454 GS FLX platform for the project (see In Sequence 9/23/2008). According to the RFP, they concluded "that in its present form (average read length of 250 bp) the GS FLX technology is limited to gene mining and establishing a set of ordered sequence contigs with many gaps."
In the meantime, Roche's 454 has increased the read length of its platform, to approximately 400 to 500 bases, with its Titanium upgrade (see In Sequence 9/30/2009), and is working on even longer reads. "We are continuously pushing our read lengths beyond the current 500 bp and predict we will shortly intercept Sanger sequencing read lengths," 454 CSO and Vice President of R&D Michael Egholm told In Sequence by e-mail this week.
Also, "a number of assembler improvements have recently been made which enable the assembly of complex genomes into fewer, larger scaffolds and contigs," he added. "We predict that routine high-quality draft assembly of mammalian-size genomes will be possible shortly without the requirement for the very large expensive clusters used today."
The largest published genome to date that has been sequenced and assembled de novo from 454 data alone is that of baker's yeast. However, 454 said recently that it has sequenced and assembled the 1.7-gigabase oil palm genome, which has a repeat content of 60 percent. That project, which has not been published yet, generated a combination of BAC-pool and shotgun sequencing data with 250-base GS-FLX and 500-base Titanium reads (see In Sequence 5/19/2009). The genome was assembled by Synamatix, a Malaysian bioinformatics firm.
[ pagebreak ]
The Cod Genome Project
The ICSASG will likely also pay close attention to the cod genome project, a collaboration between several Norwegian research groups and international partners that was established last year to sequence the cod genome de novo, using largely 454 data.
According to its website, the project "will be carried out in close collaboration" with 454, and "will be one of the first of its kind where [454's] new technology is used to de novo sequence a complete and large vertebrate genome at a low cost."
The plan is to generate 25-fold coverage of the 0.9-gigabase cod genome by 454 shotgun sequencing, and to assemble these data into contigs, according to the website. Paired end libraries with 20-kilobase, 8-kilobase, and 3-kilobase inserts will then be used to order and orient the contigs into scaffolds. Sanger sequencing of a subset of the BAC library will generate 125-kilobase pair sequences to link the scaffolds together, and individual BACs will be screened to resolve difficult regions and to close gaps.
The assembly is expected to be "the major challenge," according to the website, and the project plans to use 454's Newbler assembler on a high-performance computing cluster, in addition to assembly programs for heterozygous genomes. The researchers also plan to assess programs written for Sanger sequence assembly for long 454 reads on the order of 500 bases.
According to Unni Grimholt, a researcher at the Centre for Ecological and Evolutionary Synthesis at the University of Oslo who is involved in the cod genome project, the collaborators have so far sequenced "a lot of" shotgun libraries and "some" 3-kilobase and 20-kilobase libraries. "Our current problem is getting the software to assemble all reads," which takes weeks or even months, she told In Sequence in an e-mail message.
Grimholt added that 454 is "working on improving the assembly software, and we just received a new version, which may solve the problem." She said the project will likely be completed this fall "if the assembly behaves nicely."
Because its genome is smaller and has not undergone a duplication, cod is "a much easier organism to work with" than the Atlantic salmon, according to Grimholt.
Sanger "or Equivalent"
Despite the recent improvements to the 454 technology, "we still see an advantage in using Sanger or equivalent," the salmon project's Meulien said. "Those longer reads will help in the assembly." He added that the ICSASG reached the decision after consulting with "very eminent advisors."
According to the RFP, "Currently, to obtain a genome sequence that can act as a reference for other salmonids, it appears that a substantial portion of the sequencing of the Atlantic salmon genome should be carried out using Sanger technology or equivalent."
Meulien explained that "if somebody can come up with an equivalent to Sanger that does the same thing as Sanger, then of course we will consider it."
In particular, the repetitive nature of the Atlantic salmon genome, and the length of its most common repeat — approximately 1,500 base pairs — "make it necessary to have long paired-end reads for assembling the sequence of this species' genomes," according to the RFP.
The solicitation applies only to the first phase of the project, which aims to generate five-fold coverage of the genome with Phred 20-quality sequence and assemble and initially annotate the genome.
The project wants to achieve this by end-sequencing 100,000 BAC clones and 100,000 fosmid clones, with a target minimum Phred 20 read length of 750 base pairs, and by generating 20 million paired-end reads with a target minimum Phred 20 read length of 750 base pairs.
None of the next-generation sequencing technologies currently available to customers provides this read length at the moment.
[ pagebreak ]
The project plans to generate a first assembly after three-fold coverage has been achieved, and a second assembly with five-fold coverage.
These will serve as a "solid foundation" for the second phase of the project, which "could use novel sequencing technologies that are deemed to be appropriate to complete the sequence, identify SNPs, and finish selected regions of interest," according to the RFP. A third assembly will be performed at the end of the second phase.
"These technologies are moving so quickly, we are trying to take a balanced approach here," Meulien said. "We don't want to make a mistake, because it will be an expensive mistake if we find that we can't assemble using an inferior read length technology. And at the same time, we want to embrace the new technologies that are out there. That's why we divided [the project] into phase 1 and phase 2."
All assemblies will be annotated during three ICSASG-organized workshops, to be held in Canada, Norway, and other participating countries.
Project data will be publicly released immediately, though "rights to use the outcome of the results will be guaranteed via agreements between the project partners," according to the RFP.
Centers intending to apply are asked to respond by July 17. Proposals are due August 17, and ICSASG plans to select a candidate by October 16. The group will then negotiate the terms of the contract with this applicant and plans to launch phase 1 of the project before the end of the year, with a goal to complete the sequencing within 12 months. After a review of the first phase, the contractor may also be asked to conduct sequencing for the second phase of the project. The RFP is available here.
Applicants need to have a "strong track record in sequencing complete complex vertebrate genomes," and have access to Sanger sequencing as well as new sequencing technologies. They also need to have sufficient capacity to take on the project and to complete it within the scheduled timeframe.