Skip to main content
Premium Trial:

Request an Annual Quote

BGI-Led Project Uses Mix of Illumina and Sanger Sequencing to Assemble Cucumber Genome


By Julia Karow

Using a combination of Sanger capillary and Illumina sequencing technologies, researchers led by the Beijing Genomics Institute in Shenzhen and the Chinese Academy of Agricultural Sciences in Beijing have sequenced and assembled the genome of the cucumber at a sequencing cost of approximately $3.3 million.

The hybrid approach, which used a small amount of Sanger and a large number of Illumina sequencing data, improved the contig and scaffold length, and the fraction of the overall genome assembled, compared to either technology alone, according to the researchers. It might be useful for other genomes, they suggested, though it will depend on the nature of the genome under study.

In a paper describing the cucumber project in Nature Genetics this week, the authors wrote that "in combination with traditional Sanger sequencing, next-generation DNA sequencing technologies can be used effectively for de novo sequencing of plant genomes, making it possible to carry out rapid and low-cost sequencing for other important plant species."

But according to Jun Wang, executive director at BGI and an author of the study, different genomes will require different approaches. "Repeat structures and genome size will affect the choices of read length and insert sizes," he told In Sequence by e-mail. "Different platforms have their own advantages of read length and insert sizes. We should first analyze the features of the genome before we make a decision of which approach to take."

According to the paper, the researchers sequenced an inbred line of the domestic cucumber, Cucumis sativus, which has an estimated genome size of 367 megabases.

In total, they generated 26.5 gigabases of high-quality sequence data, equivalent to more than 72-fold genome coverage. Sanger sequencing contributed almost 4-fold coverage, or about 5 percent of the data, whereas paired-end reads from Illumina's Genome Analyzer provided more than 68-fold coverage, or almost 95 percent of the data.

However, the cost of generating the data using the two technologies was reversed: According to Wang, the team spent approximately $3 million on generating the Sanger data but only about $300,000 in direct costs for the Illumina sequencing.

The Sanger reads were generated in 2007, he said, while the Illumina data was produced in 2007 and early 2008 and required about 25 runs, since the Genome Analyzer only generated about one gigabase of data at that time. The analysis was completed at the end of last year.

When the data was generated, he said, Sanger capillary sequencing and Illumina sequencing were "the most mature" technologies at BGI. "That's the reason we used them" in this project, rather than a different mix of sequencing platforms. According to the institute's website, BGI currently has 28 Illumina GAs, two ABI SOLiD systems, and one Roche 454 GS FLX, in addition to more than 110 Sanger sequencers.

Most of the Illumina data came from 200-base insert libraries with 42-base paired reads, while a portion was generated with 400-base inserts and 44-base paired reads, and the remainder from 2-kilobase libraries with 53-base paired reads.

The researchers then generated three assemblies, one using only the Sanger data, one using only the Illumina data and one combining the two datasets.

The hybrid approach improved the N50 length of both contigs and scaffolds — meaning contigs and scaffolds of at least that length accounted for half the bases in the assembly. The N50 for contigs improved to 19.8 kilobases for the combined approach, from 2.6 kilobases from Sanger data only and 12.5 kilobases for Illumina data only. The N50 for scaffolds increased to 1,140 kilobases, from 19 kilobases from Sanger data alone and 172 kilobases from Illumina data alone.

[ pagebreak ]

In addition, the fraction of the genome assembled in scaffolds increased to 243.5 megabases, from 238 megabases from Sanger data only and 200 megabases from Illumina data only. The remaining genome regions — approximately 30 percent — remained unassembled and are "likely to be heterochromatic satellite or rRNA sequences," according to the paper.

The researchers further confirmed their assembly by comparing it with existing EST, fosmid, and BAC sequences, including 350,000 ESTs sequenced with Roche's 454 technology.

Wang declined to mention whether BGI has used a similar sequencing strategy in other projects to assemble a genome de novo. However, at a conference this summer, he mentioned that BGI has developed a short-read de novo assembly algorithm and has used it to assemble the genome of the panda from Illumina sequence data only (see GenomeWeb Daily News 6/12/2009).

Other recent large-scale genome sequencing efforts have also eliminated Sanger sequencing. Last week, for example, a consortium of researchers in Norway said that in collaboration with 454 Life Sciences, it has sequenced the 800-megabase genome of the Atlantic cod using solely 454's sequencing technology (see other article in this issue).

And in May, 454 and two Malaysian companies sequenced the 1.7-gigabase oil palm genome with a combination of shotgun and BAC-pool sequencing in a project that they said was the first de novo assembly of a "large and highly complex" plant to be completed without the use of Sanger sequencing data (see In Sequence 5/19/2009).