Rapid developments in the field of next-generation sequencing continued to pose a wealth of bioinformatics challenges in 2009, but there was ample evidence that the informatics community is more than able to keep pace with new sequencing technologies.
A prime example of the progress made in bioinformatics over the last year was the publication this week in Nature of the panda genome. It was sequenced and assembled by the Beijing Genomics Institute in Shenzhen de novo using only the Illumina Genome Analyzer platform and the SOAPdenovo algorithm for assembly.
That paper followed another BGI study published in Nature Biotechnology earlier this month that used SOAPdenovo to assemble two human genomes using reads from the Illumina GA.
To put that in perspective, when the first few short-read de novo assembly algorithms were published in 2008, they could only tackle bacterial genomes, and some in the community doubted that de novo assembly would ever be possible with short-read platforms like the Illumina GA and the SOLiD from Life Technologies' Applied Biosystems group [BioInform 3-21-2008].
In a paper published this week in Genome Research describing the assembly of the two human genomes with SOAPdenovo, the BGI team explains that it used eight quad-core AMD 2.3 GHz CPUs with 512 GB of memory two assemble one genome in 48 hours and the other in 40 hours. This included more than 20 hours of preassembly error correction on both genomes, they added.
In another sign that bioinformatics tools may be improving as quickly as the sequencers they are aiming to support, CLC Bio this week released a new version of its de novo assembler that can purportedly assemble a 37-fold coverage human genome on a 32 GB RAM computer in around seven hours. That would be a 10-hour improvement over the version of the assembler that was available in June [BioInform 6-5-2009].
At the "short-SIG" special interest group session held as part of the Intelligent Systems for Molecular Biology conference in July, Michael Brudno, a researcher at the University of Toronto and a meeting organizer, told BioInform that topics that were "completely hot" in 2008, such as alignment and assembly, are now "mostly solved." In the case of short-read mapping, he said, "in a year's time, we went from having almost no tools out there to having 12 or 13, of which almost all are good or very good." [BioInform 7-17-2009].
One reason that these tools have improved so rapidly is that so-called short-read sequencers are able to generate much longer read lengths than they were a year ago. While the Illumina GA was generating reads of around 30 base pairs in 2008, the GA IIx is now able to generate paired-end reads of 75 to 100 bases.
"Short reads are no longer as short as they used to be," said Jens Stoye from the University of Bielefeld, another short-SIG organizer.
Michele Clamp, senior computational biologist at the Broad Institute, also noted that bioinformatics has made "tremendous" progress since last year. At the Genome Informatics conference held at Cold Spring Harbor Laboratory in October, Clamp noted that while "everything was in flux" in 2008, technologies are now "maturing," sequence read lengths are increasing, and bioinformatics teams have refined and validated algorithms and workflows.
"There is a sense of relief, because we can cope and address really interesting biological questions," Clamp said [BioInform 11-6-20009].
[ pagebreak ]
Moving and Storing Data
Just as bioinfomatics technologies have gotten better at handling the analytical challenges posed by next-generation sequencing data, some of the thornier issues related to storing and managing large volumes of sequence data seemed to sort themselves out during 2009.
When it comes to storage, many labs with next-gen sequencers are finding that large-scale storage systems are a necessity. For example, the National Heart, Lung, and Blood Institute recently added 500 terabytes of new storage capacity — a 10-fold increase over its previous capacity — to help store data from next-gen sequencers [BioInform 11-13-2009].
NHLBI currently has two Illumina Genome Analyzer IIs and plans to add one sequencer per year for the next two to three years.
In addition, the DNA Sequencing Core Facility at the Oklahoma Medical Research recently expanded its storage capacity to 78 terabytes [BioInform 11-19-2009] and Montreal's Pharmacogenomics Centre added 38 of terabytes of storage [BioInform 10-1-2009]. Both groups have a single Illumina GA installed.
The Oklahoma and Montreal groups both installed clustered storage systems from Isilon Systems, which has been eager to meet the growing demand for storage in the life science market. Chris Blessington, Isilon's senior director of marketing and communications, told BioInform at the Bio-IT World conference in May that life sciences accounted for less than 2 percent of Isilon's revenue going into 2008, but by the end of the year, that portion was "greater than 12 percent." [BioInform 5-8-2009]
Researchers are also finding solutions for data-management headaches that used to plague next-gen sequencing. For example, the 1,000 Genomes Project is working with a data-transfer software company called Aspera to help move large data files from location to location.
Holly Zheng Bradley, a member of the 1,000 Genomes Project Data Coordination Center at the European Bioinformatics Institute, told BioInform at the Genome Informatics meeting this fall that "FTP simply won't work for terabytes of data," so the DCC turned to Aspera, which uses a protocol called fasp (fast and secure access protocol) that is able to transfer files more quickly than the transmission control protocol. While TCP has a maximum throughput is 50 Mbps, fasp's throughout is in the neighborhood of 500 Mbps.
The National Center for Biotechnology Information has also adopted fasp to help users submit raw sequence data to the Sequence Read Archive.
But despite these advances, some groups still find that their sequence files are too large to transfer over most networks and often end up saving the data on disks and shipping it via FedEx or UPS.
For example, Bruce Martin, vice president of software at Complete Genomics, said at Bio-IT World that the human genome sequencing services firm delivers data to customers by packing boxes full of USB drives and shipping them off.
"There are really only two mechanisms we've identified" for delivering data, he said. "There's nothing brilliant here, there's electronic delivery via the net and burning physical media and using FedEx or UPS."
Delivering via FedEx or UPS is "incredibly good" in terms of cost and throughput, he said. Electronic delivery will be "the ultimate winner" as networking costs drop, but for now it is not practical unless two computers in the same data centers can "talk to each other over a very big pipe."