As second-generation sequencing advanced in 2008, the technology took center stage in the field of bioinformatics as well. Labs buying into this high-throughput technology encountered a number of informatics challenges, including upgrading their storage and finding software to help organize and interpret the whirlwinds of data, and a number of academic and commercial efforts took steps to help meet this demand.
One driving force for second-generation sequencing in 2008 was the launch of several large-scale projects, such as the 1000 Genomes Project. The project, which kicked off in January, had surpassed 600 gigabases by May and 3.8 terabases by November. The project organizers plan to begin releasing data early next year and expect to finish sequencing 1,200 human genomes by around the end of 2009.
Research groups working on these projects are finding that the ability to generate large volumes of sequence data has quickly outpaced the ability to analyze it. As Tim Hubbard, head of the informatics team of the human genome-analysis group at the Wellcome Trust Sanger Institute told BioInform last January, it is “going to become cost effective to sequence anything in biology, but not necessarily cost effective to annotate it.”
That shift toward high-throughput sequencing has entailed a “major investment” in IT infrastructure to support the sequencing workflow at large genome centers. Hubbard said that the Sanger Center installed 340 terabytes of disk cache just to handle the temporary processing of data coming off of the machines.
Other genome centers are also dealing with these issues. Rick Wilson, director of the Genome Sequencing Center at Washington University School of Medicine, told BioInform in November that his team wrestled with a number of IT questions, particularly storage, in its effort to sequence a female patient’s acute myeloid leukemia genome [BioInform 11-07-08].
“You had to figure out exactly what data that came off the next-gen platforms you needed to save and which you could afford to toss,” Wilson said. “We are still learning that, I think.”
Some large-scale sequencing ventures, such as the International Cancer Genome Consortium, are following a federated design rather than a centralized approach to data management. This approach may be a harbinger of methods that will continue to be explored this year.
Within the ICGC, each data-generation center will handle its own information management, Lincoln Stein, director of informatics and biocomputing at the Ontario Institute for Cancer Research in Toronto, which is serving as the data-coordination center for the ICGC, told BioInform in May. Under this model, each of the participating centers will host a local “franchise” database that will share a common data model and structure. The DCC will provide the schema and software for these franchise databases, but each data producer will manage its own local database.
John McPherson, platform leader in cancer genomics and high-throughput screening at the OICR, told BioInform in November that keeping up with the informatics for the center’s five Illumina Genome Analyzers and five Applied Biosystems SOLiDs “is a real challenge.”
OICR has 600 cores at its disposal, but “we can saturate those with 10 instruments pretty easily,” McPherson said.
Meanwhile, several third-generation sequencing technologies on the horizon promise to generate even more data than currently available systems. For example, Complete Genomics, a firm launched this year, plans to use its proprietary technology to offer customers a human genome sequencing service in 2009. Bruce Martin, the firm’s vice president of software, told BioInform that the service’s $5,000 price tag includes sequencing to 20-fold coverage and “the first few steps of analysis”
The company has already built a data center with 400 terabytes of disk storage and 600 processors. Next year, it plans to scale up to 5 petabytes of disk storage and 10,000 processors, and by 2010, it wants to ramp up capacity another six-fold, to 30 petabytes of storage and 60,000 processors [BioInform 10-10-08].
Hardware adaptations are gaining in importance for high-throughput sequencing. Sequencers have some capacity for so-called “on-rig” analysis, but as Matt Trunnell, group leader in the Broad Institute’s Application and Production Support Group, described at this year’s Bio-IT World conference, he and his colleagues have found that alignment of large genomes can sometimes be “too big of a job for the on-machine system.”
Some vendors are eyeing reconfigurable hardware, such as field-programmable gate arrays, as an option for easing the computational burden of next-gen sequencing.
Earlier this year, Intel announced that it plans to develop an FPGA-based “appliance” that would run side-by-side with a sequencer, but the company has not provided further details on its plans in that area [BioInform 05-02-08].
“As the reads get longer, I am concerned that performance is going to go down.”
And in July, Invitrogen, now Life Technologies, announced that it was working with Active Motif’s TimeLogic biocomputing unit to explore the use of FPGA-technology to speed the analysis of next-generation sequencing data [BioInform 07-25-08]. TimeLogic had already signaled its interest in the next-gen sequencing market with the launch in March of its SeqCruncher FPGA accelerator that it said was “designed to handle the explosion of data from next-generation sequencing endeavors.”
Storage challenges are another issue facing laboratories both large and small, which must decide how they are going to gear up their own hardware and software solutions.
Storage vendors such as Isilon, BlueArc, Network Appliance, and EMC are finding a new market for their systems in this area. Complete Genomics and the Broad Institute are both using Isilon storage systems, while Geospiza signed an agreement with the firm earlier this year to integrate Isilon’s clustered storage technology with its FinchLab laboratory information management system for managing data from next-generation sequencers [BioInform 20-13-08].
Other labs are taking alternative approaches to storage. Scientists at the Friedrich Miescher Institute in Basel, for example, are using Sun Microsystems’ Storage Archive Manager File System, or SAM-FS, in combination with high-density disk-based storage from Copan Systems, in order to serve as both the primary storage and the backup system for two new Illumina Genome Analyzers [BioInform 06-20-08].
As sequence databases expand, software development efforts have intensified, both in the academic community and in the commercial market.
Bioinformatics software firms such as Geospiza, GenomeQuest, DNAStar, SoftGenetics, and CLC Bio all launched products targeted specifically at next-generation sequencing this year, while sequencing vendors such as ABI (now Life Technologies) and Illumina launched new software tools to help their customers analyze data from their instruments.
Illumina launched an updated version of its analysis software for the Genome Analyzer, called GenomeStudio, which replaces the company’s BeadStudio data-analysis software. In addition to existing tools that analyze data from both the BeadArray and sequencing platforms, it contains two new modules for sequencing-based applications and algorithms for detecting copy number variation, SNP calling, and data visualization.
Illumina also extended its Illumina Connect program during the year to include third-party commercial and academic bioinformatics providers working on software and hardware tools to manage Genome Analyzer data.
ABI, meantime, launched a website to support third-party software development for the SOLiD sequencing platform and to offer its own tools for the system, and has continued to add new tools and data to the site over the year [BioInform 10-24-08].
ABI is also partnering with commercial firms to develop new software for the SOLiD. In February it announced that it was partnering with Geospiza and GenomeQuest to work on a suite of tools for data analysis of SOLiD sequencer and to allow both firms to tune their software tools for compatibility with the instrument [BioInform 02-08-08].
Geospiza, meantime, took steps to build out its software portfolio for next-gen sequencing by acquiring the assets of microarray analysis firm VizX Labs in November [BioInform 11-21-08]. Geospiza plans to apply VizX’s GeneSifter software to digital gene-expression analysis on second-generation sequencing platforms.
But the academic community appears to be ahead of the curve when it comes to software development for second-generation sequencing, because the technology is opening up new lines of previously inaccessible scientific inquiry “and the software is the key,” Andrew Fire, Nobel Laureate and professor of pathology and genetics at Stanford University School of Medicine, told BioInform. His lab is part of Stanford's High Throughput Sequencing Initiative, which is using new sequencing technologies to study cancer and other diseases.
For now, Fire said, "there is no killer app" for second-generation sequencing, which means that most labs must still develop new software tools on the fly for particular projects.
Even as researchers are getting a handle on the informatics challenges of the current crop of high-throughput sequencers, manufacturers are rapidly increasing the read length and throughput of these systems, which could pose future hurdles.
“As the reads get longer, I am concerned that performance is going to go down,” OICR’s McPherson said to BioInform.
At this year’s Intelligent Systems for Molecular Biology conference in Toronto, Jason Stajich, president of the Open Bioinformatics Foundation, told BioInform that second-generation sequencing has put sequence analysis “back in vogue.” Pipeline software as well as sequence analysis fall into that category of renewed interest.
Stajich noted that development of de novo assemblers will likely be important for the foreseeable future, while researchers are still trying to figure out how to “go after a novel genome” using short-read sequencing.
Indeed, 2008 saw the release of a wealth of new algorithms for de novo assembly of short reads, adding to several such algorithms that were published in 2007 [BioInform 03-21-08].
In a study published in Genome Research in March, researchers compared the performance of four short-read assemblers: the Velvet algorithm from the European Bioinformatics Institute; Edena, for Exact De Novo Assembler, from Geneva University Hospitals; SSAKE from the Genome Sciences Center of the British Columbia Cancer Center; and SHARCGS from the Max Planck Institute for Molecular Genetics in Berlin. The authors found that Edena and Velvet, which differ in their graph approach, showed better performance “in terms of assembly quality and required computer resources” and the scientists suggested the algorithms might be more able to cope with base errors in the reads than the other methods.
Extending graph approaches in assembly algorithms for paired-end reads, researchers at the Broad Institute tested their algorithm Allpaths on simulated Illumina data and found it is possible to produce “very high-quality assemblies based entirely on paired microreads at high coverage,” but they also said that it remains to be seen if this approach can extend to assemble genomes larger than those of bacteria.
Other new short read analysis packages published or improved this year include Maq, or Mapping and Assembly with Qualities, from Richard Durbin and his colleagues at the Sanger Institute; SHRiMP, or Short Read Mapping Package, developed by the University of Toronto’s Michael Brudno, which is adapted mainly for the SOLiD system; and ABySS, or Assembly By Short Sequencing, developed at the Genome Sciences Center in Vancouver.
Other projects are looking beyond assembly and other analytical steps to develop entire processing workflows for next-gen sequencing data. At the Genome Informatics conference in September, post-doctoral fellow Brian O’Connor of the University of California at Los Angeles presented an open source tool kit called SeqWare, which includes a LIMS to track sequencing as well as a pipeline framework to organize analysis of high-throughput sequencing data [BioInform 09-12-08].
A team of UK bioinformaticists is developing another sequence analysis pipeline, called Swift. The open-source software package is devoted to primary analysis, such as image-data processing and base-call extraction, for second-generation sequencing instruments [BioInform 10-10-08].
Second-generation sequencing is driving rapid development across the full range of bioinformatics tools that make up sequence-analysis pipelines.
As high-throughput sequencing gains popularity in the genomics community, efforts are underway to create guidelines and standards that allow scientists to compare results.
A workshop hosted in the spring by the Microarray and Gene Expression Data Society led to a checklist called MINSEQE, for Minimum Information about a high-throughput Nucleotide Sequencing Experiment [BioInform 04-11-08].
Among the MINSEQE guidelines, available here, are such elements as a description of the biological system and the particular states that are studied; the sequence read data for each assay; the “final” processed (or summary) data for the set of assays in the study; the experiment design including sample data relationships; general information about the experiment; and essential experimental and data-processing protocols. The next MGED meeting is in October of 2009.
In addition, Martin Shumway, an NCBI staff scientist, told BioInform that a cross-platform data format is under development by the 1000 Genomes Project. The generic short-read alignment format, called SAM, for Sequence Alignment/Map, is a tab- delimited format for sequence read and mapping data.
The SAM standard is currently being applied to pilot data from 180 individuals in the 1000 Genomes Project, and Shumway said that the project organizers expect to use it to align the sequences of 1,200 individuals to the human genome reference assembly next year [BioInform 12-05-08].