As second-generation sequencing advanced in 2008, the technology took center stage in the field of bioinformatics as well. Labs buying into this high-throughput technology encountered a number of informatics challenges, including upgrading their storage and finding software to help organize and interpret the whirlwinds of data, and a number of academic and commercial efforts took steps to help meet this demand.
One driving force for second-generation sequencing in 2008 was the launch of several large-scale projects, such as the 1,000 Genomes Project. The project, which kicked off last January, had surpassed 600 GB by May and 3.8 TB by November. The project organizers plan to begin releasing data early next year and expect to finish sequencing 1,200 human genomes by around the end of this year.
Research groups working on these projects are finding that the ability to generate large volumes of sequence data has quickly outpaced the ability to analyze it. As Wellcome Trust Sanger Institute's Tim Hubbard said last year, it is "going to become cost effective to sequence anything in biology, but not necessarily cost effective to annotate it."
That shift toward high-throughput sequencing has entailed a "major investment" in IT infrastructure to support the sequencing workflow at large genome centers. Hubbard says that the Sanger center installed 340 terabytes of disk cache just to handle the temporary processing of data coming off of the machines.
Other genome centers are also dealing with these issues. Rick Wilson, director of the Genome Sequencing Center at Washington University School of Medicine, said in November that his team wrestled with a number of IT questions, particularly storage, in their effort to sequence a female patient's acute myeloid leukemia genome.
"You had to figure out exactly what data that came off the next-gen platforms you needed to save and which you could afford to toss," Wilson says. "We are still learning that, I think."
— Vivien Marx
The National Cancer Institute's Cancer Biomedical Informatics Grid program is seeking proposals for a series of in silico research centers that will offer bioinformatics support to cancer researchers. According to a solicitation issued in mid-December, the centers will perform "in silico, hypothesis-driven research using data analysis, aggregation and mining focused on discovery using caBIG and other publicly available cancer-related data sources."
The Institute for Research in Biomedicine in Barcelona will use €5 million in funding from the European Commission to coordinate scientists who are studying malaria and diabetes using 'omics and bioinformatics tools. The projects are part of the EU's Seventh Framework Program.
Amount Insilicos received from NIGMS to develop its "Ensemble Learning" statistical modeling technology.
Software for Homology Modeling of Ribosomes
Grantee: Norm Watkins, DNA Software
Began: Sep. 1, 2008; Ends Feb. 28, 2009
With this grant, DNA Software plans to extend the functionality of an existing 3D homology structure prediction platform. The company hopes that researchers will then be able to study the structural mechanisms of whole ribosome. They say the grant will have long-term effects on the scientific community because "once the functionality of RNA-123 is extended ... it can be easily adapted to model biopolymers, DNA, and carbohydrates."
Algorithmically-Tuned Protein Families, Rule_Base and Characterized Proteins
Grantee: Daniel Haft, J. Craig Venter Institute
Began: Sep. 26, 2008; Ends: Jul. 31, 2011
Haft will develop algorithms and annotation methods to use metagenomic DNA sequence data to discern which species do what inside the human gut. Using a three- tiered approach, he plans to develop algorithms that build protein families, devise a new way to apply annotation rules, and complete a systematic compilation of the right starting points for those annotations.