If 2007 was the year of the next-gen sequencer, 2008 could turn out to be the year when bioinformaticists are forced to learn how to analyze and annotate all the data those machines are spitting out. And this will likely drive further trends in the field, such as increased hiring and a renewed focus on economical computing systems.
Ryan Koehler, staff scientist with Applied Biosystems, told BioInform that “the biggest thing” in bioinformatics in the coming year “will be the giant data sets … [from] next-gen sequencing [machines].”
Indeed, ABI has identified bioinformatics as a potential bottleneck for prospective adopters of its newly launched SOLiD sequencer, and last fall expanded its Software Community Program to encourage development of third-party software tools for the platform [BioInform 09-07-07].
Michael Hadjisavas, director of commercial development at ABI, told BioInform at the time that the company decided to “reach out to the community and invite a dialogue in the area of software because we as a company cannot address all of the software requirements for the data interpretation of the myriad readouts that could be deployed” in next-gen sequence analysis.
For companies such as ABI, the lack of good supporting software for high-throughput sequence analysis could potentially slow adoption of the technology. “The amount of data that a researcher would have … could be very substantial and overwhelming, and unless there are companion software elements in place, the ability of customers to really enjoy the value of these instruments can offer can be somewhat challenged, and that’s a problem,” Hadjisavas said.
Next-gen sequencing vendors 454 Life Sciences and Illumina as well as a number of academic efforts are also tapping into next-gen sequence analysis.
Several groups are developing methods for assembling very short reads from high-throughput instruments, including the European Bioinformatics Institute; the Broad Institute; the British Columbia Cancer Center’s Genome Sciences Center; Stony Brook University; the University of Carolina, Chapel Hill; and the Max Planck Institute for Molecular Genetics [BioInform 11-16-07].
In addition, several standardization efforts are underway to manage data from these new systems. One such project, a collaboration between next-generation sequencing vendors, genome centers, and other organizations, has developed a DNA sequence data format called SSR, for short sequence reads. Another project, led by a group called the Genomic Standards Consortium, has created a checklist for sequencing experiments called MIGS, or minimum information about a genome sequence [BioInform 03-23-07].
Other researchers are ramping up their IT infrastructures to brace for an onslaught of data from next-gen sequencers. In November, the Genome Sequencing Center at Washington University St. Louis announced plans to build an $11 million, 16,000-square-foot facility specifically designed for a growing fleet of next-generation sequencing systems [BioInform 11-09-07].
More recently, the University of Maryland’s Center for Bioinformatics and Computational Biology published an approach that could interest research groups who don’t have the IT budget of a major genome center. The method, which uses 3D graphics hardware to accelerate the MUMmer alignment algorithm, is particularly suited for short-read data from new sequencing platforms [BioInform 12-21-07].
Demand for New Talent
The renewed interest in sequence analysis is having a broad impact on bioinformatics. Recently, Roderic Guigό, coordinator of the Centre de Regualiciό Genomica in Barcelona, Spain, told BioInform that the steady rise of sequence data is driving the need for more developers.
He compared the current sequence-analysis environment to that of microarray analysis a decade ago. “There was an explosion of people moving to work with microarray data. Now, there is an explosion of people working on algorithms to deal with high-throughput sequencing data because this data is going to be used for many different applications,” he said [BioInform 11-30-07].
As a result, bioinformatics developers are in high demand. A spate of conferences this year, from Intelligent Systems in Molecular Biology to Genome Informatics, served as cattle calls for software developers.
Chinnappa Dilip Kodira, director of the genome annotation department at the Broad Institute, last month told BioInform that he is doubling his team to 14 largely as a result of large-scale projects using next-generation sequencing technology [BioInform 12-07-07].
In addition, Jason Swedlow, a senior research fellow at the University of Dundee, told BioInform recently that there is demand for experienced software developers who know “how you write an algorithm down and how you choose to implement that algorithm.”
Expertise is an issue, he said. “Quite frankly, we’re not talking about Word documents; we are talking about multi-gigabyte data systems, so how you move those from one database to another … how you calculate and mine on that kind of data [is what the industry needs].”
It's Not Easy Going Green
Swedlow said that the growth of data in bioinformatics is likely to drive another bioinformatics trend in 2008: green computing.
While the popular notion is that “going green” is good for the environment, 2007 provided evidence that it can also lower overall computing costs by consuming less energy.
“Cooling is substantial in this business,” Swedlow said. His group runs 200 CPUs, and “my colleagues are sitting on a lot of storage … [which contributes substantially to] the cost of power and energy,” he added.
Tim Hubbard, informatics head at the Wellcome Trust Sanger institute, recently told BioInform that Sanger’s IT group is exploring greener options because the institute’s 11,000-square-foot data center has “an expensive electricity bill.”
“There’s a view that we will have to do more in the short term and medium term to accommodate the growth in data, particularly sequence data because of the new technologies,“ Hubbard said. “The growth of this data is now such that it’s impossible to back up, so we’re thinking about off-site hosting in order to replicate some of that, to protect ourselves. But we are also looking specifically at whether there are greener options for that, too.“
“There’s a view that we will have to do more in the short term and medium term to accommodate the growth in data, particularly sequence data because of the new technologies.“
Some vendors have identified bioinformatics as an early-adopter market for cooler-running systems. For example, when SGI launched its Altix ICE 8200 “energy-smart” blade platform in June, it targeted the life science sector as a key potential user base for the system [BioInform 06-29-07].
Deepak Thakkar, SGI’s bioscience segment marketing manager, said at the time that the company expected the system to appeal to cost-conscious life science users. “Most life science customers are showing that almost 40 percent of their HPC budget is associated with power cost,” he said.
It appears that at least one bioinformatics group is among the front-runners in green computing technology. In November, a supercomputer installed at Stanford University’s Biomedical Computational Facility placed in the top 10 in the first-ever “Green500” list of energy-efficient supercomputers.
Stanford’s 15.6-teraflop Dell "Bio-X2" machine ranked No. 6 in the Green500 roster, which is intended to complement the twice-annual Top500 supercomputer ranking. Unlike the Top500 list, however, the Green500 emphasizes performance metrics other than speed, such as performance per watt and energy efficiency.
Following a tumultuous 2006 in which Lion Bioscience and Tripos were sold after lengthy, highly public deliberations, M&A activity in 2007 proceeded much more smoothly if not less actively.
The year’s acquisitions were marked by a high degree of complementarity and very little overlap: In May, information services firm Thomson acquired Unleashed Informatics for an undisclosed amount in a bid to expand its Thomson Pharma database of sequences, chemicals, drug targets, intellectual property, drugs, compounds, and company information [BioInform 03-23-07].
Then in June, Tibco, a provider of business process management software, acquired Spotfire for $195 million in cash in order to expand its business-intelligence offering [BioInform 05-04-07].
That was followed in August by Entelos’ purchase of Iconix Biosciences in an all-share transaction valued at $8.3 million. Entelos said at the time that it expects Iconix’s DrugMatrix database of gene-expression patterns for more than 350 drug compounds to complement its in silico disease modeling work [BioInform 09-07-07].
In all three cases, customers and employees for both firms were left relatively unscathed. One exception to this trend was Symyx’s $123 million acquisition of MDL Information Systems, which led to a restructuring effort in October that resulted in the layoff of 124 employees — 18 percent of its staff — in an effort to “reduce overlap and streamline operations.” [BioInform 10-26-07]
Another acquisition left the bioinformatics field with one less microarray analysis package at the end of 2007. When Agilent Technologies acquired Stratagene in June for $246 million, it gave the firm two microarray analysis packages that were very well established in the research community: Agilent’s GeneSpring suite of tools, which it picked up in its 2004 acquisition of Silicon Genetics, and Stratagene’s ArrayAssist software package.
In August, Agilent said it would merge features of ArrayAssist into GeneSpring array analysis platform and phase out ArrayAssist [BioInform 08-17-07].
Marketing Amid Belt-Tightening
Some life science software vendors noted that shrinking R&D budgets among pharmaceutical firms and academic research labs have led to an increased demand to prove that their tools can create real value for their customers.
Kristen Zanella, marketing manager for biotech, pharma, and medical operations at the Mathworks, told BioInform that 2007 was “certainly a challenging year for the pharma industry and we see the challenges.”
Within pharma, there appears to be an effort to figure out “which organization and which approach to the workflow works best for them in terms of [return on investment],” she said.
She added that the Mathworks views its wide array of software tools as an “on-ramp” for users in the pharma environment, where expert programmers typically interact closely with biologists and other researchers who are less skilled in this regard The tools, she indicated, are “user-friendly” enough to be of value to both the scientist and the bioinformatician or computer expert.
Nancy Latimer, product manager for the Gene Expression, R-Statistics, and Text Analysis collections for Accelrys’s SciTegic Pipeline Pilot product, also noted the current environment of “extremely tight budgets, where the bioinformatics groups are really having to justify their existence and their past expenditures.”
As a result, she said, “they really have to re-use that internal code that they have developed.”
Accelrys considers Pipeline Pilot to be of interest to such users because it allows them to break internally developed code up into “small chunks” and deploy them through the Pipeline Pilot interface. “That allows them to really salvage pieces of their internally developed software and IP that can be shared by other groups within their organization,” Latimer said.
More broadly, Latimer said she envisions “a changing user base” for bioinformatics in the year ahead, “where there are more bench scientists and [fewer] specialists in the bioinformatics area.”
Other software vendors cited different areas they are keeping an eye on in the year ahead. For Claudio Schmidt, head of the Expressionist product line at Genedata, it is “clearly toxicology … that will have the biggest impact in life science, and we want to expand on that [in the year ahead.]”
Ronald Ranauro, CEO of GenomeQuest, told BioInform that one of the trends his company sees is “an interest in getting information not one gene at a time or one sequence at a time, but looking at information in aggregate and looking at kind of [a] high-content view of information.”
Ranauro added that “bioinformatics, clearly, is anticipating the downstream use case of genetic sequence data in the preclinical and clinical phases” of drug discovery research.