As its name suggests, TeraGenomics was founded to manage terabytes of genomic data — specifically, gene expression data from thousands of Affymetrix chips. But the life science business unit of IT consulting firm IMC has recently expanded its scope to embrace a new type of biological data and a new technology platform.
Last week, the company got word that it had received a Phase I Small Business Innovation Research grant for an undisclosed amount from the National Institute of Drug Abuse to extend its data warehouse platform to handle clinical information as well as gene expression data. In the first phase of the study, the company will build a data-management system to house microarray experiments and clinical trial data for 850 patients undergoing treatment for drug abuse.
In addition, this month, the company expects to complete a port of its data warehouse platform from Teradata to Oracle as part of a bid to attract a broader customer base in the life science market.
TeraGenomics got its start in 2002 as a collaboration between IMC, NCR's data warehouse subsidiary Teradata, and a team of neurogenetics researchers at the Salk Institute for Biological Studies. The goal was to build a data-management and -analysis system for thousands of Affymetrix microarray experiments [BioInform 01-27-03]. That database — TeraGenomics' flagship project — now includes 7,000 arrays, 130 million rows of analyzed data, and more than 22,000 probe-level chip-to-chip comparison files, according to Eva Mitter, TeraGenomics' development and operation manager.
At roughly 100 megabytes per array, including raw and processed data files, the 7,000 arrays in the database so far use about 600 gigabytes, Mitter said, but the system is growing. Last January, it contained about 1 billion rows, and then doubled over the course of the year to 2 billion rows. It doubled again between January and March to 4 billion rows.
Mitter said that the fruits of the project are just now starting to appear, and that the Salk research team has a number of papers in press discussing several "surprising" discoveries mined from the extensive database, which is hosted at IMC's headquarters in Reston, Va.
But with the Salk project well underway — and an anticipated jump in interest for its services once the lab's papers are published — TeraGenomics has recently set its sights on addressing a larger customer base.
One way that TeraGenomics is planning to expand its user base is by migrating the platform to Oracle, which Mitter said is "well established in the life science domain." The company will still support the Teradata platform, but Mitter said that new features in Oracle 10g — combined with its dominance in the market — made a move to the second platform necessary.
Mitter described the migration process at Oracle's Life Science User Group meeting, held concurrently with the Bio-IT World conference in Boston, May 15-16. The company began the switch in March, and is nearly finished, she told BioInform in an interview after the meeting. "There's just a few loose ends to finish — just a few bug fixes here and there," she said. The full port is scheduled for release on June 1.
The most time-consuming step in the migration, she said, was "building the Oracle data model."
When TeraGenomics first began to build the data warehouse for the Salk Institute, "the original idea was that Oracle wouldn't be able to handle the amount of data" from the project, Mitter said during the user group meeting, making Teradata the obvious choice at the time. But new features in 10g, combined with improvements in the algorithms that the TeraGenomics was developing for the system, removed those doubts, she said.
Teradata's data warehouse system is common in retail and banking, but has not had much acceptance in the life science market. In fact, it appears that some early life science adopters are changing their minds. Earlier this month, BioInform reported that the Windber Research Institute is also migrating its Teradata data warehouse to Oracle [BioInform 05-09-2005].
Price is also a consideration. Mitter estimated that a customer could store the same number of arrays in a $70,000-$80,000 Oracle-based system as in a Teradata system that would cost more than $1 million.
In the end, Mitter said, the database system isn't as important as "the mathematics, algorithms, and appropriate database design." One key to the TeraGenomics system, she said, was scaling up the RMA normalization algorithm from the open-source BioConductor project to run on more than 500 experiments at once, from a previous limit of 100. TeraGenomics is also "pretty close" to finishing a scale-up of Affy's PLIER algorithm in a similar manner, she said.
Its use of the Teradata platform hasn't hurt TeraGenomics' business so far, however. Mitter said that the business unit — composed of around a dozen employees who "flow" between IMC and TeraGenomics — claims Merck Research Laboratories, the AT (ataxia-telangiectasia) Children's Project, and several non-profit and academic research centers among its customers.
TeraGenomics expects demand for its system to grow as microarray labs building in-house data-management systems around desktop gene-expression analysis platforms hit the wall, Mitter said. These labs "say that [their system is] going to scale, but it's not," Mitter said.
The company could face competition from publicly available microarray database systems like caArray from the National Cancer Institute's Cancer Biomedical Informatics Grid project, but Mitter said that caArray lacks the analytical features that TeraGenomics has integrated into its data warehouse.
The caArray system enables storage, "but there are no analytical tools," she said. "Researchers are still faced with the issue of pulling data out of the database, analyzing it, and putting it back in — that's time-consuming."
— Bernadette Toner ([email protected])