The International Genomics Consortium has taken a key step toward its goal of building a publicly available database of over 10,000 gene expression experiments. The Phoenix, Ariz.-based non-profit organization recently selected genetic banking firm First Genetic Trust to create the IT infrastructure for its Expression Project in Oncology (expO).
FGT will put its enTrust technology to work in the enrollment and consent of 10,000 patients from 19 participating academic cancer centers, build and maintain the resulting database, and support distributed access to the data and samples for partners in the project.
Financial terms of the agreement were not disclosed.
Andy Baxevanis, director of computational genomics at the NHGRI who is overseeing the computational aspects of the IGC’s activities, said the project’s security requirements made outsourcing a necessity. Similar consortia-led efforts, like the Human Genome Project or the SNP Consortium, used tools developed in the public domain to gather and store data, but did not have the patient confidentiality issues that expO faces.
“We wanted to make sure this was what we’ve been calling the military-grade secure environment — that it would be difficult if not impossible to crack into. The encryption would have to be on every single piece of data rather than on the database as a whole,” said Baxevanis.
“We turned to FGT because of their experience in the patient confidentiality side, and on the enrollment side it would allow us to handle these 10,000-plus samples using pieces they had already developed for other projects.”
Perhaps the selection of FGT wasn’t such a surprise. The company’s chairman and CEO, Arthur Holden, already served as a senior corporate relations advisor to the IGC. But Baxevanis said the consortium evaluated other available genetic banking technologies and “no one was even close. Other firms had potential but not on the time frame we needed.”
One Down, but another big Decision Remains
The IGC has yet to determine what microarray technology it will use to conduct the study. It has completed a pilot project using several platforms, including Affymetrix, Agilent, and Amersham, “but it was with the understanding that there was no guarantee of the chips being used in the actual study itself.” The final chip decision will be in part “budgetarily driven,” Baxevanis said.
The IGC estimates that the total cost for the project will come in at around $42 million and hopes to secure around $35 million in funding from a number of pharmaceutical partners by the end of the year. IGC partners will not receive early access to the expO data, which will be made publicly available upon completion of the project and updated regularly.
It is expected that pharmaceutical firms will pledge their support because of the opportunity it will provide to shape standardization efforts and guide the direction of the project.
In addition to the publicly available resource that the IGC will support — expOdb — Baxevanis said the expO data would also be submitted to the EBI’s ArrayExpress gene expression repository. The IGC has been working with Alvis Brazma at the EBI to ensure that the expO data sets are compliant with emerging microarray standards, Baxevanis said. “We wanted to do that to really bolster these standards like MIAME and MAGE and really put some beef behind them, to say here’s a large block of data that conforms to these standards to help solidify the use of those standards in the scientific community.”
While the two sites would at first mirror the same data, they will diverge eventually as the IGC plans to have the participating academic medical centers provide follow-up information and survival data about the patients, which would be added to the expOdb records.
Further plans call for a Hewlett-Packard-based Linux cluster to be housed at the IGC’s Phoenix headquarters, Baxevanis said. Longer term, a supercomputer facility is planned that all IGC partners would have access to. The consortium is currently in discussions with IBM about this center.
According the Baxevanis, the biggest computational challenges will come far downstream of the expO project — a product of the sheer volume of data it will generate. “There’s never been a set of data this big. Never,” Baxevanis said. Indeed, each of the 10,000 patient-sample records is expected to generate at least a gigabyte of data. By comparison, the current release of GenBank takes up 75 gigabytes. “The problem now becomes how do you analyze the data? It’s going to present a big supercomputing challenge, but it’s exciting at same time,” he said. “It gives people a very nice playground.”