Bristol-Myers Squibb explains why full-scale genomic data integration will have to wait
By Aaron J. Sender
Bristol-Myers Squibbs current philosophy on genomic data management is a little like Nathan Siemers take on music theory. Four people no more than that is about the limit for harmonious music, says the 37-year-old leader of the BMS bioinformatics group, whose lanky six-foot-two-inch frame dwarfs the cello he keeps in his office. Once you get to five that [cohesion] is pretty much gone.
Strolling around the 443-acre BMS campus in rural Hopewell, NJ, Siemers says of the terabytes of sequence and expression data under his management, The more things you try to integrate, the worse your output might be, the less coherent. Its important to choose the scope really carefully.
In other words, if complete data integration is a symphonic orchestra, then, at least for now, BMS and Siemers prefer a string quartet.
Virtually every major pharmaceutical company in the world today has a team dealing with the same data management challenges that Siemers and his BMS cohorts face. The genome sequence deluge has left them all with a wealth of information that, being disparate, disconnected, and not entirely reliable, may be more overwhelming than useful.
But Siemers maintains that, until the data are more stable, the problem remains insurmountable. Genomic data are still too raw and the software standards too many to attempt a comprehensive solution. It would be very easy to spend a lot of money on large-scale data integration tasks in areas that may be not mature enough yet, he says. Its like trying to build a house of cards on a bed of marbles. The ground underneath your feet is changing dramatically and its going to continue to change.
Until something more stable rolls along, perhaps in the form of a complete reference sequence, BMS is focusing its efforts on projects that directly influence the drug development pipeline. Were a conservative group, says Siemers, who sports a long blond ponytail. This company has changed from being slightly ivory-tower in terms of research to being very much product oriented, he says. Its a very organized thing even very early on in the pipeline now. People are in the loop. The researchers are in the loop with marketing. The marketing people are in the loop with the researchers. Expectations are communicated.
BMS received six major regulatory approvals in 2000 and boasts an R&D budget of close to $2 billion. But the company would rather spend that money on science than on infrastructure. Only about 30 percent of our resources are spent on infrastructure, Siemers says, referring to the bioinformatics budget. And thats all infrastructure: keeping machines running, data integration any non-science activity.
Instead of investing in long-term projects to integrate genomic data from various databases, the bioinformatics group is creating ways to simply enable researchers to know what data are available and where to find them. We are not going to make the perfect database, Siemers says. We are trying to give people quick solutions. Thats our philosophy. If it takes you two years to build something, by the time its built the questions will have changed.
One current project is a portal, dubbed Gene Tracker, to help researchers explore BMSs cumulative knowledge of selected genes. It will contain fields for researchers to enter what theyve discovered about a particular gene. The bioinformatics group already uses Gene Tracker internally and expects to release it to bench scientists this summer.
Heres how it works: A researcher enters a particular gene, by name or sequence, and sees a description of the gene and a list of its synonyms the associated proteins name in SwissProt or Entrez, for example. The next section contains a summary of whats known about the genes function, physical map data, mutant phenotype, introns, exons, and complexes with which the gene is associated. Another section contains information such as literature references, gel mobility, and genetic interactions. Essentially its a summary of everything thats known about the gene, says Dan Davison, associate director of bioinformatics.
Although Gene Tracker is a high-priority project, itll be fairly lightweight, says Siemers. And this is one of the challenges in data integration: to limit the scope of the project so that it is doable. A consultant works two days a week on the project.
Gene Tracker is neither a data warehouse nor a multi-tier middleware linked project. It doesnt even attempt to pull all the relevant data onto a single Web page. It simply centralizes access to the disparate databases.
The communications across databases are really quite trivial, Siemers says. Hyperlinks cross-reference genes to relevant data. There wont be a lot of automated tools shoving data into this thing. It will be entirely managed by people, not programs.
Also in the works is a meta-database called the Harmonizer. What we are trying to harmonize is nucleotide sequence, protein sequence, gene expression, and proteomic data, Davison says.
Harmonizer gives the user a summary of information available in internal and third-party databases, along with the contact information for the researcher who did the experiment. If they want the raw data they have to go into those databases, Davison says. Eventually the meta-database would also link to Gene Tracker and cheminformatics data.
Siemers first brought the Harmonizer idea to the bioinformatics groups attention early this year. If there was a bioinformatics steering committee it would consist of the groups executive director Wes Cosand, Davison, and Siemers. But were a flat group, Davison says. We pretty much all talk everything out.
For example, the group decided to go forward with Harmonizer on an ad hoc basis. So its not a formal program within Bristol-Myers Squibb, says Davison. Thats how most bioinformatics projects get their start at BMS. The Harmonizer is expected to be up and running for bioinformaticists in the beginning of 2002 and later that year for everyone else.
OUT OF TUNE
Integrating all available EST transcripts into clustered gene contigs is not on BMSs agenda. Its not an impossible feat, to be sure, especially given that BMS employs Davison, arguably the worlds expert in EST clustering. Davison is best known for developing the d-squared algorithm that his students John Burke and Win Hide further developed into the D2 cluster tool, now fundamental to DoubleTwists CAT tools and the South African National Bioinformatics Institutes Stack-Pack. Clustering all the data and turning a bunch of sequences into a bunch of genes is not that hard, says Siemers. Maintaining that over the growth of all the data is whats difficult.
Its just not cost effective, he explains. Even if you do a good job, the databases have been changing so much in the last two years that its like trying to track a moving target.
What to do?
Just wait for GenBank to be thrown out, to be honest, says Siemers. He argues that GenBank is a hodgepodge of sequences a few thousand bases long and of limited use to researchers looking for novel gene targets.
Siemers hopes that NCBIs work to combine the human genome scaffold and curate a database of transcripts under the reference sequence collection and LocusLink will ultimately replace GenBank.
But its not that simple. BMS relies heavily on Incyte, which is compiling its own reference sequence collection. Its a very interesting issue of how that will develop and what we have to do to integrate that with LocusLink from NCBI, Siemers says. Invoking the music metaphor, Siemers wonders, When things show up in the public domain will Incyte change their key to the reference sequence collection key?
For now, though, BMS combines the Incyte data with GenBank, SwissProt, and internally generated sequence, along with relevant patent information, into an in-house relational Oracle database using GCGs SeqStore tools, and waits for things to sort themselves out. Gene Tracker and the Harmonizer link to SeqStore for sequence data.
BMS was an early adopter of SeqStore and was instrumental in convincing GCGs move away from flat-file databases. It is the reference from which we work, says Siemers. GenBank and protein databases are updated every night and patent data once every two weeks.
Of course, true genomic data integration is the ultimate goal. Ten years from now this department wont even exist, Siemers predicts. But until then its up to the 15-member bioinformatics group, split between BMSs Hopewell and Wallingford, Conn. sites, to make the data manageable for the companys 2,000 research scientists.
About half of the people have direct liaison responsibilities, which is one of our most important jobs, says Siemers. The bioinformaticists are assigned to support a particular therapeutic area. Bench scientists at BMS send their requests to liaisons in the bioinformatics group, who dig through the databases and compile the relevant information in useful scientific format. Almost everyone here is a PhD biologist, says Siemers. If you dont have the perfect database, one of your largest assets is a highly trained bioinformatics group.
Siemers, however, was not trained as a biologist. I didnt touch anything alive until I did a postdoc, he says. As a PhD candidate in synthetic organic chemistry at Cornell University he made cockroach sex pheromones. Theyre beautiful structures, he says.
In 1993 Siemers began applying his chemical expertise to work on cancer as a postdoc at a BMS research site in Seattle. I spent three years doing molecular biology, recombinant protein design, protein expression, and animal models for cancer therapy, he says.
But it was only a matter of time before the other postdocs had Siemers running their GenBank searches for them. My bachelor training was at MIT. Its almost impossible to get out of there and not have some level of proficiency, even though I was a chemist, Siemers says.
Hed also been hacking Unix systems in his spare time. Siemers superiors took notice and he was soon transferred to NJ where he became one of BMSs first bioinformaticists.
So-called data integration solutions for the industry abound. Bioinformatics software and hardware companies alike are fawning over pharmaceutical companies genome data dilemma promising to federate, warehouse, or otherwise link up all their data sources.
Surely among the plethora of integration technology for sale theres a solution to this problem? There isnt a perfect solution, says Siemers. BMS has looked at half a dozen middleware vendors, including Lion, IBM, and NetGenics. None of them solves the key harmonization issue, says Siemers. Its not really because any of these vendors are doing it wrong. Its just a really hard problem.
Data must first be transposed to the appropriate keys, the way a saxophonist attempting to play a piece written for the cello must first transpose the musical data or notes. The problem in genomics is that were like the tower of Babel here, Siemers says.
A B-flat on one cello is a B-flat on another cello regardless of which orchestra it plays in. But in genomics, what the chip databases call a gene is different than what the chip vendors call a gene, what the proteomics databases call a gene, what a clinician calls a gene, and what a researcher calls a gene, Siemers says.
For instance, a gene that Siemers calls HER2 can be represented in microarray data as 33218_at, 1901_s_at, S57296_at, M12036_at, 198689, 44634, 44686, or 62860. Even on a single chip, the same gene comes more than once and with a different name, say Siemers. NCBIs UniGene database calls that same gene Hs. 3231910. And GenBank lists six mRNA sequences XM_008603, M11730, X03363, AX023363, NM_004448, and AF177761 and 204 ESTs representing the gene, each time with a different name.
If you took a computer scientist and showed him all of this and asked him to build a data warehouse, hed die, Siemers says.
But dont get me wrong, if someone comes up and says, Look Ive got the perfect solution, and we go and check out their stuff and theyve got the perfect solution, Ill buy it, he says. But I dont see that happening real soon.