Skip to main content
Premium Trial:

Request an Annual Quote

Data Dissonance

Premium

Bristol-Myers Squibb explains why full-scale genomic data integration will have to wait

By Aaron J. Sender

Bristol-Myers Squibb’s current philosophy on genomic data management is a little like Nathan Siemers’ take on music theory. “Four people – no more than that – is about the limit” for harmonious music, says the 37-year-old leader of the BMS bioinformatics group, whose lanky six-foot-two-inch frame dwarfs the cello he keeps in his office. “Once you get to five that [cohesion] is pretty much gone.”

Strolling around the 443-acre BMS campus in rural Hopewell, NJ, Siemers says of the terabytes of sequence and expression data under his management, “The more things you try to integrate, the worse your output might be, the less coherent. It’s important to choose the scope really carefully.”

In other words, if complete data integration is a symphonic orchestra, then, at least for now, BMS and Siemers prefer a string quartet.

Virtually every major pharmaceutical company in the world today has a team dealing with the same data management challenges that Siemers and his BMS cohorts face. The genome sequence deluge has left them all with a wealth of information that, being disparate, disconnected, and not entirely reliable, may be more overwhelming than useful.

But Siemers maintains that, until the data are more stable, the problem remains insurmountable. Genomic data are still too raw and the software standards too many to attempt a comprehensive solution. “It would be very easy to spend a lot of money on large-scale data integration tasks in areas that may be not mature enough yet,” he says. “It’s like trying to build a house of cards on a bed of marbles. The ground underneath your feet is changing dramatically and it’s going to continue to change.”

FINE TUNING

Until something more stable rolls along, perhaps in the form of a complete reference sequence, BMS is focusing its efforts on projects that directly influence the drug development pipeline. “We’re a conservative group,” says Siemers, who sports a long blond ponytail. “This company has changed from being slightly ivory-tower in terms of research to being very much product oriented,” he says. “It’s a very organized thing even very early on in the pipeline now. People are in the loop. The researchers are in the loop with marketing. The marketing people are in the loop with the researchers. Expectations are communicated.”

BMS received six major regulatory approvals in 2000 and boasts an R&D budget of close to $2 billion. But the company would rather spend that money on science than on infrastructure. “Only about 30 percent of our resources are spent on infrastructure,” Siemers says, referring to the bioinformatics budget. “And that’s all infrastructure: keeping machines running, data integration — any non-science activity.”

Instead of investing in long-term projects to integrate genomic data from various databases, the bioinformatics group is creating ways to simply enable researchers to know what data are available and where to find them. “We are not going to make the perfect database,” Siemers says. “We are trying to give people quick solutions. That’s our philosophy. If it takes you two years to build something, by the time it’s built the questions will have changed.”

One current project is a portal, dubbed Gene Tracker, to help researchers explore BMS’s cumulative knowledge of selected genes. It will contain fields for researchers to enter what they’ve discovered about a particular gene. The bioinformatics group already uses Gene Tracker internally and expects to release it to bench scientists this summer.

Here’s how it works: A researcher enters a particular gene, by name or sequence, and sees a description of the gene and a list of its synonyms — the associated protein’s name in SwissProt or Entrez, for example. The next section contains a summary of what’s known about the gene’s function, physical map data, mutant phenotype, introns, exons, and complexes with which the gene is associated. Another section contains information such as literature references, gel mobility, and genetic interactions. “Essentially it’s a summary of everything that’s known about the gene,” says Dan Davison, associate director of bioinformatics.

Although Gene Tracker is a high-priority project, “it’ll be fairly lightweight,” says Siemers. “And this is one of the challenges in data integration: to limit the scope of the project so that it is doable.” A consultant works two days a week on the project.

Gene Tracker is neither a data warehouse nor a multi-tier middleware linked project. It doesn’t even attempt to pull all the relevant data onto a single Web page. It simply centralizes access to the disparate databases.

“The communications across databases are really quite trivial,” Siemers says. Hyperlinks cross-reference genes to relevant data. “There won’t be a lot of automated tools shoving data into this thing.” It will be entirely managed by people, not programs.

Also in the works is a meta-database called the Harmonizer. “What we are trying to harmonize is nucleotide sequence, protein sequence, gene expression, and proteomic data,” Davison says.

Harmonizer gives the user a summary of information available in internal and third-party databases, along with the contact information for the researcher who did the experiment. “If they want the raw data they have to go into those databases,” Davison says. Eventually the meta-database would also link to Gene Tracker and cheminformatics data.

Siemers first brought the Harmonizer idea to the bioinformatics group’s attention early this year. If there was a bioinformatics steering committee it would consist of the group’s executive director Wes Cosand, Davison, and Siemers. “But we’re a flat group,” Davison says. “We pretty much all talk everything out.”

For example, the group decided to go forward with Harmonizer “on an ad hoc basis. So it’s not a formal program within Bristol-Myers Squibb,” says Davison. That’s how most bioinformatics projects get their start at BMS. The Harmonizer is expected to be up and running for bioinformaticists in the beginning of 2002 and later that year for everyone else.

OUT OF TUNE

Integrating all available EST transcripts into clustered gene contigs is not on BMS’s agenda. It’s not an impossible feat, to be sure, especially given that BMS employs Davison, arguably the world’s expert in EST clustering. Davison is best known for developing the d-squared algorithm that his students John Burke and Win Hide further developed into the D2 cluster tool, now fundamental to DoubleTwist’s CAT tools and the South African National Bioinformatics Institute’s Stack-Pack. “Clustering all the data and turning a bunch of sequences into a bunch of genes is not that hard,” says Siemers. “Maintaining that over the growth of all the data is what’s difficult.”

“It’s just not cost effective,” he explains. “Even if you do a good job, the databases have been changing so much in the last two years that it’s like trying to track a moving target.”

What to do?

“Just wait for GenBank to be thrown out, to be honest,” says Siemers. He argues that GenBank is a hodgepodge of sequences a few thousand bases long and of limited use to researchers looking for novel gene targets.

Siemers hopes that NCBI’s work to combine the human genome scaffold and curate a database of transcripts under the reference sequence collection and LocusLink will ultimately replace GenBank.

But it’s not that simple. BMS relies heavily on Incyte, which is compiling its own reference sequence collection. “It’s a very interesting issue of how that will develop and what we have to do to integrate that with LocusLink from NCBI,” Siemers says. Invoking the music metaphor, Siemers wonders, “When things show up in the public domain will Incyte change their key to the reference sequence collection key?”

For now, though, BMS combines the Incyte data with GenBank, SwissProt, and internally generated sequence, along with relevant patent information, into an in-house relational Oracle database using GCG’s SeqStore tools, and waits for things to sort themselves out. Gene Tracker and the Harmonizer link to SeqStore for sequence data.

BMS was an early adopter of SeqStore and was instrumental in convincing GCG’s move away from flat-file databases. “It is the reference from which we work,” says Siemers. GenBank and protein databases are updated every night and patent data once every two weeks.

Of course, true genomic data integration is the ultimate goal. “Ten years from now this department won’t even exist,” Siemers predicts. But until then it’s up to the 15-member bioinformatics group, split between BMS’s Hopewell and Wallingford, Conn. sites, to make the data manageable for the company’s 2,000 research scientists.

“About half of the people have direct liaison responsibilities, which is one of our most important jobs,” says Siemers. The bioinformaticists are assigned to support a particular therapeutic area. Bench scientists at BMS send their requests to liaisons in the bioinformatics group, who dig through the databases and compile the relevant information in useful scientific format. “Almost everyone here is a PhD biologist,” says Siemers. “If you don’t have the perfect database, one of your largest assets is a highly trained bioinformatics group.”

Siemers, however, was not trained as a biologist. “I didn’t touch anything alive until I did a postdoc,” he says. As a PhD candidate in synthetic organic chemistry at Cornell University he made cockroach sex pheromones. “They’re beautiful structures,” he says.

In 1993 Siemers began applying his chemical expertise to work on cancer as a postdoc at a BMS research site in Seattle. “I spent three years doing molecular biology, recombinant protein design, protein expression, and animal models for cancer therapy,” he says.

But it was only a matter of time before the other postdocs had Siemers running their GenBank searches for them. “My bachelor training was at MIT. It’s almost impossible to get out of there and not have some level of proficiency, even though I was a chemist,” Siemers says.

He’d also been hacking Unix systems in his spare time. Siemers’ superiors took notice and he was soon transferred to NJ where he became one of BMS’s first bioinformaticists.

VENDOR BENDER

So-called data integration solutions for the industry abound. Bioinformatics software and hardware companies alike are fawning over pharmaceutical companies’ genome data dilemma promising to federate, warehouse, or otherwise link up all their data sources.

Surely among the plethora of integration technology for sale there’s a solution to this problem? “There isn’t a perfect solution,” says Siemers. BMS has looked at half a dozen middleware vendors, including Lion, IBM, and NetGenics. “None of them solves the key harmonization issue,” says Siemers. “It’s not really because any of these vendors are doing it wrong. It’s just a really hard problem.”

Data must first be transposed to the appropriate keys, the way a saxophonist attempting to play a piece written for the cello must first transpose the musical data or notes. “The problem in genomics is that we’re like the tower of Babel here,” Siemers says.

A B-flat on one cello is a B-flat on another cello regardless of which orchestra it plays in. But in genomics, “what the chip databases call a gene is different than what the chip vendors call a gene, what the proteomics databases call a gene, what a clinician calls a gene, and what a researcher calls a gene,” Siemers says.

For instance, a gene that Siemers calls HER2 can be represented in microarray data as 33218_at, 1901_s_at, S57296_at, M12036_at, 198689, 44634, 44686, or 62860. “Even on a single chip, the same gene comes more than once and with a different name,” say Siemers. NCBI’s UniGene database calls that same gene Hs. 3231910. And GenBank lists six mRNA sequences – XM_008603, M11730, X03363, AX023363, NM_004448, and AF177761 – and 204 ESTs representing the gene, each time with a different name.

“If you took a computer scientist and showed him all of this and asked him to build a data warehouse, he’d die,” Siemers says.

“But don’t get me wrong, if someone comes up and says, ‘Look I’ve got the perfect solution,’ and we go and check out their stuff and they’ve got the perfect solution, I’ll buy it,” he says. “But I don’t see that happening real soon.”

The Scan

And Back

The New York Times reports that missing SARS-CoV-2 genome sequences are back in a different database.

Lacks Family Hires Attorney

A lawyer for the family of Henrietta Lacks plans to seek compensation from pharmaceutical companies that have used her cancer cells in product development, the Baltimore Sun reports.

For the Unknown

The Associated Press reports that family members are calling on the US military to use new DNA analysis techniques to identify unknown sailors and Marines who were on the USS Arizona.

PLOS Papers on Congenital Heart Disease, COVID-19 Infection Host MicroRNAs, Multiple Malformation Mutations

In PLOS this week: new genes linked to congenital heart disease, microRNAs with altered expression in COVID-19, and more.