A new research project at the Fred Hutchinson Cancer Research Center to aid the early detection of cancer could be Microsoft’s springboard into the bioinformatics market.
The five-year project, dubbed the Early Detection Initiative and funded with $4.4 million from private donors so far, will use proteomics and bioinformatics to study human serum protein profiles and distinguish individuals with early-stage cancer from those who are healthy.
The heart of the project will be a large-scale human serum proteome database to capture, store, and analyze data from the initiative. Researchers from the Hutchinson Center, Microsoft, and the Institute for Systems Biology will co-develop the database. Eventually, it will serve as a GenBank-like public resource for proteomics data, according to Martin McIntosh, a biostatistician at the Hutch who is working on the project.
But first, they need to build it, which is where Microsoft comes in. Jim Gray, a relational database pioneer and senior researcher in Microsoft’s Bay Area Research Center — fresh from similar stints in other scientific domains — will lend his skills to the project. Gray helped build the 3.3-terabyte TerraServer database of aerial images for the US Geological Survey, and more recently pitched in on the SkyServer database project, which aims at mapping every star in the sky by 2007.
Now, Gray will turn his attention to optimizing Microsoft’s SQL Server database technology for the nuances of biological data, which is “much more complex than any other kind of data that I’ve had to deal with,” he told BioInform. Unlike business data and even data in the physical sciences, “[bioinformatics] data is very dirty — it’s got lots of errors in it and it’s imprecise,” Gray said. “People in bioinformatics think they have a lot of data and, frankly, GenBank is 50 gigabytes — that’s part of a disk. So it’s not that there’s a lot of data — it’s that the complexity of the data is orders of magnitude higher than the complexity I’ve seen in other disciplines.”
The early detection project should put Gray’s database skills to the test. As with microarray gene expression data, the correlation of proteomics data across multiple experiments at different locations is a bioinformatics nightmare. However, proteomics experiments present an additional obstacle over gene expression experiments, according to McIntosh: “With SAGE and cDNA arrays, you’re trying to measure something you’ve already identified — you know there’s a gene, and you’re trying to measure its expression. What we’re doing now is trying to identify what’s there, and then we can talk about combining measurements that quantify it…So the challenge is combining databases that do both discovery and quantification at the same time.”
Hitting the Books
Gray, a professed “complete newbie” in bioinformatics whose involvement in the effort is partly due to Microsoft CTO Craig Mundie’s position on the Hutch board of trustees, said he’s immersing himself in biochemistry textbooks and the scientific literature in order to “at least hold up my end of the conversation” with his new collaborators. Gray added that he also spent a week at NCBI to get a first-hand look at a bioinformatics data center in action, where he was “a bit surprised” to find that the venerable home of GenBank doesn’t use Oracle or even MySQL or DB2, but a combination of Sybase and Microsoft SQL Server.
SQL Server’s presence at NCBI highlights one aspect of Gray’s involvement in the project, which he self-mockingly described as the “crass financial motive of Microsoft.” The company sees a real opportunity in optimizing its database technology for the bioinformatics market. “I don’t expect to see BioInfo 1.0 as a product from Microsoft any time soon,” Gray quipped, “but certainly applications like this have unusual needs and to the extent we can see ways of meeting those needs we can improve our products.”
Gray noted that recent improvements in SQL Server, including the addition of data mining tools and spatial search features, were derived from his experiences with the Terraserver and Skyserver database projects.
“The idea is that if you can see a general pattern that occurs again and again and again, you try and put that into your product,” he said. “I think the Microsoft products will benefit from this collaboration [with the Hutch]…we’ll learn new algorithms and we’ll also learn places where our products don’t work very well and we need to fix them.”
While acknowledging that the bioinformatics database market is already dominated by Oracle — and is growing even more crowded with increasing competition from IBM and the open source MySQL — Gray said that the lower price of SQL Server compared to other commercial products should keep it in the running.
Jim Ostell, chief of the engineering branch at NCBI, told BioInform via e-mail that price was a consideration in switching some of its database tools from Sybase to SQL Server. “With our increasing move to commodity hardware, NCBI, like many other groups, has re-evaluated our RDBMS in terms of features and price/performance and has decided to deploy SQL Server for a number of our new database applications,” he said.
But while the suits at Microsoft may view the Hutch project as a foot in the door of the bioinformatics market, Gray said that his primary interest in the initiative is a cerebral one. “The notion of understanding biological systems stands as one of the great intellectual challenges, and it has the prospect of extending human life and improving the environment, so there are many social aspects of working in bioinformatics that you don’t find in banking or insurance,” he said.
The Early Detection Initiative itself officially kicked off only last week, with the $4.4 million coming from the Paul G. Allen Foundation for Medical Research of Seattle, the W.M. Keck Foundation of Los Angeles, and Donald Listwin, a businessman from Woodside, Calif. The Hutch plans to put an additional $3.3 million toward the five-year effort.
The database project, McIntosh said, is still in its early stages, and the developers are still evaluating their current options. It’s likely that ISB’s in-house proteomics database will serve as an initial model, but scaling up the system to include data from other labs will be a challenge. Noted McIntosh, “informatics tools that help you evaluate your own experiments are not necessarily the same tools that will organize multiple experiments from multiple sites and different specimens.”
McIntosh said the database developers would start small. “The first step is going to be getting together to figure out how to organize the data from single sets of experiments, and then figure out how we would compare that to a next set of experiments at the same place, and onward.”
Ultimately, McIntosh said, the database will serve as an automatic “curator” for serum proteome information organized along the lines of the Cancer Genome Anatomy Project, where researchers can retrieve gene expression information for any given tissue type. In practice, he said, “We can just go into the serum and say, ‘What’s there in the cancer that’s not there in the control?’ or we can take a specific target that we’ve already identified with cDNA arrays and see if it’s there in the blood. And we can do that without first developing an ELISA assay, which is costly and time consuming.”
A pilot version of the database may be available for “local investigators” within the next year, McIntosh said.