BOSTON--By focusing on data warehousing, the founders of Mosaic Bioinformatics hope to carve themselves a unique niche in the genomic-data-integration arena. "No one else is out here applying data warehousing to life science data," contended Jeffrey Resnick, CEO, who cofounded the company in November with Ken Griffiths.
During careers developing software for pharmaceutical companies, the two said they came to believe that data integration is the industry's number one problem, and that warehousing is the technology to solve it, Resnick said.
Their five-person firm, which they consider more a consultancy than a software vendor, is now developing three new products: a sequence datamart, a gene expression datamart, and a literature datamart. Resnick described datamarts as integratable single subsections of a warehouse that store information for a particular domain.
He explained, "Within a sequence datamart, we will have integrated numerous sequence databases, for example Genbank and Swissprot. Within a gene-expression datamart we will have integrated oligo array expression data and cDNA array expression data. Likewise, for literature we'll integrate multiple literature databases. These and other datamarts to be developed can be brought together over what's called a data warehouse bus architecture."
Resnick contrasted Mosaic's warehousing approach with application integration technology. The latter is based on a so-called "middle layer approach" that works with the data separate from a higher layer of software. "As soon as you put middle layers in between the database and application, you're going to slow it down," he argued.
That's particularly problematic for pharmaceutical companies, Resnick said. "You try to run a query against an integrated database and it takes four weeks. That's not going to work," he said.
Another drawback to the application integration approach, Resnick argued, is a simple naming problem. "What happens if you have a gene that is called two different things in two different databases?" he asked. Application integration can't handle it.
Furthermore, Mosaic's white paper contended, "application level integration doesn't integrate data at the lowest level, so query optimizers for even the simplest things, such as cross products of two tables in different applications, can't be utilized."
Resnick argued that genetic data must be integrated at the data level. "You have to worry about how you're cleaning the data up. The boring detail stuff, somebody's got to do that," he said.
The technology's strongest point, Resnick argued, is that it cleans the data. "There's an actual staging process built into warehousing, where transformation occurs on all the data that is being integrated, so that when it finally ends up in the warehouse there are some rigid quality assurances you have about the data," he explained.
Mosaic has two life science company clients--Wyeth-Ayerst and Paradigm Genetics--as well as a partnership with NetGenics, which also sells a bioinformatics integration platform called Synergy. Resnick said the two companies are cooperating to improve their products' performance, and he added that Mosaic could conceivably work with other bioinformatics companies too. For instance, he observed that Lion Bioscience's SRS package "does a reasonable job at the text file level but doesn't address the issues of having dirty data that you're integrating."
Resnick summed up Mosaic's position in the market like this: "Anywhere there is a need for database experts who understand how to make databases go quickly, a requirement that those experts have an understanding of the semantic domain they're working in, and integration of data as a key task, that's the place where we fit in."