The Windber Research Institute, a two-year-old biomedical research center in Windber, Pa., “was designed around biomedical informatics,” according to Richard Somiari, its COO and CSO. The institute is fully outfitted with a state-of-the-art informatics infrastructure — including a newly installed data warehouse from Teradata that will collect up to 50 terabytes of data every nine months — but the secret to its informatics foundation may be a simple matter of old-fashioned communication rather than technological prowess. According to Somiari, a strategy of open dialogue with the research community, industry partners, and vendors has been a key building block in WRI’s bioinformatics groundwork.
In August, WRI hosted a conference that brought together 240 biologists, computer scientists, mathematicians, doctors, and other professionals in order “to see what the experts in the field think about biomedical informatics, and also help us determine if we are heading in the right direction,” Somiari said. WRI conducted a 17-question survey during the conference to gauge the community’s informatics needs, as well as its outlook for the future (see sidebar, p. 8). Not only did the survey results show that WRI is “headed in the right direction” when it comes to its informatics framework, but it also showed that the biomedical research community is bullish about the promise of bioinformatics “in how drugs are developed and how diseases are managed,” Somiari said.
In addition, Somiari has put his communication skills to work in persuading WRI’s software suppliers to open up their systems so that they can be integrated into a seamless, automated framework. Last November, Somiari hosted a meeting for 15 different software vendors at the Windber campus to discuss ways to integrate their software. “Each company is trying to protect its intellectual property,” he said, “but I told them that it’s not going to help healthcare in any way if we can’t integrate all these processes.” The vendors, including InforMax, Spotfire, Amersham Biosciences, Cimarron, SPSS, SAS, and others, have willingly opened up their systems, Somiari said. “We spend a lot of money with them, so they have to be cooperative,” he added.
An Informatics Foundation
Somiari said that biomedical informatics is the “foundation” upon which WRI is building a high-throughput clinical genomics infrastructure. “We started by saying, ‘We are going to drive everything that we’re doing with biomedical informatics.’ But we know that we can’t rely on the quality and quantity of data available in existing databases, so we have to be able to generate raw data ourselves,” he said.
With this philosophy of “backward integration from biomedical informatics,” WRI set up a large-scale tissue acquisition and banking system as well as an analysis pipeline to capture molecular-level information about tissue specimens provided by a network of medical institutions. WRI annotates each tissue specimen with 450 data fields of information comprising 166 megabytes — including demographic data, pathology information, radiology data, medical histories, imaging data, as well as DNA, RNA, and protein data. The WRI facility is designed to handle around 5,000 specimens per year, Somiari said, and currently holds around 6,000 specimens. The tissue bank is expected to hold up to 240,000 samples in a few years.
To handle this vast and complex data output, WRI collaborated closely with a team of informatics vendors to build a system that seamlessly tracks biological specimens, captures information, and offers data mining and analysis capabilities.
The first level of the infrastructure is built upon five parallel Scierra laboratory workflow systems co-developed with Amersham and Cimarron, Somiari said. Four of the Scierra systems — for sequencing, microarrays, proteomics, and genotyping — were ready more or less as soon as they came out of the box, while the fifth is a proprietary system for clinical information that Somiari said is the “first of its kind.” The workflow systems guide how each sample is processed and also capture information on the specific tests that are done. Integration at this stage of the process is “critical,” Somiari said, “because for you to have a robust and functional bioinformatics infrastructure, you have to standardize some upstream processes.”
The five workflow systems feed into a central Oracle database, “and because they are all on a common platform, you can easily associate DNA information with RNA information, and also link that to clinical parameters,” Somiari said.
Last month, WRI installed the final piece of its informatics infrastructure: a Teradata data warehouse that collects “sanitized” data from the Oracle database as well as from other public and proprietary sources [BioInform 09-20-03] . “This gives us tremendous data mining capabilities,” Somiari said.
Making it Work
Somiari said that WRI is in the final stages of implementing the research infrastructure. “I did not want to start generating massive amounts of data before we completed the architectural design of our data warehouse,” he said. His group is now wrapping up tests of several software tools that it is fitting into the Teradata system to ensure compatibility before the system runs at its full capacity.
WRI plans to employ around 100 researchers, and around 15 of those will be in the informatics group, Somiari said. For now, however, four full-time employees in addition to Somiari are building the informatics foundation, in collaboration with several staffers from WRI’s software partners stationed on the Windber site. While Somiari said his team does develop its own software tools when necessary, “my strategy is to get the state-of-the-art tools currently available. … So any company that has a solution that would be useful to us, we establish a relationship with that company, and we begin to tap into the benefits of using that tool.”
The massive database repository of clinical and molecular data will eventually be made publicly available via “multiple levels of access,” Somiari said. “Basic information” contained within the database will be made freely available, he said, while “refined information” that results from industry collaborations will be available under restricted terms.
The ultimate goal of the data-collection and storage process, according to Somiari, is disease modeling. While “everyone is talking about systems-level research,” he said, the first step in attaining that goal is “good information at the basic level.” By capturing information of interest to the research community — and checking in periodically with users and vendors to make sure it’s still on the right track — WRI plans to “use the best tools available today to answer the questions of today, but position ourselves to be able to respond to the needs of tomorrow,” Somiari said.