Rutgers University has enlisted Washington University and the Information Sciences Institute at the University of Southern California to help it build out the computational infrastructure for its Cell and DNA Repository, which is quickly outgrowing its current informatics underpinnings.
The university has also tapped StarLIMS to replace its in-house developed laboratory information-management system.
As part of a five-year $42.4 million grant that the National Institutes of Mental Health recently awarded Rutgers to establish the Center for Genomic Studies on Mental Disorders at RUCDR, the university will award subcontracts to the Wash U School of Medicine and ISI to develop new computational technologies to support the repository.
Financial details for the subcontracts and the agreement with StarLIMS were not provided.
RUCDR’s scientific director Jay Tischfield told BioInform via e-mail that the 18,000-square-foot cell and DNA repository has outgrown its computational backbone.
“Our databases are adequate to merely store and correlate the data,” he said. “However, we are in dire need of new ways to view, parse and analyze the data.”
RUCDR was established in 1998 and has grown rapidly. In its founding year, it received 2,923 samples. By 2006, that number had grown 10-fold to 29,900. As of June 2008, the repository had processed a total of 181,523 samples and housed 128,310 cell lines.
RUCDR is the primary repository for NIMH’s Center for Genetic Studies, and also fills several other roles: It is the Genetics Repository for the National Institute of Diabetes and Digestive and Kidney Diseases; it maintains cell lines and DNA for the National Institute on Alcohol Abuse and Alcoholism Collaborative on the Genetics of Alcohol Abuse and Alcoholism; and stores materials for organizations such as Cure Autism Now/Autism Genetic Research Exchange.
Tischfield said that his team collects biological samples, usually blood, which are processed to provide cell lines, DNA, and RNA for scientists doing genome-wide association studies and other research geared toward understanding complex diseases.
“We have huge databases of biological material, for example DNA, that we dispense to researchers throughout the world, and databases of clinical, phenotypic, traits on all of our subjects that we also share,“ he said. The center has also begun collecting large amounts of genotypic data, “as many as a million data points per individual.” To date, over 300 research groups have submitted samples to the repository, he said.
Rutgers will work with Wash U biostatistician John Rice on new data-analysis methods for collecting, curating, and analyzing phenotypic, clinical, and genetic data, Tischfield said.
ISI, meanwhile, will be involved in several aspects of the project, including visualization methods, data integration, and computational infrastructure.
Beyond Home-Grown
One of the first pieces of the new informatics foundation will be a new LIMS. RUCDR originally developed its own LIMS, built around an SQL database, to track samples and the analysis done on those samples.
Over the course of the next three to six months the repository will switch from its home-grown system to StarLIMS version 10 to manage operations, Tischfield said.
“Clearly, we now need the advanced capabilities of StarLIMS as we have outgrown our current system with its more limited capabilities,” he said. “Aside from being directly connected to all of our analytical instruments, it will allow our clients to have a real-time, read-only view of our daily acquisitions to their collections, [and] provide delivery notification, etc.”
The project is “very exciting for us,” Ed Krasovec, director of clinical operations for StarLIMS, told BioInform. The firm is currently deploying the software platform at Rutgers “with meetings, on-site visits that define how we need to configure our system,” he said. The team is tuning the off-the-shelf software platform to meet requirements of local researchers and administrators, adapting it to on-site instruments, and specific types of lab procedures.
Krasovec said he expects the deployment to run through mid-2009.
He said that the plan is not to replace the current system in all of the repository’s functional areas. “In some cases we’re replacing the capability of their existing homegrown system; in other cases they were using spreadsheets and paper forms, [so] we’re replacing that; and in some cases we are integrating with existing infrastructure that they have.”
While StarLIMS can run on an SQL server database, he said, it is not yet decided if that will remain in place.
Once the StarLIMS system is established, it will capture data on specimens, along with instructions about preparation and processing intended for the material. The LIMS will also indicate the quality checks performed on the material, and store its exact location in the repository.
“Internal to Rutgers on the operations side, [the system provides] visibility,” he said.
Early Genomic Forays
ISI will be involved in many facets of creating RUCDR’s new computational infrastructure. A unit of the university’s Viterbi School of Engineering, ISI is based in Marina de Rey, Calif., and Arlington, Va. It delivers computer services in many areas such as developing novel processor architectures, intelligent systems in robotics, knowledge representation, data-mining, and distributed systems for military applications such as cyber-security.
ISI was part of a previous Rutgers grant for a genome-wide association study data coordinating center. “Between these two awards, it is for us at ISI our first foray into genomics,” Yigal Arens, director of ISI’s Intelligent Systems Division, told BioInform in an e-mail. In addition, he said, a separate group at ISI’s new Center for Health Informatics has begun taking on genomics projects.
“Our databases are adequate to merely store and correlate the data. However, we are in dire need of new ways to view, parse, and analyze the data.” |
Arens’s division has 140 researchers devoted to various areas such as large-scale information gathering and integration. For RUCDR, he said, his team “will take advantage of our experience working with other scientific communities, including astronomers and earth scientists, to provide users of RUCDR with a set of tools that will enable more natural and transparent access to the center’s and other researchers’ data and resources.”
As the magnitude and heterogeneity of data collections grows, he said, “it now appears that without the application of the most sophisticated computational techniques we are facing the danger that future integrated repositories will overwhelm scientists that need to use them with their size and complexity,” he said.
José-Luis Ambite, senior research scientist in ISI’s Information Integration Research Group explained to BioInform in an e-mail that he wants to integrate RUCDR’s heterogeneous data, and find ways to enhance access to the data.
“Our plan is to create a unified view of all [the] center’s data and selected external resources and also provide an expressive query interface over these data,” he said.
The details of the collaboration are still being sorted out, but Ambite said he expects ISI’s 15 years of experience in information integration will come into play. The institute’s research interests in this area include ontology-based integration of heterogeneous databases and web sources; machine-learning techniques to extract data from semi-structured web sources; integration of different types of data including statistical, biological, and geospatial data; and record- and entity- linkage techniques, which recognize when objects from different sources represent the same entity and integrate that information.
“We will evaluate and use open-source and commercial tools in the center,” he said. Those tools include OGSA-DAI (Open Grid Services Architecture Data Access and Integration), a middleware platform to allow data resources, for example an XML database, to be accessed through a web service; and OGSA-DPQ, the OGSA distributed query system; as well as data federation systems such as IBM’s Information Server.
See It, Grid It
ISI systems programmer Marcus Thiebaux will be tackling RUCDR’s visualization challenges. In his view, genomics and the study of biological pathways “present a special set of otherwise well-understood challenges for visual organization and interaction with complex, structured data.”
Thiebaux said he sees a “tremendous need for specialized visual tools,” as well as toolkits for building and adaptively enhancing tools, to manage and explore this kind of data complexity. Currently, he said, “there is no generic modus operandi to apply visual methods to such a problem space.”
“Our initial pilot efforts [at RUCDR] will inform the development of a more complete extensible toolset in following years,” he said.
A “visualization evaluation” at RUCDR will set out to identify where and what kind of new tools are needed. “Ultimately we will work toward a unified system that will cross-reference data across domains in genomics, proteomics, neuroanatomy, and molecular and cellular neuroscience using a single platform,” he said.
“Precisely what will be built will depend on the results of our requirements queries and the types of heuristic rules and principles that are discovered,” he said.
Thiebaux said it’s likely that currently available visualization tools will not be up to the task. “While there are widely available and established open source tools for interactive dataset visualization, such as Kitware's Visualization ToolKit, these are simply low-level tools with which to build specific applications,” he said.
On the other hand, “existing higher-level tools are likely too narrow in scope, and less flexible for adapting to the dynamic needs of genomic research support, as we expect will be required.”
The plan, he said is to directly interact with genomics researchers to determine what best serves their needs and place that in the context of the data grid and data integration.
Ewa Deelman is ISI’s project leader in the advanced systems division and will be leading the effort to apply grid technologies to RUCDR. “We will rely on the computational resources existing today at Rutgers, Wash U, and USC and tie them together into a larger virtual resource,” she told BioInform in an e-mail.
Computation for RUCDR will be enabled at each of these sites, and access to computational resources across the network will be provided through grid software such as Condor, developed at the University of Wisconsin, and tools from the public-private grid-development community Globus, she said.
Each of the institutions involved in the project has computational clusters in place, she said. “If we need more resources to support computation … we will apply for computational cycles on the TeraGrid.”
Data storage is not as much of an issue for RUCDR as “the ability to find, interpret, and analyze the data,” she said. “In this project, my team will provide capabilities for researchers to conduct complex analyses on the data maintained at the center.”
“Initially, these will be analyses developed at the center, but as the project progresses we hope to engage the community into contributing the analyses they want to perform,” Deelman said. The analyses, she explained, will be supported by a workflow management system called Pegasus, which was developed at ISI in collaboration with research groups around the world.
“Pegasus allows scientists to describe their application in terms of individual computational steps, the input data they take and the output data they generate. The order of the steps in the workflow can also be specified,” Deelman said.
The platform honors original codes, she said, so users do not need to re-code steps in another language. “Pegasus can help individual scientists by automating the computational tasks and by keeping track of what computation have been done and what data was used,” she said.
Pegasus has been already been deployed in large-scale research projects, for example helping geo-scientists create “shake maps” to estimate possible seismic shaking at a given location and develop building codes based on these calculations. “A single such workflow-based simulation for an individual point on the map is composed of approximately 800,000 computational tasks and processes as many data files,” she said.
In genomics, Pegasus is being applied to epigenomics projects at USC where Ben Berman and his colleagues at the Keck School of Medicine are mapping epigenetic states on a genome-wide scale.
At the current time, it has not yet been established to what degree Pegasus and StarLIMS “will need to interact” said StarLIMS’s Krasovec.
In terms of other software, Deelman said that for both the grid and workflow technologies the plan is to use open source tools. “We will also use open source collaborative tools such as wikis … to enable sharing of ideas, information, data, and computational methods.”
Even after all the technology is developed and deployed, the computational structure will need to grow in many ways. “It is likely that the repository will always be a work in progress,” RUCDR’sTischfield said. As the scientists bring in new materials to expand existing collections and extend their work to additional diseases, he said, it is “unlikely that any of these disorders has a single cause,” adding that tools to analyze data in the context of multi-gene involvement, novel mutations, and environmental influences will be necessary.
“We are also expanding our studies to include clinical pharmacogenetics,” he said. “We anticipate that this will continue and that the center itself will play an expanding role in the ongoing research.”