The National Cancer Institute's Early Detection Research Network established in 2000 to identify and validate cancer biomarkers for early cancer detection has enlisted an unlikely partner to help its investigators share data across many research institutes: NASA.
The space agency's Jet Propulsion Lab originally developed a software framework called the Object Oriented Data Technology (OODT) to help distribute planetary data across geographically distributed heterogeneous repositories. Recognizing that the ultimate goal of EDRN's integration effort was very similar, NCI named JPL's Daniel Crichton a principal investigator in 2001 and placed him in charge of the network's informatics infrastructure.
BioInform caught up with Crichton by phone this week to learn about the OODT framework and how the JPL team iss applying it to cancer research.
How does the OODT technology work? It seems to be similar to some bioinformatics integration systems that people are building using web services and semantic web technology.
What we've done is develop it as an architectural framework for connecting systems together. So it's not so much competing with the space of semantic web technologies or web services or things like that, but rather providing the architectural framework for integrating information that sits on top of those kinds of frameworks. So it's actually able to leverage semantic web technologies things like RDF for being able to describe information and being able to plug those into semantic search engines.
It's able to use the web service standards for interfaces, but what we've done is tried to decompose data-oriented or science-oriented systems into a conceptual data architecture that allows us to define data in a certain way so we can find data, access it, and make it sharable with the community.
Can you go over what this entails?
We deploy a very simple set of services at a site that allows us to plug into their existing systems to make data accessible. Measures are taken for security purposes, so we do certainly address that, but we are able to extend the core software framework to plug into a heterogeneous environment. So rather than having to go through and recode or rebuild or reimplement existing systems, we're able to actually link in existing systems.
The software comes as a core set of software with services that can then be extended and tailored for a particular deployment.
In planetary science we have a similar model to what's going on in EDRN. The deployment just tends to look different in planetary science versus cancer research.
So how did a system that was developed for planetary science get brought into the EDRN in the first place?
If you look at it, the kinds of things that we do for [planetary] science are fairly similar. We want to be able to take an observation of something, we want to be able to acquire that observation, we want to be able to do processing on it, analyze it, and distribute it to the community. And that's almost an identical pattern in planetary science as in cancer research. We acquire and have to manage samples on both sides as well. We have soil samples versus human tissue.
And then in both of these kinds of environments, the entire definition of information is based on defining a core set of metadata for describing the kinds of data that we're managing. So the software is able to leverage the metadata from each domain and dynamically configure itself to be able to describe the information and build the necessary services to share the data.
So the power of the architecture is in defining the metadata for the particular domain. So within EDRN, we've defined a set of data elements, metadata elements, for describing the kind of data that we are managing.
Is that the NCI's Common Data Elements?
EDRN started developing these several years ago about five or six years ago and has shared them and is participating with the NCI as well. So EDRN has a specific set that they are using for describing things like biospecimens. But it's also been participating and sharing those with the larger caBIG program.
How did this collaboration come about? Did the EDRN folks approach NASA?
In some ways they did. We were presenting a paper at the National Academy [of Sciences] back in 2000 on doing this same kind of thing for NASA for planetary science research, and we were approached by someone at the [NCI] Office of the Director who was interested in establishing a collaboration with NASA. So we did a pilot collaboration with the Office of the Director, which then led to a larger collaboration with NCI.
What is the current status of the EDRN implementation?
It's up and running. The first specific project we focused on was sharing biospecimens, and we have nine sites that are linked right now sharing biospecimens, and we are on our way to connect 15 total within the next year. In addition, we're starting to work on the capture and management of the science data that's being produced through validation studies that are looking for biomarkers.
The focus for EDRN is on trying to identify and classify biomarkers as potential markers for disease. So they are looking at various validation studies and the application of various technologies for researching, discovering, and validating those markers.
As part of that, they are trying to establish informatics tools and infrastructure for supporting that collaboration and capturing information related to those markers, and then providing what we call a knowledge infrastructure for searching for all the information that's been acquired as part of that biomarker study.
So we expect to have data distributed and available from several different sites that would provide information about various biomarker studies and all the associated information the data, the protocol information, the specimen information, the biomarkers themselves, and so forth.
What's the timeline for that?
We are looking to deliver a first prototype that demonstrates the integration of all this information this summer.
In your experience so far, both in the biospecimen area and the biomarker area, how much have you had to change or modify the OODT technology to accommodate cancer research data?
At the core software level, we haven't made any changes at all. The work that we do tends to be more in defining a data model for life sciences, following practices that we've done, certainly with planetary science, but optimizing and defining a specific model that is appropriate for biomarker research. So we are defining things like what does a specimen look like, and what are its attributes, and how does a biospecimen relate to a biomarker, and how do biomarkers relate to a protocol, and those kinds of things.
It seems like a lot of that would still be unknown.
That's right, but from a perspective on supporting the science research, we can define sufficient attributes to capture the information, even though a lot of the discovery process is still going on.
It's an ongoing dynamic methodology, because we've got to come back and update the model as we discover new attributes of things that are critical to capture. One of the fundamental points of the software was to separate the technology itself, the software, from the data model because we knew the data model would continue to evolve at a very different pace from the software. So that was a fundamental principle that we put into the software from the very beginning. And one of the ways we were actually able to take it directly into NCI was that we had thought about that up front, and said, 'Hmm, if we separate the data model from the software, we can use this for earth science as we do for planetary science, or for biomedical research or research in other areas.'
Do you have interest or have you seen interest from external parties in extending OODT to other life science research projects?
Lots of different folks have an interest in looking at the software. It's been released as open source, and we have a number of projects working with the medical science community, trying, for example, to connect and share data for critically ill children for pediatric care.
So if someone were to download this software, would they be able to easily install it and run it, or does it require a lot of customization?
They can download the software it's available but it requires some informatics expertise to really understand how to build a system from it. What we provide for EDRN, for example, and for some other projects, is they're able to download the software and install it, but they need to be able to connect it into an existing system or community with which they can share data. And that's really the role that JPL has played, is building the architecture for EDRN to get all this to work together.
How many people at JPL are working on this project?
There are about four to five people working on this, and several of them are working on other projects as well. It fits nicely into us being able to work and apply the same principles across projects, so we're doing a lot of leveraging.
Can you discuss any other future development plans for EDRN besides the biomarker validation piece?
The other key plan is to take all of this and fit it into what's known as the EDRN Knowledge Environment, and that is really going to provide a one-stop shop for researchers to come in and get to all the information that EDRN is producing.
The goal for the long term is to provide that as much as we can as a public resource.
So it would be a repository for all the data that EDRN generates?
That's right, but the plan is that a lot of that data will be distributed still held at individual institutions so we'll need to be able to tie that all together.
Are you coming up against privacy or security issues in this project that you might not encounter when dealing with planetary data?
We are, and one of them is biospecimens. We've had to go through a process with each of the institutions' Institutional Review Boards and get approval for sharing the data, and that's certainly an extra hurdle for us to get over. And that can delay some of the deployment progress that we've had. We've often run into the fact that we could move faster, but that is something that definitely has to be done.
So we've done that, and then as well, we've ensured that the interfaces between sharing data are all encrypted.
So users have to log in to access the data?
Right now there is definitely a log in, so one of the things we're looking at in the longer term is how do we better open this up as more of a public resource for researchers without compromising the security of the information? So we're working on an update to our IRB protocol right now to make it more widely available and less as a resource for just EDRN researchers.
Is this something that will become part of the broader caBIG framework?
We see it as something that can be interoperable with the caBIG framework so being able to plug in and share data and so forth. We've already done an initial pilot project with caBIG where we've been able to take one of their products called caTissue, which is a tissue tracking system, and demonstrate that it can be plugged into ERNE [EDRN Resource Network Exchange], the distributed specimen system. So we already know we can share data and we can get these systems to interoperate together.