Data warehouse architect, bioinformatics group
University of North Carolina's Lineberger Comprehensive Cancer Center
Across the country, designated comprehensive cancer centers are integrating research data with clinical-care records and studies. One of them is the University of North Carolina’s Lineberger Comprehensive Cancer Center.
As a public center, the Lineberger doesn’t have large resources to accomplish data warehouse tasks needed to enhance cross-silo research. So to set up a data warehouse that can integrate a wide variety of data and data types generated by researchers, David Jordan, a data warehouse architect in the bioinformatics group at the Lineberger, created an open-source tool.
Jordan has been working on the data warehouse project with between one and three people and said he believes the first data-integration tasks can be accomplished over the summer giving the researchers a first look at their emerging warehouse.
Jordan has been designing database schemata for 25 years, and was formerly at Bell Labs, where he was appointed a distinguished member of the technical staff for his work in object modeling and database design.
BioInform spoke with Jordan this week about the challenges of integrating and warehousing data on a shoestring budget. Below is an edited transcript of that interview.
What kind of data are you integrating?
Today, breast or lung cancer data go into separate databases, tissue samples, and blood work on patients go yet into other databases. There are lots of silos that are not linked but researchers need to look across those databases. … There is also a data warehouse project at the hospital. Probably some time next year, when both our warehouse and theirs are completed, we will be implementing patient privacy and HIPAA [Health Insurance Portability and Accountability Act] regulations to allow both sides to access information from the other.
We are still in development. One of the databases researchers are anxious to have access to is the tissue banking database. They want, for example, to ask questions about breast cancer patients, look at characteristics of their tissue, see how those might relate to race or stage of cancer … the goal is to find patterns and correlations.
At first we are integrating data from the breast cancer and a lung cancer database, importing a tumor registry, which [holds] all the different tumor types. Once this first release is done, we will add other cancer types.
[We are also working with data] from microarray studies on patient tissues. We won’t be storing sequence information but rather [recording] which experiments were run, the slides for that patient, any analysis and results from that microarray research.
How did you select the software tool you needed?
There are three or four commercial vendors who are the leading players, and a local company. [Jordan declined to name the company’s names]. … I looked at the technical literature of the vendors and started discussing the cost with them. Basically you can’t buy a system for less than $200,000. Management had not originally budgeted for the tools, figuring it would be a relatively minor cost. That was much more expensive than they anticipated.
Did you consider evaluations?
Some of [the vendors] wouldn’t even let us do an [evaluation]. One of them we brought in and had a number of problems with [it]. While I was trying to get it to work and searching for an alternative, that is when I found Pentaho.
Then there was a third vendor who said they could give us a 30-day evaluation [period], saying it would cost $15,000. That is the nice thing about Pentaho’s open source implementation – you can just download it off the web and try it with no upfront commitment.
How about support?
They need to beef up their documentation in places. That’s why we found support helpful. In some cases there’s no documentation, there are placeholders, but they haven’t written it yet. In other cases they have material but it is also lacking and they need to add to it. That is where the paid support helps. … It is fairly new technology, they started it in 2004, whereas most of the data warehouse products were out before 2000.
We’ve hit enough bugs, had enough issues, without the support we would have had troubles. People have said to me that they’ve used some of the commercial vendor implementation and their comments were “you pay a lot more for their products, but theirs aren’t bug-free either.”
What do you think of Pentaho?
I like that it’s modular, you can just use the products you need. Pentaho took tools that were developed fairly independent of Pentaho and decided to integrate them, they have been building that up over the last few years.
What query language does it use?
You can use either SQL (Structured Query Language) or there’s a standard query language called MDX … most vendors are supporting MDX.
Some tools, like the ETL tool that you use for transferring data from one system to another, is not working off a dimensional model, so you do not see dimensional modeling concepts in the tool. But when you are designing your data warehouse schema, that relational schema needs to be able to represent the dimensional model. So it is kind of implied but it is not very explicit that it’s a dimensional model.
[Pentaho] has been incorporating third party tools and most tools do support SQL because that has been a standard for several decades. MDX is fairly new.
Some tools that were being used outside the context of data warehouse technology wouldn’t have by default supported MDX.
As Pentaho acquires a tool, they do some integration and then they release their product. As each release goes by, where they need to provide better integration, for example if MDX is valuable for a particular tool, they add it.
Mondrian is Pentaho’s Java server program that implements the MDX query language. There’s a file where you set up a definition of the dimensional data model and [define] how that maps to the underlying relational database schema. The MDX query is based on a dimensional model. In memory it evaluates MDX queries, takes MDX input, translates it to SQL, and then runs SQL against the database but then returns the data back to the application in the dimensional form.
We are not using [Mondrian] yet [because] there was a bug such that you couldn’t share a dimension definition across what are called cubes in data warehouse terminology. You had to essentially replicate the definition of these dimensions, and that would have been a real maintenance nightmare for us.
We found that bug in September, they fixed in March, but to use Mondrian and set up the dimensional view of the data, you only do that once you have data in your data warehouse and when you want to look at it from a dimensional perspective.
We are going to be busy through the end of July with writing the ETL logic to transfer the data from the lung, breast, and tissue databases into the data warehouse. So we won’t even be ready to start using Mondrian until August, September at the earliest.
How well do the vocabularies from your different databases fit with each other?
A vocabulary is a list of data values, typically strings that are used to describe some aspect of the cancer, an enumerated list of strings with different values that someone who isn’t even familiar with cancer can understand, for example different races: Caucasian, African American, etc. … When you migrate from multiple source systems into the data warehouse, [one issue is] these vocabularies have not in the past been centrally managed and coordinated so you end up [with a situation in which] one system uses values like White and Black, whereas in another they use Caucasian and African-American. So when you move that data into the data warehouse … you need to take the different vocabularies from the different systems and map them to one consistent set in the data warehouse.
We developed a tool that defines the mapping. I was able to figure them out by myself, but with one of the next data sources I will need to work with an oncologist.
Where does data mining come in?
That’s [a module called] Weka, it was developed completely independent of Pentaho, it’s existed since the early ‘90s. First it was written in C++ and then they re-wrote it in Java and now, like the other tools, they are trying to integrate it into the platform better.
It’s formally called a machine-learning toolkit, where the software analyzes the data and looks for patterns, comes up with ways of organizing the data to show different perspectives. … I haven’t used it yet, it does forms of data analysis that the cancer researchers have never used before. I think it’s got potential to help them discover patterns in their data that that they didn’t realize was there.
How will your data warehouse interact with the cancer Biomedical Informatics Grid, caBIG?
We have to have a server on our side that allows people to connect into our data and lets us to go out to the caBIG universe on the Web. [As for] data that we would want to make available on the caBIG network, we will have to map that information from our models, which are different from caBIG’s and map them onto caBIG representation.
Are there any tools lacking in Pentaho?
We developed a tool to map vocabularies. I think they should support something like that. The ETL tool is very robust.
How easy will it be to generate reports using Pentaho?
A reporting tool they released this spring is driven off metadata, we have only started using it. As opposed to a developer building up a report that has a bunch of logic in it which a developer would need to write, [Pentho’s reporting] allows the end user, who wants to look at data in the data warehouse, see a description of the data from a high level view of the model. They can point and click with the mouse and say: “I want this piece of data, this piece, and this piece over here.” It allows them, using a graphical user interface, to quickly select what pieces of data they care about.
The metadata that the developer sets up ahead of time defines how you associate and join together these different pieces of information. All the user has to do is say “I want this, this and this,” and then it on the fly creates that report.
So overall you are happy with your choice?
Yes. The nice thing is that you can try it with no upfront cost. You have the choice of paying for support or not. You can examine the source code if you want to. If there is a problem that has got you completely at a standstill, you can go in, look at the source and figure it out. My policy is that they should fix it, so we have never gone in and looked at their source. … With commercial vendors you could never [do that]. They don’t share the source code for their systems.
There is no cost of entry to start using it….It’s mainly a time investment … a lot of time is taken up with setting up the schema that is independent of using the tool. We could switch to anther vendor but we would lose some work. With commercial vendors, not only do you have to pay a lot upfront but their ongoing support and maintenance costs are much higher than what we are paying for Pentaho’s support.
[Pentaho is available here.]