Martin Leach is executive director of basic research and biomarker information technology for Merck Research Labs, where he leads IT, informatics, and data-mining support for the company's basic research division, which encompasses drug discovery through Phase I research.
He works to integrate data capture, integration, and analysis across basic research, and to build out the IT and informatics infrastructure to enable external basic research and translational research.
Before joining Merck in 2007, Leach was a consultant with Booz Allen Hamilton with a focus on the pharmaceutical, biotechnology, and life-science fields. He has also served as vice president of informatics at CuraGen, and has consulted for IT, informatics, and research organizations.
Leach holds a PhD in pharmacology from Boston University School of Medicine and a BS in cellular and molecular sciences from Anglia Polytechnic University in the UK.
BioInform spoke with Leach about his thoughts on second-generation DNA sequencing, cloud computing, and pharma-IT standards. Below is an edited version of that conversation.
What in bioinformatics has changed in the last few years?
Everyone says data is growing exponentially. It is. At Merck we've moved from terabytes to petabytes; we're living in the petabyte realm. We haven't gotten into the brontobyte [era] yet, which is the [next-next-next] next tier. I am not sure when we will get there, but brontobytes are around the corner.
Across Merck we have about 1.5 to 2 petabytes. Merck Research Labs is over 1 petabyte. [Driving this demand is] disruptive technology such as the next-generation sequencing boxes, high-throughput screening platforms for chemistry and biologic assays, an increased use of clinical imaging, [and] high-content screens. Every time you do a high-content screen you have a couple of megabytes of TIFF images. It is exploding, and for us it is growing at least by 30 to 40 percent per year.
Scientists get their first R01 [grant], they get their money, they go and get a next-generation sequencing box, and they've now got a couple of hundred terabytes of data to deal with. You can't just go to your local consumer electronics retailer and buy these things. You can't use commodity IT solutions for this realm we live in today. On top of that you can't use commodity networking.
Let's say you have an Illumina [Genome Analyzer]. It dumps out 4 terabytes. How do you move that around rapidly on your network or across sites? How do you back it up? Do you back it up? There is a lot of work that has to focus on policies and management of how to do this before you have generated a single byte of data. Because once you pull the trigger it is a flood, a tsunami, if you [think of] that now classic photograph of the Japanese tsunami and the flow of information.
Do scientists need to include thoughts about data management as they think about new experiments and instruments?
Everyone tries to do annual planning processes. They work to a limited degree. [However,] the technologies nowadays [are] disruptive and they can very easily disrupt plans.
What I think people are now beginning to see are the flexible computing environments, whether it's HP's flexible computing environments, where you say, 'I need a couple of terabytes and a couple hundred CPUs to do "a, b, c, d, e,"' and you get an e-mail the next day that shows you a server link that [contains what you requested.] It might be Amazon and cloud computing or IBM or Microsoft. To manage the peaks, we really need to leverage flexible computing and flexible application environments.
You very rapidly get into hardcore iron, big boxes of IT. Labs shouldn't be having those. The temperature in server rooms can be very warm. You need hard-core HVAC [and other technologies] to really manage the environment; technology to support the data flow.
Then issues such as cooling and electricity bills come to the forefront?
And green initiatives. You have to do it as green as possible. Do we use water-cooled approaches or do we use HVAC? There are so many different ways.
[ pagebreak ]
With cloud computing, are you concerned at all about security?
Yes. There are a number of challenges. One challenge is around security. We will not put three-dimensional coordinates for a chemical structure out on the cloud and then run some compute on the cloud. Because if anyone saw that, our chemical structures, that's Merck's proprietary information.
If [we] are going to leverage the cloud in many more ways, we have to figure out as a community how to obfuscate or transform that information, or encrypt and decrypt to do these computations where you could never deconvolve it without the key.
If you take [the] mathematical matrix of a chemical, how do you apply a molecular-dynamics calculation that's based on standard xyz coordinates? We need a change in the algorithms to support some of these obfuscations of information if we are going to throw it out [to the cloud].
Another thing that has been a challenge with this sort of flexible computing is movement of data. If I want to do some large compute against a database, I have got to move the database onto the cloud.
Amazon is addressing that, by putting key databases on their cloud environment. That's at least a step in the right direction. But there is [also] the networking issue and getting something onto the cloud. I think they are addressing that somewhat.
The analogy was that cloud computing was to be something like the electricity outlet that you plug into, like getting electricity in a building. But at the moment it is more like I have to deliver my problem to the generator station.
If I can get that electrical outlet, where I am not sending my machine, or the problem to be computed there, but it has been delivered, then it's delivering on the promise of cloud computing.
How would you prioritize standardization, data formats, and metadata capture? Are these just big pharma issues, or do they affect biotech and academia as well?
It's certainly an issue in large organizations in pharma or research. The general consensus is to try to standardize the platform you are using. Why have 50 different HPLC platforms when you could have one or two? Then you decrease the problem. Try to standardize on the vendors that you are using. These are some of the approaches that large organizations take.
Overall, there are too many standards. When you have too many it is almost [like having] none. There are a lot of different standards, of which a number overlap. Where do you place your bet? Some vendors latched onto standards, others have not, and some still have their closed proprietary format. This is going to be one of those problems we are going to have to chip away at over the years.
There [are] formats and [there are] standards. There may not be a standard for doing something. But [vendors could provide] common formats, such as having instruments that spit out XML formats. Everyone can take and parse XML and push that into whatever system they need to. The first step is working to common formats; then once we have agreements on common formats, we can move to common standards.
One thing that is important from [the perspective of] a large pharmaceutical organization is, 'How we go about connecting to the external research environment?' An important part of Merck's strategy is external research and integrating external research into how we do internal research. How do you enable that?
We've shown a lot of success with the Moffitt Cancer Center collaboration [an agreement signed in 2006 to develop personalized cancer therapies], but that was a specific project. Thinking big picture, how can we create that connectivity in a more standardized way?
We really need to work on that space and it really is a joint [endeavor] for pharma and academics. … We need some focus around that, standardizing the connectivity space so we can collaborate more seamlessly with outside research.
[ pagebreak ]
There are people coming together to try and do that. … The aerospace industry has already shown a lot of success. They may be working with 100 different companies just to make one airplane. There are technologies and methods they have applied, which I think we need to look to as it applies to doing life-science research.
Would one answer be the semantic web?
Show me where that's been applied to make this happen. It's a great concept. But show me the companies that sell some hard[ware] and software applications that really enable that and have a track record of doing that. I don't think there is enough track record of that being done at the moment beyond innovation experiments that have been performed in that space.
What would you reply if you received a call from the White House asking about your view of general drug-discovery IT challenges?
One of the challenges we face is how to get better flow and exchange of information from the medical providers we have. That comes down to better standards and movement of information in the healthcare space, which can be applied to drug discovery if we can access those in the right way.
What is aligned with where [President Obama's pick for Secretary of the Department of Health and Human Services Tom] Daschle was going before he stepped out of the race [is the] the utilization and standards in the electronic medical-record and the electronic healthcare-record space. If there is better definition and utilization of those standards and technologies it facilitates getting key clinical and medical information that will enable translational research that gets down to target validation and biomarker research.
I have spoken with medical record providers such as Cerner and GE. There is still [a need] to ensure that the right information is captured that would enable translational research down the road.
If you can capture the right stuff and get standards around that and then have standards around the technology platform, then there are ways we can talk to that and extract that information.
[For example,] we've been connected directly to the Moffitt Cancer Center, [and] we have an information pipeline flowing, a standard data warehouse repository that is there. They have affiliate hospitals that are funneling information to this central point and then funneling it to Merck. That has been enabled for oncology translational research because we had a standardized way of doing that, [a way that also] protects patient privacy.
You developed the method to put that in place?
We leverage industry standards and apply them. For example we use the [Clinical Data Interchange Standards Consortium Study Data Tabulation Model] standard for the transfer of information and we have internally the IBM Janus [clinical trials data warehouse] that is the model for our clinical data repository. That has enabled translational research.
If we talked to lots of different hospitals and they used similar standards around the [electronic medical-records] space we could more readily get the right [de-identified] information. … That would enable translational research, which enables drug discovery.
In bioinformatics it appears that software-as-a service models are emerging. What do you think of that approach?
Much like I say [it's important to have] a flexible computing environment, having a flexible application environment is also key. Flexible [meaning] that you can do many things with it, but also [describing] how you interact with it.
Internally we have used various virtual servers and various virtual desktops to deliver applications, servers, and services to scientists without them having anything physical under their desk or in their building.
You manage the software?
All I have to do is manage that connection and update at one physical site. The management of that becomes easier. Whether that is Microsoft Live or something else, I have not experimented with those myself.
I think the way IT provisions software applications has to change. It's very tied to some physical thing in your room. I think that flexible applications and flexible computing environments are going to change the way scientists interact with the capabilities they need to do research.
Down the road with this very information-rich world and the explosion of data, they are going to be needing work with more information science — computational science tools [that] have to be delivered to them in a very different way than today.
That might connect with being able to mine more data from samples?
You can generate one set of data, which is probably 100 terabytes worth of data, and you've probably got the equivalent of 10 PhDs worth of research in there. The focus is now going toward leveraging the value from the data and information. From my point of view you need to enable that way of doing research.
How good are the software tools? Does it involve a sad choice between speed, cost, and reliability?
There needs to be more flexibility, better performance, speed, and [they should be] modular. You've got this toolbox in your application environment. I sort of describe it as a kind of canvas that you want to work. You bring these tools to the canvas and you paint your picture of research based on how you use these tools. These sandboxes, these environments need to be flexible and modular so they can be used in many different ways. That is what I would like to see.
There are some things out there but they focus on one space. I don't want to call out any specific vendors. … There is some underlying core piece that might be shared and then there is something unique. The calculation of physical properties on some chemical structure is very different from the calculation of IC50 from a biological assay. There are different calculations, different methods, but a graph is a graph.
If the application environments we have are more modular and flexible with high performance, that is what I would like to see, if I could wave my magic wand.