As vice president of research informatics at Cambridge, Mass.-based Biogen, Rainer Fuchs has redirected the company’s informatics capabilities away from early discovery and toward the more complex problems of target validation. One part of this strategy has been the development of a LIMS-like Animal Information Management System to provide access to all in vivo animal data generated in the validation process. Fuchs recently spoke to BioInform about how his company is keeping up with the rapidly expanding role of informatics in drug discovery and development.
How have you seen research informatics evolve over the last 5, 10, or 15 years? What are the key changes you’ve seen take place at Biogen recently?
A few years ago the problem for everybody was that we were predominantly data limited. So a lot of the work in bioinformatics was focused on intelligent ways of squeezing as much knowledge as possible out of data, and that has definitely changed now. The human genome is here, so most of the challenge is now the opposite problem, which is information overflow. Algorithms are certainly no longer as important as they used to be and now we’re more challenged by the problem of finding meaning in these data sets.
On the other hand, because the toolkits available for biologists have changed so rapidly, we’re now seeing the increasing relevance of what I call more biological data. Our scientists here are no longer asking us to find more genes, to find more candidate targets, but to help them interpret the biology of those candidates they have already.
We have to store … the information about the genetic background of these systems, but then we have people using these animal models to test drugs, and all that information has to be put together somehow. So just over the last few years, we decided to focus a lot of our informatics activities in that particular area — in animal information management. Part of that is genetics and genomics information that feeds into it, but it also adds complexity that we just never had before: How can we detect that the response of an animal model to the treatment of two different drugs is actually very similar? It’s easy if you have hard measures like, say, blood glucose levels. But if it’s more behavioral reactions of that model it’s much more difficult to describe those observations in such a way that you can computationally identify patterns. So there’s a real problem here in data representation that you don’t have with sequence.
How have you approached that problem?
I wish we had an answer to it. One key issue here is to sit down with scientists and agree on common terminologies that can apply across different projects. That’s why work that’s going on in groups like the Gene Ontology Consortium is very important. But it’s important to take it to the next level. It’s important to define gene function, but it’s very different from what I would call biological function. And I think the existing ontologies are still relatively weak in those areas. The emphasis has been very much on the molecular level, but I think we need to take it to the next level.
Have you developed an in-house ontology?
We have started to put systems together that allow us in very specific areas to map different objects on to each other, words and terms onto each other. But we’re not trying to do that in any comprehensive way.
How has the user base for research informatics tools changed at Biogen?
Traditionally our user community used to be the genomics people, people that are very interested in target identification, and molecular biologists, but now it’s a much broader range of cell biologists, experts in particular disease areas like oncologists and neurobiologists, people who are not trained in molecular technologies but more in cell biology. It was really a big surprise for me how engaged they became in interactions with research informatics.
They were overlooked before?
Exactly. Sometimes it’s almost trivial because you start at such a low level of computational infrastructure that you can actually very quickly have some impressive results. For example, the way they capture experimental data in animal models traditionally has not been so computer supported, so we worked to come up with better ways of capturing that information in our databases and basically we created time savings on the order of 35-40 percent, which allowed them to dramatically ramp up the throughput.
Do you think that genomics moved so rapidly that everybody got to this validation stage before the technology was ready?
Yes, I would argue that the real experimental problem today is in fact high-throughput biological validation — you have a list of 50 genes that pop out of a gene expression experiment or a proteomics experiment. That’s nice, but what does it really mean? Once you have a gene, you go to the database and maybe you get a good sense of its molecular function, but that doesn’t really tell you much about the biological relevance. And here’s the big disconnect — experimentally we just don’t have the methodologies to take 50 or 100 genes and run them through animal models to do large-scale phenotypic descriptions. You really don’t know what to look for, so that’s really where the main bottleneck is.
I can’t argue that informatics can fill that gap, but it can take the experimental approaches to the next level by supporting higher-throughput, more efficient data manipulation and data analysis. And maybe by creating some of the knowledge infrastructure, it may allow you to identify patterns in your study of mouse models that you couldn’t without it.
So you’re going back to wet lab work and LIMS.
We’re really beginning to put more emphasis on traditional informatics. We’re going back to the basics of putting LIMS in place, working on process improvement. So in addition to the notion that informatics can be a qualitative enabler, we are going back to the notion of informatics as a tool to create quantitative improvements.
If you can just do the biology faster and cheaper, that’s a huge benefit. It may not be as intellectually exciting as coming up with a great new method of predicting a protein structure, but from a return-on-investment viewpoint, it’s actually much easier to justify.
When you put together the AIMS at Biogen, how much did you draw from your existing LIMS?
We basically had to build everything from scratch. We had to put much more of the basic data collection and data management infrastructure in place than we anticipated. We’re pretty good now at data collection, data management, and data representation, but the next problem is high-level pattern identification. You have all the data in your database, you have all the information from the animal models available to your scientists to allow them to go in and browse those databases and confirm or reject their hypotheses. That’s a nice capability to have, but what they’re really interested in is finding unexpected patterns — the answers to questions we haven’t even asked yet. I don’t know what the answer is to that yet.
Because you’re dealing with more heterogeneous types of data?
That’s exactly the problem. I guess the best we can hope for is to come up with ways to spot possible correlations without being as precise. It’s like brainstorming — to just throw out ideas and you can look at them and say, ‘No, that really looks stupid.’ But at the same time it may come up with something that really looks interesting.
There’s also the shift to new opportunities in the informatics community for technologies like knowledge mining approaches. It’s interesting; the ISMB meetings started out originally as a forum for the AI community, but over the years turned into very run-of-the-mill bioinformatics meetings. Just over the last couple of meetings it started getting back to where it started to talk about intelligence again. So I think we’re seeing a real shift in focus back to more intelligent ways of interpreting data.
What types of technologies have you put in place for interpreting this data?
I don’t think we’ve seen anything yet, and we certainly haven’t had the ability to develop anything that goes beyond tools that support scientists in evaluating their hypotheses. I conceptually differentiate here between technologies that help scientists support a particular hypothesis — so you have an idea and you go in a database and browse it using visualization tools and search engines. Opposed to that you have technologies that are more independent — the database analyzes itself to come up with ideas. I think maybe it’s still a pipe dream, but we haven’t seen much that’s useful in that respect.
Do you find ideas like this more in the bioinformatics community or the AI community?
I suppose more in the AI community. There are some companies popping up trying to work in that direction, but nobody I can point to yet beyond the vaporware stage.
Isn’t there the argument that companies working in this area don’t understand the problems in biology?
That’s true in a sense. Companies in that area have focused more on helping the CIA identify spies or terrorists, but some of these technologies could really be applied to the analysis of biological networks. Basically, there are particular interactions going on, but you can’t observe the interactions directly. You can only see the effects of those interactions. Of course, people in the intelligence community have been trying to solve that problem for a long time, so I would expect that over the next few years, as we put more resources into finding terrorists, there might be some benefit for the biological community as well.