When Beyond Genomics was formed in late 2000, the company’s focus on systems biology was considered to be a bit of a gamble. But four years later, systems biology is quickly becoming the norm rather than the exception, and Beyond Genomics has established itself as a leader in the field. The Waltham, Mass.-based company now has research collaborations with GlaxoSmithKline, AstraZeneca, DiaDexus, and Elan Pharmaceuticals, as well as with the Boston University School of Medicine and the University of Leiden in the Netherlands. More recently, BG began applying its blend of experimental and computational science to its own in-house discovery programs.
BioInform recently spoke to the company’s newly appointed vice president of computational sciences, Aram Adourian, to get a better idea of how bioinformatics fits within the company’s systems biology approach.
What role do computational sciences play in systems biology at Beyond Genomics?
What we do within the computational sciences group is data analysis and data interpretation, with bioinformatics falling under the data interpretation aspect of it, and data analysis relating more to the statistical aspects of the data.
There are a number of systems biology entities out there, but what distinguishes Beyond Genomics is that we perform both the wet biochemistry sample analysis, as well as the in silico data analysis, integration, and interpretation, in house, with a variety of proprietary approaches on both sides. ... and it’s a very iterative process in the way we operate. An important part of this is that if we see something very interesting in the data analysis or data interpretation, we can go back and, using the same samples, target various hypotheses that may have arisen, or various observations that may have been revealed that may be interesting for one reason or another.
Has the ratio between wet and dry work changed at Beyond Genomics since the company was launched?
It’s been fairly even. The ratio stays around 1:1, although the bioanalytical platforms do require a fair amount of resources, which scale with the size of the project. If you have 200 samples versus 2,000 samples, that’s going to require more resources on the wet lab side, whereas on the computational side, whether it be statistics or bioinformatics, it often does not matter from a research standpoint whether you have 200 samples or 2,000 samples.
How large is the company now, and how many employees are in the computational sciences group?
We’re based in Waltham [Mass.], and we also have facilities in the Netherlands at TNO in Zeist, so taken together, we’re at around 40 or so, just on the science side, and it’s split fairly evenly — around a 15/25 kind of split. That’s counting IT and infrastructure.
That seems like a small staff to have that many research projects going on.
I think what we’ve succeeded in doing — and that’s one of the benefits of starting a company from the ground up — is that we’ve really taken an effort to pipeline a lot of procedures, to streamline a lot of processes, both on the data acquisition and LIMS side, as well as on the data analysis and interpretation side. If we see a particular MS-MS spectrum that we’ve seen before and we have it in our database, we try to avoid reinventing the wheel and reinterpreting that spectrum. [For] the various statistical approaches, we have [some that are] appropriate for more than one project, and we apply them as such; we have procedures that are specific to specific projects, depending on the objective; and on the bioinformatics front we have a variety of tools that we’ve built in house that we use from project to project. So while the interpretation and the biological context of observations is not necessarily automatable, a lot of things we’ve pipelined.
What do you consider to be unique about BG’s approach to systems biology, and what specific tools have you developed to suit this approach?
In my view, the challenge in systems biology includes integrating as well as interacting with these large data sets — these multi-dimensional data sets that are heterogeneous and diverse and are information-rich in the sense that they are reflective of complex biological or phamacological processes. So the data integration is something that we have put a lot of resources and thought into, and the interacting part is also an aspect on which we have focused our in-house solutions, because we haven’t seen anything that is available that would address that.
So, what do I mean by interacting? I mean the visualization of certain results within a context. We may see certain genes or certain proteins or other analytes change, going from a disease state to a treated state, or from a healthy state or a disease state, whatever the context is. And we may be able to look at some statistical structure within that data set, but then the next question is, ‘Why do I see what I see?’ And that really involves the data interpretation side, which is, number one, ‘Let me visualize and interact with what my measurements tell me.’ Number two is, ‘Does this visualization agree with what’s known in the community, or does it disagree with it, or has it not been addressed?’ And for that, we’ve unified or brought under one common framework a number of common biological databases — we’re up to 20 or 30 or so — so that we have a tool in house that we can use to query those results across a number of different databases, and then visualize and interact with the results.
It’s a very iterative process. You may hypothesize that a certain pathway is being modulated or perturbed by a disease or by some form of pharmacological intervention. And if that’s the case, you may want to propose a subsequent experiment. So the set of tools that we’ve developed in house that would allow one to integrate and interact with those integrated data sets, and annotate, and visualize in an active manner, has been key for us.
In some ways, that’s the same problem that any in-house bioinformatics group might have. Have you found that any of your collaborators and partners are developing similar approaches?
There are some very nice tools out there, and I wouldn’t claim that we have solved this problem, because I think the problem is constantly changing, and is constantly being defined for us. … [B]ut for our purposes, we had specific design points, which were that we did want to be able to integrate the data, interact with the data, visualize the data, annotate, generate hypotheses, maintain those, and then, because we do have this duality of the wet chemistry and the in silico work, follow up on those initial interactions and visualizations with the results of a second round of experimentation, and maybe even a third or a fourth if there are a variety of hypotheses that we want to test. So it’s a variety of tools that — depending on our partner, and depending on our internal programs — we use to varying degrees.
Can you tell me about the internal programs that you have underway now?
These are relatively new. We recently published work on an internal program on a transgenic mouse model of atherosclerosis [Clish C, et al. Integrative Biological Analysis of the APOE*3-Leiden Transgenic Mouse. OMICS — A Journal of Integrative Biology 2004, 8, 3-13], and that was interesting for us because it was in some sense a proof-of-concept study. For us, it allowed us to test out our tools — both our wet chemistry and the statistics and the data interpretation, and the various infrastructures we have in-house to map the results onto pathways and onto mechanisms of disease. And indeed, that did generate a couple of very interesting hypotheses that we would like to follow up on — hypotheses dealing with lipid metabolism, in this context, that hitherto had been not known or not appreciated. But they do need to be followed up with some additional experimentation.
In the past, BG has mostly worked with gene expression, proteomics, and metabolomics data. Is this still your core set of experimental data, or are you adding in new types of information?
We are integrating things like clinical data, things like interpretations or measurements of imaging data. That’s quite nascent, but we have done that recently. Certainly for some of the systems that we work with that are genetic perturbations in nature, such as transgenic animals and so forth, there is genotype data. Even in human samples, we sometimes have genotype data available to us as well. So there a lot of types of information that make this actually very exciting, because many patients in a lot of these studies may be on concomitant medications, they may be on other types of therapies that will require you to integrate those data in with the molecular data that we acquire in order to draw realistic hypotheses about what’s going on in the system.
Does the analytical and statistical infrastructure that you’ve developed scale to these different types of data, or do you have to continue developing new tools as you add new data types?
We have sort of a two-track approach, where we have a set of validated tools that we apply whenever appropriate, and on-going methods development efforts. In statistics, for example, there are a number of challenges, as the community knows. We’re pretty much always dealing in the realm of having fewer samples than the number of molecular or other measurements that we can make, so your sample size is relatively small and that requires a fair amount of innovation on the statistical side that we and others in the community are addressing. But we have a variety of tools that do scale to that.
On the informatics front — on the traversing various databases front — there, it’s interesting. We do find ourselves solving problems such as, ‘What is a metabolite and how do you query for a metabolite in ten different databases, and identify in different pathways?’ There are issues of synonyms, there are issues of structure — even with proteins and polymorphisms, because what we measure in the laboratory are really what we like to refer to as protein instances or metabolite instances, which means that it’s a particular instance — it could be a non-synonymous SNP that you’ve measured, it could be a polymorphism that you’ve measured — and to map those things onto constructs such as pathways requires at least one level of abstraction out to the level of abstracting an observation to a protein class, or a class of metabolites. And there’s always that back-and-forth between the granularity with which we can make our measurements [and] the relative coarseness with which the various databases out there are populated. So it’s actually a very interesting set of interactions and traversals that we’ve come up with to get around that, and to glean information. You don’t want to lose that granular information. That’s where actually a lot of the interesting biology is happening.
You’ve been with BG since it was launched in 2000, so I’m curious to hear your thoughts on how systems biology has evolved in that period, and what the company is doing to stay ahead of the curve.
The company really was founded on the concept that ... the integration of all of these data sets, or many of these data sets, in an intelligent and coherent way would lead to a much more rich and nuanced picture of what was happening in a given system. I think where we’ve really tried to stay ahead is in the intimate coupling of the informatics and statistics — or the in silico work, if you will — with the wet biochemistry, because that really allows us to take full advantage of the iterative nature of systems biology and data integration.
There are many definitions out there for systems biology, and I don’t think any one is the definitive one. I think there are various commonalities throughout systems biology that involve integration of disparate data of one sort or another — it can be transcriptomics and proteomics, it could be clinical and proteomics, or some subset of those or superset of those. Other commonalities are reconstructing biological networks, which can be pathways or regulatory mechanisms. There are groups that focus on modeling and prediction solely, and that’s not something we do in isolation. We are much more applied to medicine, and again, we’re very much coupling the experimental and computational approaches and focusing on not just integrating, but interacting with these data sets and putting them in context.