Mark Boguski cut his teeth at NCBI at the height of the genome sequencing era before moving on to lead R&D at Rosetta Inpharmatics just as the field of microarray analysis was taking off; in other words, his career has mirrored major shifts in the field of bioinformatics itself. And, judging from Boguski’s latest appointment as director of the newly launched Allen Institute for Brain Science, it looks like the discipline may be poised for another shakeup. The institute is embarking on an ambitious project to map gene expression patterns for the entire mouse brain at the cellular level. With $100 million in seed money from Microsoft co-founder Paul Allen and a planned staff of around 100 researchers at his disposal, Boguski may have landed the best gig in bioinformatics. BioInform spoke to Boguski last week to discuss his new position and his plans for the Allen Brain Atlas project.
You moved from the linear sequence data at NCBI to multidimensional data at Rosetta, and now into the complexities of neuroscience at the molecular level. How did you get involved with this project and what new informatics and computational challenges do you anticipate?
I’ve always looked for interesting challenges in my career, and after the human genome project was wrapped up I was kind of asking myself, ‘What’s next?’ One hears a lot about systems biology, and what’s more complicated than the central nervous system? So I was inclined to go there. Those interests happened to intersect with a meeting that myself and a number of other ad hoc advisors had with Paul Allen about two years ago, in July 2001. Paul had been following the progress of the genome project, and for a long time had been interested in how the mind works, and brought together a group of about a dozen scientists who flew to Seattle for a day to brainstorm with him about the brain. The 90s was declared by Congress to be the decade of the brain, and the 90s was also the decade in which most of the human genome project was completed, so these trains were running on parallel tracks and it was time to have them intersect.
Paul wanted to do something that was lasting in terms of, for instance, a research institute, but it takes a while to build something like that and he wanted an impact as quickly as possible, being a real results-oriented person. And at about the time it seemed that there was an opportunity to utilize existing technology, but at a scale not done before, to do an inaugural project for the institute, and that became this mouse brain atlas. There was really a lot of interest and activity around this topic starting at NIH in 1998 or 1999. NIH supported development of some basic technologies that could be applied to this, but a lot of people around the table at the time thought that the Allen Institute could have a real impact on this by taking it to scale — not just a few hundred genes, but many thousands, resourcing it adequately, and managing it in an optimal way.
So then the difference between this project and other mouse gene expression brain atlas projects is one of scale?
Certainly, it’s unprecedented in scale. Some of the techniques we’re going to be using have been around for a long time, but they’ve been applied by scientists who are focused on a particular gene or a particular phenomenon involving a small number of genes. So although anatomically connected expression data has been published in countless journal papers over the years, it’s never been done in a completely consistent way, so it’s really hard to compare one study to another.
One of the lessons learned from the genome project was that we used to sequence individual genes, too, back in the 70s and 80s, but rather than duplicate that activity over and over and over again with a series of R01 grants, the genome project said, ‘Hey, why don’t we just do this once and for all and raise everyone to the next level?’ So philosophically and operationally, I think the Allen Brain Atlas learns a lot of lessons from the genome project, as well as builds on the data infrastructure that the genome project created.
Can you discuss some of the technology that you’ll be using to bring all this genomic and anatomical data together into a single view?
One of the things Paul and his advisors were looking for in an inaugural project was one that didn’t require any sort of revolutionary technology development. The idea here was to identify best-of-breed technologies that have been developed already, and combine them in a synergistic way and take that to scale. So at least in the first two years of the project, we’re pretty much going to be using off-the-shelf components, both in the lab and in terms of our information technology, to get the job done. That doesn’t mean that necessity will not be the mother of invention here. We may have to develop some new approaches when we find out what the bottlenecks in the process are going to be.
You’ve already been working on this for two years as a pilot project. Where were you doing that work?
With the interest of starting to produce some data — not only to advance the project, but to educate ourselves about what some of the challenges of scale-up are going to be — we have a relationship with one of our scientific advisors, Gregor Eichele, who has developed quite a bit of technology at the Max Planck Institute in Hannover, Germany, and he also has an appointment at Baylor College of Medicine, where he’s been producing some of this data. So we’ve been working with Gregor to get the project off the ground while we built our own facility in Seattle.
What kind of data and software tools do you have on hand right now?
We actually have quite a bit. One of the reasons we kept the project confidential for so long is, basically, talk is cheap and you can say you’re going to do anything, but we didn’t really want to go public with the effort until we had a substantial amount of work completed — not in terms so much of data production, but in terms of all the things we’ll need to scale it up over the next couple of years. So we already have a database and software applications to browse and mine the data, populated with pilot-project data that we’ve been collecting over the last year, and we’re going to accumulate more data and make sure it all works before the before the first public launch, which will happen hopefully early next year.
What’s the timeline for getting the Seattle facility up and running?
We signed the lease on Sept. 1 and we’ve got all the major equipment ordered, so it’s just a matter of waiting for the equipment to be delivered and staffing up. There are a number of open positions on our website. We have a core group of about two dozen people already on the project, and they’ve been working on it pretty much for the last year.
What kind of IT infrastructure will you require to accomplish this work? Have you decided on a vendor yet?
We’ve mapped it out, and it’s a pretty staggering amount of data when you get to it. In addition to talking to organizations like the government and pharmaceutical companies about getting involved in this, we’ve also been talking to IT companies as well. I think there are some interesting opportunities for collaborations there.
How much data are you talking about?
We’re guesstimating that about two-thirds of the genome is involved in the structure and operation of the brain — so 20,000 of the 30,000 genes — and we’re aiming for an atlas with cellular resolution, and there are 1 trillion neurons in the brain. So if you multiply 20,000 genes times 1 trillion neurons, you get the order of magnitude that you’re talking about. Of course, that ignores the fact that most of the brain is composed of glial cells that support the neurons, so it even gets more outrageous when you think of mapping 20,000 genes at cellular resolution in something as large and complex as the brain.
What are the primary challenges for scaling this up to where you’d like it to be?
Well, when you look at the genome project, we’ve got essentially one-dimensional data and a parts list, and the challenge here, and the thing that intrigued Mr. Allen quite a bit, was how does that finite and actually rather small parts list of only 30,000 pieces create something as complex as a brain? So the real challenge here is to take one-dimensional information from the genome and turn it into a three-dimensional functional form, and even four-dimensional if you consider it operating in both space and over time.
You’re looking to hire 75 more researchers. What kinds of informatics expertise do you think will be required to address the complexity of this project?
In terms of where we go next, we certainly need people with informatics experience, both on the management side, but also in areas such as imaging that will be crucial in a lot of our work in trying to bring together single-dimensional genomic information with two-and three-dimensional information as we put this atlas together.
What is the advantage of conducting this research in a nonprofit group instead of within an industrial setting?
It was Mr. Allen’s decision, and I don’t think there was any doubt from the beginning that he wanted to do this as a philanthropic endeavor. The kind of data that we’ll be generating is very much like the genome project in terms of it being something that will have its maximal impact the more people who are able to look at it and use it. One other thing is that Mr. Allen describes his investment as seed money, and we will work toward attracting some outside support for this project, both from government and industry, because we would like to have it viewed as a real public/private partnership in order to do for neuroscience what the human genome project did for molecular biology and genetics.