The Allen Institute for Brain Science — which launched just over a year ago with $100 million in seed money from Microsoft co-founder Paul Allen — has reached its first major milestone in an effort to create a molecular-resolution 3D map of the mouse brain.
Last week, the institute released gene-expression data for 2,000 genes together with a suite of data-mining and visualization tools through its website, http://www.brain-map.org. This follows a “sample” data release in the first quarter of 2004 that did not include the browsable application.
Allan Jones, senior director of atlas operations at the institute — who replaced founding director and fellow Rosetta Inpharmatics alum Mark Boguski earlier this year — called the release the “first chapter in the story of the Atlas project,” and said that the initiative is on track to reach its ultimate target of 20,000 genes by the middle of 2006.
Michael Hawrylycz, director of informatics for the Allen Brain Atlas, said that the current release comprises approximately 17 terabytes of raw data, which has been reduced for presentation on the website by a factor of about 10 to 1.
When complete, Hawrylycz said the Allen Brain Atlas could contain about a petabyte of uncompressed data, with a “conservative” estimate of 300 terabytes of compressed data. By comparison, the latest version of Genbank is only 166 gigabytes.
Hawrylycz said that the project is generating such a massive amount of data because it its using in situ hybridization to study gene expression, an imaging-based method that relies on labeled probes that are specific to certain genes. Slices of stained brain tissue are photographed, and then the images are cleaned up, analyzed, compressed, and uploaded to the Atlas. Hawrylycz estimated that each of the uncompressed images is around 250 megabytes per section.
Eventually, Hawrylycz told BioInform, it’s expected that these raw images will be available for researchers, but the current release offers only the compressed images, which are processed with an application called Zoomify.
The current release allows researchers to query the atlas by gene name, and to examine the expression patterns of each gene, slice by slice, across the entire brain using an “expression filter” feature. It also includes an annotated reference atlas that allows researchers to view gene expression within a neuroanatomic context.
Hawrylycz said that the AIBS informatics team is planning to improve the algorithm by which the ISH expression data is mapped to the reference atlas “to present more of a perfectly aligned view.” Future plans also call for structure-based queries, which will allow researchers to view genes that are only expressed in certain brain regions. “The main goal that we’re striving for is to get the data in a common anatomic framework,” he said.
“As the project evolves, gene expression data will also be annotated by anatomical structures so that we can build a database that will allow a complex set of searches combining gene function and anatomical location,” said Ed Lein, director of neuroscience at AIBS, in a conference call announcing the release of the data.
The ultimate goal, Hawrylycz said, will be for researchers studying multiple genes to be able to view them in the same anatomical context. “That’s where we’re going,” he said. “Our current solution is a step in that direction, but it’s not the final solution yet. Most of our work moving forward will be on achieving this true 3D rendering.”
There are a number of steps necessary to reach that goal, he said. The AIBS informatics team has developed several new methods to process the data it has generated to date, but more new methods will be required to scale up to 20,000 genes. Both the expression filter, which “highlights cells that are of increased probability of being expressed,” and the algorithm that maps the expression sample to the reference atlas are among the new tools developed for the current version of the atlas, Hawrylycz said.
“If we stayed with exactly the same technology solutions that we have in place now, we could complete the job by just reiterating what we’ve done nine more times,” he said, “but that would not be providing the data in the true anatomic framework that we want and need to do. So it’s that next step that’s going to be exciting.”
One challenge associated with the scale of the project involves ensuring the accuracy of the data. Due to the automation of the process, and the millions of images involved, “certain aspects that ideally would allow for human intervention are just not feasible,” he said. “To my feeling, that’s one of the scariest aspects of the project. … Despite the existence of really top-notch methods and algorithms, you still at the end of day really want some kind of reality check of expert annotation, and with data sets of this size, it becomes harder and harder to get that.” This challenge has guided the informatics team’s strategy of developing tools “that enable us simultaneously to scale, and to get real usage out of our machines and algorithms, but also to enable an expert eye to get a look at the data,” he said.
The informatics team is also looking into ways to integrate the ISH gene expression data in the Atlas with other types of bioinformatics data, such as microarray gene-expression information.
Another area under consideration is the team’s IT infrastructure. The group currently uses a 28-node IBM blade cluster for its post-processing, and relies on a “conventional” storage architecture, Hawrylycz said. “In this first release, there have been lessons learned [in the hardware area], too, and we’re taking a look at what we need to do to scale this for the long haul,” he said. One likely change, he said, will be a switch to a storage area network configuration. As for the cluster, he said, “This is expected to be expanded quite a bit over the next year as we move into the true 3D world.”
Not surprisingly, considering the IT background of its principal benefactor, informatics was originally envisioned as the “primary component” of the Brain Atlas project, Hawrylycz said, but added that “we realize now that the data production and several of the other issues are massive undertakings themselves,” he said.
Nevertheless, the AIBS team still considers its efforts to be a predecessor to the Human Genome Project. “One of the philosophies we’ve tried to live by in a way is that we’ve trying to apply the genomic model to a neuroscience problem,” Hawrylycz said, “which means large scale, high throughput, the heavy exploitation of analytics, numerics, and informatics.”