The Allen Institute for Brain Science is launching some new software in anticipation of the fall launch of the Allen Brain Atlas, a genome-wide image database of gene expression in the mouse brain that it began building three years ago.
This week, the institute's informatics team launched 3D Brain Explorer (available at http://www.brain-map.org/), a desktop client for viewing data in the Allen Brain Atlas, which currently includes gene expression information for around 20,000 genes and is expected to include data on around 21,000 when it is completed in the fall.
3D Brain Explorer, like the data in the Allen Brain Atlas, is free for all users and enables them to view gene expression data in the mouse brain in three dimensions at 100-micron resolution. Users can also view expression data from multiple genes superimposed on each other in 3D, and search by coverage, intensity, or pattern of expression.
Michael Hawrylycz, director of informatics for the Allen Brain Atlas, told BioInform that developing the 3D search and visualization tools was a "major part of the [Allen Brain Atlas] effort informatically."
While many of the image-processing methods involved in creating the database and visualization tools are commonly used, "what's novel is that this is being done on a genome-wide scale," he said.
The scale is evident in the amount of data generated by the project. Hawrylycz said that the effort generates about a terabyte of data per day and that each gene is responsible for around a gigabyte of data.
"Going from 2,000 [genes] to 16,000 was easier than from zero to 2,000."
The level of throughput was too high for human analysis, so the ABA team developed an automated process — called the Informatics Data Pipeline — for image preprocessing and QA/QC. With more than 600,000 high-resolution in situ hybridization images to process, "we realized that it just wasn't possible, based on our resources, to have somebody look at every single image, so an automated pipeline had to be created," he said.
The ABA informatics group runs a 148-CPU Linux cluster of HP and IBM blades, and has adopted a Hitachi Thunder 9585V storage area network, which Hawrylycz characterized as an "indispensable" piece of the informatics infrastructure. "Imaging data is very large and very cumbersome and we needed access to that quickly, and we found that very high-performance file servers were really necessary," he said. "Early on, we didn't have the SAN and it took us some time to get going."
Hawrylycz said that the most "painful" part of the project was processing the data for the first 2,000 genes, released in December 2004 [BioInform 12-20-04] and. "It taught us the lessons of what infrastructure we needed to get to where we are today," he said. "Going from 2,000 to 16,000 was easier than from zero to 2,000."
From 2D to 3D
In order to recreate the mouse brain in three dimensions, complete with expression patterns for 20,000 genes, the ABA informatics team had to assemble it computationally from 400 painstakingly annotated 2D ISH images that are referred to as the Allen Reference Atlas. These 400 images were all carefully aligned on a common coordinate system in order to account for variability in brain size and shape and serve as the "de facto standard brain that everything will be compared to," Hawrylycz said.
Working with the original Nissl-stained images, Hawrylycz and colleagues used image-processing methods to reassemble these 2D images into a 3D model and then mapped the annotations onto the 3D model. The next step involved mapping these annotations to the ISH data for each gene. Finally, the team quantified expression levels using segmentation and adaptive methods so that researchers can easily visualize the gene expression in each anatomical region of the brain.
3D Brain Explorer provides an interface to explore this data and query specific genes or gene expression clusters.
Over the next few months, Hawrylycz's team will wrap up some other software tools and will also process the remaining genes, which are likely to be challenging. "They're the ones that for some reason may have been difficult to get through the pipeline in the first place, or they're genes that are obscure, or for which probes have to be made that are not readily obtainable," he said. In addition, there is still uncertainty as to how many genes are in the mouse genome, "so there is the issue of what is finished, too."
The ABA team is already making post-Atlas plans. Hawrylycz said that in the works is a "big Atlas-mining related effort to determine the content and the correlation and significance in the data." He said that initial analysis indicates that there are very few genes that are "absolute markers for one specific micronucleus. They're the exception and not the rule, whereas there are many, many genes that have interesting expression patterns that need to be teased out."
Elaine Jones, chief operating officer for the Allen Brain Institute, said that the institute will likely turn its attention to studying the mouse neocortex once the Atlas is complete, and Hawrylycz said that many of the tools developed for the Atlas should be applicable to that work as well. "We've even provided for the neuroscientists a list of genes that are good markers for relevant structures and cell types in the cortex, and as a starting point we already know a sublist of viable genes that are worth looking at further," he said.