Few research institutes have provided the scientific community with as many robust and user-friendly Web interfaces as the Allen Institute for Brain Sciences in Seattle. Founded in 2003, the Allen Institute has made several online, multi-dimensional atlases incorporating genomic and anatomic information publicly available, so that researchers can pinpoint exactly where certain genes are expressed in the brain. So far, the institute has produced the Allen Mouse Brain Atlas, the Allen Spinal Cord Atlas, the Allen Developing Mouse Brain Atlas, and, in late May, the Allen Human Brain Atlas. The most detailed resource of gene expression data in the human brain to date, this latest atlas is a three-dimensional map of gene activity that incorporates a wealth of data from magnetic resonance imaging, diffusion tensor imaging, histology, in situ hybridization, and microarrays for more than 700 distinct areas of the brain with information from more than 62,000 gene probes.
In order to provide the IT necessary to support the development and maintenance of these atlases, the data processing and storage facilities must be scalable and flexible. This is both a matter of technology and creativity for the Allen Institute IT staff. Chinh Dang, the institute's senior director of technology, says that scalability and management strategies for these large image files are two issues she spends a lot of time thinking about. "The IT architecture that we had here when we first started out with the mouse data — this was seven years ago — even back then we knew that storage required that we have to be able to scale," Dang says. "We do in situ hybridization — various kinds of histological staining in a high throughput environment — which generates millions and millions of images. Depending on whether the images are from mouse or human samples, they can range from 1 or 2 gigabytes to 6 gigabytes. When you generate about a terabyte of data every day, it doesn't take too long to get to the petabyte level."
The institute is currently dealing with more than a petabyte of data, and on any given day, it is processing more than a terabyte of brain tissue image data from multiple scanning platforms. The IT pipeline uses a combination of commercially available storage and server hardware housed in a 2,200-square-foot facility connected by a high-speed gigabit network to other servers and desk-side computers across the institute so that investigators can access the compute resources whenever they like. The institute currently uses a processing cluster comprised of Sun Microsystems and Hewlett-Packard blades running a Linux operating system, BlueArc and Hitachi Data System storage, and a Spectralogic tape archival system.
Dang's strategy for dealing with massive raw image files of brain tissue is to remove them from the disk drives as soon as possible. It is simply not an effective use of storage for these raw images to be taking up valuable disk space — not to mention the cost of both time and energy usage associated with moving and processing these images on the compute cluster. Once the analysis is complete, the images are cleaned up through a process of cropping and other "beautifying" methods, and compressed into a special image format known as JPEG 2000, which allows for a 16-to-1 compression ratio. The raw images are moved onto tape and are only brought back to the hard drive cluster in the event of disk failure — something Dang says is not unheard of in areas prone to earthquakes like theirs.
"All of these atlas projects have overlapping data, so we hang onto some of the older data for analysis and comparison to the newer data sets that we generate. We don't need that old data to be as high--performance as the new data that's coming through because we're not generating terabytes of new data each day on the mouse data," she says. "What we do is start to phase it out into a lower-tier storage on the older project while we continue to build out the high-performance storage pools on our newer project. That's how we deal with the scaling and management of all this data."
A huge component of the Human Brain Atlas is microarray data, considerably larger than the amount required for the mouse atlases. For each brain processed, based on the sampling density and how many probes have been profiled, roughly 50 million gene expression data points can be generated. The institute is planning to have data for 10 human brains by the end of 2012, which will result in more than 500 million data points. Proper management of this new amount of data required a change in the structure of the database server. Unlike the mouse data, which lives in a relational database format, the brain expression data is stored as binary data on the file system while the rest of the metadata continues to use the relational system. This new hybrid method of storing data means an increased need for more processing power, which has required Dang to make adjustments to the computational cluster. "Instead of with the mouse atlases, where we had a much smaller number of jobs on our cluster but each job required lots and lots of memory, the human atlas requires long hours and days of computation," she says. "We ended up having to build out a new processing cluster. Even though it's all the same cluster, the back-end hardware is actually quite different." The computational servers, which originally had four-core processors, have now been upgraded with 24-core processors. In addition, Dang also increased the RAM from 16 to 32 gigabytes so that the entire brain database can be loaded into the memory for analysis.
As important as scalability is for the internal IT infrastructure, equally as crucial is the institute's ability to ensure that online atlases remain operational with no downtime. To this end, Dang has created "silos" — hardware clusters that contain a series of database servers and Web servers. These silos allow the website to run seamlessly while maintenance — including upgrades and application installations — is conducted.
An additional hosting site also exists at a collaborator's site in Sweden where the mouse data sets are mirrored. When European traffic hits the Allen Institute website, it is first routed to the Swedish servers. If those servers can't handle the load, the traffic is then directed to the servers in Seattle. Maintaining this reliability is one of the reasons that Dang and her team pre-compute all of the data before it goes out to the public, so that computation is not taxing the site.
The atlases — which are excellent examples of Web 2.0 meeting biological data — are built with the scientist in mind as a customer. "Think of us as a mini bioinformatics software company that exists within a larger institute. We have a specific cycle for software development, including testing, deployment, and going live — it's a very structured environment," she says. "We're very much product-driven here, so we talked a lot with users from the genomics side as well as from the anatomy side, and all of these things help us put out a product that is polished and meets everyone's needs from a performance standpoint."