A project to complete a computational infrastructure and workflow for second-generation sequencing, analysis, and data management is one of four recent grants worth a total of $7 million awarded by the Minnesota Partnership for Biotechnology and Medical Genomics to link up scientists at Mayo Clinic and the University of Minnesota.
The state-funded partnership, launched in 2003, is an economic-development initiative designed to leverage the scientific base of the two institutions to do work that neither institution could do on its own.
In its newest round of grants, $1.2 million is slated for Jean-Pierre Kocher at the Mayo Clinic and Sushmita Singh at the University of Minnesota for a project called HAITI, or high-throughput sequence analysis infrastructure technology investigation, which aims to create an informatics workflow that will allow researchers to run jobs on second-generation sequencing instruments at either institute and easily share and analyze the resulting data.
Kocher, who chairs the Division of Biomedical Statistics and Informatics and also directs the Bioinformatics Core at the Mayo Clinic, told BioInform that the Minnesota Partnership has helped foster collaboration between the institute and the university by enabling the sharing and validation of instruments.
“The idea was to stop investing into redundant technologies on both sides,” he said.
Kocher noted that it is often more difficult to get NIH funding for infrastructure projects than for research projects, so the partnership funding is a welcome boost. “For us as recipients of these funds, it shows that we have found a way to develop an infrastructure that makes sense,” he said.
Singh, Kocher’s collaborator in the HAITI project, is a research associate at the University of Minnesota’s Biomedical Genomics Center, the university’s microarray, genotyping, DNA sequencing, and analysis core facility. She said that the partnership funding is “essential” for her facility, particularly given the growing informatics challenges of high-throughput sequencing. “We are not a Broad, we are not a Sanger,” she said.
Singh’s core facility used partnership funds to buy a 454 Genome Sequencer in 2006. She said that this was a large-scale event for the small center because it gave sequencing capacity to researchers for whom sequence analysis had been prohibitively expensive — from medical researchers working on clinical problems, to entomologists and plant scientists. “We were one of the first small core facilities to get this [instrument],” she said.
“When you bring [services for a typical sequencing experiment] down from $200,000 to $20,000, that is fantastic,” she said. Some researchers might use commercial sequencing services in the future, but as core facilities, she said “what we need to do is adapt … [and] provide quality service for a reasonable amount of money, so it is accessible to an average researcher.”
The partnership’s funds also make it such that “together we are stronger, we have greater leverage points than individually,” she said. Validating the analysis pipeline will help her to assist researchers collect their data and also ensure that all data-collecting parameters are “optimal,” she said.
The new award builds on three previous grants that the Mayo Clinic and the university received from the Minnesota Partnership related to bioinformatics and second-generation sequencing. Under an initial grant, the organizations developed a common bioinformatics workflow. With a second grant, they established a grid-based analysis environment, and this was followed by a grant to acquire second-generation sequencers — two Illumina Genome Analyzers at the Mayo and the 454 in Singh’s facility.
Are You True?
In the first phase of the latest chapter in their collaboration, Kocher and Singh will validate the sequencers and their associated analytical tools. “When you get a new piece of equipment an investigator runs behind you and says ‘Hey, run my sample; I want to discover something interesting,’” Kocher said. “But that is a challenge if you do not know what the device is doing and it is very hard to know if it is a true answer or not.”
Grants for validation work are also hard to come by, he said. Individual researchers have grants to cover running a sample at a core facility, but that does not cover controlled experiments that the facility must run to ensure its instruments are running properly.
Kocher said he is looking forward to running controlled experiments with samples for which the genes and their concentrations are known. A DNA fragment is amplified at different concentrations and run through both the Genome Analyzer and the Roche 454 to see if the equipment can help scientists detect rare mutations, he said.
“What is the rarity level to which you can go — 0.1 percent? 1 percent? Are you still sure the information is a mutation or is it a sequencing error?” he said. “These experiments are used to both validate the instrument and the analytics that go with them.”
“Together we are stronger; we have greater leverage points than individually.”
The goal for the next phase, Kocher and Singh explained, will be to create a work environment in which scientists gain access to instruments and data analysis tools at both institutions, and can run samples, get the data, and analyze the results.
Singh said that HAITI should help scientists manage data from second-generation sequencing projects. Data analysis is less of a problem in large centers, she said, but scientists working on more narrow scientific topics with smaller labs may not have the bioinformatics expertise to handle the data.
“HAITI will allow us to handle the immense amount of data we will be generating on two different platforms available to us,” she said.
The team has not decided yet if it will use commercial software as part of its analytica pipeline.
“Possibly,” said Brian Wilson, a senior analyst and programmer at the Mayo Clinic’s Division of Research Education Systems Support, in an e-mail. “[It] really depends on whether our evaluation identifies gaps that our existing software portfolio or open source solutions cannot address.”
Wilson said the Minnesota partners will evaluate the MAQ toolset, and then branch out “to evaluate other options such as BFAST, BOWTIE, SLIDER, SHRiMP, to name but a few, as well as offerings from the [next-generation sequencing] platform vendors.”
“One of the biggest challenges is the need for effective visualization tools,” Wilson added.
One idea is to create “user-friendly GUIs, so investigators can get in there and accomplish what they need to,” Singh said. In the past, she has consulted with individual researchers on their data but in the future, with high-throughput experiments, that will no longer be possible in every instance, she said.
Getting with the Flow
Wilson said that HAITI will involve software integration “big time,” which will create a need for an integration environment such as a scientific workflow builder.
“We are looking to exploit our existing analytic workflow environment to provide a framework for the rapid evaluation of new software algorithms, as well as provide the structure for routine analysis,” he said. The team may need to figure out additional annotation functionality to help supplement the second-generation sequence analysis pipeline, he said.
During the first half of the project, the design and completion of controlled validation experiments will run in parallel to the development of the analysis infrastructure, a phase that might last around three to six months, Kocher said.
“The latter stages will focus on formalizing the results into a set of standard operating procedures and [integrating them into] the larger Mayo/UMN IT infrastructure,” Wilson said.
Previously, the Minnesota Partnership funded development of the Collaborative Workflow Environment for Bioinformatics, or CWEB, to create a library of bioinformatics workflows for scientists to tap as web services.
CWEB was a collaboration between InforSense and the University of Minnesota’s Supercomputing Institute. Some the workflows created can be found on the partners’ joint development server.
InforSense provided toolsets, consultation, and training as needed, said Wilson. The researchers are exploring “the possibility” of using the InforSense platform to help with second-generation sequencing in HAITI but this collaboration is “still at a very early stage,” he said.
“Within Mayo, we have focused on the development of semi-automatic processing of genotype data from various platforms, said Wilson. “The most recent is a flow to handle Affy SNP 6.0 GWAS data, [and we are] currently looking into Illumina GWAS.”
A partnership grant after CWEB supported a project called The Research Optimizer for Project Information eXchange, or Tropix, which relies on caGrid and other tools developed as part of the National Cancer Institute’s Cancer Biomedical Informatics Grid.
Wilson said that Tropix leverages caBIG technology “to allow researchers to request laboratory and analysis services from either institution.”
“We are building on Tropix” for HAITI, said Kocher. Ultimately, with HAITI, “the grid system can become a network of collaborators, with, of course, restricted access,” he said.
The HAITI project will require computer hardware upgrades at both institutions, particularly for storage, Kocher said.
“Both institutions have pretty heavy IT [infrastructure facilities],” he said. “What is more of a challenge usually is disk space, because you can generate a lot of data and to some extent you have to store it, even if only temporarily. You need 40, 50 TB of disk space, that is something we don’t always have,” he said.
Wilson said that the grant will enable the purchase of around 20 TB to store data for initial sequence analysis. “This will be distributed and installed within each institution’s current infrastructure,” he said. “It is our hope to augment rather than change the existing infrastructure, allowing the grant to focus more on the scientific validation.”
Computationally, said Wilson, the teams hope to leverage existing infrastructure as much as possible. At Mayo, this includes a computational cluster of more than 100 nodes, with a maximum of eight processors per node and memory ranging from 8 GB to 128 GB, he said.
At the University of Minnesota Singh and her colleagues at the Biomedical Genomics Center have access to the university’s Supercomputing Institute. “So far, storage has not been a problem for us and that is thanks to the Supercomputing Institute,” she said.
The Supercomputing Institute supports a range of other disciplines, but “now people are realizing that bioinformatics is a very aggressive virus that is basically taking over,” she said.
The HAITI partnership gives Singh’s lab access to Kocher’s group of 25 bioinformaticians, with much needed “expertise,” she said.
When HAITI is completed, said Kocher, if an investigator at Mayo wants to have a sample run on a 454, he or she will be able to submit the sample and run it at the University of Minnesota. “When the data is available, Tropix will be used to transfer the data from the University of Minnesota back to Mayo and vice versa if someone from University of Minnesota wants to run a sample on the Illumina [platform].”
“When the platforms are validated, we will also understand how the analytics perform on the data, because we will be validating analytical software, too,” he said.