Researchers from the University of Chicago and other institutions have created a computational workflow for sequence alignment, processing, and variant calling that runs on supercomputers, taking advantage of the parallelization capabilities that these systems offer to shorten the amount of time it takes to analyze whole genomes.
Working with researchers from the medical schools at the University of Michigan, University of Pennsylvania, and Washington University in St. Louis, the U of Chicago scientists put together a workflow called MegaSeq — which is made up of well-known open source tools — that runs on a Cray XE6 supercomputer named Beagle that is housed at the Argonne National Laboratory (ANL) and managed by Computation Institute, a joint initiative between ANL and U of Chicago. The institute purchased the 150-teraflop supercomputer in 2010 naming it after the ship that carried Charles Darwin on a scientific voyage in 1831.
The researchers wrote in an article published earlier this month in Bioinformatics that by pairing MegaSeq with Beagle, they sought to "improve the timeline required to process whole-genome sequencing by optimizing [the] extraction, alignment, processing, and variant calling" steps, reasoning that "supercomputing capacity was better suited to parallelize [whole-genome analysis] and allow for the rapid simultaneous analysis of multiple genomes."
Using the infrastructure, they were able to analyze data from 61 human genomes, sequenced by Illumina, in just more than 50 hours using between 300 and 400 Beagle nodes — it has approximately 726 nodes and each node has 32 gigabytes of memory and 24 cores. They predict that if they use all of the available nodes, Beagle would be able to work through 240 genomes in that same time frame. However, those numbers are "fluid" since some jobs finish faster than others freeing up more space that can be used by more time intensive jobs, Megan Puckelwartz, a postdoctoral fellow in Elizabeth McNally's laboratory and first author on the study, told BioInform. McNally is a professor of medicine and human genetics and a co-author on the Bioinformatics paper. For comparison, it would take nearly 12 years for a single core and just over seven months for a three-node system to work through 61 genomes.
A system like this not only offers improvements in terms of speed and accuracy but it "reduces the price per genome," bringing down the cost of analyzing the entire genome to "less than the cost of just looking at just a fraction of the genome," McNally said in a statement.
McNally, who is also director of U of Chicago's cardiovascular genetics clinic, further noted that the lower cost and shorter time to results has implications for medical applications of whole-genome sequencing. For example, a system like this could be used to quickly analyze data from multiple individuals in a family in order to identify mutations that are associated with an increased risk of cardiovascular disease, she said.
In many cases, genomic data is analyzed in parallel on many smaller machines in a compute cluster, but the trouble with this sort of infrastructure is that the available "memory and computational core number limit the capacity for simultaneous computation," the researchers wrote in their paper. Also, clusters that lack a shared file system take longer to transfer data from node to node, a time-intensive bottleneck that slows down analysis. McNally told BioInform that these were some of the limitations her lab faced when it initially tried to use their local cluster for whole-genome analysis, and that led it to work with U of Chicago's Computation Institute on the supercomputing solution, which took about a year and a half to develop.
Supercomputers like Beagle are at an advantage because they have both "a parallel computational environment and external disk storage based on a parallel file system [that] ensures that each node (and core) is able to access all data at any time without waiting for transfer of data across nodes," the authors wrote.
Simply put, "generally in a cluster, the nodes are almost separate components, whereas on Beagle with the Lustre [parallel file] system, they … all work together at the same time [so] you don’t have to wait for one node to give data to another node to continue on," Puckelwartz explained to BioInform. "That transfer time on a cluster system" and on cloud computing infrastructure as well "is really intensive."
The authors also noted that the MegaSeq workflow incorporates familiar open source tools for alignment, such as Picard, SamTools, and the Burrows-Wheeler Aligner, and uses components from the Genome Analysis Toolkit, such as IndelRealigner and the HaplotypeCaller, for the alignment cleaning and variant calling steps, which "demonstrates that the publicly available systems software currently in use for genome analysis is amenable to the supercomputing environment."
There are detailed descriptions of the alignment, cleaning, and variant calling steps of the MegaSeq workflow in the paper, but basically, the first step, after the BAM files have been uploaded to the system, is to extract the raw reads from the compressed files by readgroup using the SamToFastq tool from the Picard suite, Puckelwartz explained — this extraction step is not needed if the FastQ sequence data are directly available, the paper notes. The next step is to align the reads using BWA, recompress them first to SAM and then to BAM files, and then sort them using SamTools.
"Then we go through and merge all the readgroups and then we split them by chromosome [which allows] you to spread [the genomes] out across many nodes [and] use the size of the supercomputer much more efficiently," she said. After that, "we can do the cleaning up of the sequence and so that’s when we mark all of our duplicates" using Picard and then "we realign the sequences around the indels [and] recalibrate the [quality] scores using GATK." The final step is to re-call all the variants using the GATK's HaplotypeCaller, generate and merge the VCF files, and then run some final data cleaning procedures, she said.
Comparison tests between MegaSeq and a pipeline comprised of the ELAND and CASAVA software — both developed by Illumina — served to highlight the accuracy and efficiency of the former over the latter. According to some results described in the paper, MegaSeq's sequence alignment results had a higher mean coverage for each genome than the alignments generated by the ELAND/CASAVA pipeline. MegaSeq also called more high-confidence single nucleotide variants than the competing pipeline in terms of both quality score and depth, according to the researchers.
MegaSeq is currently running on Beagle and those interested in trying it out on their projects or installing it on their supercomputers can contact the development team. Also, it can be installed on other Cray systems and possibly supercomputers from other companies as well, although the latter theory hasn’t been tested yet. It would depend on the supercomputer but "we do think that with some minor tweaking, it could be used on a number of systems," Puckelwartz said.
According to the Bioinformatics paper, the MegaSeq workflow is written in Bash and the submission subscripts are based on Portable Batch System commands but it can be adapted to work with other batch systems such as the Sun grid engine. Using it on other systems may require "optimization because of differences in machine design, which may affect bottlenecks and stability," the researchers wrote. Meanwhile, Cray is currently testing the workflow on the CrayXC30, the company's newest supercomputer model, so that life science customers of that system will be able to download and use the software on their system if they want to, Puckelwartz said.
There's also room to improve on the current system, McNally told BioInform. For example, uploading data to the system prior to beginning the analysis is still a major timesink, a problem that her team and other groups are actively working on, she said.