NEW YORK (GenomeWeb) – Computational scientists at the University of Illinois at Urbana-Champaign have worked with researchers at the University of Cape Town and elsewhere to use supercomputing infrastructure housed at UIUC to analyze genomic data as part of a broader effort to develop an inexpensive genotyping chip to test for variants found specifically in people of African descent.
Specifically, researchers at UIUC's National Center for Supercomputing Applications (NCSA) used the Blue Waters supercomputer to identify genomic variants in over 300 deeply sequenced human samples. The project is a collaboration between the Genome Analysis Working Group of the Consortium for Human Heredity and Health in Africa (H3Africa), H3ABioNet, and the Wellcome Trust Sanger Institute. It aims to come up with a list of genomic variants for a bead-array-based genotyping chip specific to people of African descent that will be built by Illumina.
Victor Jongeneel, director of UIUC's High-Performance Biological Computing group, told GenomeWeb that the project researchers tapped Blue Waters to help with the analysis because the quantity of data that the project generated overwhelmed the local compute infrastructure and storage resources. For the project, the researchers gathered 348 samples from both urban and rural parts of the continent and these were deeply sequenced by researchers at Baylor College of Medicine to between 30X and 50X coverage. In addition to the locally sourced samples, the researchers also used publicly available sequence data from the 1000 Genomes project and over 2,000 low-depth whole-genome sequences from the African Genome Variation Project.
The researchers had initially hoped to do the analysis on high-performance computing infrastructure at the University of Cape Town but after conducting some early pilot projects on smaller datasets, they realized that they would not be able to complete the variant calling in a timely fashion. "If you added up the number of machines and cores that they had available, they could run maybe two or three genomes at a time. They just didn't have the physical capacity to run the full workflow on this number of samples," Jongeneel said. "So they contacted us and asked if we would be willing to do the variant calling."
The much larger Blue Waters system, which officially went into operation in 2012, features nearly 400,000 compute cores, more than 1.5 petabytes of memory, over 25 petabytes of disk storage, and more than 500 petabytes of tape storage. It includes over 22,000 CPU nodes with 64 gigabytes of memory per node and over 4,200 GPU nodes with 32 gigabytes of memory per node. The researchers were allocated about 500,000 node hours on Blue Waters for the African genotyping chip project, Jongeneel said. With the allotment, they were able to complete the variant extraction in around 250,000 node-hours with a total disk footprint of 600 terabytes, according to UIUC researchers.
Besides limitations in physical hardware, the project researchers did not have a ready-to-use variant calling pipeline that would work for the project, Jongeneel said. In contrast, UIUC had a pipeline that it developed in partnership with Mayo Clinic as part of an ongoing collaboration that started back in 2010. The pipeline includes tools for mapping and calling variants, checking the quality of alignments, and removing inconsistent reads.
"We had to do a little bit of work to modify it so that it would actually match the requirements of the project," Jongeneel said. Specifically, the researchers wanted the platform to closely mimic the one used by the Sanger Institute for the African Genome Variation project so that the variant calling protocols would be consistent across projects. For example, they wanted to use the haplotype caller from the Broad Institute’s Genome Analysis Toolkit; in the Mayo pipeline, the researchers were using the Unified Genotyper which is also from the GATK, he said. Also the U of I researchers had simplified some steps in the Mayo workflow so that the pipeline could run faster but the African genotyping project researchers did not want these modifications, he said.
One option would have been to try to implement the Sanger pipeline on Blue Waters, however it was simpler to try to adapt the Mayo pipeline for the project. "The vast majority of existing pipelines are not easily portable, and the Blue Waters architecture and job submission policies are a bit off the beaten path because of the massive size of the machine," Jongeneel explained. "So it was easier to take a pipeline for which we fully understood the code and that we had tested extensively, and to then incorporate the idiosyncrasies of the Sanger pipeline."
There were also a number of quality control steps that needed to be performed as part of the analysis to "ensure that the results would be appropriate for subsequent population variant calling," Liudmila Mainzer, a senior research scientist with NSCA and the HPCBio group said in a statement. "[So] we had to do a bit of work to optimize the workflow for the architecture of Blue Waters."
As part of the project, the NSCA and HPCBio teams also tested new methods of transferring data more efficiently across participating sites. Specifically, "we engaged in debugging the issues of data transfers over the network to South Africa, and found better configuration settings that could be applied to their system," Jim Glasgow, a senior system engineer with the storage enabling technology group, explained in a statement.
They also evaluated a number of existing data transfer tools, including one called bbftp, to see if they could improve on the method currently used with Blue Waters. "We were able to test and verify functionality and performance of these tools and we now have alternatives to our existing high-performance transfer method … should the need arise," Galen Arnold, a system engineer with NCSA and a member of the Blue Waters application team, said in a statement.
The African genotyping chip project also offered opportunities for the UIUC researchers to test some specific capabilities of Blue Waters that could help them improve its performance in future projects. For example, "we learned how to balance the genomics workload for optimized file system performance without impacting other users," Sharif Islam, senior systems engineer with the Blue Waters system, noted in a statement. "Blue Waters has the capability to isolate certain network and file system traffic, to ensure high throughput. Working on this project helped us test and properly utilize this capability."
The project also provided a testbed for some new tools that the researchers have developed in the course of running Blue Waters on different projects both within and outside of the life science that they hope to make more broadly available. One of these is a piece of software called Parfu that is designed to help researchers working with large collections of small files more effectively use supercomputing storage infrastructure which is typically designed to work with much larger data files.
Parfu was developed by Craig Steffen, a senior research scientist in the Blue Waters team. He believes that the tool will be of particular benefit to bioinformatics researchers because it will enable them to more efficiently use large-scale storage and file systems. In bioinformatics, most computational workflows are developed to run on much smaller machines and work with multiple small files, he told GenomeWeb. But datasets are growing and researchers now require much larger systems for their projects.
However, "these very large HPC systems which are used to dealing with … very large and carefully constructed files, do not perform well when you stick a directory with 5 million [small] files in it [for example]," he explained. Systems like Blue Waters can handle the computation but moving the data to and from storage infrastructure on these systems is problematic because the files are spread out all over the tape. Existing methods like TAR (Tape Archive) and ZIP files gather multiple smaller files into one larger file and transport them that way, but these methods are not designed to work with parallel systems.
Parfu essentially does the same job that TAR does but in a parallel, efficient, and high-performance manner, Steffen said. "It takes big directories of teeny tiny files and it puts them into one container file or archive file, and then that file is easily pushed to tape [storage] and then pulled back."
While Parfu was not used for the African genotyping project, it was tested on some small datasets from the project which helped highlight some issues with tool that Steffen is now working to correct. He is also adding features to Parfu to make it more TAR-like so that its more familiar to researchers. "TAR is such a widely used tool that there are certain ways that people expect archive tools to work," he said. "So I have been adding features to Parfu to make the controls work and the way it uses directories more like TAR does."
He told GenomeWeb that he hopes to release the software by the end of July. The target user for the Parfu would be scientists who are using large parallel file systems, like Lustre and others, that have at least a few thousand processors. It could work with smaller systems but in those cases, it might make more sense to use the TAR format because it's simpler and better established, according to Steffen
Besides the current partnership, Blue Waters has supported a number of other bioinformatics projects. One such project, done as part of the partnership between UIUC and Mayo Clinic, looked at epistatic interactions in Alzheimer's disease and other neurological disorders. For that study, the researchers used Blue Waters to analyze genotyping and gene expression data collected from two different regions of biopsied brains from patients, Jongeneel told GenomeWeb. "It is hard enough to do an expression GWAS on data that rich [but] it was pretty much impossible to do epistatic interactions [on standard infrastructure]," he said. "We managed to run that on Blue Waters."
Other life science research projects that Blue Waters has been used for include one at the University of California, San Diego, where researchers used the supercomputer to construct an atomic-resolution model of the influenza viral coat. Separately, researchers at UIUC have used the supercomputer to run simulations that explore how human proteins that aid HIV infection bind to the HIV capsid.