Researchers from the computer science, biochemistry and molecular genetics, and public health departments at the University of Virginia have put together an open source informatics pipeline for characterizing and prioritizing human genetic variation in clinical contexts that they claim is able to return analysis results in about 24 hours.
The pipeline, called SpeedSeq, comprises freely available developed open source software including well-known programs such as BWA for read alignment. It also includes applications such as SamBamba, which is a threaded version of SamTools for running operations on BAM files faster; and SamBlaster, a program for marking duplicates that works faster than existing applications such as Piccard, according to the developers.
In addition, SpeedSeq includes programs for calling various types of variants. That list includes FreeBayes, a program for detecting small polymorphisms such as SNPs, and LUMPY, a structural variant detection algorithm developed by the UVA researchers that they claim is more sensitive and faster than existing algorithms such as Pindel. Also included in the pipeline is a program called Genome Mining or GEMINI which, according to a PLOS Computational Biology paper published last year, offers a flexible framework for exploring genetic variation in disease and population genetic studies; SnpEff, a tool which annotates and predicts the effects of variants on genes; and vcflib, a C++ library for parsing and manipulating data in vcf files.
Ryan Layer, a postdoctoral fellow at UVA and one of the developers of the SpeedSeq pipeline, told BioInform that the team began working on the pipeline after updates to BWAmem resulted in much faster alignment of paired-end and split read data. With the updated aligner now able to return results in just over 17 hours — as opposed to the approximately 36 hours required by the previous version of the software — it made sense to try to speed up the downstream analysis as well to better support clinical use of genomic data, he said.
Furthermore, the team wanted to develop a pipeline that was easy to use and could be run on commodity hardware. “If you had a compute cluster you could use probably any set of tools and just distribute the workflow over a thousand nodes and you would be able to do something very fast,” Layer said. With SpeedSeq, “if you had just a single multicore machine … with 128 gigabytes of RAM, you’d be able to do all of this analysis. The pipeline is preconfigured such that a small lab or a clinic could just download all of our tools.”
The key to the pipeline’s speed is in how these tools interface each other, as well as steps that the developers took to ensure that the variant callers in the pipeline only receive the data that they need to make their calls, Layer explained. For example, “SAMBlaster outputs a subset of the files that Lumpy can analyze much faster than [it] would have done on the complete files.”
Its variant calls are also accurate, according to the developers. In one validation study where the researchers used the pipeline to analyze data from a family pedigree, “we are able to phase the variants from grandparents to grandchildren, and in 97 percent of cases the structural variants that Lumpy called in the grandchildren were phased properly on the proper haplotype from the grandparents,” Colby Chiang, a graduate student in genetics at UVA and one of the developers of the pipeline explained to BioInform. In another study, this time focused on calling variants with low-allele frequency in tumor-normal pairs, the pipeline was able to call variants with an allele frequency of about 18 percent with about 90 percent sensitivity, Chiang said.
The software has been tested on Illumina short read data but the developers believe it will also be able to handle reads from other sequencing instruments. Moving forward, they plan to make the SpeedSeq pipeline available on Amazon cloud infrastructure and also to get their pipeline installed and running at genome centers and hospitals and used in projects such as analyzing genomic data from newborns. Layer said that team has begun working with some groups interested in using the pipeline but declined to name them.
The team is also making improvements to the pipeline, including developing methods to more accurately call copy number variants, and improving its ability to call and interpret structural variants, Chiang said.