This article has been updated to correct errors in the previously reported modifications made to BreakSeq — one of the components of the HugeSeq pipeline
Researchers from Stanford and Yale Universities have published the details of a computational pipeline that provides an automated approach for detecting and annotating genetic variations in high-throughput sequencing data.
The infrastructure, dubbed HugeSeq, incorporates several well-known open source algorithms developed by researchers in both universities, and is part of the computational pipeline used by human genome interpretation startup Personalis. The platform was developed to identify a broad range of variants, including SNPs, indels, and large structural variants.
Hugo Lam, a computational biology project manager at Personalis and one of the authors on the paper, told BioInform that the publicly available version of HugeSeq — which is meant for academic use only — and the commercial version offered by the company have similar capabilities; however, the latter will include improvements in the pipeline's variant-calling sensitivity and accuracy, such as filters to sift through lower-quality calls and select true positives.
Personalis is one of several firms hoping to tap into the nascent market for clinical genome interpretation services (BI 12/22/2011), and it is not the first to offer a free version of its software for academic users.
Also attempting to cater to academics, Cypher Genomics, a Scripps Health spinout, obtained a grant from the National Human Genome Research Institute earlier this year to develop a publicly available version of its commercial variant annotation platform (BI 2/17/2012).
While many bioinformatics tools purport to detect and annotate sequence variants, most of them accept input and present results in different formats and require specific parameters in order to run the analysis, Lam said.
These are factors that make extracting genetic information from sequence data problematic, Lam and colleagues note in a research paper published in Nature Biotechnology that describes the pipeline.
Michael Snyder, director of Stanford's Center for Genomic and Personalized Medicine, a co-author on the paper, and a Personalis founder, added in a conversation with BioInform that although platforms like Galaxy have many of the individual tools that comprise HugeSeq, they aren't set up to handle large data files.
HugeSeq was an attempt to "align standards" and "integrate tools" into an automated pipeline that wouldn't require researchers to account for the unique requirements of each platform individually, Lam explained.
It is also unique in the way it handles structural variants, which are "particularly difficult" because it uses four separate programs to call these alterations, he said.
Snyder added that HugeSeq's modular nature makes it easy to swap programs in and out as improved versions are developed.
HugeSeq also uses a MapReduce approach, which allows it to run jobs on multiple computers in a cluster in order to accelerate the detection and annotation process, though Lam noted that it's also possible to run the pipeline on a single computer.
For cluster deployments, the team developed a software program called Simple Job Management, or SJM, to simplify batch scheduling of jobs. SJM was developed for use with Sun Grid Engine (now called Oracle Grid Engine), but Lam said that his team plans to eventually release it as an open source tool so that users can develop adaptors for other job-scheduling engines.
For example, he said that Scripps Research Institute is interested in using HugeSeq and has a portable batch-scheduling system that it uses for its cluster, so it is working with SJM's developers to build an adaptor.
Tracking and Describing Genomic Variants
As described in the Nature Biotechnology paper, HugeSeq comprises a mapping phase that prepares and aligns the sequence reads; a sorting phase that combines and sorts alignments by mapped chromosomes for parallel variant detections; and a reduction phase that detects and annotates the variants and provides outputs in the variant call and general feature formats.
The paper also notes that HugeSeq "covers a more complete spectrum of variant types" compared with other platforms for genomic data analysis that "typically analyze SNPs or a limited set of variants." In addition to SNPs, HugeSeq identifies short insertions and deletions as well as larger structural variants, the authors note.
HugeSeq's framework incorporates well-known algorithms such as the Burrows-Wheeler Aligner, the Genome Analysis Toolkit, SAMtools, BreakDancer, and Pindel.
Most of the tools didn’t require additional development but some did require a few enhancements to ease the integration process, Lam said.
For example, BreakSeq, which Lam and colleagues developed for structural variant analysis, was modified to include support for BAM file formats and gapped alignments where previously it only supported FASTQ formats and non-gapped alignments, he said.
HugeSeq begins by dividing single or paired-end sequences into smaller subsets to enable parallel alignment and then it uses BWA to run a gapped alignment against a reference genome. Using the SAMtools program, the mapped reads are converted into the BAM format and than sorted according to their aligned chromosomal positions.
Next, HugeSeq performs a cleaning step prior to making variant calls, which includes running a local realignment around indels and SNP clusters using GATK's realigner, the paper explains.
HugeSeq then uses GATK and SAMtools to identify SNPs and small indels and then passes them through a filtering tool.
For structural and copy number variants, which are more challenging to detect, HugeSeq uses four approaches: BreakDancer for paired-end mapping; Pindel for split-read analysis; read-depth analysis using CNVnator; and BreakSeq for junction mapping.
The variants generated by the process are annotated using Annovar, which describes gene intersections, exonic variations, repeat elements, and mutation information, the researchers explained.
In a test run on a single human genome sequenced with an Illumina HiSeq instrument, the team reported that HugeSeq called more than 3.3 million concordant SNPs — those that were called by both GATK and SAMtools — and more than 4 million concordant indels between the two packages. The pipeline called more than 21,000 structural variants in total with about 1,600 reported by two or more algorithms.
HugeSeq took about 25 hours to complete its parallelized analysis on a 48-CPU compute cluster at 30x coverage, compared to 250 hours when running each individual step on a single system. Most processes required at most 6 gigabytes of physical memory, the investigators said.
The researchers also compared the sensitivity of HugeSeq calls with data from Illumina's Human Omni1Quad genotyping array, which detects about one million markers.
In one test, comparing more than 260,000 calls from the array with the 3.3 million variants reported by both SAMTtools and GATK, HugeSeq reported 254,700 concordant calls, which corresponded to a sensitivity of about 99.4 percent, the researchers reported.
In another test, focused this time on structural variants, the researchers evaluated 482 deletions from the array against 1,594 high-confidence deletions — those that were reported by two or more of HugeSeq's algorithms. They found 383 concordant calls, which correspond to a sensitivity of 79.5 percent for the pipeline.
The team also compared HugeSeq to SOAP, GATK, Galaxy, Gene Pattern, GenomeQuest, and DNANexus and reported varying results for each platform in terms of parameters such as read alignment; SNP, indel, and SV calling; and functional annotation.
Their findings indicated that SOAP was good for most activities except providing functional annotation of variants; Galaxy lacked tools to call structural variants; and GenePattern only provided capabilities for functional annotations.
Meanwhile, commercial platforms offered by GenomeQuest and DNAnexus both lacked tools to call structural variants, the authors said.
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.