CHICAGO – This month, Imec introduced elPrep 5, the latest iteration of its DNA analytics platform, a release that the Belgian nanoelectronics company claims can process a whole genome sequence eight to 16 times faster than the Broad Institute's benchmark Genome Analysis Toolkit (GATK).
The new version of elPrep takes advantage of parallel computing technology and improved algorithms to perform a complete analysis in a single pass, generally in just a few hours. Even with the gains in speed, elPrep 5 analysis produces identical results to GATK, SAMtools, and the Broad's Picard toolkit.
It also adds variant calling, a feature that had been missing from elPrep 4.
The variant calling in elPrep 5, based on the haplotype caller algorithm from GATK4, is a piece that had been missing from earlier elPrep releases. While relying on the GATK haplotype caller, Imec built its own algorithm so it the software can run on a parallel computing framework.
"We [had] optimized half of the pipeline, but then there was a large block of execution that was still the same as before, so the idea was if we add that step and we merge it into the computation like the other steps, we will gain a significant speed-up," Charlotte Herzeel, a researcher in the ExaScience Life Lab, told GenomeWeb. "We now have that [acceleration] on the full pipeline," Herzeel said.
"We specifically focused on having an optimized runtime. We tried to look at executing this pipeline as fast and efficiently as possible, given certain hardware resources," Herzeel said.
In an online press conference, Roel Wuyts, team lead of the ExaScience Life Lab, which Imec hosts on its Leuven, Belgium, campus, said that elPrep 5 advances Imec's goal of building "smart societies," which he said involves embracing digital technology and connected devices to improve people's lives.
"In a smart society … you would want personalized treatments where based on the genome of patients, you can do a personalized drug selection," he said.
Getting to that point remains an arduous task, but Imec also develops microfluidics and PCR chips to shrink the size of nanopore sequencers and thus make sequencing accessible to more people, Wuyts noted.
"You can only do personalized treatments if you also do population-level analysis, because you will need to be able to compare the genome and the treatments of a patient with many other patients in order to give useful recommendations," he said.
Wuyts also noted that sequencing has started to make its way into clinical practice as well as research, particularly in cancer and rare diseases. With hospitals investing in sequencers for clinical use, it makes economic sense for them to use the pricey instruments as often as possible, but that is creating a data glut.
Herzeel explained that there are three basic phases to sequencing computation: mapping of raw reads to a reference genome; the BAM processing phase that readies the data for statistical analysis; and variant calling. Imec's elPrep covers BAM processing and now, with the release of version 5, variant calling.
Laboratories typically run these steps through different tools that are connected by a series of scripts. "This has a number of drawbacks in terms of performance," Herzeel said.
In particular, there is a lot of file input/output (I/O) activity as each software reads in data and then produces a file for the next phase. This requires considerable computing resources as well as time.
"Each tool has to wait for the previous to the finish before it can start processing," Herzeel said. "On top of that, there is also a lot of repeated iteration over the same data as each step needs to perform its set of operations on that same data."
In theory, one sequencer can process nearly 9,000 samples at 30X coverage in a year, generating massive amounts of raw data for analysis. The analytics stage takes time, at least an hour and a half with GATK to "preprocess" a single sample, which excludes variant calling.
Wuyts noted that, it takes a year and a half of compute time to complete the preprocessing of all 9,000 annual samples from a sequencer using GATK. He said elPrep 5 can perform this task in a total of about two months using similar hardware.
As for the complete analysis, it takes elPrep 5 less than six hours to run a whole-genome sequence at 50X coverage or about eight minutes for a 50X whole exome on a standard Intel Xeon server, according to Imec.
"ElPrep is a tool for very fast processing of aligned sequencing data and it provides a speed-up of between eight and 16 times," Herzeel explained. Imec is positioning the software as a "drop-in replacement" for tools such as SAMtools, Picard, and GATK4 that produces virtually identical BAM and VCF outputs.
Available for free download from Github, elPrep 5 is written in Go, an open-source programming language from Google, which Herzeel said makes the software modular and extensible, while still following GATK best practices. Others can participate in the open-source community and develop new applications within the framework, according to Herzeel.
Herzeel said that elPrep grew out of a project at Janssen Pharmaceutica, where researchers wanted to run their pipelines faster than the software available to them at the time allowed. "Researchers were waiting for results and they couldn't progress," Herzeel said.
Imec began developing elPrep in 2013 and first publicly released the software in 2015.
Herzeel said that ExaScience Life Lab aims to develop software that helps users scale up their operations. That can come either through new algorithms, acceleration, or parallelization.
"With elPrep, we looked at the whole picture and we tried to figure out how to optimize the whole thing. If you optimize only one part, then that will be fast, but the rest of the pipeline may not be changed," Herzeel told GenomeWeb.
"The goal of elPrep is to focus on community-defined standards such as the sequence alignment MAP file format and GATK best practices for analyzing sequencing data," Herzeel said in the virtual press conference.
The elPrep software uses multi-threading computing technology to "parallelize" the computing of sequencing pipelines, thus avoiding the time-consuming step of having to output one type of data to a hard drive while another type is being processed, according to Herzeel.
"We aim to produce identical results as the reference implementations of those pipelines," she said. "Our software architecture is quite successful at optimizing the execution of pipelines and it improves the performance by up to a factor of 16 by using [fewer] or similar compute resources."
In a follow-up interview with GenomeWeb, Herzeel said that it is hard to compare the performance improvement in elPrep 5 over elPrep 4 because the new version includes the extra step of variant calling.
"Our software is able to merge the different steps [of sequencing computation]," Herzeel said. "It takes advantage of multi-threading that is available on modern servers."
The new elPrep 5 architecture is designed so that the single piece of software only needs a single pass through the data to process the whole pipeline. "The tool [is] responsible for ordering the steps, parallelizing the steps, and merging their execution," Herzeel said.
This, she said, cuts literally hundreds of hours of compute time for whole-genome and whole-exome datasets.
"The fact that our architecture is able to combine and merge different steps, it improves the performance more than optimizing an individual step," Herzeel said.
elPrep runs in two modes, one that reads data into RAM and another that reads data to disk. In either mode, elPrep runs 7.5 to 15 times faster than the original GATK algorithm while producing the same output as GATK. Herzeel also said that Imec saw an acceleration of 6 to 11 times with the parallelized version of the GATK haplotype caller algorithm, while using substantially fewer computing resources.
Imec tested elPrep 5 on both in-house servers and on the Amazon Web Services (AWS) cloud to demonstrate how the software can scale.
"Renting a node with more hardware resources tends to cost more on Amazon, but what this experiment actually shows is that using a bigger node to run your software doesn't necessarily have to cost more, provided that your software scales," Herzeel said.
Imec said that Janssen Pharmaceutica and Seven Bridges Genomics have validated elPrep 5. Other regular users of elPrep include Dutch bioinformatics company BlueBee — recently acquired by Illumina — and several Belgian hospitals.
Because elPrep scales up to large jobs better than GATK, it saves money on cloud implementations because the cost of running on AWS remains more or less stable regardless of file size, while large GATK runs could cost four times more, Herzeel said.
She also noted that this type of comparison does not account for costs associated with waiting for jobs to finish. "if you have your researchers or doctors waiting for a result, that may actually cost more than running a job," Herzeel said.
With whole genomes, benchmarking the Illumina Platinum Genomes dataset at 50X coverage against the hg38 reference genome, elPrep used only 70 percent of peak RAM or peak disk space as GATK, while running eight to 16 times faster.
Imec is not alone in building bioinformatics tools that outperforms GATK benchmarks. For example, San Jose, California-based Sentieon won a PrecisionFDA challenge in August with its DNAseq software, which offers a 20- to 50-fold increase in processing speed from BAM to variant call files over the standard BWA-GATK pipeline, without a corresponding increase in hardware requirements.
However, Herzeel said that the Imec approach is slightly different in that it tries to merge the computational functions and then parallelize each step in the pipeline, while also striving for the same results as the reference tool, in this case, GATK.
The target user is anyone who wants to analyze sequences. While this group primarily includes the pharmaceutical industry and academic researchers, Herzeel noted that elPrep has been implemented in hospitals, including for diagnostic purposes.