Bioinformatics startup Arpeggi has launched a free online resource called the Genome Comparison and Analytic Testing, or GCAT, tool that provides data and metrics for use in comparing the performance of alignment and variant calling tools.
Separately, the company, which officially opened its doors last October, is preparing later this year to launch and market a yet-to-be-named proprietary variant caller — internally referred to as the Arpeggi engine —that will be able to align reads and call variants directly from FASTQ files in a single step. The software is currently being beta tested in a pilot program with about a dozen unnamed customers.
Arpeggi co-founder and CEO Nir Leibovich told BioInform that the company is also mulling a few commercialization options including partnering with a cloud-based bioinformatics company to offer a combined web-based product as well as working with a hardware vendor to provide compute infrastructure along with the software.
Austin, Texas-based Arpeggi launched and demoed GCAT — which runs on Amazon Web Services — last week at the Bio-IT World Conference in Boston.
GCAT tracks the performance of alignment or variant calling pipelines, the metrics used to assess performance at different stages of these processes, and the sequencing application in question. This information can then be used to select the optimal approach for getting useful data out of NGS projects.
In an interview with BioInform this week, David Mittelman, an associate professor at Virginia Bioinformatics Institute and Arpeggi's chief scientific advisor, said the company developed GCAT to use internally to benchmark and test the Arpeggi engine during its development.
Seeing a need for a more standardized approach for comparing NGS tools, the company decided to make a less bulky, simpler version of the system available for general use — excluding features such as databases used to support data slicing — to provide an unbiased way for the community to narrow down the list of existing tools and metrics available for their analysis using consistent datasets, and to compare the performance of newly developed analysis methods and metrics.
They also wanted the community's input on the best metrics for measuring tools' performance, and to establish a framework for standardizing sequence data analysis more generally, which they believe will help increase adoption rates in academia and industry.
GCAT currently has 20 datasets: 16 simulated using a tool called dwgsim and four real exome datasets generated on Ilumina and Life Technologies' Ion Torrent sequencers. Users can download the data, run their internal alignment or variant calling pipelines, and upload the results to GCAT, which then analyzes them using a series of metrics such as transition/transversion ratios, coverage depth, and correct versus incorrect read mapping.
It then generates a report that describes the pipeline's performance which users can compare with results from other pipelines used to analyze the same dataset. There's also a community page where users can further discuss their findings and provide feedback and suggestions to improve GCAT.
Since last week's launch, about 25 people have downloaded GCAT data, run either alignment or variant calling pipelines, and submitted their results, Leibovich said. Additionally, several thousand people have either begun to run pipelines, participated in community discussions, or explore reports, he said.
Arpeggi is tracking the metrics and methods being used in GCAT but, since the website just launched, it’s a bit too soon to see any trends, Leibovich said. However, the company plans to share its findings when the project has had time to mature, he said.
They also plan to upload additional tests for the community to use, starting with a test for Mendelian consistency, and will add other tests based on requests from GCAT users.
The company has assembled a board of advisors to help it sort through the feedback it receives from users and "to help us prioritize … what we should continue to build on," Leibovich said.
Meanwhile the company continues to prepare for the launch of its variant caller product later this year. The tool operates in two related steps. The first is a genome reconstruction step during which it maps and aligns reads as well as does some local assembly. In the second step, the program calls variants in the data.
According to the company, combining the genome reconstruction process into a single step speeds up the tool's performance because it minimizes data transfer bottlenecks and increases scalability.
"Furthermore, because the Arpeggi engine approach is integrated, it allows data from earlier steps to be evaluated later in the analysis process and findings to be reassessed if the available evidence warrants such a revision," Mittleman said.
The tool also lets users generate BAM files or VCF files at any point in the process and then use their own pipelines to call the variants if they prefer not to use Arpeggi's variant caller or if they want to call variants that the engine can't detect.
Arpeggi is running two pilots with its software: one focuses on using the system to support sequencing centers' activities, and the second tests the system's ability to support genetic testing products. It isn't disclosing much detail about these pilots because it has signed non-disclosure agreements with participants.
Separately, the company is one of a group of 13 firms tapped to participate in GE's Entrepreneurship program.
During the three-year program, Arpeggi, along with GE executives and the GE Healthymagination fund, will use its tool and expertise in genomic data analysis and interpretation to develop personalized medicine products.
The company isn't disclosing details about that partnership at this time.