Skip to main content
Premium Trial:

Request an Annual Quote

Bioinformatics Startup Arpeggi Launches GCAT, Preps for Summer Release of Variant Calling Software


Bioinformatics startup Arpeggi has launched a free online resource called the Genome Comparison and Analytic Testing, or GCAT, tool that provides data and metrics for use in comparing the performance of alignment and variant calling tools.

Separately, the company, which officially opened its doors last October, is preparing later this year to launch and market a yet-to-be-named proprietary variant caller — internally referred to as the Arpeggi engine —that will be able to align reads and call variants directly from FASTQ files in a single step. The software is currently being beta tested in a pilot program with about a dozen unnamed customers.

Arpeggi co-founder and CEO Nir Leibovich told BioInform that the company is also mulling a few commercialization options including partnering with a cloud-based bioinformatics company to offer a combined web-based product as well as working with a hardware vendor to provide compute infrastructure along with the software.

Austin, Texas-based Arpeggi launched and demoed GCAT — which runs on Amazon Web Services — last week at the Bio-IT World Conference in Boston.

GCAT tracks the performance of alignment or variant calling pipelines, the metrics used to assess performance at different stages of these processes, and the sequencing application in question. This information can then be used to select the optimal approach for getting useful data out of NGS projects.

In an interview with BioInform this week, David Mittelman, an associate professor at Virginia Bioinformatics Institute and Arpeggi's chief scientific advisor, said the company developed GCAT to use internally to benchmark and test the Arpeggi engine during its development.

Seeing a need for a more standardized approach for comparing NGS tools, the company decided to make a less bulky, simpler version of the system available for general use — excluding features such as databases used to support data slicing — to provide an unbiased way for the community to narrow down the list of existing tools and metrics available for their analysis using consistent datasets, and to compare the performance of newly developed analysis methods and metrics.

They also wanted the community's input on the best metrics for measuring tools' performance, and to establish a framework for standardizing sequence data analysis more generally, which they believe will help increase adoption rates in academia and industry.

GCAT currently has 20 datasets: 16 simulated using a tool called dwgsim and four real exome datasets generated on Ilumina and Life Technologies' Ion Torrent sequencers. Users can download the data, run their internal alignment or variant calling pipelines, and upload the results to GCAT, which then analyzes them using a series of metrics such as transition/transversion ratios, coverage depth, and correct versus incorrect read mapping.

It then generates a report that describes the pipeline's performance which users can compare with results from other pipelines used to analyze the same dataset. There's also a community page where users can further discuss their findings and provide feedback and suggestions to improve GCAT.

Since last week's launch, about 25 people have downloaded GCAT data, run either alignment or variant calling pipelines, and submitted their results, Leibovich said. Additionally, several thousand people have either begun to run pipelines, participated in community discussions, or explore reports, he said.

Arpeggi is tracking the metrics and methods being used in GCAT but, since the website just launched, it’s a bit too soon to see any trends, Leibovich said. However, the company plans to share its findings when the project has had time to mature, he said.

They also plan to upload additional tests for the community to use, starting with a test for Mendelian consistency, and will add other tests based on requests from GCAT users.

The company has assembled a board of advisors to help it sort through the feedback it receives from users and "to help us prioritize … what we should continue to build on," Leibovich said.

Meanwhile the company continues to prepare for the launch of its variant caller product later this year. The tool operates in two related steps. The first is a genome reconstruction step during which it maps and aligns reads as well as does some local assembly. In the second step, the program calls variants in the data.

According to the company, combining the genome reconstruction process into a single step speeds up the tool's performance because it minimizes data transfer bottlenecks and increases scalability.

"Furthermore, because the Arpeggi engine approach is integrated, it allows data from earlier steps to be evaluated later in the analysis process and findings to be reassessed if the available evidence warrants such a revision," Mittleman said.

The tool also lets users generate BAM files or VCF files at any point in the process and then use their own pipelines to call the variants if they prefer not to use Arpeggi's variant caller or if they want to call variants that the engine can't detect.

Arpeggi is running two pilots with its software: one focuses on using the system to support sequencing centers' activities, and the second tests the system's ability to support genetic testing products. It isn't disclosing much detail about these pilots because it has signed non-disclosure agreements with participants.

Separately, the company is one of a group of 13 firms tapped to participate in GE's Entrepreneurship program.

During the three-year program, Arpeggi, along with GE executives and the GE Healthymagination fund, will use its tool and expertise in genomic data analysis and interpretation to develop personalized medicine products.

The company isn't disclosing details about that partnership at this time.

Filed under

The Scan

Team Tracks Down Potential Blood Plasma Markers Linked to Heart Failure in Atrial Fibrillation Patients

Researchers in BMC Genomics found 10 differentially expressed proteins or metabolites that marked atrial fibrillation with heart failure cases.

Study Points to Synonymous Mutation Effects on E. Coli Enzyme Activity

Researchers in Nature Chemistry saw signs of enzyme activity shifts in the presence of synonymous mutations in a multiscale modeling analysis of three Escherichia coli genes.

Team Outlines Paternal Sample-Free Single-Gene Approach for Non-Invasive Prenatal Screening

With data for nearly 9,200 pregnant individuals, researchers in Genetics in Medicine demonstrate the feasibility of their carrier screening and reflex single-gene non-invasive prenatal screening approach.

Germline-Targeting HIV Vaccine Shows Promise in Phase I Trial

A National Institutes of Health-led team reports in Science that a broadly neutralizing antibody HIV vaccine induced bnAb precursors in 97 percent of those given the vaccine.