Skip to main content

Sanger Researcher Wins Pistoia's Sequence Data-Compression Algorithm Contest

Premium

BOSTON — The Pistoia Alliance has tapped a method developed at the Wellcome Trust Sanger Institute as the winner of its "Sequence Squeeze" competition, an effort to identify the best algorithm for compressing next-generation sequencing data.

Pistoia announced the winner — James Bonfield, a member of the sequencing informatics team at the Sanger Institute — earlier this week at its annual meeting in Boston.

At the meeting, the group also presented the fruits of the second phase of its sequence services program, which aims to develop a fully functional platform for next-generation sequence data analysis and storage.

Bonfield was deemed the Sequence Squeeze winner — and awarded $15,000 in prize money — for a cluster of algorithms that all delivered high performance in the top three judgment criteria, the alliance said. Bonfield's entry was selected out of more than 100 submissions for the contest.

According to the group, judges evaluated the submitted entries’ compression ratio, which is a measure of how much the algorithm squeezes the data; compress and decompress time; and compress and decompress memory.

Out of these five criteria, the judges weighted compression ratio and compress and decompress time higher than the other two criteria because of their “pivotal role in real-world usage," the alliance said.

Compression ratio and compress time impact how quickly data can be packaged for analysis and how easily it can be stored long term, while decompress time affects how readily scientists can extract value from NGS data sets. In addition, compress and decompress time were deemed important because of the role they play in expediting data transfer between proprietary data centers and cloud-based systems for genomics storage and analysis, the alliance said.

Bonfield’s approach considered the importance of preserving alignment data in addition to raw FASTQ output and used two programs: fqzcomp, which compresses raw FASTQ files, and sam_comp for the SAM/BAM output.

Explaining his approach in a conversation with BioInform, Bonfield said he began by breaking up FASTQ file data into three categories: the identifier, which is usually the machine name and additional information about the sequence; the sequence itself, which was about 100 base pairs; and the confidence value of each base.

The next step, he explained, was to run specific algorithms on the data in each category.

To compress the quality values, Bonfield used quality scores from the previous bases to suggest the next likely quality value. A similar approach was applied to compress the identifiers, he said.

“The [machine] names are very similar,” he explained. “One name is very much like the next name and very often they have the same prefix or start off with the same text and then differ by a few digits at the end, so … you could use a previous name [to suggest] what the next one would be.”

Bonfield explained that to compress the sequences, he found that the best approach was to use a sequence aligner — in his case Bowtie2 — to align sequences to a reference and output the data in the SAM/BAM format, and then use the Sam_comp program to capture and store information about each sequence’s location in the genome. As an alternative to Bowtie2, one could use an alternative aligner or even a de novo assembler like Velvet, Bonfield said.

The alliance launched the Sequence Squeeze competition last October and established the $15,000 prize with the intent of finding the best algorithm for compressing and decompressing NGS sequencing data stored in the FASTQ file format (BI 10/28/2011).

The entries were judged by representatives from BGI, the Broad Institute, the Sanger Institute, and the Pistoia Alliance.

“The competition exposed two important elements: First, that the gzip algorithm that served as the competition baseline is actually quite sufficient for run-of-the-mill compression, and second, that it’s extraordinarily difficult to make huge improvements in all three of the judged dimensions,” said Nick Lynch of AstraZeneca, external liaison of the Pistoia Alliance and chair of the judging panel, in a statement.

Lynch also noted that for each of the entries, “tradeoffs were made, which means that ultimately a compression toolkit might be the best approach to handle specific workflows.”

Lynch also praised the quality of entries received and the conversations between the participants.

“During the competition itself, entrants discussed ideas openly on a variety of forums, and many entrants are already talking about merging the best parts of their algorithms together to address particular sequencing workflows,” he said.

The competition also included an open leaderboard for tracking submissions.

In a statement, Bonfield said his entries “benefited” from the open nature of the contest.

“In a closed competition with a score table only visible after the submission deadline, I might have sat back and waited for the results. Instead, seeing an entry beaten spurred me to improve my submissions,” he said.

Bonfield plans to donate a portion of his prize to the Wellcome Trust Sanger Institute and the remainder to the British Heart Foundation.

Sequence Services Update

Also during the meeting, three teams selected for the second phase of the alliance’s sequence-services program presented their platforms to delegates.

In February, Pistoia announced that it had selected proposals from the teams — Constellation and GeneStack, Eagle Genomics and Cycle Computing, and Hewlett-Packard — as the winning entries for round two of the project (BI 2/10/2012).

Pistoia issued a request for proposals last July that outlined the group’s requirements for a hosted platform for sequence storage and analysis.

Among other capabilities, the platforms are also expected to enable researchers to align proprietary sequences to publicly available data in tools like Ensembl; provide a gene alias search that uses public aliases and lists that are unique to each company; and an RNA-seq pipeline that provides tools to align short reads to a reference.

Systems were also expected to provide access to several well-known bioinformatics tools, including EMBOSS, Clustal-W, SAMtools, Bowtie, and Tophat, among others (BI 7/29/2011).

Filed under

The Scan

And For Adolescents

The US Food and Drug Administration has authorized the Pfizer-BioNTech SARS-CoV-2 vaccine for children between the ages of 12 and 15 years old.

Also of Concern to WHO

The Wall Street Journal reports that the World Health Organization has classified the SARS-CoV-2 variant B.1.617 as a "variant of concern."

Test for Them All

The New York Times reports on the development of combined tests for SARS-CoV-2 and other viruses like influenza.

PNAS Papers on Oral Microbiome Evolution, Snake Toxins, Transcription Factor Binding

In PNAS this week: evolution of oral microbiomes among hominids, comparative genomic analysis of snake toxins, and more.