Skip to main content
Premium Trial:

Request an Annual Quote

New Sequence Simulator Could Simplify Alignment Software Selection, Case Western Researchers Say

Premium

By Uduak Grace Thomas

Scientific investigators who aren’t sure what alignment algorithm is best for mapping reads from their latest sequencing project may get some clarity from a new simulation program developed by researchers at Case Western Reserve University that evaluates the accuracy and speed of some well-known software packages.

The open source software tool, dubbed the Simulation and Evaluation suite, or SEAL, simulates next-generation sequence runs with different configurations of various factors such as error rate, read length, insertions and deletions, and coverage. Users can adjust these parameters to evaluate the effects of specific factors on different mapping algorithms.

The developers describe SEAL in an advance-access article in Bioinformatics and demonstrate its use in assessing the alignment skills of several algorithms — Bowtie, BWA, mr- and mrsFAST, Novoalign, SHRiMP, and SOAP2 — in terms of each tool's accuracy and runtime.

The tool is meant to help scientists choose the most suitable software for their next-gen sequencing experiments, the authors state in the paper, adding that their findings also highlight "factors that should be considered to use alignment results effectively."

Rather than simply testing each package's performance against a publicly available dataset, the team chose to create the software to provide a controlled environment for the comparison, Matthew Ruffalo, a doctoral student in the department of electrical engineering and computer science at Case Western and one of SEAL's developers, told BioInform.

"If we [use] real reads from the genome then there is really no notion of correctness," he said, explaining that it would be difficult to know whether a read was mapped back to its original location since there would be no way to know where it was located in the genome.

Written in Java, SEAL can read a reference genome from one or more FASTA files or generate an artificial reference genome using input parameters such as length and repeat count. It then simulates reads from random locations in the genome based using read length, coverage, sequencing error rate, and indel rate as its parameters. It then applies the alignment tools to the reference genome and the reads and evaluates the results of the alignment, taking into account run times and accuracy.

In terms of accuracy, the program accounts for differences in the algorithms that are likely to influence their results — such as whether they return a single or multiple alignments for each read, for example.

Ruffalo developed the software as part of his doctoral research project with co-authors Thomas LaFramboise and Mehmet Koyuturk, both professors at Case Western.

He said the team selected the six packages that are evaluated in the paper in order to cover the spectrum of the different types of alignment programs currently available. Currently, SEAL only includes code for running comparisons between the software used in the paper. Ruffalo said it is possible to add in other tools, although a researcher who wanted to that would need to write additional Java code telling SEAL how to run the program.

For the comparison described in the paper, the team generated simulated data from two genomes — an artificially generated genome and human genome release 19 — by "choosing uniformly distributed locations at random" and then created reads from "fragments of normally distributed sizes."

To obtain accurate mappings, the researchers set a threshold quality score as a standard and only considered reads with scores greater than or equal to that value.

For the sake of tools such as mrFAST and mrsFAST, which map reads to multiple matching locations in the genome — a capability that is useful for ChIP-seq experiments, Ruffalo noted — the team used two alternate definitions of incorrect mapping, "strict" and "relaxed" measures.

To illustrate, "if a read is mapped to four locations in the reference genome and one of those mappings is correct, the other three alignments are not counted as incorrect mappings in the relaxed sense," the researchers explain in the paper. Conversely, in the strict sense, the other three alignments would be considered incorrect.

Ultimately, the method for selecting the winning algorithm depends on the user's research needs, the developers said. For example, if a researcher is studying structural variants, the relaxed accuracy approach may be more useful while a study on genotyping single nucleotide polymorphisms may require strict accuracy.

Among other findings, the team found that mrFAST and mrsFAST might be the best tools for genome projects studying indels as they "appear to be more robust for increasing frequency of indels" while the performance of all the other algorithms declined.

Furthermore, as the average indel size approached 10 base pairs, SOAP failed to align any reads, suggesting that it is better suited for identifying single-nucleotide variations, they said.

Meanwhile, BWA wasn’t quite so accurate when it didn't have a threshold value with which to eliminate unreliable reads, even at low error rates, while SOAP had a "consistently high accuracy" whether or not it had a threshold.

"Based on these observations, we can conclude that BWA is specifically designed not to miss any potential mappings at the cost of reporting many incorrect mappings," the authors wrote.

In terms of speed, the researchers report that Bowtie, BWA, and SOAP align reads quickly but take a long time to build an index of the genome. Novoalign, on the other hand, doesn't slack when it comes to setting up indexes, but "shows more of a dependence on the number of reads," they said.

Ultimately, the authors note that the results demonstrate that "alignment tools are designed with different approaches to trading off speed and accuracy to optimize detection of different types of variations in donor genomes." The aim of SEAL, therefore, is to take those trade-offs into account in order to help users determine which algorithm is best for their specific research needs.

For Ruffalo, this project was a good way to "figure out what the different alignment tools do, how they differ, and what shortcomings they might share." A key take-home message, he noted, is that selecting an alignment tool depends on what the research needs are and not necessarily which program performs best overall.

"If you are looking for single nucleotide differences, SOAP does very well; if you are looking for structural variations, not so much, for example," he said.


Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com

Filed under

The Scan

Removal Inquiry

The Wall Street Journal reports that US lawmakers are seeking additional information about the request to remove SARS-CoV-2 sequence data from a database run by the National Institutes of Health.

Likely to End in Spring

Free lateral flow testing for SARS-CoV-2 may end in the UK by next spring, the head of Innova Medical Group says, according to the Financial Times.

Searching for More Codes

NPR reports that the US Department of Justice has accused an insurance and a data mining company of fraud.

Genome Biology Papers on GWAS Fine-Mapping Method, COVID-19 Susceptibility, Rheumatoid Arthritis

In Genome Biology this week: integrative fine-mapping approach, analysis of locus linked to COVID-19 susceptibility and severity, and more.