Skip to main content
Premium Trial:

Request an Annual Quote

UK Developers Prep Open-Source Alternative To Illumina GA’s Primary Analysis Software

Premium
A team of UK bioinformaticists is finalizing an open-source software package devoted to primary analysis, such as image-data processing and base-call extraction, for second-generation sequencing instruments.
 
The software, dubbed Swift, was developed as “an open-source alternative to the Illumina ‘pipeline' software,” and covers the pipeline from raw images to scored basecalls, Nava Whiteford, one of Swift’s developers, told BioInform.
 
Whiteford began working on the software while at the Wellcome Trust Sanger Institute along with Clive Brown, Tony Cox, and Tom Skelly. Brown recently left Sanger to join Oxford Nanopore Technologies where he is director of bioinformatics and IT.
 
Whiteford has also recently joined Oxford Nanopore as senior computational scientist, but he told BioInform in an e-mail that he will continue to develop the software and integrate feedback from the community and from his team’s experiences.
 
“Swift now has the fundamental functionality required to process and run end to end, and is the only open-source tool currently available that can do this,” said Whiteford.
 
“Though Swift is at an early stage, we've decided to make it available to the public so people can start giving us feedback, and, we hope, contributing.” The tool is written in C/C++ and available under the GNU Lesser Public License v3.
 
“At the moment our tools are aimed at Illumina [Genome Analyzer] data, but will eventually process [Applied Biosystems’] SOLiD images too,” he said.
   
Some of the advantages of the software over the Illumina tools, according to Whiteford, include an architecture that creates fewer files and overall “freedom and control, improved transparency, and the ability to develop new applications and control how the data are analyzed,” he said. 
 
“It is also faster and has a smaller compute overhead, which translates to lower costs.” he said.
 
Skelly is currently validating the tool at the Sanger Institute. “The Illumina pipeline is at present the ‘gold standard,’ representing the best that is currently achievable, so we of course compare our results to that,” he told BioInform in an e-mail.
 
To accomplish the validation, he and his colleagues are running “a large amount of data” through the two processing paths in parallel. Beyond that, especially where new algorithms are involved, “we test in situations where we can verify that Swift has got the ‘right answer.’ For example, resequencing an organism for which a good reference genome exists allows us to determine our base-calling error rates.”
 
Skelly said Swift is not complicated for users. “Although its internal processing is significantly different, Swift uses the same input and output formats as the Illumina pipeline. Anyone currently running the Illumina software as part of a larger workflow pipeline should have no difficulty incorporating Swift,” he said.
 
Getting Through the Primaries
 
As Whiteford explained, there is a “significant” primary data analysis problem in second-generation sequencing. Scientists need to take the images from the instrument, extract intensities of the spots or beads, and tie these together between imaging cycles. “You then have to perform a series of post-image analysis corrections in order to correct for signal artifacts and base call,” he said. “How you perform this analysis has a huge effect on the resulting sequence data.”
 
Artifacts that crop up may be due to batch-to-batch variability of the systems or reagents and many other facets of the technology. Staying abreast of those confounding factors — particularly when running large projects that last over extended periods of time — means that researchers require “the ability to understand the effects and inter-relationships of these variables and how they affect your end data — and thus applications,” he said.
 

“It’s important to have control over raw data.”

Whiteford believes the team mastered that task at the Sanger Institute and created a “very efficient, high-quality and consistent next-gen data pipeline.” In the course of that work, they found that analyzing the primary data and experimenting with new algorithms was “extremely important in terms of optimizing the production process,” he said.
 
Ultimately, he said, it was a process that enabled Sanger to make “the largest and highest quality contribution” to the 1000 Genomes Project.
 
“Looking at, and diagnosing images helped us generate hundreds of gigabases of good quality sequence data for the 1000 Genomes project. However, working with a closed source pipeline makes this harder,” he said.
 
One key driver for developing an alternative to the Illumina tools was the realization of the need to recalibrate base scores, Whiteford said. In addition to base score calibration, the team found there are better algorithms for phasing correction, both of which could be best deployed under a modular, framework such as the one that Swift provides, Skelly said.
 
Skelly also said that the tool gives users results quicker than the existing Illumina software. “Swift runs as a single executable, rather than a chain of processing steps,” he said. “This minimizes the [input/output] overhead. And [it] can be run in parallel on hundreds of processors in a cluster, which minimizes total processing time.”
 
Opening the Closed Door
 
The experience of building the Sanger pipeline with Illumina software also revealed to the team the value of developing a fully open-source version for their work and for use by the wider scientific community. “We wanted an open-source product of high code quality, capable of supporting a community of collaborators, and understandable enough that authors of papers based on Illumina data can do more than treat it as a ‘black box,’” Skelly said.
 
“It’s important to have control over raw data. It also gives you some leverage with vendors to stop them doing things that may be against your longer term interests,” Whiteford said.
 
As Whiteford explained, images from next-generation sequencers are currently processed with closed-source proprietary tools provided by the manufacturers. “That's really unfortunate because the data is being used to draw scientific conclusions,” he said. “It's difficult to trust your data and understand the artifacts in it if the data analysis algorithms are not open to peer review.”
 
Being able to try out new methods, change them, and then see how well they do all help to optimize the analysis process. The community has a track record of “doing well when it comes to pushing the analysis of the data,” he said.
 
“The genome centers in particular have the desire and resources to put behind developing new tools and methods and I think it's important that vendors work with them. The open-source model is ideal for this [because] it allows scientists to take vendor-developed tools and extend them, giving value back both to the vendor and the scientific community in general.”
 
Skelly believes that software used in science should be open, and supported by a community of users and contributors. “I think the instrument builders should view themselves as members of that community,” he said.
 
“Our relationship with the developers of the Illumina pipeline has been extremely cordial and productive, with regard to both Swift and support to their own product,” he said. As the growing community of Illumina instrument users explores new ways of exploiting GA data, he said he hopes the company will continue to be supportive of those efforts.
 
“In the scientific arena where Swift lives, the benefits of openness are even more valuable,” he said.
 
When publishing results in peer-reviewed journals, scientists need to document their methods. “If those methods include the use of closed, proprietary software processing, the ability of reviewers and readers to assess the merits of the findings is compromised,” he said.
 
When Reality Brings Loads of Data
 
Whiteford originally began working on sequencing approaches while at the University of Southampton as part of trying to tackle sequencing by hybridization technology.
 
“The technology itself didn't pan out,” he said, but noted that the work helped to lay the foundations of short-read sequencing and its computational feasibility.
 
Moving to the Sanger Institute meant a sudden encounter with plenty of real data where a typical day included “hacking away at Swift and developing the analysis algorithms [with] quite a lot of time spent gathered round Clive [Brown’s] or Tom [Skelly’s] computer staring at images trying to understand what went wrong with a run.”
 
Brian O’Connor, a bioinformatician in Stanley Nelson’s laboratory at the University of California, Los Angeles, is also addressing the data analysis challenges in second-generation sequencing. He admitted in an e-mail to BioInform that he is not yet familiar with Swift, but added, “I think it's a really good idea.”
 
O’Connor is developing an open-source toolkit for the Illumina Genome analyzer called SeqWare, which includes a LIMS, an analysis pipeline, and other support tools [BioInform 09-12-08].
 
The difference between Swift and SeqWare is that Swift is focused on the image analysis and base-calling step of the pipeline, while SeqWare “is more focused on the whole analysis pipeline and is tool agnostic,” he said.
 
O’Connor and his colleagues are using Illumina’s software for primary analysis, but “as the project matures I'd like to see support for other image analysis software components,” he said. That that might include Swift along with a tool to support the output from ABI’s SOLiD image analysis software, he said.
 
The idea behind SeqWare, he said, is to provide a pipeline environment to wrap image analysis, base calling, and then downstream analysis including alignment and reporting for various experimental designs such as genomic sequencing, cDNA sequencing, or ChIP-Seq.
 
“I'm trying to make each step along the way as plug-ins so that if a great new alignment tool, SNP caller, etc. becomes available I can write a wrapper for it and support it in SeqWare,” he said.
 
Oxford Nanopore’s Brown told Bioinform via e-mail that “obvious characteristics” of second-generation sequencers include rapid change and improvement curves. During his tenure at Solexa before it was acquired by Illumina, he said, those traits were “designed into the system and the software.”
 
While open access to data files and source code contributed to the initial rapid uptake of the Illumina system, Brown said that the company later abandoned this open-source strategy.
 
Illumina officials could not be reached for comment in time for this publication.
 
“I think we can claim to be the first vendor to officially promulgate the open-access, open-source model as part of its early-adoption and launch strategy,” he said.
 
“The only downside to this model is that it can be a frustration to non-computer and data literate users who expect a push-button turnkey system with fixed and predictable use-cases,” he said. For now, these users are currently in the minority because the bulk of the expansion in second-generation sequencing is driven by new applications and experiments that could not be done on Sanger-based systems.
 
“Companies can only do so much, and once you have launched something it becomes very difficult and costly to support and control it. In many ways you get more control over something like a next-gen sequencing system by actually letting go of things like data processing, sample prep, and applications — and handing it over to a community of developers,” Brown said.
The current version of Swift is available here and a discussion forum is here

Filed under