Researchers from the Wellcome Trust Sanger Institute have been making improvements to SAMTools, a commonly used program for processing next-generation sequence read alignments that are stored in the Sequence Alignment/Map, or SAM, format.
The Sanger team took over SAMTools development from its principal developer Heng Li, a research scientist at the Broad Institute. Li developed the software, which was used in the 1000 Genomes Project, during a postdoctoral fellowship at WTSI in Richard Durbin's group. Earlier this year, he transferred the main task of developing the software over to a team led by John Marshall, senior software developer at WTSI, and Petr Daněček, a senior bioinformatician also at WTSI. Li will continue to contribute code to the open-source program, which has repositories on both SourceForge and github.
Besides fixing bugs and incorporating software patches sent in by community users, so far the team has incorporated the High-Throughput Sequencing library, HTSlib, which is a C-based library developed by Li that provides an application programming interface for accessing and working with common NGS file formats including SAM, the Binary-sequence Alignment Format, or BAM, and the Variant Call Format, or VCF.
The library also supports the two new file formats that have been added to the toolkit for aligning sequences and calling variants. The former is Cram, a new toolkit and file format for compressing and storing NGS data more efficiently that was developed by researchers at the European Bioinformatics Institute (BI 11/30/2012); and the latter is the Binary VCF version 2, or BCF2, format, which according to its website, is a "binary, compressed equivalent of VCF."
They've also added new functionalities to the existing models SAMTools uses for calling variants, for example, "it's been updated to better handle multi-allelic sites," Daněček told BioInform.
Marshall presented a poster describing these changes during the Genome Informatics conference held at Cold Spring Harbor Laboratory earlier this month. He told BioInform after the meeting that the team hopes to do a formal launch of the restructured software later this year or early next year. However, all the updates they've made to SAMTools are already available in github.
"There is a bit of polishing up to do before I'd recommend anyone use it for production purposes, but it's getting pretty close to that," he said. "At the moment … we are really interested in people using this experimentally. If you want to download and run it over some of your data, we'd be really interested in how it goes."
Already, there are "many people contributing to the software by contributing code, fixing bugs, and testing" but "we would like to make it more open so that more people can … influence the direction where SAMTools [goes]," Daněček added.
Besides bringing in new features, the developers are also focused on bringing the current components in the toolkit up to date and making sure that they work the way they should. "During the years when [Heng Li] had moved on to other things," development on SAMtools "had kind of fragmented a bit," Marshall said. "People would send a patch perhaps but there was nobody picking it up." In the future, he said, "we want to have the github repository as a sort of clearing house for that … and try and integrate the things that people put there."
Other efforts involve ensuring that the toolkit's components are supple enough to be adapted into various pipelines. For example, one team member is working to improve the flexibility of the SAMTools Merge application, which is used to merge BAM files, Marshall said.
Later developments would, among other things, focus on providing more support for the Cram file format, which the developers believe could become more widely used because its reference-based approach compression promises significant space savings. "If Cram does take off … I think there is going to be a bit of work to make sure that’s … as convenient as it needs to be," Marshall said.