Last week, Curagen subsidiary 454 Life Sciences installed the first commercial version of its new high-throughout sequencers at the Broad Institute, but the arrangement is more of a collaboration than a customer/vendor relationship — at least on the informatics front.
James Knight, director of bioinformatics at 454 Life Sciences, told BioInform that his company is working closely with researchers at the Broad to refine its assembly algorithms for the 454 Genome Sequencing System, which generates thousands to millions of short — 100-base-pair — reads that must be either mapped to a known sequence or assembled into contigs to form a complete genome.
“We’re taking a combination approach of developing our own assembler, but also, as part of the initial instrument sales, there will be research collaborations,” Knight said. “So we’re targeting the major genome centers like Broad, as well as the people who have developed the major assembly algorithms. We would love to work with any of them to help them incorporate our reads … into their assemblers.”
Knight said that 454 has had mixed results applying existing assembly algorithms to its sequence data. “A lot of them are geared towards being able to handle 700- to 1,000-base reads, with paired-end information, and they require that for their assumptions, and they work in varying degrees of quality with our 100-base reads,” he said. “So the shorter reads, and the many more shorter reads, are the general issue.”
The 454 bioinformatics team has already begun working with the developers of the Arachne assembly algorithm at the Broad Institute, Knight said. “We’ve had some initial conversations about the properties and changes to make to Arachne to make it be able to handle assembling our reads,” he said. “That’s going to continue and we’re going to be working closely with them over the rest of this year.”
454’s own assembler is designed for the unique properties of the company’s microfluidics-based sequencing technology, which measures the amount of light that corresponds to the number of single nucleotides that occur in a row as they flow through the instrument. These signals, called “flowgrams,” are used to count how many of each base occur sequentially, and include a floating-point remainder “based on the fact that the normalization, the whole process, contains some errors,” Knight said. “So it could be 1.08 as your real point call for a single-base incorporation. But then we’re able to use the difference between a 1.0 incorporation and a 1.08 versus a 1.95 that will allow us to distinguish a single base from two bases.”
Using this approach, the instrument produces a series of numbers — say 3.2 As, 0.1 Ts, and 1.0 Cs, that would correspond to a sequence read of AAAC.
But the accuracy of this signal-based approach falls by the wayside when the sequence reads are used as the basis for assembly. “We found that when we first convert [the flowgram] into nucleotides — effectively rounding off those floating points — and then assemble those nucleotides, if we do some sort of majority voting or any other consensus calling that starts with the nucleotides, that’s not as accurate as if we do our consensus calling using the actual original signals,” Knight said.
Ultimately, Knight said, 454 envisions a combined approach in which the initial assembly is performed “in the signal space,” and this information is then fed into a traditional assembly algorithm for finishing. “What we would do is have our assembler almost as a pre-processor to their assembler, where we’ll do initial contiging and consensus calling to get that more accurate consensus. And that consensus is then read in and treated either as a read or as an initial contig into their assembler.”
In the case of resequencing, 454 has developed mapping software to align the signal-based data to a reference genome, Knight said, “so if you have a very closely related strain, you can use the mapping, and then from the mapping pull out very high-confidence mutations that are found in the strain that you’re actually sequencing.”
Next-generation sequencing technology is producing additional informatics challenges, Knight said. “What we’re finding is that treating each read individually is just going to blow up disk space and computational space, and in some sense, users’ ability to handle the data space.” 454’s solution is to “pack” multiple reads into single files “and basically use those [consensus sequences] as opposed to the individual reads as the core piece of information that you as a user manipulate.”
This approach enables 454 to store the 200,000 reads that it gets from a run in about 500 megabytes.
Knight said that the instrument comes with a two-processor computer that satisfies the demands of a lab performing several sequencing runs a week. The Broad, however, intends to run the instrument around the clock, so 454 is “still in the process of determining” the compute and storage capacity that it will require for the system it has installed.