By Monica Heger
Complete Genomics has published its analysis method, detailing how it calls SNPs, short substitutions, and indels, and said that it is working on modifying the pipeline to enable haplotyping.
Described last week in the Journal of Computational Biology, the method was designed to address the unique characteristics of Complete Genomics reads in which each arm of a mated read contains multiple read segments of fixed length, but variable gaps.
Because of this structure, it's likely that any given read could map to multiple regions of the genome, which rules out assembly methods that rely on alignment to a reference.
In light of this, Complete Genomics has developed a method that first takes a constrained approach to mapping DNBs to the reference genome and then uses local de novo assembly to ensure an optimized alignment that is independent of any bias toward the reference genome. Variants are called based on the local reassembly, and not on read alignment — an approach that differs from so-called map-consensus assembly approaches.
According to Aaron Halpern, a senior staff scientist in bioinformatics at Complete Genomics, the local de novo assembly step is particularly important given the company's unique sequencing method.
Unlike other sequencing platforms, Complete Genomics does not produce reads that are one continuous sequence. Instead, it generates two non-continuous 35-base mate pair reads separated by several hundred bases. Each 35-base pair arm is composed of three 10-base segments and one five-base segment with gaps between each segment. There may be several bases in between each segment or segments may overlap.
Because of this method of generating reads, doing a local assembly "allows us to entertain alternative ways the genome might be, decide which of them is better supported and by how much, and provide robust scores that allow us to arrive at quite an accurate result," Halpern said.
Other methods of assembly and alignment would not work, he said, because they typically begin by aligning reads to a reference. But, because of the nature of Complete's reads, there is not necessarily a unique location on the genome for a given read, so a local assembly is a necessary step.
The method also gives Complete other advantages, said Paolo Carnevali, principal software engineer at Complete. In mapping-first approaches, which make variant calls after aligning reads, the user has "already made a fixed decision on what the mapping of each read is, and if you made a mistake on that, that mistake is unfixable."
By contrast, Complete's approach doesn't "make an assumption that a read goes in a certain place," said Carnevali. It could go in several places, and "that all gets taken into account."
This is particularly helpful when looking at complex variations, said Krishna Pant, Complete's director of bioinformatics applications.
For example, if there are two variations that are close together, or an insertion that is near a SNP, "the way we do things allows for a more comprehensive exploration of possibilities for the resolved genome from any given region of interest," he said.
While the approach is necessary given Complete's read structure, it also allows the company to "resolve ambiguous mapping scenarios and make sure that we do not make calls that are the result of mapping ambiguities rather than true variation," added Pant.
Since developing this initial analysis pipeline, the company has made several improvements, including the development of a cancer pipeline to compare matched tumor/normal genomes, which required developing analysis methods that would appropriately make calls from heterogeneous tumor samples and detect structural variations, said Pant.
Additionally, the company is working on modifying its analysis method to be compatible for the long-fragment read technology it is developing for whole-genome haplotyping.
In the long-fragment read sequencing approach, 100,000-base pair fragments are distributed into a 384-well plate. Each fragment contains around 10 percent of the genome, reducing the chance that DNA from both the maternal and paternal chromosomes are in the same well.
The company plans to start testing this technology in pilot projects later this year. Carnevali said that researchers at Complete are currently modifying the analysis pipeline so that it works on the long-fragment read technology.
"We are in the process of extending that model to include the concept of multiple wells, and how fragments end up in different wells," he said. The current model is relatively simple, describing how reads are generated probabilistically, so the algorithm for the long-read technology will be "a bit more complex."
It will need to "be able to compute what is the probability of generating a certain read in a certain well," he said. While this will enable whole-genome haplotype phasing, it may ultimately also improve variant calling because it will provide additional information to increase the signal, he said.
Have topics you'd like to see covered by In Sequence? Contact the editor at mheger [at] genomeweb [.] com.