Complete Genomics has published details of its variant-calling pipeline in an effort to help customers interpret the results they receive from the company's sequencing service.
The paper, which was published in the Journal of Computational Biology, "lifts the hood" on the company's variant calling capabilities so that customers can "better understand the nuances of how these calls are being made and ... why they work," Krishna Pant, Complete Genomics' director of bioinformatics applications, told BioInform.
The mathematical framework described in the paper forms the core of Complete Genomics' informatics infrastructure, which is optimized to handle the unique mated gapped structure of its sequence reads, Pant said.
The paper aims to "describe the foundational methods that lead from our reads, our data, to the variations and the scores and quality, and the final results that we deliver to our customers for analysis," he said.
Complete Genomics performs whole human genome sequencing using proprietary biochemistry based on DNA nanoball arrays and combinatorial probe-anchor ligation sequencing.
These methods produce reads that have unique characteristics and, as a result, the company has developed a number of its own methods for calling SNPs, short substitutions, and insertions/deletions.
The company's approach employs a local de novo assembly process, which uses a combination of Bayesian analysis and graph-based techniques, to identify variants in genomic data.
Paolo Carnevali, the company's principal software engineer and one of the paper's authors, explained that existing alignment and assembly tools failed to work for Complete Genomics reads because most of these algorithms require sequences to be long enough to be "sufficiently unique" in the human genome — usually on the order of "at least 25 bases or more."
Complete Genomics' reads comprise two 35-base arms, but "they are not contiguous," he explained, since each arm is comprised of three 10-base segments and one 5-base segment with gaps of variable length. As a result, any given read could potentially map to multiple regions of the genome.
In the Complete Genomics method, variants are called based on an initial mapping step followed by a local reassembly process, rather than using read alignment.
Besides meeting the requirements of its read structure, Pant said the company's variant-calling mechanism improves on so-called "map-consensus" approaches, which first map reads to a reference and then determine the "consensus" variant call based on the number of reads mapped to each region.
While map-consensus approaches can "yield good results," the paper states, "they are at the mercy of the abilities of the aligner. Areas of dense variation with respect to the reference genome may yield no mappings and therefore no calls, while indels or dense variation when not detected by the mapper and accounted for in the caller may give rise to spurious calls or false negatives."
Unlike such methods, Complete Genomics' approach doesn’t "make an assumption that a read goes in a certain place" but assumes that it could possibly align in possibly several locations, Carnevali said.
The approach also includes a correlation analysis step to resolve "ambiguous mapping scenarios" and does not call variants that are based on these uncertain alignments, Pant noted.
This method allows the company to call both alleles at a position independently, which enables it to make calls in cases where both alleles differ from the reference. The approach can also detect variants that are located close to each other as well as previously unknown indels, the company said.
Once its sequencing and analysis services are completed, the company provides customers with reports that could include information on copy number variations, structural variations, transposable element insertions, and a comparison of tumor and normal samples.
Complete Genomics' focus is on "trying to solve, and pipeline, and standardize all those parts of the analysis that are shared across all or a large swath of customers," including calling SNPs, small indels, substitutions, as well as some functional annotation, Pant said.
Customers can then go on to run analysis algorithms that are specific to their research objectives or study designs, he added.
Complete Genomics intends to extend its infrastructure to analyze data from a haplotype-based long fragment read sequencing approach that it is currently developing. The company plans to pilot the haplotype phasing offering later this year ahead of a full launch at a later date.
Moving forward, the company intends to make several improvements to its informatics infrastructure, Pant said.
"We have a very good foundational framework from which to extend and improve capabilities," he said. For example, though [our] core methods ... have been around for awhile, we found it relatively straightforward to improve the error models that we use, to add other capabilities, to reuse some of the capabilities we have in other contexts like cancer or structural variations."
Among other things, the company will work on improving the quality of its calls for both clinical and research applications, Pant said.
Furthermore, the company is also considering a number of methods from members of the life science community that it could incorporate into its analysis pipelines, he said. For example, it is looking at a method for detecting viral sequences in human genomes.
Last November, Complete Genomics launched a competition offering commercial and academic groups the chance to have eight genomes sequenced at no cost if they could develop software, scripts, methods, workflows, or other resources used for downstream analysis of its datasets (BI 11/4/11).
Officials told BioInform earlier this year that the company isn't disclosing the winning entries.
The company also inked a deal with DNAnexus last March that allows its clients to store and visualize their human genome sequencing data on the latter's cloud-based platform (BI 3/18/11).
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.