This article was originally published July 13.
In a study published online in Nature last week, Complete Genomics demonstrated its long fragment read technology, which enables whole-genome sequencing and haplotyping from 10 to 20 cells with an error rate of one in 10 million bases.
The company plans to introduce the technology into its services business next year, and said that it will enable clinical-grade genomes (see related story in Clinical Sequencing News).
Complete Genomics' chief scientific officer Rade Drmanac told In Sequence that the method could be applied to any sequencing technology. Additionally, he said the company is continuing to optimize the method, and is considering improvements that would better enable sequencing in repetitive regions and calling indels.
The process adds about 24 hours onto upfront processing steps, but Brock Peters, director of research at Complete Genomics and lead author of the study, said that "there is a very simple path to get that down to maybe 12 hours, without having to do anything dramatic."
The method is based on the premise of diluting a small DNA sample — from between 10 and 20 cells — into a 384-well plate. Each well contains DNA comprising around 10 percent to 20 percent of the haploid genome that has been fragmented into 100,000 base pieces.
Diluting the DNA into the different wells reduces the chances that any two DNA fragments in one well are complementary sequences from the paternal and maternal chromosomes.
Next, the DNA is amplified using multiple displacement amplification, a technique used in single-cell sequencing; fragmented; and then barcode adapters are added. Each well has its own unique barcode.
The DNA is then pooled and sequenced using Complete's standard 35-base read sequencing technology.
Drmanac said that the short reads actually offer an advantage in this case, "because short reads are accurate and you can produce billions of them," he said.
After sequencing, the fragments can be grouped back with the other fragments of the same well to reconstruct the larger fragments.
Complete uses an assembly approach that combines alignment and local de novo assembly of its reads to call variants, details of which it described earlier this year (IS 1/24/2012). Variants are called based on the local assembly, not on the read alignment.
For its long fragment read technology, the company had to develop a new algorithm that could use the information from all 384 wells at once, Drmanac said.
Proof of Principal
In the paper, the researchers demonstrated the technique on three different samples, creating 10 long-fragment read libraries — six from a European HapMap sample, three from a Yoruban female HapMap sample, and a single library from a sample from the Personal Genome Project.
For the Yoruban sample, one library was constructed using 10 cells, or around 100 picograms of DNA, while all other libraries for the Yoruban sample and the other two samples were constructed using between 15 to 20 cells, or between 100 and 130 picograms of DNA.
Each sequencing library generated on average over 250 gigabases of mapped data with an average of 80x coverage.
The algorithm enabled around 92 percent of the phasable heterozygous SNPs to be placed into contigs with N50s of around one megabase for the Yoruban sample, and around 500 kilobases for the other European samples. The authors attributed the lower N50 size for the European sample to the increase in regions of low heterozygosity from those genomes.
They found that doubling the number of reads to around 160-fold coverage or combining replicate samples increased the phasing rate about 96 percent.
Comparing the haplotype data between two replicate libraries found that the results were in general very concordant, with only 64 differences per library in around 2.2 million heterozygous SNPs phased by both libraries.
Additionally, comparing phase and sequence data from the European HapMap sample along with parental information that had been previously phased using other methods found 60 instances out of 1.57 million comparable loci in which the two methods differed.
Finally, the team looked at 35 de novo mutations in the European HapMap sample. Thirty-four of those mutations were called in either the previously sequenced genome or one of the long fragment read libraries. Thirty-two of the mutations were phased in at least one of the two replicate long fragment read libraries. The two non-phased variants were located in regions of low heterozygosity.
Aside from yielding a sequenced haplotyped genome, another advantage of the method is that it reduces the error rate 10-fold from one in one million bases to one in 10 million bases.
One reason it is able to do this is because having the phase information makes it easier to identify false positives because it adds in an extra layer of checking. Not only does there have to be enough reads with the given variant to call it as real, but the variant cannot contradict phasing. So, if two different variants are identified at the same location, and they are both are from the same parental chromosome, one of them is actually a false positive.
In the current paper, the researchers first called variants, then used the tags to piece together longer stretches of DNA from the same parental chromosome, and from that built contigs, Drmanac explained.
Next Steps
Despite the dramatic reduction in error rate, Drmanac said the main source of error is in the multiple displacement amplification step. Each library contains about 15 times more errors than Complete's standard library, and the vast majority of those errors are generated from MDA.
Drmanac said that researchers are currently working to optimize this process to reduce the error rate even further and to increase coverage uniformity. In the current iteration, the DNA fragments in each well are amplified around 10,000-fold with MDA. "We'd like to reduce that in future," Drmanac said, which would help reduce errors.
The company is also working to revise its algorithm so that instead of calling variants first and then phasing them, it will phase first and then call variants.
To do this, said Drmanac, some heterozygous SNPs would need to be called first in order to generate enough information to phase the fragments. Then the phase information could be used to call all the bases.
The advantage of this, is that it can help resolve instances in which there is not enough information to call variants. For instance, "if we have say five reads and we don't know whether they all come from mom or dad," Drmanac said. By phasing first and then calling variants, "suddenly, we can see three are coming from mom, two are coming from dad, so both of them have that same base."
Reversing the order of phasing and variant calling will also help call short indels, he said. In many genetic diseases, there are short deletions that affect an important element of a gene, but those deletions are located in repeats. When sequencing by traditional methods, it is very difficult to measure such deletions.
"You cannot easily measure that there [are fewer] reads for such a short piece of DNA," he said.
But by phasing first and then calling variants, it will become obvious because after phasing there will be no coverage in that area from the parent with the deletion.
Additionally, he said researchers are also working on techniques to increase the size of the fragments in each well. Longer fragments would help distinguish between functional genes and non-functional pseudogenes, he said.
Other modifications include increasing the number of wells, and diluting the DNA even further, which he said would help to resolve repetitive regions and to do things like sequence accurately through the telomeres.
To do this, he said the company is looking into the possibility of increasing the plate size to include 10,000 wells and will use a nanodroplet technique to pipette tiny amounts of DNA into each well.
With 10,000 wells, "there will be a single fragment in an aliquot," said Drmanac, which will help resolve "some of the different strange repeats and the telomeres." Since only one telomere would be present in each well, "we could measure the length of the telomere," Drmanac said.
Additionally, added Peters, "by putting it into a droplet process, the whole process would take hours instead of a day."