By Julia Karow
This article was originally published Nov. 5.
Nine month after presenting human genome sequencing data generated on its proprietary sequencing platform for the first time at a conference, Complete Genomics published a detailed description of its technology in Science last week, applying it to sequence three human genomes at consumables costs between $1,700 and $8,000 per genome.
The paper, published online, is short on scientific insights from an analysis of the genetic variants found in the three samples — two HapMap samples and DNA from Harvard Medical School Professor George Church — but serves to showcase Complete Genomics' technology, and where it stands compared with other short-read approaches when it comes to sequencing full human genomes.
"People have been waiting for a long time for us to publish a detailed description of the technology in a refereed journal, and also publish the validation of that technology," said Cliff Reid, chairman, president, and CEO of Complete Genomics. "We are pretty excited and interested in getting feedback from the scientific community about these genomes and about this technology, which I'm sure will allow us to improve it in the future."
Researchers agreed. "This is a highly anticipated paper," said Stephen Kingsmore, president and CEO of the National Center for Genome Resources in Santa Fe, NM, given that Complete Genomics has been talking about its capabilities for almost a year now. NCGR has been involved in sequencing several human genomes this year, using Illumina's Genome Analyzer platform.
In the study, the company used its combinatorial probe anchor ligation chemistry and patterned DNA nanoarrays to sequence two HapMap samples — NA07022, a Caucasian man and the same sample that Complete Genomics presented data from at a meeting in February (see In Sequence 2/10/2009), and NA19240, a Yoruban woman, whose genome was also sequenced at high depth as part of the pilot phase for the 1000 Genomes Project — and PGP1, also called NA20431, which is Church's sample from Harvard's Personal Genome Project.
The technology produced 2x35-base non-contiguous mate-paired reads, generated from a single 400- to 500-base-pair library. A company spokesperson told In Sequence that about 2 percent of the human genome cannot be analyzed with this library, but that the firm is working on two other library types with longer insert sizes.
"We are in the process of evaluating for which applications the other two library types would benefit, such as 100-kilobase fragments for full chromosome haplotyping," she said.
The researchers generated the greatest amount of sequence data for sample NA07022. To analyze this genome, they aligned 241 gigabases — equivalent to 87-fold average coverage — to the human genome reference, using a custom alignment algorithm. Consumables costs for this genome totaled approximately $8,000.
After assembling the mapped reads into a "best-fit" diploid sequence, using custom software, the researchers were able to make diploid calls of SNPs and indels up to 50 bases in length in approximately 91 percent of the genome. In total, they called about 3 million SNPs, 10 percent of which are novel, and about 340,000 indels. Coverage bias was quite high for this genome, they wrote, much of it due to local GC-content.
This bias was "significantly reduced" for sample NA19240 by improving adapter ligation and PCR conditions, allowing them to make diploid calls of SNPs and indels in 95 percent of that genome, even with a reduced average coverage of 63-fold, or 178 gigabases of mapped data. According to Reid, reducing the representational bias "has been a major technological advancement" that has allowed the company to scale up its operations.
[ pagebreak ]
Consumables costs for this genome were reduced to about $3,500, and the researchers called about 4 million SNPs, approximately 20 percent of them novel, as well as almost 500,000 indels.
Church's genome was covered at just 45-fold depth, with 124 gigabases of mapped data, allowing the researchers to call variants in only 86 percent of the genome.
Consumables for this genome cost $1,700, and the scientists called about 2.9 million SNPs — 10 percent of them novel — and 270,000 indels.
For genome NA07022, they also found more than 2,000 anomalous mate pairs that may indicate the presence of structural variants and rearrangements. They verified one of these — a 1,500-base deletion — by PCR.
In addition, they annotated this genome with Trait-o-Matic software developed in the Church lab, which yielded 14 variants with potential disease implications. However, the company presumably cannot relate these with phenotypic information, which is not available for HapMap samples.
In order to estimate their error rate, the researchers tested 291 random novel non-synonymous variants by targeted sequencing in sample NA07022. Based on the results, they calculated an error rate of about one in 100,000 bases, which the company claims "exceeds the accuracy rate achieved in other published complete genome sequences."
In addition, they compared known SNP genotypes for all three genomes that had previously been reported with their sequence data and found concordance rates of more than 99 percent for each genome.
Genome researchers — including early-access customers of Complete Genomics' sequencing service — offered different views of the results presented in the paper.
"Compared to Illumina, you find SNPs at roughly the same rate and with roughly the same accuracy" using a comparable fold coverage, said Chad Nusbaum, co-director of the genome sequencing and analysis program at the Broad Institute, which recently purchased an additional 30 Illumina Genome Analyzer systems for a total of 89 (see other article in this issue).
In addition, he said, the coverage of the genomes is "pretty good" and approaches the theoretical limit that can be reached by aligning the reads uniquely to the reference, a limit that he said cannot be overcome by other short-read sequencing platforms, either.
But according to NCGR's Kingsmore, the "high percentage of insufficient coverage will limit the utility of this technology to that of an adjunct to high-quality but more expensive methods." He added that he does not believe "that the quality is sufficient to displace Illumina or 454 or SOLiD" and that "independent unbiased validation is very clearly needed to legitimize the conclusions" of the paper.
According to David Cox, senior vice president and chief scientific officer of applied quantitative genotherapeutics at Pfizer in South San Francisco, "what's impressive is that they really have quite good coverage for outstanding quality, and that is what will allow this technology to catch people's attention."
Pfizer is an early-access customer of Complete Genomics and has already received data for a small number of genomes from the company. These results "are consistent with what they report" in the Science paper, said Cox, who is the former CSO of genotyping company Perlegen and a former co-director of the Stanford Genome Center.
He also said the technology has the potential to improve, as evidenced by the reported decrease in representational bias. "[The company researchers] can take empiric data and learn how to tweak their system to make it better," he said.
Cox and his colleagues are currently deciding whether to proceed with larger studies with Complete Genomics. "Based on the quality, I think it looks fine," he said. "We are definitely in talks with them to do larger things."
[ pagebreak ]
Nusbaum said he is wondering whether the technology has any systematic biases that were not reported in the paper, for example how good it is at sequencing very GC-rich regions, which he said are poorly sequenced by both the Illumina and the SOLiD platforms. Although those regions only make up a small fraction of the genome, they are biologically important, he said, since they include many promoters as well as exons with CpG islands.
A Complete Genomics spokesperson told In Sequence that the technology "does acceptably well until GC is in low-to-mid 70 percent," pointing out that "at this point, sequence complexity in mapping is also an issue."
In terms of being able to call structural variants, including copy number changes, Nusbaum said that "I expect they can do it, [but] they haven't addressed that issue yet."
Nusbaum and his colleagues are currently doing their own validation of Complete Genomics' data, after receiving their first batch of data from the company less than two weeks ago, for a HapMap sample and several cancer samples they submitted to the firm earlier this year.
Broad researchers are validating these results by comparing them with Illumina sequencing data as well as genotyping data for the same samples. "That's going to be a very interesting comparison," Nusbaum said, adding that the company's paper and comments from the Institute for Systems Biology, which already presented results that were based on data from the company (see In Sequence 9/29/2009), indicate that "it's going to be pretty good."
Putting Pressure on Competitors
Researchers agree that the price at which Complete Genomics plans to offer human genome sequences — it currently charges $20,000 per genome for multi-genome pilot projects and has said that it will offer genomes for as little as $5,000 in the future, based on volume — makes it unique and attractive.
According to Mike Snyder, who joined Stanford University this summer as chair of the department of genetics, and who will lead a new center for genomics and personalized medicine at Stanford, "it is easily the least expensive genome sequencing on the market and has dropped genome sequencing costs by at least five-fold." Because of the low cost, he said, "it is likely to be the best technology for large-scale studies."
Snyder added that, given the accuracy is good, he is "very eager to use their services for projects we are pursuing," although he said that the company's technology, like other short-read sequencing platforms, is likely "still deficient in accurate calls of structural variations."
According to Kingsmore, Complete Genomics' method "appears to be five- to 10-fold less expensive than very well-tested methods."
"They are putting pressure on Illumina, certainly, [and] Illumina is going to respond to that pressure," said the Broad's Nusbaum. "A genome, realistically, costs $20,000 with [Complete Genomics], and we certainly don't come close to that with sequencing a genome by Illumina in our own hands," he said. However, he added, "I can see a path to getting there" with Illumina's technology, probably sometime next year.
Regarding Complete Genomics, "if the genomes are cheaper … and they can do things that we want to do — like find all the mutations in a genome — we at Broad will obviously be thinking very hard about how we can take advantage of this, and it's likely that we will do some more stuff with these guys," he said.
Cancer Genome Sequencing Next
In the meantime, Complete Genomics is working on scaling up its operations in order to be able to sequence 10,000 genomes in 2010. "To date, our big challenge has been to scale up, to provide the kind of high throughput that our customers are going to need, and we are confident we will be providing that next year," Reid said.
This will include changes to the sequencing instruments — in particular, increasing the density of the patterned DNA nanoarrays — as well as improvements in workflows to be able to operate the genome center efficiently and at a low cost, yet generating high-data quality.
Finally, the company plans to further develop its algorithms and software, in particular for analyzing cancer genomes. "Cancer genomes are no different from a sequencing point of view … but they are very different from an assembly point of view," Reid said, because they often contain major structural variations. Improving software to analyze these will be "one of the major product development efforts for 2010."