By Julia Karow
Complete Genomics recently described a number of planned technical improvements to its proprietary large-scale sequencing platform that will allow it to sequence 100,000 human genomes per year on 20 sequencing instruments in a single sequencing facility. Over the next few years, the company plans to both implement these improvements and to build 10 such centers around the world, with the goal of sequencing a million human genomes within five years.
At the Advances in Genome Biology and Technology conference two weeks ago, Complete Genomics' chief scientific officer, Rade Drmanac, also presented a number of early customer projects, including rare disease studies of a family and an individual, and the firm's first analysis of a primary tumor.
In addition, the company recently provided details about its full commercial service — slated to start officially in April with the completion of its genome center, although the firm already has more than 30 customers — including deliverables, DNA sample requirements, and turnaround time.
Complete Genomics is currently outfitting its Mountain View, Calif.-based facility with initially 16 high-throughput sequencers, along with sample-prep instrumentation and computational equipment. The company's goal is to be able sequence and analyze 500 samples per month on these machines, according to Drmanac.
Last year, the company delivered to early-access customers 50 human genomes that were sequenced on research-grade instruments that have an order of magnitude lower throughput than the commercial systems.
The company is also planning to increase its number of compute cores to 6,500 this year, from 1,500 last year; and to beef up its storage to 1.5 petabytes from a current 1.2 petabytes. The reason why storage will increase much less than compute power is that "we already have what we need for our initial launch," Bruce Martin, Complete Genomics’ vice president of software, told In Sequence. Another reason is that the company uses a portion of the storage capacity for data that does not increase with genome volume, such as quality control data. In addition, he said, "we also benefit from some new software efficiencies."
The new production sequencers will generate almost 2 terabases of data per 11-day run, and process up to 18 ordered nanoarrays in parallel. The goal is to sequence a genome on a single array with at least 40-fold coverage, Drmanac said. Each array currently has about 3 billion 250-nanometer spots, or 2 million spots per square millimeter, and the company's combinatorial probe-anchor ligation chemistry produces 2x35-base gapped paired-end reads.
A number of technical improvements over the next few years are expected to increase the throughput further, to 15 terabytes, or 120 genomes, per three- to five-day instrument run. Drmanac said that these improvements can be achieved "just with precision engineering" and will require "no new inventions and no new physics."
They will include an increase in spot density from 3 billion to 24 billion per array, along with smaller DNA nanoballs; a decrease of the number of pixels required to image each spot from 2 to 1; faster cameras that can acquire more than 100 frames per second instead of the current 30; an increase in the number of megapixels per CCD camera from 1 to between 2 and 4; reducing the number of dye labels from 4 to 2; and acquiring more than one base per cycle.
On such an instrument, 8 to 12 human genomes could be sequenced on the same nanoarray using just 100 microliters of reagents per genome. Reagents per genome will be "less expensive than FedEx for sending the samples," Drmanac said, and the instrument-amortization cost per genome will reach about $20.
"Using 20 of these instruments, we can sequence about 100,000 genomes per year in one small facility," he said, and 10 such facilities with 200 instruments in total would be sufficient to sequence a million human genomes per years.
[ pagebreak ]
Sample preparation will need to keep up with this throughput, and the company expects to be able to produce 100,000 sequencing libraries in an automated fashion "with just a couple of instruments," he added.
Complete Genomics' sequencing technology has two limitations, according to Drmanac: each run will always take several days, so a genome cannot be sequenced in an hour; and the read length will be limited to an estimated 2x100 bases. However, "none of these are limitations for many of the genome applications, except for special diagnostics," he said.
In addition to improving sequencing instrumentation and processes, the company has also started to explore a new method to sequence both haplotypes of a genome individually by physically separating the chromosomes prior to sequencing, an application Drmanac called "really important for medical genetics and to improve accuracy."
Customer Projects: Disease Samples
To illustrate possible research applications of its service, Drmanac mentioned several early customer projects where Complete Genomics sequenced a small number of disease samples.
In its first analysis of a primary tumor, the company sequenced the genome of a non-small cell lung tumor and its matched control for customer Genentech. The analysis revealed more than 50,000 somatic single nucleotide variants, along with a number of structural variants and copy number variations. "Obviously, the real biology will come when we sequence 100 tumor-normal pairs for the same tumor type," Drmanac said.
During a workshop at AGBT organized by Complete Genomics, Genentech researcher Zemin Zhang provided further details about the project, and mentioned that Genentech "is in the process of sequencing a lot more tumor genomes with Complete Genomics."
For the Institute for Systems Biology, Complete Genomics last year sequenced the genomes of two healthy parents and their two children, who both suffer from Miller syndrome, a rare craniofacial genetic disorder, as well as lung disease. ISB researchers presented early results from this project last fall (see In Sequence 9/29/2009). The analysis identified mutations in a single gene that cause Miller syndrome in both children, as well as mutations in three other candidate genes. ISB has since ordered 100 genomes from Complete Genomics for a study of Huntington's disease (see In Sequence 11/3/2009).
In another project, Drmanac reported that the company sequenced the genome of an infant with extremely high cholesterol levels for a researcher at the University of Texas Southwestern Medical Center. The analysis showed that the child has mutations in a transporter gene, and a treatment for this condition is available (see In Sequence 12/8/2009).
Last year, Complete Genomics published a detailed description of its technology, together with the genome sequence of two HapMap samples and an another human genome (see In Sequence 11/10/2009). Since completing that project, it has also sequenced a HapMap trio for Pfizer, a project that enabled the company to estimate its false discovery rate. The three HapMap samples, of European descent, are also sequenced as part of the 1000 Genomes Project, according to Drmanac.
[ pagebreak ]
The company generated about 180 gigabases for each of the three genomes and was able to call bases in 96 to 97 percent of them. For each genome, the researchers called about 3.3 million SNPs, more than 200,000 insertions, and more than 200,000 deletions.
As part of their analysis, the researchers looked for novel SNPs in the child's genome where both parents are homozygous for the reference genome and found more than 5,200. Since the number of actual de novo mutations in the child's genome is expected to be much smaller, the scientists were able to use this number to estimate the false discovery rate to be about 2 false variants per megabase.
Elaine Mardis, co-director of the Genome Center at Washington University, questioned Complete Genomics' validation method for insertions and deletions — Sanger sequencing — saying that these variants are "hard to detect" using Sanger. A company spokesperson told In Sequence that the firm believes "using Sanger sequencing for verifying the accuracy of our indel calls is sufficient" and pointed out that the company uses "a procedure that handles properly 'mixed base calls' in Sanger reads."
As Complete Genomics is preparing to make its commercial sequencing service widely available, it recently released details of its offering for potential customers.
According to a service specification sheet distributed at a conference this month, the company promises "highly accurate" sequence variance detection — SNPs and small indels — on both alleles for more than 90 percent of the genome.
Besides a list of SNPs and small indels, it also delivers copy number variants that are based on depth of sequence coverage, a recently added feature.
The company provides both data summary reports and supporting data, such as raw sequence reads, per-base quality scores, and maps of paired-end reads against the human reference genome, currently NCBI Build 36.
Every base of the reference is mentioned in the results, called either as variant, same as the reference, or absent. Variants are annotated, including their presence in dbSNP and whether they have potential effects on protein function.
The company requires 15 micrograms of unamplified, high molecular weight genomic DNA for the analysis, submitted by the customer in barcoded 96-well plates supplied by Complete Genomics.
Customers can expect to receive data within three to four months after the company confirms the quality of their sample, which takes about two weeks from sample receipt. However, the company says that "actual timelines for data delivery can vary based on the number of genomes in a project."
The company also said that it provides data in "convenient text-based formats," and that the total data set for each genome is around 400 gigabytes.
After sending data to customers, it will be stored on Complete Genomics' "secure servers" for up to 30 days. Any remaining genomic DNA will also be destroyed after 30 days, and "data and samples are the property of the customer."
It appears that the firm is looking to expand its presence into the European market. Listed among the employment opportunities on its website, where 12 positions are currently posted, the company notes that it is seeking a sales manager in the UK, as well as another for either France or Germany.