As Complete Genomics prepares to launch a sequencing service next year that it claims will give customers a human genome for $5,000, the company is still evaluating certain aspects of the IT infrastructure it will need to support that service, as well as the extent of the informatics analysis that it will provide to its customers.
The company has not yet announced the exact terms of its service, but Bruce Martin, vice president of software, told BioInform that the $5,000 price tag includes sequencing to 20-fold coverage and “the first few steps of analysis” such as alignment, assembly, and some automated annotation.
Martin said that the company has developed its own suite of base callers, mappers, assembly tools, and analysis software for the service. “The founders of the company realized that they had to make software one of the cornerstones of the company’s development,” he said.
One option being discussed at Complete Genomics is whether in some cases it is more cost effective for the firm to host the sequence data for customers or whether customers will be able to do this on their own.
The computational and storage requirements for the service are considerable. The company, which operates in a 32,000-square-foot facility in Mountain View, Calif., has already built a data center with 400 terabytes of disk storage and 600 processors. Next year, it plans to scale up to 5 petabytes of disk storage and 10,000 processors, and by 2010, it wants to ramp up capacity another six-fold, to 30 petabytes of storage and 60,000 processors.
Martin explained to BioInform that the company is “planning for scale” and has pulled together a team of bioinformaticists, computational biologists, image-processing experts data-center planners, IT specialists, and software engineers specializing in indexing and searching.
The company said it has a total of 100 employees, 30 of which are IT and computing staff.
“Our primary strategy has been to put together a multidisciplinary team” in order to solve a range of bioinformatics challenges such as algorithms for assembly and alignment, he said. Beyond being able to set up a pipeline, Martin said he sought people who could “deploy the pipeline at extreme scale.”
If the company’s business proceeds according to plan, scale will certainly be an issue. By the end of the year, Complete Genomics expects to have 16 sequencers in place, and intends to increase that to 192 sequencers by 2010.
The company expects to sequence 1,000 complete human genomes next year and 20,000 genomes in 2010. By the end of 2010, it expects to be able to sequence 200 genomes a day.
“As of a few weeks ago we had around 600 processors and something like 400 terabytes online and in active use,” Martin said, adding that these numbers which will “be up dramatically” by the end of 2008, though he did not elaborate.
“The founders of the company realized that they had to make software one of the cornerstones of the company’s development.”
By 2010, the data center is expected to hold around 60,000 processors with 30 petabytes of storage. The data center will at that point be at two locations, a smaller one in Mountain View and a larger one at an undisclosed remote site.
Martin said Complete Genomics is currently using commodity computing technology such as large-scale clustered file systems, and its initial development system is based on an Isilon clustered storage system. “We like it a lot, it is very scalable, it performs well, it is easy to manage, but we are still evaluating what our generation one production systems are going to look like,” he said.
“The secret is knowing which [systems] to choose and putting software engineers together with your operations folks so you can design systems all the way from the algorithms to where you plug it into the wall and being able to operate it,” he said. “That’s a bit of the secret sauce we are bringing to the table here.”
Another issue the company is addressing is data security — particularly for potential customers in the pharmaceutical industry. Martin, who joined Complete Genomics from PSS Systems, a retention management software firm, said that his experience in this area and available best practices provide a good foundation to soothe concerns about segregating data, keeping metadata separate, and other policies and procedures that provide a secure information environment.
“I have a very deep background in telecom and enterprise software,” he said. “You don’t hear a lot about banks talking about how hard it is to do it, because it is a solved problem.”
Keep it for Me
As the company explores the sequencing-services business model, one key question it is mulling is how long to retain a customer’s data.
“Our strategy is fairly simple here,” Martin said. “There will be the active tier for the parts of the data set the pipeline is actively computing upon,” and then there will be a “delivery tier,” which is a queue for data for which the assembly, alignment, or annotation may be complete and is pending delivery to a customer.
“Our expectation is that we are going to have a lot of dialogue with our customer base about exactly what their requirements are,” he said. The delivery time period “will be baked into our financial models and our pricing.”
Even if the company only keeps the data for 90 days — and “that is a hypothetical,” Martin said — “we will have the capacity to go even further than that.”
In terms of its IT options, Complete Genomics is still in “the evaluation stage” for its production system. “When you build a system of this scale you evolve it over time,” he said, much like Google or Yahoo has generations of subsystems they have unified through their software architecture.
This arrangement lets those firms switch between vendors and change architectures without too many challenges, he said. “We are taking a similar approach.”
As for data analysis, the company has developed its own suite of base callers, mappers, assembly tools, and analysis software for its proprietary sequencing approach, which is based on short-read sequencing-by-probe-ligation technology.
The technology generates 35-base-pair reads, but unlike short reads from the Illumina or ABI SOLiD systems, these reads are gapped, consisting of 10 bases, a gap, 20 bases, another gap, and another 5 bases. As a result, the company has developed its own mapping software for these types of reads that can map a 50x coverage human data set to a human reference set in under 24 hours, Martin said.
Martin said that Complete Genomics has hired several bioinformatics specialists, such as an individual who developed the company’s base caller and an expert in assembly algorithms.
The assembly software, which is designed to generate a diploid assembly, scales linearly with the number of CPUs, Martin said. The assembler uses a combination of Bayesian reasoning and de Bruijn graph-based algorithms. “The statistically driven models allow you to integrate even partial or incomplete information sets,” he said.
The software is also able to identify structural variants such as deletions or translocations as well as copy number variations, the company said.
“We are focused on defining software and algorithms that can make this largely an automated process,” Martin said, adding that the company is building an automated pipeline and “extremely fast software” with the goal of minimizing human intervention and lower costs.
Given the scale at which the company operates, it expects to get “very, very good pricing” on the elements required for its IT architecture, Martin said.
Considerable compute power is expected to address the “wall clock time” element of the service business. While analyzing hundreds of genome-scale data sets is a complex challenge, “I can just throw more computers at the problem and fix that,” Martin said.
“When you throw enough computing at the assembly problem, there is lot of extra information that a base caller produces, such as quality scores,” he said. Academics have been discussing the need to build better algorithms to leverage that to take advantage of all the information in the read set.
“We have spent a number of years new designing algorithms to do just that,” said Martin.
“Our focus is high throughput as opposed to low latency,” he said. “We are not really worried about doing something in a day or a week. We are worried about doing a lot of genomes for a customer in 90 days … about getting a statistically significant data set in a reasonable amount of time.”
Martin said that the company “hasn’t made a final decision” on whether to release its software under an open source license, as some other next-generation sequence vendors have though he added that “we understand that we need to gain the trust and credibility for our data set and our methodology.”