This story was originally published April 4.
ORLANDO – A year and a half into its production phase, the Cancer Genome Atlas is currently producing more than 20 terabytes of data per month as it moves toward its goal of sequencing 3,000 tumor/normal pairs by late 2011.
And, as sequencing throughput increases and costs continue to fall, project officials anticipate sequencing 10,000 tumor/normal pairs by 2014.
At a session outlining the current status of the project at the annual meeting of the American Association for Cancer Research held here this week, Brad Ozenberger, TCGA program director for the National Human Genome Research Institute, said that the project is currently sequencing an average of 250 cases per week — an increase from 150 cases per week in 2010.
Each case comprises a tumor/normal pair, which both undergo exome sequencing. In addition, TCGA is also conducting whole-genome sequencing for 10 percent of all cases, or approximately 50 whole genomes per month.
Ozenberger said that the 500 exomes plus 50 whole genomes equates to more than 20 terabytes of data per month — a substantial increase over an estimated 8 terabytes per month that the project was generating last year.
The goal of the TCGA is to completely characterize 200 samples from each of 10 tumor types over the next two years and 20 tumor types within five years, though Ozenberger said that ideally the project would like to characterize 500 samples per tumor type. In addition to exome and whole-genome sequencing, the project is conducting gene expression analysis, methylation analysis, microRNA analysis, and copy number variation analysis for each sample.
Ozenberger said that the project has so far collected more than 2,500 samples and aims to have 3,000 cases sequenced and analyzed by the end of the year.
To date, however, TCGA has sequenced just under 1,000 cases — a pace that Ozenberger acknowledged is a bit behind where it should be to reach the 3,000 mark by the end of the year.
"That's partly due to the challenges of doing genomic sequencing at this scale, but that is not a challenge that will be met readily," he said.
A key issue, he noted, is analyzing and managing the data, which he described as the "biggest challenge" that the project currently faces. "A lot of effort is going into understanding how to manage data at this scale," he said.
As an example, Ozenberger said that it recently took nine hours to transfer data from a single tumor/normal pair between NHGRI and one of the TCGA sequencing centers using standard Internet protocols.
Paul Spellman, who leads the TCGA Data Analysis Center at Lawrence Berkeley National Laboratory, noted that one of the most "dramatic" changes in the project since moving from the pilot phase to production phase in late 2009 was the adoption of RNA-seq in place of gene expression arrays.
"We're now nearing the ability to have roughly 10 gigabases of raw sequence for every TCGA sample from RNA sequencing," Spellman said.
Spellman forecast that advances in sequencing technology will enable the project to sequence 6,500 cases by the end of 2012, or the equivalent of more than 500 terabases of raw data.
By 2014, he said it's likely that the project will eliminate exome sequencing in favor of whole-genome sequencing, and he estimated that by October of that year, TCGA will have sequenced 10,000 cases and generated nearly 2 petabytes of raw data.
Have topics you'd like to see covered by Clinical Sequencing News? Contact the editor at btoner [at] genomeweb [.] com.