COLD SPRING HARBOR, NY (GenomeWeb News) – The 1000 Genomes Project has almost tripled the amount of sequence data it has produced during its pilot phase since this spring, to 2.8 terabases, or approximately 100 terabytes.
The project has also added another production center, the Max Planck Institute for Molecular Genetics in Berlin, that has recently begun to generate data for the effort.
Paul Flicek, head of the vertebrate genomics group at the European Bioinformatics Institute in Hinxton, UK, and co-leader of the 1000 Genomes Project's data flow group, gave an update on the project's progress at the Personal Genomes meeting at Cold Spring Harbor Laboratory last week.
Raw sequence data generated by the production centers is amassed at the EBI, where researchers in collaboration with colleagues from the Wellcome Trust Sanger Institute recalibrate it in order to obtain accurate and uniform quality scores that allow data from different centers and sequencing platforms to be compared.
It is then uploaded to both the EBI’s and the National Center for Biotechnology Information's FTP sites for public access. Long term, the data will be stored in the NCBI’s Short Read Archive and the EBI’s European Read Archive.
The next batch of data — resulting from a data freeze in August — will be ready for download early this week, according to Flicek. As a result of the increased data production, data transfer between the production centers and the data storage centers is becoming increasingly difficult, he added.
The next data freeze, which is planned for Oct. 24, is expected to complete data production for the first two of the three pilot projects.
Under the first pilot project, researchers are sequencing 60 HapMap samples from three different populations at low coverage. The second pilot involves sequencing two trios – parents and child – of European and African descent at high coverage. The third pilot project aims to sequence 1,000 genes in 1,000 individuals at high coverage.
Later this year, following a meeting in November, the scientists are planning to release a first genetic variation map, according to Flicek.
Following the pilot phase, the entire project, he said, will probably generate about 20 terabases of sequence data. Sequencing production worldwide, he estimated, will soon be just an order of magnitude smaller than data generation by the Large Hadron Collider that recently opened in Geneva, which is expected to produce 15 petabytes of data per year.
The 1,000 Genomes project, a three-year project, was launched in January. The goal of the project is to produce a detailed catalog of genetic variants in the human genome.
In May, the projects organizers announced they had generated 300 gigabases of sequence data, more than the amount of data stored in GenBank.
The following month, Illumina, Roche/454, and Applied Biosystems joined the project as data producers, which already included the Sanger Institute, BGI Shenzhen, the Broad Institute of MIT and Harvard, Washington University School of Medicine’s Genome Center, and Baylor College of Medicine’s Human Genome Sequencing Center.
The MPI in Berlin is the latest production centers to join the effort, Flicek told GenomeWeb Daily News.