COLD SPRING HARBOR, NY (GenomeWeb News) – More than 300 gigabases of data have already been generated for 1000 Genomes pilot projects — more data than is currently housed in all of GenBank — and organizers plan to have a whopping 2 terabases of genetic information by the end of this year.
An international consortium launched the 1000 Genomes Project this February. Speaking at the Biology of Genomes meeting at Cold Spring Harbor Laboratory last night, Steering Committee Co-chair Richard Durbin, a principal investigator at the Wellcome Trust Sanger Institute, summed up the progress made on the project so far.
The first stage of the project involves three pilot projects, Durbin explained. The first pilot project is a low-coverage analysis of 60 samples from three different populations — including individuals of European, African, and East Asian descent. The second will involve families or trios of individuals of European and African descent analyzed at higher coverage, and the third will involve sequencing 1,000 genes in 1,000 people at high coverage. Several experiments suggest 20 or 30 times coverage will be necessary, Durbin noted.
So far, the team has mainly generated data for the first two pilot projects. More than 300 gigabases of data have been generated so far, Durbin said. By comparison, a single human genome comprises 3 gigabases. The consortium has completed low-coverage sequencing on roughly ten people from the first pilot project and has higher-coverage data on some trios — mainly European— for the second pilot project.
Preliminary analyses have also been done on most of this data. The team did a data freeze a couple of weeks ago, when it had about 240 gigabases of data: 32 gigabases on low-coverage samples, 185 gigabases from European trios, and 20 gigabases from African trios.
After these pilot projects, Durbin said, the consortium will generate high-coverage information from multiple populations for all 1,000 genomes of the 1000 Genomes Project. The exact design for that two-year main project has yet to be finalized. For example, the main stage of the project will likely involve collecting additional samples and the team must determine how — and where — these will be collected.
The goal right now, Durbin said, is to get sufficient coverage to call variants down to five percent in three populations and down to one percent in the 1,000 genes. Ultimately, they hope to be able to call variants down to one percent in all 1,000 genomes.
The team plans to have 2 terabases of data by the end of this year.
The 1000 Genomes data collected so far has been submitted to the NCBI’s Short Read Archive.