Skip to main content
Premium Trial:

Request an Annual Quote

1,000 Genomes Project Approaches 3 TB of Sequence Data



The 1,000 Genomes Project has almost tripled the amount of sequence data it has produced during its pilot phase since this spring, to 2.8 terabases, or approximately 100 terabytes.

The project has also added another production center, the Max Planck Institute for Molecular Genetics in Berlin, that has recently begun to generate data for the effort.

Paul Flicek, head of the vertebrate genomics group at the European Bioinformatics Institute in Hinxton, UK, and co-leader of the 1,000 Genomes Project's data flow group, gave an update on the project's progress at the Personal Genomes meeting at Cold Spring Harbor Laboratory last month.

Raw sequence data generated by the production centers is amassed at EBI, where researchers in collaboration with colleagues from the Wellcome Trust Sanger Institute recalibrate it in order to obtain accurate and uniform quality scores that allow data from different centers and sequencing platforms to be compared.

It is then uploaded to both EBI's and the National Center for Biotechnology Information's FTP sites for public access. In the long term, data will be stored in NCBI's Short Read Archive and EBI's European Read Archive.

The next batch of data — resulting from a data freeze in August — was expected to be ready for download last month, according to Flicek. As a result of the increased data production, data transfer between the production centers and the data storage centers is becoming increasingly difficult, he added.

The next data freeze, which was planned for the end of October, is expected to complete data production for two of the three 1,000 Genomes pilot projects.

Under the first pilot project, researchers are sequencing 60 HapMap samples from three different populations at low coverage. The second pilot involves high-coverage sequencing of two trios — parents and child — of European and African descent. The third pilot project, which is still underway, aims to sequence 1,000 genes in 1,000 individuals at high coverage.

By next year, following a meeting this month, scientists are planning to release a first genetic variation map, according to Flicek.

Following the pilot phase, the entire project, he said, will probably generate about 20 terabases of sequence data. Sequencing production worldwide, he estimated, will soon be just an order of magnitude smaller than data generation by the Large Hadron Collider that recently opened in Geneva, which is expected to produce 15 petabytes of data per year.

The Scan

Billions for Antivirals

The US is putting $3.2 billion toward a program to develop antivirals to treat COVID-19 in its early stages, the Wall Street Journal reports.

NFT of the Web

Tim Berners-Lee, who developed the World Wide Web, is auctioning its original source code as a non-fungible token, Reuters reports.

23andMe on the Nasdaq

23andMe's shares rose more than 20 percent following its merger with a special purpose acquisition company, as GenomeWeb has reported.

Science Papers Present GWAS of Brain Structure, System for Controlled Gene Transfer

In Science this week: genome-wide association study ties variants to white matter stricture in the brain, and more.