Skip to main content
Premium Trial:

Request an Annual Quote

1,000 Genomes Project Approaches 3 TB of Sequence Data



The 1,000 Genomes Project has almost tripled the amount of sequence data it has produced during its pilot phase since this spring, to 2.8 terabases, or approximately 100 terabytes.

The project has also added another production center, the Max Planck Institute for Molecular Genetics in Berlin, that has recently begun to generate data for the effort.

Paul Flicek, head of the vertebrate genomics group at the European Bioinformatics Institute in Hinxton, UK, and co-leader of the 1,000 Genomes Project's data flow group, gave an update on the project's progress at the Personal Genomes meeting at Cold Spring Harbor Laboratory last month.

Raw sequence data generated by the production centers is amassed at EBI, where researchers in collaboration with colleagues from the Wellcome Trust Sanger Institute recalibrate it in order to obtain accurate and uniform quality scores that allow data from different centers and sequencing platforms to be compared.

It is then uploaded to both EBI's and the National Center for Biotechnology Information's FTP sites for public access. In the long term, data will be stored in NCBI's Short Read Archive and EBI's European Read Archive.

The next batch of data — resulting from a data freeze in August — was expected to be ready for download last month, according to Flicek. As a result of the increased data production, data transfer between the production centers and the data storage centers is becoming increasingly difficult, he added.

The next data freeze, which was planned for the end of October, is expected to complete data production for two of the three 1,000 Genomes pilot projects.

Under the first pilot project, researchers are sequencing 60 HapMap samples from three different populations at low coverage. The second pilot involves high-coverage sequencing of two trios — parents and child — of European and African descent. The third pilot project, which is still underway, aims to sequence 1,000 genes in 1,000 individuals at high coverage.

By next year, following a meeting this month, scientists are planning to release a first genetic variation map, according to Flicek.

Following the pilot phase, the entire project, he said, will probably generate about 20 terabases of sequence data. Sequencing production worldwide, he estimated, will soon be just an order of magnitude smaller than data generation by the Large Hadron Collider that recently opened in Geneva, which is expected to produce 15 petabytes of data per year.

The Scan

Tens of Millions Saved

The Associated Press writes that vaccines against COVID-19 saved an estimated 20 million lives in their first year.

Supersized Bacterium

NPR reports that researchers have found and characterized a bacterium that is visible to the naked eye.

Also Subvariants

Moderna says its bivalent SARS-CoV-2 vaccine leads to a strong immune response against Omicron subvariants, the Wall Street Journal reports.

Science Papers Present Gene-Edited Mouse Models of Liver Cancer, Hürthle Cell Carcinoma Analysis

In Science this week: a collection of mouse models of primary liver cancer, and more.