By Matthew Dublin
Given the current state of the United States Post office, that the largest sequenced data sets are still primarily transferred by mail is not heartening. In fact, according to Li Yingrui of the Beijing Genomics Institute “this is absurd.”
In an International Science Grid This Week article that looks at informatics challenges facing the 1000 Genomes Project, Yingrui, as well as other 1000 Genome Project participants including David Altshuler of the Broad Institute and Phil Butcher, head of IT at the Sanger Institute, describe their frustrations when it comes to dealing with their current data bottleneck. After finishing the first 1,000 genomes in mid-2010 they are now aiming to sequence 2,500 genomes, and the because of the current limitations of Internet bandwidth, the cloud won’t do.
“The main issue for us is that our data sizes are so large, that the cost and difficulty of moving the data to the cloud stops it being cost effective for many jobs. We do use the cloud for the Ensembl genomes database, but only to provide [data] mirrors that are closer to users,” says Butcher.
The IT leaders of 1000 Genomes Project describe how they must “distressingly often resort to shipping hard disks around to transfer data between centers, rather than use the internet, or even via Aspera which is faster than ftp [file transfer protocol].”
The issue is so dire that BGI has established an open access journal, Giga Science, to deal with the problem of data dissemination and organization.