The next few years are going to see some changes in human genetic studies, writes Dan Koboldt at MassGenomics. The adoption of the Illumina HiSeq X Ten sequencing system by a handful of centers will lead to the cheap sequencing of some 18,000 genomes a year. But that, he adds, serves up its own set of challenges.
There is, of course, the sheer size of the datasets to contend with. The size of a BAM file for a 30x whole genome is around 80 gigabytes to 90 gigabytes large, he estimates, adding that the BAM files for a sample of a thousand people could take up some 80 terabytes in storage space. As storing that data will be expensive, Koboldt anticipates that researchers will have to make hard choices about what data to keep and what data to delete.
And, he notes, moving data from the centers that can afford the HiSeq X Ten price tag to the researchers whose samples they run won't easy. "Have you tried to download an 80 gigabyte file lately?" he says. "The regular internet is just not going to work for this."
Still, Koboldt says that whole-genome sequencing studies will enable researchers to get a look at various kinds of sequence variations. "The wonderful thing about WGS is that it both enables and forces us to look beyond the obvious (e.g. the nonsynonymous variants in known protein-coding genes)," Koboldt writes. "We're headed into the unknown, the dark matter of the genome, whether we like it or not. And that is a good thing."