CHICAGO (GenomeWeb) – A little more than a year after launching a genome-sequencing effort, a high-performance computing facility at a Scottish university has analyzed more than 6,000 human genomes, and has been processing more than 600 a month of late.
Eventually, the Edinburgh Parallel Computing Centre at the University of Edinburgh expects to process 8,000 to 10,000 human genomes annually, according to Director Mark Parsons.
"Effectively, this is a genome factory. We move around 400 terabytes of data around every week, and, if we keep all of the genomes we process, we will store several petabytes of data every year," Parsons said.
The EPCC processes genomic data on behalf of Edinburgh Genomics, a University of Edinburgh-affiliated organization that is the the largest university-based sequencing facility in the UK. The computing center also handles data processing for the Scottish Genomes Partnership, a collaboration announced in January 2015 between the University of Edinburgh, the University of Glasgow, and the National Health Service Scotland. The partnership has since grown to include the University of Dundee and the University of Aberdeen as well.
The volume of data processed by the partnership would not be possible without the high-performance computing center and massive data storage to back it up. "People are trying to do this using [commercial] cloud-type technology and they have not had good results," said Parsons.
It takes 40 hours to analyze a complete genome at the EPCC. But running a cloud server all-out for that long would burn out the processors fast, according to Parsons.
The EPCC has its own Silicon Graphics International ICE XA supercomputer with 13,000 cores to handle genomics and other high-performance computing applications, including simulation for design of the Koenigsegg One:1, a Swedish vehicle said to be the world's first production "megacar."
For the Scottish Genomes Partnership, the two founding universities a year and a half ago jointly purchased 10 Illumina HiSeq X Ten sequencers; half went to each school. Edinburgh Genomics was the first site to deploy Illumina's SeqLab workflow software, chosen for ease of deployment.
The Edinburgh Parallel Computing Centre is handling data analysis and storage of gene sequences, leaving the scientific operations to the scientists on each campus. Parsons and his team are supported by an array of three ES7K Lustre parallel file system appliances from DataDirect Networks, providing nearly 3 petabytes of storage.
The Illumina HiSeq X machines produce raw genome sequence data, which is then copied over via the university network to EPCC's supercomputer and stored on the DDN file systems, Parsons explained, and the computing center then creates the variant calling files.
Edinburgh Genomics' customers, including NHS Scotland, can retrieve and download to their own systems the end files for research or diagnostic purposes, Parsons said.
The EPCC already had 23 pb of DDN storage for other purposes and started the genomics processing a year ago with a single ES7K unit. Two more have come online as production has ramped up.
This decision to use DDN storage goes back to Parsons' view of the current state of the cloud. "We’ve seen projects falter when trying to support these complicated processing pipelines with traditional storage," Parsons said in a white paper released by DDN. "Properly spec’d compute and storage is vital to supporting highly complex workflows that generate vast amounts of data. Our genomics colleagues are pushing the boundaries of HPC and data management today," he added.
The £15 million ($19.2 million) Scottish Genomics Partnership is currently focused on diagnosing cancer, childhood illnesses, and genetic disorders of the central nervous system, as well as population studies.
"We would like to achieve a database for all genomic data for the whole of Scotland," Parsons said. At the moment, his center does not have the money to do so, but the potential is there.
Edinburgh Genomics was set up with a capital investment by the University of Edinburgh. Ongoing funding is contingent on the program breaking even — bringing in enough money from genomic sequencing and processing fees to pay operating costs and to start paying off the original capital investment — within three years. That has already happened, said Rich Mansfield, a DDN system engineer stationed at the University of Edinburgh to support the Scottish Genomics Partnership.
"From Day 1, they were under pressure to deliver a return on investment, and they did so in the first year," Mansfield explained.
One of the early tests of sequencing technology at Edinburgh Genomics involved the genomes of 1,374 people in the Lothian Birth Cohorts, a long-term study of cognitive decline in aging. "We were able to handle vast amounts of data generated by the sequencers effectively," Edinburgh Genomics facility manager Javier Santoyo-Lopez said in the DDN white paper. "These capabilities were backed by world-class HPC hardware and our DDN storage, enabling us to support initial analysis, hold processed data, and then transfer final results to our users."
Though the EPCC already is producing large amounts of data, Parsons and his team are preparing for demand to accelerate if and when researchers start asking for full genomes for their work rather than just small parts of genomes that are relevant to the immediate questions they are trying to answer. Parsons called this situation "exciting" in addition to being "challenging."