CHICAGO – Bioinformaticians from Australia's Commonwealth Scientific and Industrial Research Organisation (CSIRO) have released a computing method capable of analyzing complex, polygenic phenotypes that may involve epistatic interactions using massive sets of whole-genome data.
They described their method, called VariantSpark, in an article published last month in GigaScience.
VariantSpark is a cloud-based, distributed, machine-learning computational framework that achieves scale through multilayer parallelization. It is built on top of the Apache Spark distributed-computing analytics platform, referred to simply as Spark.
Spark features central node computation, while the dataset is partitioned in the memory of several computers, called compute nodes. In other machine-learning setups, the dataset is partitioned "horizontally," with each compute node containing data for all features plus a set of samples, but processing genomic data requires "vertical" partitioning because the number of features — in this case, variants — is far larger than the sample set, according to the researchers.
CSIRO bioinformatics group leader Denis Bauer said that all public cloud providers and many high-performance computing clusters can run Spark. "The algorithm we implemented in Spark is sort of this universal framework for distributed computing," she explained.
Spark users often pick algorithms from a machine-learning library called MLlib. "We quickly noticed that just using MLlib and those standard algorithms is not going to cut it for the high-dimensional genomic space," Bauer said. "Therefore, we had to implement our own version of random forest, which uses a virtual partitioning instead of the usual horizontal partitioning to divvy up the data and distribute it to the different CPUs. That is our core innovation."
For their research, the CSIRO team created a "synthetic" dataset of 100,000 people, then performed population-scale analysis with VariantSpark's machine learning.
They processed one trillion data points at once, comprising more than 100 million variants on 10,000 samples, in a period of 15 hours. Bauer said that no other genomics technology platform has been able to accomplish that to date, and the next fastest systems would need 100,000 years to crunch that much data.
Her team examined "random forest," a classification concept popular in machine learning that considers multiple decision trees simultaneously, in its study. Random forest (RF) assigns a single "importance score" to indicate association power among multiple variables, making it suitable for analyzing complex phenotypes. "Even though RF is not a deterministic algorithm, it is an accurate approximation with a manageable computational requirement," the researchers wrote.
CSIRO, the digital health research branch of the Australian e-Health Research Centre, created VariantSpark to conduct association studies of complex phenotypes with datasets on the scale of whole human genomes, following RF principles. It is integrated with the Broad Institute's Hail open-source genomic analysis software and fully compatible with VCF files.
The researchers used the tool with datasets of different sizes to assess the capabilities of several each machine-learning application, starting with real genotypes from the 1,000 Genomes Project of the first half of last decade, as well as with a simulated phenotype created by open-source software called the Polygenic EpiStatic Phenotype Simulation (PEPS).
Even with smaller datasets, they demonstrated that VariantSpark is 3.6 times faster than the fastest alternative the researchers were aware of, a tool called ReForeSt that was presented by Italian researchers at the 26th International Conference on Artificial Neural Networks in 2017. They also found VariantSpark to be as much as three times faster as MLlib and the "only method able to scale to ultra-high-dimensional genomic data in a manageable time," they wrote.
VariantSpark and ReForeSt showed similar performance at 1.6 million variants. "After this point, the runtime of ReForeSt increases exponentially while VariantSpark increases sub-linearly," the CSIRO authors wrote.
According to the researchers, polygenic risk scores usually only look at additive effects of individual genes, ignoring epistatic and other interactions. "While evidence of epistatic interactions is found in small datasets, large datasets have not been processed yet owing to the high computational complexity of the search for epistatic interactions," they wrote.
They surmised that there have not been studies on polygenic-epistatic phenotypes to date because GWAS methods are not powerful enough to manage such complex computations. "VariantSpark is the first methodology to perform complex association analyses on whole-genome sequencing experiments and outperforms other state-of-the-art implementations," they wrote.
However, VariantSpark is meant to complement rather than replace genome-wide association studies, they added. "The results of traditional GWAS … and VariantSpark should be considered together to gain insights into the full influence of the genome on disease and other phenotypes. Similarly, VariantSpark's output may be usable to prioritize variants in [polygenic risk scores] to reduce noise levels," they wrote.
The runtime analysis of VariantSpark started with 1,000 samples and 10,000 variants, and proceeded to test sets of 100,000 samples and 10 million variants and 10,000 samples and 100 million variants.
A dataset of 100 million variants and 10,000 samples processed in typical ways would require 1 terabyte of computer memory, a quantity likely not available on a standard HPC system. This kind of processing power calls for a computing cluster, such as is found in a large cloud environment, according to the authors.
For their own study, tests were run on the AWS Elastic Compute Cloud (EC2) for HPC computing and Elastic Map Reduce (EMR) environments for cluster computing.
CSIRO found that VariantSpark outperformed standard logistic regression for both epistatic and non-epistatic phenotypes where interactions between variants are involved, though VariantSpark fell short on datasets with variants associated with variables used to simulate phenotypes. "This gain over VariantSpark is likely due to the need to tune hyper-parameter choices for each dataset, which has resulted in non-optimal performance in these instances," according to the paper.
However, VariantSpark proved to be far faster than both ReForeSt and another distributed-computing implementation called Ranger, based on runtimes of all three on synthetic datasets with 10,000 samples and an increasing number of variants — 100, 6.5 million, and 10 million. Only VariantSpark and ReForeSt were even able to process the two largest datasets.
The set with 10 million variants and 10,000 samples includes 100 billion genotypes, which the authors said can be loaded into 100 GB of memory. VariantSpark was able to process that dataset with a peak memory usage of 120 GB.
According to Bauer and colleagues, variant interactions also "remain invisible to traditional GWAS and subsequently to PRS methods." Also, until very recently, there have not been suitable algorithms or computing power to analyze whole-genome data for epistatic interactions.
Bauer said that VariantSpark has gone through several iterations since it was created about five years ago, and has evolved as datasets have grown and matured. CSIRO now is trying to anticipate the types of datasets that will exist in the future.
"These will be UK Biobank-style datasets where consortia could have hundreds of thousands of individuals with whole-genome sequencing [data]," Bauer said. VariantSpark needs to scale to that size while also being able to manage the complexity associated with disease genes.
In that regard, VariantSpark is not alone. Bauer noted that the Hail library for scalable genomic research has been built on the Spark framework with future datasets in mind. However, she said, Hail follows standard GWAS logistic regression, which may not be able to look at multiple mutations simultaneously in the exploration of a disease, and thus is not ideal for polygenic risk scoring.
According to CSIRO, technology like VariantSpark is absolutely necessary to advancing the field of genomic medicine. "Given that there is statistical proof for the existence of both polygenic phenotype and epistatic phenotype, there is a likelihood of a complex phenotype to exist — a phenotype that is driven by several variants individually as well as several sets of interactive variants," the authors wrote.
VariantSpark has so far been applied to research into genetic causes of cardiovascular disease, dementia, Alzheimer's disease, and amyotrophic lateral sclerosis (ALS). Bauer said she is looking for international collaborations with others who have large datasets and suspect that there may be a complex polygenic effect in play.
She said the notion that machine learning can supplement statistics that has been built up for close to 30 years is "a message that takes a while to be accepted." She anticipated that acceptance and trust will come in "baby steps." Few organizations have datasets large enough for VariantSpark to be a feasible option, she added.
"We've been struggling in finding partners that have those kind of dataset sizes today," she said. "Tomorrow, probably everyone will have that." She called the UK Biobank "our dream dataset," containing hundreds of thousands of people, "but at this stage, we are happy with thousands."
Bauer said that VariantSpark might also be useful in COVID-19 research, perhaps to identify variants in the viral or human genome that could predict disease outcomes. "We can scale to pandemic scale," she said. "If the world really comes together and puts [all their COVID-19 research data] into one pot, VariantSpark could conceivably analyze it. It will be expensive, but it is feasible."
However, because COVID-19 is caused by a novel coronavirus, the data is not mature. While the D614G mutation in the viral genome has been shown to make it more virulent than other forms of SARS-CoV-2, for example, Bauer said that there needs to be more research into the clinical behavior of the virus. Additionally, the viral genome has just 30,000 base pairs, compared to 3 billion in the human genome, but the research sample pool is small, still with just 100,000 samples or so. "Going forward, I can easily see this going to millions. There will be millions of samples times 30,000 bases in the genome to look at," Bauer said. "It will be a large dataset and VariantSpark will be good for that."