CHICAGO – Computer scientists at the University of Washington have created software to make imputations from epigenomic data by breaking down the data and creating visualizations in three dimensions, representing cell types, assay types, and genomic loci.
Called Avocado, the open-source software leverages the Encyclopedia of DNA Elements (ENCODE) consortium's database, which collects information about the human genome, epigenome, and transcriptome. ENCODE members perform analyses in many different types of cells and tissues, producing measures along the whole genome.
Avocado advances earlier work from the UW laboratory of William Stafford Noble, a computing model called Parallel Epigenomics Data Imputation With Cloud-based Tensor Decomposition (PREDICTD), as published in Nature Communications in 2018. That software generalizes from a 2D matrix decomposition to a 3D deep learning technique by "tensor decomposition" to impute multiple experiments at the same time.
PREDICTD itself, according to Noble, was a step forward from the only previous epigenomic imputation method called ChromImpute. The latter, described in a 2015 paper in Nature Biotechnology, maps genome-wide predictions of epigenomic signal tracks from 16 ENCODE and 111 National Institutes of Health Roadmap Epigenomics reference genomes.
"We show that PREDICTD data captures enhancer activity at noncoding human accelerated regions. PREDICTD provides reference imputed data and open-source software for investigating new cell types, and demonstrates the utility of tensor decomposition and cloud computing, both promising technologies for bioinformatics," Noble and colleagues wrote in their 2018 Nature Communications article.
Noble, a senior data science fellow at the UW eScience Institute and a professor of genome sciences, discussed Avocado during a keynote address at the combined Intelligent Systems for Molecular Biology and European Conference on Computational Biology (ISMB/ECCB) conference in Basel, Switzerland, put on late last month by the International Society for Computational Biology.
"One of the challenges that we face is that doing all of these experiments is simply not feasible in terms of the compute time, in terms of the expense, in terms of the sequencing, and also just the human effort involved," Noble said in his keynote. There are many missing data points.
"You have components in the latent space corresponding to genomic positions, a separate latent space for the assay factors, and a latent space for cell-type factors. Those help make predictions for what would happen when running a particular assay on a particular biosample at a particular genomic location," Joseph Schreiber, a graduate student in computer science in Noble's lab and the chief programmer of Avocado, explained in an interview following Noble's keynote.
"I think this is a fairly straightforward application of neural networks. It's mostly that other people hadn't done it before," Schreiber said.
ChromImpute models each experiment individually. "But if you're trying to do it with everything individually, it's going to be difficult to model everything," Schreiber said.
PREDICTD models everything at the same time, but Avocado adds the neural network and tensor factorization.
"The use of deep-tensor factorization in general isn't a novel one, but the idea of applying it to this type of data is," Schreiber said.
The Avocado name comes from Schreiber's preference for naming software after fruit, and Noble said that his lab actually was motivated by a Netflix challenge. Netflix challenged software developers to produce matrices to solve imputation problems using data that the streaming video service made available about viewers and their viewing preferences. The idea was to predict how a given viewer would like a particular movie or TV show.
"Methods that worked really well in the Netflix challenge were based on matrix decomposition, where you decompose each show or movie into a link representation and each user also has their own link representation," Noble explained. An algorithm would attempt to fill in "latent spaces" based on the answers to various questions, such as whether the viewer likes to think or whether the show has a particular actor.
Similarly, Schreiber built Avocado to capture nonlinear relationships among different factors. A forthcoming paper, in prepublication on BioRxiv, purports to validate Avocado predictions made using data from the Roadmap Epigenomics Consortium.
"We were particularly interested in this paper in understanding which of these predictions are easy and which ones are hard to make," Noble said in an interview in Basel shortly after he accepted the 2019 ISCB Innovator Award.
Noble called Avocado "the state of the art" in terms of making accurate predictions of transcription-factor bindings for use in research, thanks to the deep-tensor factorization built into the software. "That has the benefit that it can capture nonlinear relationships," he said.
For all 47 cell types studied, Avocado outperformed other technologies using the ENCODE data directly, according to the prepublication research.
The developers also compared the Avocado platform with promoter enhancer interactions, using data from previous research, fixing previous benchmarking problems. Avocado was better the majority of the time, the research showed.
Noble said that his team also cares about whether the latent representations themselves are useful. About 10,000 different measurements go into training the Avocado-ENCODE model, Noble explained. "Maybe the latent space that's been learned there can be used for other tasks," he said.
Schreiber has since expanded Avocado to include the entire ENCODE dataset. For the original paper, they used the same dataset as the ChromImpute and PREDICTD papers. The dataset now includes 400 different cell types rather than 127 cell types and 84 rather than 24 assays.
"In particular, we have expanded to include transcription factor binding, which the other models weren't attempting to do. It's a much more challenging task in general," Noble said.
The technology does have limitations, though.
An audience member asked Noble whether Avocado could predict perturbations related to the effect of treatments. He said that Avocado is filling in a tensor, so the software cannot be used to predict an entirely new layer of the tensor. However, if Avocado had some measures in the perturbed state, it could predict other measures.
Answering another question, Noble said that it took 15 hours to run the analysis looking for "nearest neighbors" along the three axes to fill in the gaps, though the UW team refined its algorithm by looking for 1,000 nearest neighbors. "Clearly, if we scale up to a much larger compendium, we are not going to do 1,000. We might have to do 10," Noble said.
The researchers now are uploading more than 30,000 experiments from Avocado predictions to encodeproject.org, and the data soon will be freely available to researchers. That page currently has 3,048 imputed experiments, according to Noble.
Schreiber said that Avocado would not have been possible without the University of Washington's cluster of high-performance graphics processing units. Although the Seattle school is in the backyard of cloud heavyweights Microsoft and Amazon, Avocado is using in-house computing resources.
As the full name suggests, PREDICTD previously ran entirely in the cloud, thanks to a series of credits UW had for both the Microsoft Azure and Amazon Web Services clouds. The jumping back and forth between cloud platforms proved too expensive and complicated to manage.
"We use the cloud very successfully for things like a web server, where you could just put it in the cloud then it can scale gracefully if more users come. For stuff like what Jacob's doing, it's really hard to predict usage for this kind of exploratory analysis," Noble said, so his lab switched to in-house high-performance computing resources for Avocado.
The current release of Avocado is for bulk data only, analyzing whole-genome datasets and some cancer cell lines from ENCODE. However, Schreiber is working on a single-cell variant of the software.
Schreiber noted that while it does take a lot of data to train the Avocado algorithms, once they are trained, they can be widely used without the need for others to create their own builds from the source code.
"We are going to be talking with ENCODE about getting it set up with them so that as more data comes in, then it would scale," Schreiber said. "While the code is available for anyone to use if they'd like to, we don't really anticipate that its largest use will be people training their own, but rather taking these trained models which condense this information into a much more reasonable subset."