This story has been updated from a previous version with additional comments addressing critiques of the technology. The name of the conference has been corrected as well.
CHICAGO – Computational biologists and data scientists at the University of Washington-Tacoma have developed a methodology using "serverless" cloud computing that accelerates alignment of human RNA sequencing data by more than a thousandfold. This technology, they said, slashes the alignment time for a 640-million-read dataset from 19 hours to just one minute.
These researchers described their work in a prepress article posted to BioRxiv and presented key findings at the virtual Bio-IT World conference in October.
Computer scientist Ka Yee Yeung, who presented the research at Bio-IT World, said that the acceleration came from a combination of techniques, including a new algorithm, a parallel computing structure, and, most importantly, the "serverless" cloud configuration through a strategy known as function-as-a-service computing (FaaS).
"In this case, serverless [means] that we don't set up the servers ourselves," Yeung explained. "The cloud providers would configure and set up the servers for us. All we do is we upload the code."
This technique had not been used for RNA-seq analysis before, according to Yeung, a professor in UW-Tacoma's School of Engineering and Technology with an adjunct appointment in microbiology. She is on sabbatical this academic year as she works to commercialize the technology through a new company called Biodepot.
Her academic laboratory designs computational workflows for next-generation sequencing data, with a particular focus on RNA-seq data. "And there are a lot of RNA-seq data out there, so we believe the impact will be substantial," Yeung said.
Biodepot has received a National Cancer Institute Small Business Innovation Research Phase I contract worth $246,378 to help with data generation for the Cancer Research Data Commons.
"There was nothing wrong with previous methods," Yeung said, but the FaaS paradigm she and her UW-Tacoma colleagues adopted is exponentially faster.
The approach provides on-demand access to single-purpose applications, eliminating the need to configure virtual servers on cloud platforms. However, according to the preprint, there are limitations to this related to memory, disk space, execution time, and network bandwidth that the UW-Tacoma researchers overcame by developing an algorithm to create an on-demand virtual supercomputer for RNA-seq processing.
"Unlike approaches designed for processing large collections of samples, our strategy supports accelerating the alignment of a single dataset on Amazon Web Services (AWS) or the Google Cloud Platform," the paper said.
Yeung and her team simply had to optimize their method to address the constraints of the technology by creating what they called parallelized sequence alignment.
"The idea of our framework is that we increase concurrency," she said.
That means that the methodology takes large sequencing files and breaks them into smaller pieces to accommodate the memory limitations of each so-called "serverless instance." Each instance then aligns a small chunk of the RNA-seq data, which the algorithm collects and pieces back together into a full sequence.
In this case, Biodepot breaks each FASTQ file into 1,752 "data shards" of about 60 megabytes each, a process that which Yeung labeled "split and multiplexing." Following alignment of the small chunks is the "merge" step, where intermediate results are collected and reassembled.
In the preprint, after optimizing their processes, the researchers obtained benchmark transcript counts in just six minutes at a cost of $3.85 to rent AWS cloud space. It took less than two minutes to repeat the workflow process because they did not have to generate and transfer shards in subsequent runs, they wrote.
Still, the 1,100-fold increase in speed for aligning RNA-seq reads to the human transcriptome via the Burrows-Wheeler Aligner (BWA) did not include time spent on steps before and after the alignment, which opens Biodepot to some criticism.
Oliver Hampton, VP of informatics and biostatistics at oncology-focused bioinformatics company M2Gen, said that the preprint manuscript did not adequately calculate computation time to split or reassemble reads, nor did it account for some computational jobs being placed in a queue when being submitted to the cloud. "The authors' claim is based on theoretical or ideal/assumed conditions," he said via email.
Yeung responded to this by saying that the six minutes to obtain benchmark transcript counts includes the time to split and merge files and also accounts for cloud queues. The actual alignment step needs just one minute of runtime in the serverless configuration, she said.
Hampton also suggested that the UW-Tacoma idea is not so groundbreaking. "We at M2Gen routinely split FASTQ into 'shards' to increase computation time, so unfortunately this is not such a novel idea," he said.
Yeung said that the key innovation on this front in Biodepot is "how we circumvented the technical limitations of serverless functions to enable efficient processing of big data." Breaking up FASTQ files into shards is one way of optimizing processing of RNA-seq data by reducing the overall file sizes of the reference and data. Yeung and colleagues detailed other optimization methods in their preprint manuscript.
According to Hampton, the only "items of interest" in the manuscript were the fact that the UW-Tacoma team used unique molecular identifier barcodes during the RNA-seq process and that the researchers converted the cloud input and output to a 64-bit hash to decrease the size of files for transfer.
Hampton said that it is likely that the down-conversion would result in the loss of some information, but Yeung said that is not the case with the UW-Tacoma method.
Yu Qiu, a senior principal scientist specializing in next-generation sequencing bioinformatics at GeneWiz, a genomics service provider owned by Brooks Automation, noted that the Biodepot processes still rely on the Burrows-Wheeler Aligner and that the modifications made to reduce the size of reference and output files fit within the framework of AWS "but may not be applicable" to other cloud platforms.
"It is an interesting setup to make things to run faster, but [the] underlying tools are still the same and the setup has its own limitations," Qiu said in an email.
Yeung said that the method has been tested on Google Cloud using her lab's interactive pipeline on GitHub, but the process was slower; it took about 21 minutes to get benchmark transcript counts.
Yeung called Biodepot a work in progress that builds on several previously published papers from her team, including one on parallelized sequence alignment, presented at the 2019 International Workshop on Parallel and Cloud-based Bionformatics and Biomedicine.
"That was more like a proof of concept for a smaller dataset and easier application," Yeung said. "At some point we realized that we could use this serverless technology for bigger datasets, for higher-impact applications."
During her Bio-IT World presentation, Yeung emphasized the need for a user-friendly graphical interface on the front end. This was an outgrowth of another 2019 paper, this one in Cell Systems, about creating workflows in a piece of software that her lab developed called Biodepot-Workflow-Builder (BWB).
"This is the accessible front end that we developed that allows users to create, edit, and reproducibly deploy bioinformatics workflows," Yeung explained. "Now that we have this amazing technology, we want to make it more easily accessible, more usable, so we can immediately make a biological and clinical impact."
She said that Biodepot-Workflow-Builder is aimed at biomedical researchers who do not know how to write software code.
Both BWB and the RNA-seq analysis tool follow the National Institutes of Health's "FAIR" principles of research data being findable, accessible, interoperable, and reusable. For example, the front-end interface can be extended to new applications, which is exactly what Yeung's lab did for with the serverless technology.
However, Yeung said that several technical limitations need to be overcome before the technology can be extended to larger datasets such as whole exomes and whole genomes. "We are really hoping that our technology can transform the way that RNA sequencing data analysis is being done right now," she said.
Yeung said that one of the goals of Biodepot LLC is to encourage better collaboration between computational and biomedical scientists. "Biomedical scientists, the people who generate the data should have more insight to actually change the analysis, maybe reset the parameters, and generate new visualizations to … interpret the data," she said.
That is a major reason why a user-friendly interface is so important, according to Yeung.
Yeung and colleagues also are working on supplementary documentation, including videos, before submitting their research for peer review. "Our immediate next step is to try to get the paper published," she said.