CHICAGO – Paradigm4 is continuing its move into multiomic translational informatics with the recent release of Reveal: SingleCell, an application to help biopharmaceutical companies manage and analyze large sets of single-cell data.
Built on top of the company's SciDB database management platform, Reveal: Single Cell supports multiomic analysis and association. Waltham, Massachusetts-based Paradigm4 said that users can choose cells across multiple studies to evaluate tissue distribution, look for variance in response to treatment, and compare normal to diseased cells.
Reveal: SingleCell is an extension of Reveal, a translational informatics platform released in 2018. Reveal works with the company's core SciDB database management system for life sciences that powered the National Institutes of Health's 1000 Genomes Project's browser.
Reveal also takes advantage of work done by Stanford biomedical data scientist Manuel Rivas, who built a public browser called the Global Biobank Engine on SciDB to help researchers conduct genome-wide association studies using the UK Biobank database.
Zachary Pitluk, head of business development for life sciences and healthcare at Paradigm4, said that that the technology is, at its core, "fundamentally different" than other database software.
Pitluk described Reveal as a stack, with application layers on top. Those apps feature application programming interfaces (APIs) and graphical user interfaces (GUIs) so, as Pitluk put it, "non-cognoscente can query data." Underneath the app is SciDB, which runs in the cloud.
Researchers interact with the application layer.
"We don't want them to have to worry about how big the compute is for the question that they're asking," said Paradigm4 Cofounder and CEO Marilyn Matz. "We're helping them focus on their questions at the Reveal app layer and hiding all of the complexity of the compute and the data storage and the data wrangling so that they can get their questions answered."
She said that some researchers Paradigm4 has worked with had to resort to sampling data because the old software they were working with could not scale to the size of their datasets. Reveal is able to scale not only because it runs on the Amazon Web Services cloud, but because of the structure of SciDB.
"We store data in the logical state," Pitluk said. "We organize it into matrices, store it as matrices," which is what scientists are accustomed to looking at. Rather than having to reassemble a matrix view each time a database is queried, SciDB stores information that way, allowing for highly efficient searches.
SciDB started as an organizational tool, but with input from customers including Roche, Paradigm4 developed the Reveal software on top of it to perform analysis within the structured databases.
"You can do linear algebra. You can do things like normalizations and looking at … the genes that have the highest variance across the datasets," Pitluk said.
SciDB also has evolved to include machine learning.
Pitluk said that the idea for Reveal: SingleCell came from another pharmaceutical giant, namely Bristol Myers Squibb.
"We are providing a solution that addresses a challenge that everybody in single cell has, which is: How do I compare? How do I search across studies?" according to Pitluk, who holds a PhD in biochemistry and was biochemistry research faculty at Yale University before joining Paradigm4.
He called this fundamental to the ability to scale compute power to handle the amount of information represented in single-cell sequences.
"Instead of just having this massive collection of files … you want the data to be cleanly organized," he said.
It also is much faster than other approaches. "You're taking things that would take normally hours to maybe days and boiling it down to something that happens in 15 to 30 seconds," he said.
This speed helps data analysts focus on their essential functions rather than spending time organizing data, according to Pitluk.
Paradigm4 scientists, working with collaborators from BMS, prepared an article, now posted to prepress site BioRxiv, describing their use of Reveal: SingleCell to return results within 60 seconds of queries of SARS-CoV-2 RNA sequences from a database of 2.2 million cells. "We highlighted [that] cells expressing COVID-19 associated genes are expressed on multiple tissue types, thus in part [explaining] the multi-organ involvement in infected patients observed worldwide during the on-going COVID-19 pandemic," they wrote.
By aligning the single-cell RNA sequences to the GRCh38 reference genome with data from 32 projects in the Human Cell Atlas, the Census of Immune Cells, and most of the COVID-19 Cell Atlas, Paradigm4 and the pharmaceutical company were able to keep the entire dataset smaller than 1 terabyte.
"There are distinct benefits to having [single-cell RNA-seq] data organized as arrays in a database, such as allowing cross-study selection of cells by gene expression thresholds or metadata tags and analysis by multiple users, while ensuring the consistency from a shared version of QA'd data and workflows," according to the prepress paper.
Paradigm4 said that SciDB gives Reveal: SingleCell "future-readiness," which the company defined as the ability to build a data commons with genomic, proteomic, imaging, and metabolomic information. The commons approach helps remove limitations of data silos, which the researchers said promotes cross-study analysis and allows the scaling of computational power.
Pitluk said that the paper has been submitted to BMC Genomics.
The preprint was a proof of concept, meant to demonstrate that Reveal: SingleCell could organize data in the COVID Cell Atlas and other related SARS-CoV-2 genome data for rapid searching.
"It was really just to show that instead of opening 33 datasets individually and trying to pull out this piece of information, when it's organized into SciDB. then the Reveal app will allow you to look across patients and across samples and come up with that information quickly," Pitluk said.
The earlier version of Reveal that mined the UK Biobank has a pandemic-related application as well, added Matz, since the biobank is being updated every two weeks with new COVID-19 morbidity and testing data.
"Some of our customers are using the COVID add-on data to the UK Biobank [with Reveal] to try to tease out the phenotypic correlations that you've been reading about, like people that have certain sets of underlying conditions," Matz said.
She said that one customer is working on an actual COVID-19 treatment rather than a vaccine, using the COVID-19 data in the UK Biobank as one research source.
Matz said that Paradigm4's technology stands out for three reasons: the multidimensional-array database; an integrated compute engine that allows for scaling up to tasks such as GWAS; and the availability of apps rather than less-developed toolkits.
Matz likened SciDB to an operating system and apps like Reveal to smartphone apps. "If you're a scientist, you want to just load the app and use it," she said. Researchers would rather not be handed a software development kit and be told to build their own tools.
"We work with customers to identify a set of repeatable apps. We know that the single-cell app is a challenge that many, many companies are facing now as the number of experiments and the number of patients in each experiment grows dramatically," Matz said.
Pitluk talked of a dichotomy in bioinformatics. Genomics requires datasets involving tens or hundreds of thousands of patients to extract insights about the genetic drivers of diseases, but single-cell research currently relies on very small cohorts; the largest he has seen has data on just 17 people.
He predicted that demand for single-cell data is going to explode in the near future, and that will require sufficient computing power and scalability.
"You're going to be in a situation in a year or two where studies will have hundreds of thousands of patients in them, and then all of a sudden the data demand is going to be that much greater and much more challenging for people just to look across the studies," Pitluk said. "For instance, tell me about the pancreas. I don't care about opening 500 files to figure out what I need to know."
Matz said that current popular bioinformatics tools will not be able to process data on that scale that quickly.
"[BMS] came to us because they were stuck. They needed some way that wasn't going to require them to hire an army of informaticians in order to do simple things," Pitluk said.
Pitluk noted that informaticians like to talk about "ETL," the challenge of extracting, transforming, and loading data when processing queries.
"Every time you open a file, you extract it, you transform the data in it, and you load it into something like R or Python," Pitluk said. Paradigm4 saves the transformed and loaded data into SciDB, so analysts have a structured dataset to work from.
Reveal and derivatives including Reveal: SingleCell are meant to give bioinformaticians and other scientists a scalable platform for organizing multitudes of omics, behavioral, clinical, health outcomes, and environmental data, including data from wearable devices.
Paradigm4's SciDB grew out of the laboratory of Turing Award winner Michael Stonebraker at Massachusetts Institute of Technology. Matz started the company in 2010 with Stonebraker.
"It was pretty raw academic technology," Matz recalled. That meant that Paradigm4 had to convert it to a "robust commercial product" and build a series of applications to address specific challenges that life sciences researchers were having. Reveal and Reveal: SingleCell are among those apps.
Pitluk said that Stonebraker had a "fundamental breakthrough" when he decided to store data in a different manner than earlier genomic analytics software.
Going forward, Paradigm4 expects to continue to develop new apps for the Reveal platform. One area that the company is looking at now is image processing because UK Biobank contains MRI data on tens of thousands of participants.
"We can upscale image processing for [users], make it happen very efficiently and quickly so they can take advantage of the MRIs that are in the UK Biobank, for instance," Pitluk said. "With single-cell, we're just populating more and more datasets."
"While a lot of people might be focused on the number of individual cells, we're focused on patients because we know that a patient might come in groups of 10,000 cells and we want 1,000 patients, so it's immediately going to be millions and millions of cells," Pitluk said.
Paradigm4 is also looking at normalizing data in earlier parts of research pipelines so that information can be added to queries. The company also remains bullish on wearable devices as a means to build more accurate patient phenotypes.
Matz told GenomeWeb in 2018 that the wearables component of a multimodal approach can help researchers and clinicians alike understand physical symptoms and disease progression.
"To develop an understanding of disease and to build models for precision medicine, you need this rich multimodal data. The genomics data isn't enough in and of itself," she said at the time.
That still holds true now, Matz said.
"We are able to bring together biomedical imaging, genomic data, environmental data, hospital data, and clinicians' records, so it's this ability to bring together all of this complex, diverse, multidimensional data in order to build a better model of disease and health," Matz said.