NEW YORK (GenomeWeb) – The growing popularity of sequencing applications has created a data analysis bottleneck for researchers using older tools based on the R programming language, the favored approach for conventional biostatistics.
To address the computational challenges presented by single-cell sequencing in particular, scientists at Helmholtz Zentrum München, have developed a new software package called Scanpy that they hope will support major analytical efforts, such as the Human Cell Atlas.
They described the tool in a new paper in Genome Biology.
"I think we are undergoing a shift in the scale of data that is produced in transcriptomics," said Alex Wolf, lead author on the paper, and an investigator at the center's Institute of Computational Biology (ICB) based in Neuherberg, north of Munich. "The original biotechnologies produced a limited number of data points, albeit across a high number of genes," he said. "But within a few years of single-cell sequencing, we have generated more data than all the data generated using conventional technologies in the prior 20 or 30 years and this is continuing."
Wolf noted that "within just one single-cell dataset of a million cells, as published last year, more samples than in all previously existing bulk data, about 400,000 samples for humans, have been generated."
However, Wolf entered this picture with a different perspective. Given his background in machine learning and high-performance computing, he was less familiar with the conventional R-based statistical approaches that were being overwhelmed by the computational challenges associated with single-cell sequencing.
"This infrastructure, based on the R programming language and conventional biostatistics approaches, did not scale to these amounts of data," noted Wolf. "In the worlds of biostatistics and classical statistics, R is what you do and what was done in the past 15 to 20 years, but Python is the language of machine learning and high-performance computing environments."
Wolf's familiarity with Python and high-performance computing led him to create Scanpy, with a particular project, the Human Cell Atlas, in mind. That international effort, which will seek to create a catalog of roughly 100 million cells using single-cell RNA sequencing, aims to produce a first draft by 2022.
"The Human Cell Atlas wants to build an atlas for the whole human body," said Wolf. "This will require measuring millions of cells that we want to analyze, but isn't possible with the existing infrastructure of computational tools."
Scanpy has been invited for review by the tertiary analysis committee of the Human Cell Atlas in a few weeks. Where primary analysis is considered the actual experiment, and secondary analysis provides an alignment of the reads, Scanpy's tertiary approach will identify structure in the large amounts of generated data.
"This will include focusing on new cell types, classes of subjects, or developmental trajectories, and then finding the genes that associate with those phenomena in the structure of the data," Wolf said. "This is what Scanpy does."
As described in the paper, Scanpy provides methods for "preprocessing, visualization, clustering, pseudotime and trajectory inference, differential expression testing, and simulation of gene regulatory networks." It also integrates concepts from popular R-based approaches, including graph drawing, diffusion maps, and clustering.
According to Wolf, graph-based algorithms play a core role in Scanpy. Rather than placing cells within a coordinate system inside a gene-expression space, Scanpy's algorithms rely instead on a graph-like coordinate system that characterizes cells by identifying their closest neighbors. He likened the approach to connections in social networks, and said that to identify cell types, Scanpy relies on the same algorithms as facebook for tagging communities.
Since the tool is discussed in Genome Biology, it is currently available for use by anyone, and early adopters include researchers at the Broad Institute and at the Massachusetts Institute of Technology. Some commercial platforms for single-cell genomics, such as FastGenomics, which is offered by Comma Soft and the Life & Medical Sciences Institute at Bonn University, have decided to integrate some features from Scanpy into their offerings, Wolf noted.
While other Python-based approaches have been created for single-cell genomics, such as the European Molecular Biology Laboratory's scLVM tool for factor analysis, or the University of California, Berkeley's FastProject for data visualization, Wolf said that Scanpy is the first tool to offer all of these approaches within a single package. He also said that it should find application outside of the Human Cell Atlas.
"Not everyone will use this, but with [the increasing popularity of] 10x genomics sequencing technologies and machines, even smaller groups are setting up experiments where they produce datasets with 30,000 to 50,000 cells," said Wolf. "That's already in the range where things start to become very tedious if you do it with conventional tools."
He noted that the Human Cell Atlas's tertiary analysis working group is currently settling on a standard pipeline for processing data for the atlas. The degree to which the pipeline might rely on Scanpy and additional elements will be decided in the next few months, Wolf said.