CHICAGO (GenomeWeb) – While the Connectivity Map, a collection of gene expression data from perturbed cells, has been a useful resource for drug discovery and genome-wide association studies since its beginnings in 2006, it had not fully kept up with the times.
"The way we showed in the original Connectivity Map paper [from September 2006] is to measure mRNA gene expression profiles," noted Todd Golub, chief scientific officer and director of the cancer program at the Broad Institute. "It's very information-rich, but it's very high-cost, so therefore, it doesn't really scale to a genome scale kind of effort, even 11 years later. It's cost-prohibitive," Golub said. "What you need is a way to do it cheaper."
An update to CMap, led by Golub on behalf of the National Institutes of Health's Library of Integrated Network-Based Cellular Signatures (LINCS) Consortium, addresses this issue. Golub, a pediatrician who serves as the Charles A. Dana Investigator in Human Cancer Genetics at Dana-Farber Cancer Institute in Boston, noted that it usually is unnecessary to identify every RNA transcript in a given cell.
"Maybe we could identify a subset of the transcripts. If we measured that subset, maybe we could computationally infer the expression levels of all the transcripts we didn't actually measure. If that worked, then all we would have to do is figure out a method to develop a subset of the transcriptome and be able to do that at low cost and we would be all set," Golub said.
Broad ran some computational analyses to answer the question of "dimensionality reduction" — whether it was possible to maintain most of the information from a transcriptome by measuring a small subset of the full data set — by using the publicly accessible Gene Expression Omnibus data repository.
A paper published in the journal Cell last month showed that it was possible. By analyzing 1,000 "landmark" transcripts from the approximately 20,000 in the full RNA transcriptome, Golub and his research team were able to recover 82 percent of the information contained in the complete data set.
"Throw away 95 percent of the data but retain 80 percent of the information content?" Golub said. "That's a pretty good trade-off."
For the study, the Broad researchers generated data with a new laboratory platform called L1000 — so named because it helped the Cambridge, Massachusetts-based institute expand CMap by more than 1,000 times — which uses Luminex beads for low-cost measurement. Indeed, L1000 helped the Broad create more than 1.3 million profiles.
"Now we have a very low-cost, very high-throughput method to measure 1,000 RNAs in a single well of a 384-well plate, from which we can computationally say, 'Well, we only measured 1,000, but can we computationally predict the expression of all the transcripts we didn't actually measure would actually be?' It turns out, that works pretty well — not perfectly, but pretty well," Golub said.
This method, which had been in the works for close to seven years, afforded the Broad the opportunity to ramp up its data generation with far more gene perturbations and cell types than had previously been possible. "That's what this next generation of the Connectivity Map is all about," Golub explained.
According to Golub, L1000 "makes it possible to generate gene expression profiles at a scale and cost that make large-scale data collection feasible in a way that wasn't imaginable when we launched the CMap concept." It is designed to be compatible with RNA-seq, not a replacement or alternative.
"I think if you had a patient sample that you cared very deeply about, then spending a few hundred dollars for a really high-quality, very deep RNA sequencing analysis makes a lot of sense. If you wanted to profile 100,000 things, that becomes a big experiment. That's where L1000 can be useful," Golub said.
"Importantly, we've also shown that one can integrate RNA-seq profiles with L1000 profiles and relate them to each other. That's an important concept to Connectivity Map."
LINCS has publicly released the L1000 blueprint, companion computing code, and data sets generated for CMap to the open-source community.
"[We used] the data to understand, for example, the mechanism of action of drugs and chemical compounds based on the gene expression changes," Golub said. "We do a decent job of figuring that out using our algorithms, but by making the data available we hope that others will do even better, and that will be better for the whole field as well."
As the community digests the findings and the technology, Golub is pushing ahead with new experiments.
"Our initial computational approaches use reasonable but standard approaches to pattern recognition," he noted. "The obvious next step is to apply emerging machine-learning approaches to the data." That is exactly what his team at the Broad has started doing.
Golub has wider aspirations for the application of CMap as well.
"Most of the data that we've generated has been in cancer cell lines, and we're interested in expanding that to non-cancer cell types," he said. Plus, he added, the genetic perturbations the Broad studied mostly relied on RNA interference. "We plan to increase the amount of data that is generated from genetic perturbations using CRISPR," Golub said.
"We're committed to, in particular, developing tools that would support biologists' use of the data in their everyday research, which means investing in data and result visualization approaches," Golub continued. "We want biologists to be able to follow their nose as they see initial results and anticipate what they are going to want to ask next of the data and to have software that allows them to do that without being computationally sophisticated themselves."