Through a systematic analysis of messenger RNA and small RNA sequence data generated at several centers as part of a larger project, a large European consortium has confirmed the feasibility of distributing such RNA sequencing across multiple labs.
In a study appearing online last week in Nature Biotechnology, members of the group presented findings from an analysis of more than 450 cell lines subjected to mRNA and sRNA sequencing at one or more sequencing centers in Europe, following standardized Illumina sequencing and sample preparation protocols.
The study's authors saw relatively modest, though not negligible, variation between data for the samples sequenced at multiple labs. By delving into the nature of that variability, they highlighted potential sources of mRNA and sRNA sequencing differences under a distributed sequencing scheme, ultimately proposing a set of quality control guidelines for RNA sequence data.
"You do need to keep an eye on the quality of your data and magnitude of lab effects a little bit more carefully if you have multi-center data," senior author Tuuli Lappalainen told In Sequence.
Nevertheless, she and her co-authors concluded that "distributing RNA sequencing among different laboratories is feasible, given proper standardization and randomization procedures."
That is expected to hold true when researchers stick with standardized RNA sequencing and sample prep approaches such as the Illumina protocols described in the study, noted Lappalainen, who was a post-doctoral researcher working with Manolis Dermitzakis at the University of Geneva at the time the study was performed.
She is currently a visiting instructor in Carlos Bustamante's genetics lab at Stanford University.
Lappalainen cautioned that the type of variability found in RNA sequence data generated at different labs may need to be assessed again if sequencing centers settle on distinct sample-prep protocols or sequencing platforms.
"What we show here is that the Illumina project seems to be quite robust and works well — and the amount of technical variation is not that large," she said, "But I wouldn't necessarily go as far as to draw the same conclusion for other technologies."
To explore the sorts of technical variation that can occur in RNA sequence data — and the comparability of RNA sequence data generated at multiple sites — she and her colleagues began with data for 465 lymphoblastoid cell lines being assessed at seven large European sequencing centers as part of a broader study of functional variation in the human genome.
In a study published in Nature last week, for example, Lappalainen, Dermitzakis, and other members of the team reported on regulatory variants affecting gene expression that were detected using some of the same RNA sequence data included in the current technical analysis.
At each of seven participating sequencing centers in Europe, study collaborators did mRNA and small RNA sequencing on between 48 and 113 randomly selected lymphoblastoid cell line samples using the same sample preparation and sequencing protocols.
Five of the samples were assessed at all seven labs, with each of the labs performing duplicate mRNA and sRNA sequencing experiments on those selected samples.
In addition, the University of Geneva selected 168 of the samples that had been sequenced elsewhere and did lower coverage mRNA and sRNA re-sequencing on those to get what Lappalainen called "a very large dataset of replicate samples."
For the mRNA sequencing experiments, each center prepared samples with the Illumina TruSeq RNA kit and sequenced them using 75-base paired-end reads on Illumina's HiSeq 2000 instrument.
Similarly, an Illumina TruSeq kit specific to small RNAs was used to prepare the small RNA libraries from each sample set before researchers sequenced those molecules with single-end 36-base or 50-base HiSeq reads.
But while each center aimed for a minimum number of reads per sample — around 20 million in the case of the mRNA sequencing experiments — the number of reads per sample varied from lab to lab, since the sequencing protocol allowed for flexibility in terms of the number of samples pooled in each HiSeq lane.
"We chose not to 'over-standardize,'" Lappalainen said, noting that the team did not want to interfere too much with the pipelines already in place at labs producing good quality sequence data.
Once the sample prep and sequencing steps were complete, members of the team brought together data from the various sequencing centers to perform quality control and other types of analysis at a few centralized locales.
When they examined the extent to which mRNA and small RNA sequence data differed when the same samples were sequenced using identical protocols at different centers, the researchers saw relatively little lab-to-lab variation in the RNA sequence data for overlapping samples — at least compared to the level of variability between different biological samples.
When discrepancies did arise in mRNA sequencing experiments done in different labs, they most often involved differences in insert sizes and in the average guanine and cytosine nucleotide content of the sequences generated. The latter form of variation may stem from differences in the thermocyclers used during sample preparation, both within and between different labs, Lappalainen noted.
Data generated in the study suggests there is an additional contributor to technical bias that can arise even before sequencing or sample preparation steps during distributed sequencing studies: the quality of RNA extracted from the original sample itself.
That RNA quality might not necessarily lead to biased results, depending on how the samples are subsequently prepared and dealt with, Lappalainen explained, though findings from the study indicate that it is something researchers need to keep in mind.
Consequently, Lappalainen noted that it may be beneficial to randomize not only the sample preparation and sequencing step, but also the RNA extraction step, if possible, when doing multi-lab RNA sequencing studies.
The quality and nature of the RNA extracted from samples also had an impact on the variation detected for sRNA sequence data generated at various labs, the researchers reported, which led to a range of representation by sRNAs in the pool of RNA extracted from samples.
"There are very easily differences in how much small RNA you end up having in your sample," Lappalainen said. "And this, then, affects the end result of how much small RNA you're able to capture in sequencing."
In particular, the RNA extraction step had an apparent effect on the microRNA content that could be picked up by RNA sequencing, with ribosomal RNA sequences swamping out miRNA signal in some cases.
"Most of this is driven by the fact that you sometimes end up having a very low representation of small RNAs in your actual RNA sample," Lappalainen said. "Library prep in small RNA sequencing is still a little bit difficult to do," she explained, noting that "capturing only the small RNAs is not that easy."
Overall, most of those mRNA and miRNA sequence glitches are relatively easy to correct once detected. But they did prompt the study's authors to develop guidelines for detecting quality control shortcomings and dealing with such variation when it appears so that data can be accurately standardized.
Those guidelines include quality check considerations that apply to both mRNA and miRNA sequence data, such as base quality score distribution and metrics related to GC content, along with measures that are more specific to mRNA or sRNA sequencing experiments.
With respect to projects involving distributed RNA sequencing, meanwhile, Lappalainen emphasized the importance of standardizing sample processing protocols and recommended randomizing sample allocation as much as possible.
Such randomization is important, Lappalainen explained, because "while we show that the laboratory differences that we get are not huge, they are not non-existent."
"If we had, for example, analyzed one population in one place and another population in another place, it would have created a huge difference between batches," she said.