Skip to main content
Premium Trial:

Request an Annual Quote

Long-Read Sequencing Reveals Tens of Thousands of Novel Transcripts Across Different Human Tissues

mRNA transcripts illustration

NEW YORK – Human tissues contain thousands of previously unknown gene transcripts, revealed by analysis with long-read sequencing, according to a new study published Wednesday in Nature.

A team of researchers from the New York Genome Center and the Broad Institute used Oxford Nanopore Technologies' platform to conduct bulk RNA sequencing on libraries generated from 88 samples from the Genotype-Tissue Expression (GTEx) biobank. They identified 71,735 novel transcripts for annotated genes and validated the proteins for 2,575 of them using mass spectrometry. They also developed an open-source toolkit for dealing with long-read transcriptomics data.

"Annotating human genes and transcripts is a long process," said Tuuli Lappalainen, a researcher at the NYGC and a senior author of the study. "Our data demonstrates it's still not complete … there are still new things to be discovered there."

The study paints a picture of the human transcriptome in detailed pixels, rather than "broad brushstrokes," said Thomas Gingeras, an RNA researcher at Cold Spring Harbor Laboratory who was not involved in the study. "You now have much better definition; these are detailed analyses of each transcript."

The authors also included analyses to untangle the impact of allele-specific expression from allele-specific splicing and their relationships to genetic studies, said Ewan Birney, joint director of the European Molecular Biology Laboratory – European Bioinformatics Institute.

"The full-length transcripts generated have allowed better understanding of expression and splicing relationships," he said. "This sort of in-depth data generation and analysis will continue to provide insight into the human genome, and for disease-causing genes — in particular in Mendelian disorders — every detail counts." Birney disclosed that he is a paid consultant for Oxford Nanopore, as well as a shareholder.

The study is the result of a longstanding collaboration between Lappalainen, who had been looking into splicing in GTEx samples and Daniel MacArthur, director of the Centre for Population Genomics at Australia's Garvan Institute of Medical Research, who, together with Beryl Cummings, now an associate at Third Rock Ventures, had been doing cross-tissue long-read sequencing while at the Broad. "We decided to just join forces and combine the two datasets," Lappalainen said.

As a leader of the GTEx project, she was "familiar with the samples and the opportunities that they provide," she said.

The choice of Oxford Nanopore sequencing was driven largely by cost, and the convenience of being in the same building as the firm's New York-based team. "We wanted to not just look at [transcripts] qualitatively … we wanted to quantify them to look at transcript expression levels and allele-specific expression," she said. Nanopore sequencing "provided long reads and sufficient accuracy and ability to create enough reads" at the price point the researchers were looking for.

Oxford Nanopore performed sequencing for some of the cell lines used in the study, and "there was quite a bit of exchange" between several company scientists — who are listed as coauthors — and the other researchers.

While the Oxford Nanopore platform can do direct RNA sequencing, it has low yield and requires high sample inputs, so the researchers sequenced cDNA libraries, including many that were amplified by PCR. What biases this might have introduced into the data remains to be seen. "It's important because we know reverse transcriptase has a bias, we know PCR has biases," Gingeras said. "How much does that affect the overall picture that we see?"

The median number of aligned reads per sample was nearly 5 million, and the median length of aligned reads was 789 bp. "All samples had reads longer than that with read length distributed across different sizes," Dafni Glinos, now a research fellow at Vertex Pharmaceuticals and co-first author of the paper, said in an email. Two samples were directly sequenced from cDNA without PCR, providing aligned reads of approximately 1 kb or more.

The resulting dataset was large and allowed for a number of analyses. "As far as I know, this is by far the largest long-read transcriptome dataset, in terms of a single study," Lappalainen said. The authors compared their findings with the CHESS project. That study identified 116,156 novel transcripts using short-read RNA-seq from multiple tissues, the authors noted, with approximately 33 percent overlap.

Lappalainen added that her study's novel transcripts "tend to be more tissue- or cell type-specific than the annotated ones."

The study helps refine understanding of how expression and splicing work on specific alleles. Multiple studies have shown that variants affecting expression and splicing "tend to be distinct," Lappalainen said. But the study data suggested that "a surprising amount of these things were co-occurring."

The data suggests that some variants that affect both expression and transcript structure were found in the 5' untranslated region of transcripts. "That's where the promoter is; if you mess with that structure, you mess with expression," she said.

In addition to the data, the study authors presented LORALS (long-read allelic analysis), a toolkit for allele-specific analysis that addresses the higher error rate seen in some long-read data. None of the tools for allele-specific expression analysis that work on Illumina data worked on long reads, Lappalainen said, so they built new ones. "There's a lot of opportunity in scaling up these new tools and making them more accurate," she said.

Overall, the study is important for providing a "picture of how much we don't know," Gingeras said. "This is just a window into a world that I think is still going to need additional kinds of analysis." In addition to direct RNA sequencing, he suggested that single-cell sequencing could also prove beneficial. 

The bottom line is that "long reads are available and affordable," Lappalainen said. "There is tremendous potential if you think about other species, as well."

From cataloging the basic units of genetics to discovering rare variants and showing their role in disease, "all of that relies on accurate transcript annotation," she said.