NEW YORK (GenomeWeb) – Using a long-read-based approach, Stanford University researchers reported generating a personal transcriptome in the Proceedings of the National Academy of Sciences.
Senior author Michael Snyder and his colleagues sequenced the lymphoblastoid transcriptomes of three family members using the Pacific Biosciences system, reads from which they compared to shorter reads from the Illumina platform. From those transcriptomes, they developed an allele-specific full-length transcriptome for one of the family members. They were able to distinguish the two alleles even for complicated genes such as HLA genes.
"Here, we generate the deepest and longest single-molecule long-read dataset to date, to our [knowledge], for a trio of human cell lines," Snyder and his colleagues wrote in their paper. They further "show[ed] that we can determine SNVs de novo and that using a [principal components] approach, molecules from genes with multiple heterozygous SNVs can be attributed to the two alleles."
Such personal transcriptomes, Snyder and his colleagues added, are expected to become important in the understanding of individual biology and disease.
He and his colleagues used the PacBio platform to sequence some 711,000 circular consensus read molecules from the GM12878 cell line. They generated longer sub-reads for this study — an average 1,188 basepairs — than they did for the human organ panel dataset — an average 999.9 basepairs — that they presented last year in Nature Biotechnology.
They additionally noted that though both datasets equally represented shorter molecules between 0.8 kilobases and 1.3 kilobases in length, the present dataset better represented molecules longer than 1.7 kilobases.
The Stanford team also sequenced 100 million 101-basepair pair-end reads on the Illumina platform that they then analyzed using Cufflinks.
Both technologies, they reported, uncovered some 99,000 annotated exon-exon junctions, and Illumina reads covered an additional 92,000 or so annotated junctions while the PacBio reads covered a further 992 junctions. Additionally, of the 22,600 spliced genes classified by Gencode as either protein-coding genes or lincRNAs, long-read single molecule sequencing and 101-basepair paired-end sequencing identified 9,200 of them. Forty genes were found solely through long reads, 6,400 genes by 101-basepair paired-end sequencing, and 7,000 genes weren't found using either approach.
The researchers had hypothesized that since circular consensus read generation needs read lengths to be at least twice as long as the cDNA length that consensus split-mapped molecules (CSMM) wouldn't include a large number of longer genes.
However, they found that genes with and without a CSMM had similar lengths, though genes with a CSMM were less likely to be smaller than one kilobase, which the researchers said was likely due to the magnetic beads in the loading procedure preferring longer fragments.
Both expression and mature gene length, Snyder and his colleagues added, are important factors in whether or not a gene received a full-length consensus split-mapped molecule.
Such long reads, the researchers said, could include a number of novel exon-intron structures. To eliminate potential artifacts, the researchers focused on 12,000 full-length novel isoforms that could be attributed to a known gene and for which the exon-intron junction was annotated or otherwise supported by short-read sequencing.
Of these, 55 percent were novel combinations of known splice sites; 34 percent had a single novel donor or acceptor; and 11 percent had two or more novel splice sites.
Again comparing this work to their previous human organ panel dataset, Snyder and his colleagues found that some 2,100 genes had a novel isoform in the HOP sample, 4,300 in the current sample, and 600 were in both.
A goal of transcriptomic research, the researchers said, is to be able to assign RNA molecules to the allele from which they are expressed. And long-read sequencing is supposed to be able, they added, to determine each SNV affecting single RNA molecules.
To trace the origin of these alleles found in the GM12878 daughter cell line, they folded in data from the parental GM12891 and GM12892 lines, and examined that parental data for the presence or absence of SNVs present in the daughter.
Through a principal components analysis, they could separate out the two alleles based on the eigenvectors. For 166 genes with at least two annotated heterozygous SNVs, the researchers found that 158 of them had two or more SNVs, two genes had one SNV, and six genes did not appear to be heterozygous.
A few genes — particularly HLA genes — contained a number of SNVs, and for these, too, the researchers were mostly able to determine phasing.
"Even for complicated genes (e.g., HLA genes, whose sequences may differ considerably from the reference sequence) the two alleles are usually clearly distinguishable," Snyder and his colleagues wrote.
They noted, though, that deeper sequencing would be necessary to determine whether one allele behaves different than another for different genes.