What if researchers had to reinterpret all their human gene expression data — or worse, if microarrays needed to be redesigned altogether? These scenarios are possible due to extensive size variations in alternative transcripts, according to a new paper by Victor Jongeneel from the Ludwig Institute for Cancer Research and his colleagues.
The researchers estimated in their paper, which was published online ahead of print in Genome Research this month, that about half of all human genes make several transcripts that vary significantly in length.
The size variation the scientists found resides mostly in the 3’ untranslated region of the transcript; it is not a result of alternative splicing and has no effect on the size of the protein that gets made during translation. However, it might make a huge difference in a gene expression experiment, as some of these transcripts might not hybridize to a microarray probe on a chip. If the chip only carries probes near the previously known polyadenylation site, only the shorter transcript may be recognized. This is because target RNA is normally labeled starting from the poly-A tail of the 3’ end, using oligo-dT as a primer, and only extends for one or two kilobases from there, according to Jongeneel.
“If you have a probe only for the first, upstream polyadenylation site on your microarray, chances are that you will not detect all the other messenger RNA species [that] are derived from the same gene, so you will detect only a subpopulation,” Jongeneel said.
Since the relative amounts of the various transcripts might differ between samples, this variation in the ability of an array to detect different transcripts might lead microarray experiments to over- or underreport changes in expression levels. “If you want to detect all the different isoforms, you really need probes on your microarray that correspond to each of the polyadenyl-ation sites,” said Jongeneel.
Affy Chips and Alternative Transcripts
Affymetrix has claimed this problem of alternative polyadenylation sites is old hat, saying that it has already taken it into account for the design of its U133 array. But since the company is using a different approach with a higher threshold to identify the new sites, it might have missed quite a few of them. Only further experiments will show if the new findings put into question results from current microarray or SAGE experiments, said Jongneel.
Alan Williams, Affymetrix’ bioinformatics manager, said his company had already identified a significant number of these sites: 4,500 of the roughly 45,000 probe sets on the company’s U133 array are directed against alternative polyadenylation sites. And more of them, he suggested, are probably already present on the array, but have been annotated as belonging to a different UniGene cluster than those in their sister transcripts. “The data is already there, the probe set is there, customers will have the data in their database, so it’s not a matter of actually having to go back and redesign the whole microarray,” Williams said. Instead, re-annotating the probe sets might suffice — something that he said Affymetrix already does on a routine basis as the annotation for the human genome gets more refined.
However, Affymetrix has been using a different and more conservative approach than that used by Jongeneel’s team to identify the alternative polyadenylation sites, and might have missed a significant number of them. “It becomes an issue of what is your threshold for evidence of a polyadenylation site,” said Williams. “Obviously one could lower or increase that threshold and come up with a different number of [them.]” For the U133 design, Affymetrix looked for clusters of at least eight 3’ EST sequences indicating a polyadenyl-ation site, and picked a probe set upstream of that region to go on the chip. Jongeneel and his team, on the other hand, created a filtered set of trusted 3’ tags selected from ESTs with poly-A tails and mapped them directly onto the genome. Using a manual approach, they examined the distribution of these tags in a region of 52 annotated genes on chromosome 21 and found that half of these genes had multiple polyadenylation sites, spread over up to 15 kilobases. Extrapolating these results to the entire genome, the researchers estimated that about half of all human genes probably have such sites. Not only is this number higher than estimates from EST clustering, they were also able to connect these sites to the 3’ UTRs of annotated genes in many cases.
In order to confirm that their calculation applies to the whole genome, the scientists are currently working on an automated method to associate 3’ tags with individual genes, Jongeneel said. But ultimately, what will prove the significance of their findings is showing that they make a difference to gene expression results. “We need to show that using probes that hit these alternative polyadenylation sites actually improves the quality of the result you get out of microarray experiments,” Jongeneel said. A possible partner for such a validation project is Zeptosens, a microarray company based in Basel, Switzerland. CSO Gerhard Kresbach confirmed that the company had already talked to Jongeneel about his results but declined to comment further.
Apart from affecting microarrays, the novel polyadenylation sites and their assignment to annotated genes might also help solve a riddle of SAGE-type experiments, which depend on measuring small tags close to these sites: “One of the big problems of SAGE and MPSS [a SAGE-like method developed by Lynx] has been that you find many more tags than there are genes,” said Jongeneel. “Once we have documented which gene each polyadenylation site belongs to, then you will be able … for each of the tags to say which gene it came from … It would help them tremendously to interpret their results.” Jongeneel mentioned that the Ludwig Institute in New York has been engaged in talks with Lynx to perform and reinterpret MPSS experiments.