Rosetta Inpharmatics’ $650 million acquisition by Merck last year hasn’t stopped the company from conducting bleeding-edge microarray research. At the recent Microarray Gene Expression Database working group’s annual meeting in Boston, Rosetta’s Dan Shoemaker discussed the latest developments in the Kirkland, Wash., subsidiary’s ongoing project to use oligonucleotide microarrays to discover new genes in the draft of the human genome and to monitor alternative splicing on a genome-wide level.
In the project, which Rosetta has previously described in a February 2001 Nature paper, Shoemaker and other Rosetta researchers are using microarrays to pinpoint more exactly the full library of transcripts in the genome by experimentally validating low-confidence predictions generated by ab inito gene-finding programs. To do this, the researchers generated a comprehensive list of 116,000 transcripts by running multiple gene-finding algorithms on the draft of the human genome. The gene-finding algorithms were run at a very low stringency to capture as many new genes as possible. Unfortunately, casting this type of broad net also results in a large number of false positives, which is where the arrays come in, Shoemaker explained.
To determine which of the low-confidence predictions represented “real” transcripts, they designed four different 60-mer oligo probes for each of the 116,000 transcripts, and synthesized the resulting 500,000 or so probes on microarrays. They spotted the arrays using inkjet technology that Rosetta has licensed to Agilent. The group then hybridized the set of arrays covering all 116,000 transcripts with RNA from 60 different tissues to eliminate low-confidence transcripts with no expression activity. Based on the microarray experiments, they were able to narrow the list down to 68,000 transcripts. Shoemaker emphasized that this number is clearly an over-estimate because they have already found cases where multiple predictions that are next to each other in the genome can be collapsed into a single gene based on the co-regulation data from the 60 conditions.
One limitation of this approach, however, is that the gene finding algorithms have been trained on known genes and they might not be able to detect truly novel classes of transcripts. To identify these elusive transcripts, the group has tiled through 25 percent of the genome using 60-mers placed in 30 base-pair steps. They have hybridized the tiling arrays with RNA from six tissues, and the data is being used to experimentally determine the position of the exons in the genomic sequence.
In addition to refining the structure of known genes and discovering new genes, the Rosetta team is also using the tiling data to develop the next generation of gene-finding algorithms. Shoemaker said that their initial look at the data indicates that “there is a lot of transcription going on outside of annotated genes.” The challenging task that lies ahead will be to determine what fraction of this transcriptional activity represents new genes vs. transcriptional noise.
The Nature paper covered findings for Chromsome 22, and now the researchers are looking to publish in a scientific journal their findings on all of the transcripts they have found in Chromosome 20, which represents about two percent of the human genome, Shoemaker said.
Shoemaker went on to describe how the Rosetta group is using microarrays to monitor gene structure — otherwise known as alternative splicing — on a genome-wide level. So far, they have mapped sequences of about 11,000 known RefSeq genes to the draft of the human genome to determine the position of the exons in each of the genes. This information was used to generate a set of “junction arrays,” which were hybridized with RNA from 50 different tissues. Preliminary analysis suggests that approximately 30 percent of the genes had detectable levels of alternative splicing — some of which had not been previously reported in the literature. This type of finding has shown Rosetta researchers that “we need to monitor gene structure in disease-relevant tissues, and use co-regulation to infer biological function.”
What will Merck do once the whole ‘transcriptome’ has been mapped out? The company isn’t sure where to put the project, Shoemaker said, and he did not rule out the possibility that all of the results would be publicly released.