Many large-scale biology projects make use of a complete set of genes for one's favorite species. Working on such projects for a number of years, we're often reminded that the seemingly straightforward task of assembling a complete list of genes is not so easy, even for our favorite species — human. This article discusses some sources of gene sets, along with the more complex task of assembling complete and correct transcript variants. Despite some great resources, and even with a well-annotated eukaryotic genome, a complete gene set is a constantly evolving theoretical and practical challenge.
Why might we want a gene collection — or more correctly, set of transcripts, grouped into genes — in the first place? Perhaps we're trying to compile expression data, design a special type of microarray, or design a screen for genes that influence some interesting phenotype. Using high-throughput sequencing data, perhaps we want to link short expressed reads from RNA-seq to known transcripts, or link binding sites identified from large-scale immunoprecipitation (ChIP-seq) of our favorite transcription factor to nearby genes that the factor may be regulating. In all of these cases, we like to have as complete a gene set as possible so we don't miss any important players in our system. As the definition of a gene evolves, we'll want to be sure to include those sorts of genes (like microRNAs, novel noncoding RNAs, and pseudogenes) that may be important to our research but leave out others. Also, we'd like to use gene identifiers that are useful to bioinformatics people and lab biologists.
A quick look at the "Genes and Gene Prediction Tracks" section of the UCSC Genome Browser shows a bunch to choose from. Our traditional favorites are genes from NCBI RefSeq, Ensembl, Vega, and the Mammalian Gene Collection (MGC). The NCBI Reference Sequence database contains curated gene collections that are well linked to Entrez Gene and other NCBI resources. Ensembl gene sets are also derived from multiple sources and explicitly aim to be as complete as possible. The emphasis of the Vega set is manual curation, so their set is smaller than that of RefSeq and Ensembl. The MGC sequences are a special set in that every sequence is matched with a full-length protein-coding cDNA clone that researchers can purchase from the Image Consortium distributors.
This is one way to obtain gene sequences: from expression libraries that contain cDNA sequences. RefSeq gene sets originate mainly from this method, so each curated RefSeq sequence is linked to one or more nucleotide sequence in GenBank (identified in RefSeq's GenBank-format file). As a result, the curated RefSeq transcripts (with accession numbers like NM_*) are not always linked to a genome location, and those that have an obvious genome location may not match the corresponding genome sequence. RefSeq model transcripts (predictions with accessions like XM_*), however, are the opposite; they are associated with a genomic sequence but don't necessarily have a GenBank counterpart. The Ensembl and Vega transcript sequences are like this latter RefSeq group, all derived from the latest genome assembly, but they are also supported by cDNA and other expression evidence, together with gene prediction and homology information.
What we call a gene may reveal the biggest differences between the needs of a bioinformatician and a lab researcher. As bioinformatics people, we choose identifiers like an NCBI GeneID or an Ensembl gene ID that are database friendly, guaranteed to be unique, and (hopefully) stable. For someone who would like to visually inspect a gene set, however, these are far from informative, so we make sure to also include standard gene symbols. Fortunately for all of us, gene symbols have become more standardized. Human and most model species have some authority that is responsible for the assignment of approved gene names and symbols. For human genes, this is the Human Genome Organisation's Gene Nomenclature Committee (genenames.org). Other examples: yeast nomenclature is headed by the Saccharomyces Genome Database (yeastgenome.org), and mouse and rat by Mouse Genome Informatics (informatics.jax.org).
Making gene names systematic and informative is a big job, as is trying to sort out alternate names and homology issues. We happen to like the Saccharomyces method where, in addition to a standard symbol, genes have a systematic name (like YCL004C) that indicates genome location. Novel genes are still a big nomenclature challenge, even in species like human. RefSeq includes lots of genes with symbols that start with LOCx, where x is the GeneID, just like Ensembl uses ACy.z to indicate the zth gene on a segment of genomic DNA represented by clone ACy. It's impossible to create a useful symbol or name if the gene is of totally unknown function. Even where genes are well characterized, different researchers started calling the same gene by different names, and these habits can be hard to break.
Some gene symbols or names end up being problematic for a variety of reasons. Confusion quickly arises from data mining when a symbol can refer to multiple genes; according to Entrez Gene, for example, the symbol p40 or HOX1 can each refer to about 10 human genes. Conversely, FOXG1, which plays a role in brain development, has been called by 20 different names. Greek letters can be another challenge. Fly symbols use Greek letters (like in bTub85D), but Greek characters aren't permitted in human gene symbols (like the orthologous tubulin beta-2C).
On a totally different note, spreadsheet applications don't play very nicely with symbols that look like dates; we need to jump through some hoops to keep genes like SEPT1 and MARCH5 from automatically getting converted into dates. Lastly, we've always enjoyed funny gene names, like fly's "Ken and Barbie," named after the observation that mutants lack external genitalia. To fully appreciate the magnitude of this creative nomenclature (and see a human side of biology), we recommend the "Clever Gene Names" website. On the other hand, physicians can feel uncomfortable having to explain the origin of one's medical problems being due to a mutation in a gene like "sonic hedgehog homolog." As a result, in some cases a nomenclature committee has decided to modify the official name to something more staid but maintain the colorful gene symbol.
Once we have our set of gene names, we'll want to add sequence — either actual sequence or genome coordinates — to represent the one or more transcripts that encode each gene. As described above, our primary data may be cDNA sequence or genome regions, but there's a good chance we want both, which is easy to download from Ensembl, Vega, NCBI, or UCSC Genome Bioinformatics. Even though there's no good agreement between genes in different gene sets, a quick look at one gene in a genome browser shows that there's probably even less agreement between transcriptional variants included in each gene set. A quick example: for BMP4, RefSeq has three transcripts compared to Vega's one, but for BMP7, it's one in RefSeq and seven in Vega. Some high-throughput RNA sequencing projects that look at exons and splice junctions are contributing to a more complete understanding of splicing variation, but it may be awhile until this is applied to all of our favorite cells and developmental conditions.
How about genetic variation? And what if a RefSeq cDNA sequence can't be mapped perfectly to the reference genome — which is right? It's possible that both are correct, but we probably want to look in more detail. Human genetic variation is a huge area of current research, but we can get a quick look at the magnitude of this variation by turning on some SNP tracks on the UCSC Genome Browser or checking out the HapMap Genome Browser (hapmap.org). It's easy for computational biologists to represent each transcript as a consensus sequence, but if we really want to be accurate, more of us may have to start representing each SNP within the transcript as a choice of observed alleles or, even better, as a matrix of probabilities. Building a new generation of tools to analyze these SNP-aware transcripts will take some time and a lot of programming.
Fran Lewitter, PhD, is director of bioinformatics and research computing at Whitehead Institute for Biomedical Research. This column was written in collaboration with George Bell, PhD, a senior bioinformatics scientist in Fran's group.