ST. LOUIS--The team of 42 researchers that generated, annotated, and deposited more than 352,000 expressed sequence tags (ESTs) into a public mouse genome database introduced in January relied on a set of unique, custom-designed bioinformatics tools to complete the task. The tools, now publicly available through Washington University, are already being employed to find gene fragments in other organisms.
Marco Marra at Washington University's Genome Sequencing Center here led the mouse EST project, which began two-and-a-half years ago with funding from the Howard Hughes Medical Institute. Fragments of genes expressed in a broad array of cells in various developmental stages were sequenced to get the largest possible sample of genes.
A similar EST cataloguing effort for humans that was initiated here before the mouse project began is ongoing. "The bioinformatics tools that were developed for the human EST project provided us a model upon which to add organism-specific modifications," explained Marra. In part, what allowed the mouse project to be completed so swiftly was that the infrastructure was already there.
The first two steps in generating the mouse ESTs were essentially molecular biology tasks that did not require any bioinformatics tools. First, researchers prepared mRNA from tissue samples in order to collect representatives of a cell's expressed genes. The second step involved converting the mRNA into cDNA libraries that could be used for sequencing.
But once sequencing of the cDNA libraries began, special bioinformatics capabilities were required. The team needed to conduct computer checks to ensure that the sequence generated really was mouse DNA and not some artifact, such as plasmid sequence. Software with such capability "was mostly written by our bioinformatics specialist, LaDeana Hillier," said Marra. "There are no commercially available software packages that provide the ability to identify and screen out contaminating DNA sequences from mouse sequence."
The next step, submitting the large volumes of verified sequence to the public database of ESTs known as dBEST, also required special bioinformatics capabilities. Marra said the team used C, Perl, and shell scripting languages to write programs suited to larger amounts of data than popular commercially available sequence analysis packages such as GCG, Sequencher, and Staden, which are tailored to deal with small bits of sequences.
Marra explained that by accessing dBEST on the internet, researchers can use Blast, the most heavily used gene-finding tool in the academic community. Typing a sequence once in dBEST prompts the associated Blast software to provide a wide range of information, including the topology, extent, and significance of the sequence matches obtained.
While Blast and other commercially available sequence analysis programs might be sufficient for the moment, many scientists feel that new tools will soon be necessary. Shirley Tilghman, a mouse geneticist at Princeton University who is one of the leaders of the recently initiated mouse genome sequencing project, said that "today's tools have not yet been challenged with the large amount of information that will be generated now that the mouse EST database is available."
She added, "The one area of particular concern will be comparative sequence analysis, i.e., finding the common features of mouse and human and representing them in a way that will be most useful to the community."