CAPE TOWN, South Africa--On March 12, the South African National Bioinformatics Institute here, together with the National Center for Genome Resources in Santa Fe, NM, launched the Sequence Tag Alignment and Consensus Knowledgebase (STACK). The organization of expressed human gene sequences contained in the new public database will be critical to researchers seeking a unified view of genes being discovered by the human genome project: it will allow them to process gene fragments, detect errors, and create carefully joined sets of consensus sequences for each gene sequence, the centers announced.
Winston Hide, director of the South African institute, told BioInform that a team of bioinformatics experts here and in Durban, South Africa, developed STACK to "provide an independent resource for the analysis of disease gene candidates, alignments, and consensus sequences."
Hide said the system is already in use at Harvard, Cambridge, Yale, Oxford, Baylor College of Medicine, the Pasteur Institute, and a number of other research facilities. Now, database and biology experts in Santa Fe are making the database--along with custom computer tools for analyzing the information--available to the public via the center's Genome Sequence Database.
Based on a novel method that Hide and research associate Robert Miller created for processing a database of publicly available human expressed sequence tags, Hide said STACK can be easily integrated to answer questions about gene expression, gene hunting, and polymorphisms. He and other scientists at the South African institute devised portable tools and used an in-house system running on a Silicon Graphics Origin2000 multiprocessor server to make alignments and consensi from individual sequences and to cluster sequences. Algorithms used to generate the database include efficient error-compensation methods that can create longer, more accurate consensus sequences.
Quality and usefulness
STACK relies partly on the South African institute's SANIGENE database, which Hide said contains "clean contributed sequences." In fact, the expressed gene sequences in STACK have a high degree of qualification, Hide said. "The problem with most databases is, you have no idea of the quality of the data," he said.
Sequence quantity also makes the new database valuable, Hide claimed. "What makes STACK better than any other clustered database is that it has more sequences in each cluster so that the virtual genes that result are more accurate reflections of the composition of each tissue." Consequently, users will be better able to understand genes' relationships to disease, he said.
The database can be used to generate reports on number of sequences, specific error types, substitutions, consensus generation statistics, and more, Hide said.
Carol Harger, manager of the National Center for Genomic Resources' Genome Sequence Database, said STACK is an example of how data generated by multiple laboratories can be analyzed and organized to produce a higher utility data resource. "We see the generated consensus sequences and other sequence relationships, and the associated data quality information in the STACK database as a very valuable resource not only for scientists studying a particular disease or biological process, but for many other groups as well," she said.
For instance, Harger said, "STACK will be useful for pharmaceutical companies hoping to target drugs more quickly and with less effort."
The new dataset also enriches the context in which researchers can compare newly discovered sequences, she said. "STACK can be used to differentiate between different members of the same gene family or between alternate products of one gene."
Organization and accessibility
STACK's bank of expressed gene sequences are uniquely organized according to tissue and disease expression, allowing researchers interested in particular tissues and diseases access to a presorted and specialized dataset, Hide explained. For example, Hide said a University of Texas research er studying retinitis pigmentosa--genetic blindness--required a dedicated database of genes expressed only in the eye. He uses STACK to screen potential genes he discovers for candidates with eye disease.
STACK is also easily accessible, Hide said. As a public database, it satisfies a growing need to make the increasingly large volume of gene fragment data more easily and efficiently useful in the analysis of human genes.
"The database pulls together a lot of separate pieces into a more complete whole, and that serves the entire research community," Harger said.
The National Center for Genome Resources' Genome Sequence Database can be found at http://www.ncgr.org/gsdb. The sequences also may be searched at the institute's web site, http://www.sanbi.ac.za/stack.