Note: In addition to the below listing, papers for Nucleic Acids Research’s annual database issue, which will be published in January, are available under advance access here.
Baker EJ, Lin GN, Liu H, Kosuri R. NFU-Enabled FASTA: moving bioinformatics applications onto wide area networks. [Source Code Biol Med. 2007 Nov 26;2(1):8]: Describes a software program that enables data storage and computation as a shared network resource. Specifically, the software uses the network function unit-enabled Internet Backplane Protocol to distribute the FASTA algorithm and appropriate data sets within the framework of a wide area network. According to the authors, for large datasets, “computation-enabled logistical networks provide a significant reduction in FASTA algorithm running time over local and non-distributed logistical networking frameworks.”
Barbano PE, Spivak M, Flajolet M, Nairn AC, Greengard P, Greengard L. A mathematical tool for exploring the dynamics of biological networks. [Proc Natl Acad Sci USA. 2007 Nov 21 (e-pub ahead of print): Describes an approach for studying dynamical biological networks that is based on combining large-scale numerical simulation with nonlinear “dimensionality reduction” methods, according to the paper’s abstract. The approach allowed the authors to “detect robust features of the system in the presence of noise,” the abstract states. In particular, they found that the entire topology of a network is necessary to impart stability to one portion of the network at the expense of the rest. “This could have significant implications for systems biology, in that large, complex pathways may have properties that are not easily replicated with simple modules,” the authors note.
Bare JC, Shannon PT, Schmid AK, Baliga NS. The Firegoose: two-way integration of diverse data from different bioinformatics web resources with desktop applications. [BMC Bioinformatics 2007, 8:456]: Introduces the Firegoose, a Mozilla Firefox extension for integrating bioinformatics data from diverse sources. Firegoose is able to exchange data with Cytoscape, the R statistical package, Multiexperiment Viewer, and several other desktop software tools, according to the authors. Firegoose also enables researchers to use local data to query KEGG, EMBL String, DAVID, and other bioinformatics web sites.
Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Sánchez Alvarado A, Yandell M. MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes. [Genome Res. 2007 Nov 19 (e-pub ahead of print)]: Introduces a configurable genome annotation pipeline called MAKER, which identifies repeats, aligns ESTs and proteins to a genome, produces ab initio gene predictions, and automatically synthesizes data into gene annotations with evidence-based quality indices. As proof of principle, the authors used MAKER to annotate the genome of the planarian Schmidtea mediterranea and to create a new genome database, SmedGD.
Esteban DJ, Syed A, Upton C. Organizing and Updating Whole Genome BLAST Searches with ReHAB. [Methods Mol Biol. 2007;395:187-94]: Describes ReHAB (Recent Hits Acquired from Blast), a tool for tracking new protein hits in repeated PSI-Blast searches. According to the authors, ReHAB is designed to “simplify the analysis of large numbers of database matches and is therefore especially suited to comparative genomics.”
Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG. Clustal W and Clustal X version 2.0. [Bioinformatics 2007 23(21):2947-2948]: Describes an update of the Clustal W and Clustal X multiple sequence alignment programs, which have been completely rewritten in C++. “This will facilitate the further development of the alignment algorithms in the future and has allowed proper porting of the programs to the latest versions of Linux, Macintosh and Windows operating systems,” according to the paper’s abstract. Availability: http://www.ebi.ac.uk/tools/clustalw2.
Smith A, Cheung K, Krauthammer M, Schultz M, Gerstein M. Leveraging the structure of the Semantic Web to enhance information retrieval for proteomics. [Bioinformatics 2007 23(22):3073-3079]: Describes a project to use semantic web technologies to retrieve proteomics information from the web and the biomedical literature. The approach uses an RDF (resource description framework) graph that inter-relates documents through their associated biological identifiers, such as a protein ID. “A search begins with a simple query term (UniProt identifier), which is expanded with terms extracted from documents in the RDF graph surrounding the query,” which is called the “subgraph,” the authors note. The methods also uses inverse document frequency (IDF) to rescale local word frequencies in the subgraph relative to that in other subgraphs. “Using a subgraph containing family relationships (from PFAM) results in a significant improvement in accuracy (as compared to not considering the subgraph in the search) when assessed against known relationships in the yeast literature,” the authors note.
Liu Q, Dinu I, Adewale AJ, Potter JD, Yasui Y. Comparative evaluation of gene-set analysis methods. [BMC Bioinformatics 2007, 8:431]: Describes the comparison of three methods for evaluating gene-expression levels in specific biological pathways: Global Test, ANCOVA Global Test, and SAM-GS. The authors compared the methods “based on a simulation experiment and analyses of three real-world microarray datasets,” according to the paper’s abstract. The study found that “appropriate standardization makes the performance of all three methods similar, given the use of permutation-based inference.” SAM-GS had “slightly higher power in the lower alpha-level region (i.e. gene sets that are of the greatest interest).” The Global Test and ANCOVA Global Test, however, were better able to analyze continuous and survival phenotypes and to adjust for covariates.
Livny J. Efficient Annotation of Bacterial Genomes for Small, Noncoding RNAs Using the Integrative Computational Tool sRNAPredict2. [Methods Mol. Biol. 2007;395:475-88]: Introduces sRNAPredict2, a program for predicting putative sRNA-encoding genes in the intergenic regions of bacterial genomes. According to the paper, while “several bioinformatic approaches have proven effective in identifying bacterial sRNAs, implementing these approaches presents significant computational challenges that have limited their use.” sRNAPredict2 identifies putative sRNAs by integrating genome-wide predictions of genetic features that are commonly associated with sRNA-encoding genes and identifying instances in which these features are colocalized in intergenic regions of the genome.
Post LJ, Roos M, Marshall MS, van Driel R, Breit TM. A semantic web approach applied to integrative bioinformatics experimentation: a biological use case with genomics data. [Bioinformatics 2007 23(22):3080-3087]: Describes a semantic web-enabled data integration, or SWEDI, approach to integrating biological data. The approach “aims to formalize biological domains by capturing the knowledge in semantic models using ontologies as controlled vocabularies,” according to the paper’s abstract. The approach builds a collection of relatively small but specific knowledge and data models, which together form a “personal semantic framework” that can be linked to external large, general knowledge and data models. Availability: http://www.integrativebioinformatics.nl/swedi/index.html.
Siepel A, Diekhans M, Brejová B, Langton L, Stevens M, Comstock CL, Davis C, Ewing B, Oommen S, Lau C, Yu HC, Li J, Roe BA, Green P, Gerhard DS, Temple G, Haussler D, Brent MR. Targeted discovery of novel human exons by comparative genomics. [Genome Res. 2007 Nov 7 (e-pub ahead of print)]:Describesa study carried out as part of the Mammalian Gene Collection project to identify human genes not yet in the publicly available gene catalogs. The authors developed a method to predict genes using algorithms that rely on comparative sequence data but do not require direct cDNA evidence. They tested predicted novel genes by RT-PCR. Using this approach, the authors identified 734 novel gene fragments containing 2,188 exons with “weak prior cDNA support,” according to the abstract. Of the novel gene fragments, 563 were deemed “distinct genes,” of which around 160 are “completely absent from the major gene catalogs, while hundreds of others represent significant extensions of known genes.”