Amigo J, Salas A, Phillips C, Carracedo A. SPSmart: adapting population based SNP genotype databases for fast and comprehensive web access. 18847484[BMC Bioinformatics. 2008 Oct 10;9:428]: Introduces SPSmart (SNPs for Population Studies), a tool for accessing and combining large-scale SNP databases for human population genetics. SPSmart creates a data mart from the most commonly accessed databases of genotypes containing population information. The data is mined, summarized into statistical reference indices, and stored into a relational database that currently handles up to 4 billion genotypes. Available here.
Daub J, Gardner PP, Tate J, Ramsköld D, Manske M, Scott WG, Weinberg Z, Griffiths-Jones S, Bateman A. The RNA WikiProject: Community annotation of RNA families. 18945806 [RNA. 2008 Oct 22. (e-pub ahead of print)]: Describes the RNA WikiProject, part of the larger Molecular and Cellular Biology WikiProject that includes more than 600 Wikipedia articles describing families of noncoding RNAs based on the Rfam database. Since Rfam currently redistributes the Wikipedia content as the primary textual annotation of its RNA families, users can now directly edit the content of the database, the authors note in the abstract, adding that the Wikipedia/Rfam link “acts as a functioning model for incorporating community annotation into molecular biology databases.” Available here.
Delcher AL, Koren S, Miller JR, Venter E, Walenz BP, Brownley A, Johnson J, Li K, Mobarry C, Sutton G. Aggressive assembly of pyrosequencing reads with mates. 18952627 [Bioinformatics. 2008 Oct 28. (e-pub ahead of print)]: Describes a modified version of the Celera Assembler that handles combinations of ABI 3730 and 454 FLX reads. The revised pipeline, called CABOG (Celera Assembler with the Best Overlap Graph), “is robust to homopolymer run length uncertainty, high read coverage, and heterogeneous read lengths,” according to the grant abstract. In tests on four genomes, CABOG generated the longest contigs among all assemblers tested. Available here.
Descorps-Declere S, Barba M, Labedan B. Matching curated genome databases: a non trivial task. 18950477[BMC Genomics. 2008 Oct 24;9(1):501]: Introduces CorBank, a program that provides cross-referencing protein identifiers for the RefSeq and Genome Reviews curated databases. These databases were designed independently “to cope with non-standard annotation” in the sequenced genome, the authors note in the paper’s abstract, adding that this “uncoordinated effort” has had two unwanted consequences: it is difficult to map the protein identifiers of the same sequence in both databases, and the two reannotated versions of the same genome differ at the level of their structural annotation. CorBank was designed to address this problem and “allows easy search of cross-references between RefSeq, Genome Reviews, and UniProt, for either a single CDS or a whole replicon.” Available here.
Gilchrist MJ, Christensen MB, Harland R, Pollet N, Smith JC, Ueno N, Papalopulu N. Evading the annotation bottleneck: using sequence similarity to search non-sequence gene data. 18928517[BMC Bioinformatics. 2008 Oct 17;9(1):442]: Discusses a method for retrieving non-sequence gene data, such as images and the literature, based on sequence similarity, “which removes dependence on annotation and text searches,” according to the paper’s abstract. Non-sequence gene data is found in many public databases, but access to this data currently depends on gene names. “However, gene annotation is neither complete, nor fully systematic between organisms, and is also not generally stable over time,” the authors note. The sequence similarity-based approach “facilitates cross-species comparisons, and enables the handling of novel or otherwise un-annotated genes.”
Herrgård MJ, Swainston N, Dobson P, et al. A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology. 18846089[Nat Biotechnol. 2008 Oct;26(10):1155-60]:Discusses the creation of a consensus metabolic network reconstruction for Saccharomyces cerevisiae. Several genome-scale network reconstructions describe S. cerevisiae metabolism, but “they differ in scope and content, and use different terminologies to describe the same chemical entities,” the authors note in the paper’s abstract. “This makes comparisons between them difficult and underscores the desirability of a consolidated metabolic network that collects and formalizes the 'community knowledge' of yeast metabolism.” In drafting the consensus network, the authors “placed special emphasis on referencing molecules to persistent databases or using database-independent forms, such as SMILES or InChI strings, as this permits their chemical structure to be represented unambiguously and in a manner that permits automated reasoning.” Available here.
Hon G, Ren B, Wang W. ChromaSig: A probabilistic approach to finding common chromatin signatures in the human genome. 18927605 [PLoS Comput Biol. 2008 Oct;4(10):e1000201]: Introduces an unsupervised learning method called ChromaSig that finds commonly occurring chromatin signatures in both tiling microarray and sequencing data. By applying the algorithm to nine chromatin marks across a 1 percent sampling of the human genome in HeLa cells, the authors report that they recovered eight clusters of distinct chromatin signatures, five of which correspond to known patterns associated with transcriptional promoters and enhancers, and three clusters of novel chromatin signatures that contain evolutionarily conserved sequences and potential cis-regulatory elements. Available here.
Kessner D, Chambers M, Burke R, Agus D, Mallick P. ProteoWizard: open source software for rapid proteomics tools development. 18606607[Bioinformatics 2008 24(21):2534-2536]: Describes ProteoWizard, a modular and extensible set of open-source, cross-platform tools, and libraries for proteomics data analyses. The libraries support rapid tool creation by providing a development framework that “simplifies and unifies data file access, and performs standard proteomics and LC-MS dataset computations,” according to the paper’s abstract, the authors say in their abstract. Available here.
Li W, Wooley JC, Godzik A. Probing metagenomics by rapid cluster analysis of very large datasets. 18846219 [PLoS ONE. 2008;3(10):e3375]: Describes an approach for rapidly analyzing the sequence diversity and the internal structure of very large metagenomics datasets. The method relies on a modified version of the CD-HIT clustering algorithm. Using data from the Sorcerer II Global Ocean Sampling study, the new clustering method “took about two orders of magnitude less computational effort than the similar protein family analysis of original GOS study.” Available here.
Ondov B, Varadarajan A, Passalacqua KD, Bergman NH. Efficient mapping of Applied Biosystems SOLiD sequence data to a reference genome for functional genomic applications. 18842598[Bioinformatics. 2008 Oct 7. (e-pub ahead of print)]:Describes SOCS (Short Oligonucleotide Color Space), a program designed for mapping Applied Biosystems SOLiD sequence data onto a reference genome. Available here.
Pencheva T, Lagorce D, Pajeva I, Villoutreix BO, Miteva MA. AMMOS: Automated molecular mechanics optimization tool for in silico screening. 18925937[BMC Bioinformatics. 2008 Oct 16;9(1):438]: Describes AMMOS, a tool for refining the 3D structures of small molecules present in chemical libraries as well as predicted receptor-ligand complexes. The method allows partial to full atom flexibility through molecular mechanics optimization in order to overcome several challenges in virtual screening, such as structural optimization of compounds in a screening library, receptor flexibility/induced-fit, and accurate prediction of protein-ligand interactions, according to the paper’s abstract. Available here.
Price MN, Dehal PS, Arkin AP. FastBLAST: Homology relationships for millions of proteins. 18974889[PLoS ONE. 2008;3(10):e3589]: Introduces FastBLAST, a heuristic replacement for all-versus-all Blast that relies on alignments of proteins to known families, obtained from tools such as PSI-BLAST and HMMer. FastBLAST “avoids most of the work of all-versus-all BLAST by taking advantage of these alignments and by clustering similar sequences,” the paper’s abstract states. FastBLAST runs in two stages: the first stage identifies additional families and aligns them, and the second stage identifies the homologs of a query sequence, based on the alignments of the families, before generating pairwise alignments. On 6.53 million proteins from the non-redundant Genbank database, FastBLAST identified new families 25 times faster than all-versus-all Blast, according to the authors. Once the first stage is completed, FastBLAST identifies homologs for the average query in less than 5 seconds, which is 8.6 times faster than Blast, with “nearly identical” results. Available here.
Rougemont J, Amzallag A, Iseli C, Farinelli L, Xenarios I, Naef F. Probabilistic base calling of Solexa sequencing data. 18851737[BMC Bioinformatics. 2008 Oct 13;9:431]:The authorspropose a base-calling algorithm for the Illumina Genome Analyzer that uses model-based clustering and probability theory to identify ambiguous bases and code them with IUPAC symbols. The authors note in the paper’s abstract that when compared with Illumina’s data processing pipeline, the method improves genome coverage and the number of usable tags by an average of 15 percent.
Schulz-Trieglaff O, Pfeifer N, Gröpl C, Kohlbacher O, Reinert K. LC-MSsim — a simulation software for liquid chromatography mass spectrometry data. 18842122 [BMC Bioinformatics. 2008 Oct 8;9:423]: Describes LC-MSsim, a simulation software for LC-MS experiments that is intended to help software developers compare algorithms for analyzing LC-MS data. “So far, curated benchmark data exists only for peptide identification algorithms but no data that represents a ground truth for the evaluation of feature detection, alignment, and filtering algorithms,” the authors note in the paper’s abstract. LC-MSsim reads a list of proteins from a FASTA file, digests the protein mixture using a user-defined enzyme, and creates an LC-MS data set using a predictor for the retention time of the peptides and a model for peak shapes and elution profiles of the mass spectral peaks. Available here.
Zhang Z, Cheung KH, Townsend JP. Bringing Web 2.0 to bioinformatics. 18842678[Brief Bioinform. 2008 Oct 8. (e-pub ahead of print)]:Proposes the “Web 2.0-based Scientific Social Community” model, which would support the use of Web 2.0 technologies to enhance bioinformatics research. “By establishing a social, collective, and collaborative platform for data creation, sharing, and integration, we promote a web services-based pipeline featuring web services for computer-to-computer data exchange as users add value,” the authors write in the paper’s abstract.
Zhu Y, Davis S, Stephens R, Meltzer PS, Chen Y. GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus (GEO) 18842599[Bioinformatics. 2008 Oct 17. (e-pub ahead of print)]:Introduces GEOmetadb, a search engine that was developed to make querying metadata in the Gene Expression Omnibus “both easier and more powerful,” according to the paper’s abstract. GEOmetadb stores all GEO metadata records as well as the relationships between them in a local MySQL database and offers a web search interface with utilities that offer query capabilities that are not available via NCBI tools, according the authors. Available here.