Bioinformatics Tool-Related Papers of Note, April 2010
Fröhler S, Dieterich C. ACCUSA — Accurate SNP calling on draft genomes. [Bioinformatics. 2010 Apr 1. (e-pub ahead of print)]: Describes SNP-calling software for draft genomes that considers both the quality of the reads as well as the quality of the reference genome in order to call SNPs. According to the authors, current SNP callers are designed for high-quality assemblies of model organisms, and therefore do not need to consider the quality of the reference genome, but these packages are inadequate for draft genomes. Available here.
Hoffman MM, Buske OJ, Noble WS. The Genomedata format for storing large-scale functional genomics data. [Bioinformatics.2010 Apr 29. (e-pub ahead of print]: Introduces a format for "efficient storage of multiple tracks of numeric data anchored to a genome," according to the paper's abstract. The format allows fast random access to hundreds of gigabytes of data, while retaining a small disk space footprint, the abstract states. The authors demonstrate that retrieving data from Genomedata format is "more than 2,900 times faster than a naive approach using wiggle files." Available here.
Huang H, Liu CC, Zhou XJ. Bayesian approach to transforming public gene expression repositories into disease diagnosis databases. [Proc Natl Acad Sci USA.2010 Apr 13;107(15):6823-8]: Describes a framework, based on a two-stage Bayesian learning approach, for diagnosing one or more diseases based on a query expression profile along a hierarchical disease taxonomy. The method allows users to analyze both sources of information "in a unified probabilistic system" and provides "a high level of overall diagnostic accuracy," according to the paper's abstract.
Johannes F, Wardenaar R, Colomé-Tatché M, Mousson F, de Graaf P, Mokry M, Guryev V, Timmers HT, Cuppen E, Jansen RC. Comparing genome-wide chromatin profiles using ChIP-chip or ChIP-seq. [Bioinformatics. 2010 Apr 15;26(8):1000-6]: Introduces a method for comparing multiple chromatin immunoprecipitation data sets on a global or locus-specific scale. The authors present a parametric classification approach for simultaneously analyzing two or more ChIP samples that demonstrates "efficient scalability and application to three very diverse ChIP-chip and ChIP-seq experiments," according to the paper's abstract. Available here.
Kraus JM, Kestler HA. A highly efficient multi-core algorithm for clustering extremely large datasets. [BMC Bioinformatics. 2010 Apr 6;11(1):169]: Describes a multi-core parallelization of the k-means and k-modes cluster algorithms. According to the paper's abstract, the computational speed increased by a factor of 10 for large data sets compared to single-core implementations and a recently published network based parallelization. Available here.
Matos S, Arrais JP, Maia-Rodrigues J, Oliveira JL. Concept-based query expansion for retrieving gene related publications from MEDLINE. [BMC Bioinformatics. 2010 Apr 28;11(1):212]: Discusses QuExT, a PubMed-based document-retrieval and prioritization tool that can take a list of genes and then search for the most relevant results from the literature. QuExT follows a "concept-oriented query expansion methodology to find documents containing concepts related to the genes in the user input, such as protein and pathway names," according to the paper's abstract. The retrieved documents are ranked according to user-definable weights assigned to each concept class. Users can change these weights to modify the ranking of the results in order to focus on documents dealing with a specific concept. Available here.
Matsuoka Y, Ghosh S, Kikuchi N, Kitano H. Payao : A Community Platform for SBML Pathway Model Curation. [Bioinformatics. 2010 Apr 5. (e-pub ahead of print)]: Presents Payao, a collaborative web service platform for gene-regulatory and biochemical pathway model curation. Payao reads models in Systems Biology Markup Language format and displays them with the CellDesigner process diagram editor, which complies with the Systems Biology Graphical Notation and provides an interface for model enrichment, such as adding tags and comments to the models. Available here.
Miller AK, Marsh J, Reeve A, Garny A, Britten R, Halstead M, Cooper J, Nickerson DP, Nielsen PF. An overview of the CellML API and its implementation. [BMC Bioinformatics. 2010 Apr 8;11(1):178]: Introduces an application programming interface for CellML, an XML-based language for representing mathematical models. "Due to some of the more complex features present in CellML models, such as imports, developing code ab initio to correctly process models can be an onerous task," according to the paper's abstract. "For this reason, there is a clear and pressing need" for an API. Available here.
[ pagebreak ]
Orvis J, Crabtree J, Galens K, Gussman A, Inman JM, Lee E, Nampally S, Riley D, Sundaram JP, Felix V, Whitty B, Mahurkar A, Wortman J, White O, Angiuoli SV. Ergatis: A web interface and scalable software system for bioinformatics workflows. [Bioinformatics. 2010 Apr 22. (e-pub ahead of print)]: Discusses a workflow-management system called Ergatis that enables users to build, execute, and monitor pipelines for computational analysis of genomics data. Ergatis contains pre-configured components and template pipelines for common bioinformatics tasks such as prokaryotic genome annotation and genome comparisons. Available here.
Prlic A, Martinez MA, Dimitropoulos D, Beran B, Yukich BT, Rose PW, Bourne PE, Fink JL. Integration of open access literature into the RCSB Protein Data Bank Using BioLit. [BMC Bioinformatics. 2010 Apr 29;11(1):220]: Describes a project called BioLit, which aims to "exploit" the fact that the distinction between online databases and online literature is "blurring," according to the paper's abstract. BioLit provides an "enhanced view" of articles with markup of semantic data and links to biological databases, based on the content of the article. Words that match biological ontologies are highlighted and database identifiers are linked to their database of origin. It also identifies PDB IDs that are mentioned in the open access literature by parsing the full text for all research articles in PubMed Central and providing the results as XML web services. Available here.
Rocha I, Maia P, Evangelista P, Vilaça P, Soares S, Pinto JP, Nielsen J, Patil KR, Ferreira EC, Rocha M. OptFlux: an open-source software platform for in silico metabolic engineering. [BMC Syst Biol. 2010 Apr 19;4:45]: Presents a user-friendly computational tool for metabolic engineering applications. The tool, called OptFlux, provides metabolic models for phenotype simulation of wild-type and mutant organisms; metabolic flux analysis; and pathway analysis through the calculation of elementary flux modes. Available here.
Sangket U, Mahasirimongkol S, Chantratita W, Tandayya P, Aulchenko YS. ParallABEL: an R library for generalized parallelization of genome-wide association studies. [BMC Bioinformatics.2010 Apr 29;11(1):217]: Introduces an R library for parallelizing genome-wide association analysis. According to the paper's abstract, most components of GWA analysis can be divided into four groups based on the types of input data and statistical outputs: The first group contains statistics computed for a particular SNP or trait; the second group includes statistics characterizing an individual in a study; the third is pair-wise statistics from analyses between each pair of individuals in the study; and the fourth is pair-wise statistics derived for pairs of SNPs. The library, called ParallABEL, parallelizes all four types of computations. In one example, computing time was reduced linearly from around eight hours to one hour when ParallABEL ran on eight processors. Available here.
Sharma A, Zhao J, Podolsky R, McIndoe RA. ParaSAM: A parallelized version of the significance analysis of microarrays algorithm. [Bioinformatics. 2010 Apr 15. (e-pub ahead of print)]: Discusses a parallelized version of the Significance Analysis of Microarrays algorithm. The method, called ParaSAM, was developed to "overcome the memory limitations" of SAM, according to the paper's abstract. The authors state that ParaSAM "is not only faster than the serial version, but can analyze extremely large datasets that cannot be performed using existing implementations." Available here.
Zhao D, Wang Y, Luo D, Shi X, Wang L, Xu D, Yu J, Liang Y. PMirP: A pre-microRNA prediction method based on structure-sequence hybrid features. [Artif Intell Med. 2010 Apr 14. (e-pub ahead of print)]: Presents a web server for predicting pre-microRNAs that takes into account structure-sequence features and free energy of secondary structures, as well as the double helix structure with free nucleotides and base-pairing features. According to the paper's abstract, the prediction specificity and sensitivity for real and pseudo human pre-microRNAs are as high as 98.4 percent and 94.9 percent, respectively. Available here.
Zhu W, Lomsadze A, Borodovsky M. Ab initio gene identification in metagenomic sequences. [Nucleic Acids Res. 2010 Apr 19. (e-pub ahead of print)]: Discusses an algorithm for identifying genes in shotgun sequence data of microbial communities. The authors describe a refinement of a gene-prediction method that was originally proposed in 1999. "With the advent of new prokaryotic genomes en masse it became possible to enhance the original approach by using direct polynomial and logistic approximations of oligonucleotide frequencies, as well as by separating models for bacteria and archaea," according to the abstract. "These advances have increased the accuracy of model reconstruction and, subsequently, gene prediction." Available here.