Bioinformatics Tool-Related Papers of Note, June 2010
Note: In addition to the below listing, papers for Nucleic Acids Research's annual web server issue are available here.
Bansal V. A statistical method for the detection of variants from next-generation resequencing of DNA pools. [Bioinformatics. 2010 Jun 15;26(12):i318-i324]: Describes a method called CRISP (Comprehensive Read analysis for Identification of SNPs from Pooled sequencing) that is able to identify both rare and common variants in pooled sequencing data. CRISP is based on two approaches: comparing the distribution of allele counts across multiple pools using contingency tables; and evaluating the probability of observing multiple non-reference base calls due to sequencing errors alone. In a validation study on two separate pooled sequencing datasets generated using the Illumina Genome Analyzer, CRISP was able to detect between 80 percent and 85 percent of SNPs identified using individual sequencing while achieving a false discovery rate of under 5 percent. Available here.
Bonfield JK, Whitwham A. Gap5 — editing the billion fragment sequence assembly. [Bioinformatics. 2010 May 30]: Presents Gap5, a sequence assembly editor designed to scale to the large volumes of data produced by the current generation of DNA sequencers. Gap5 is part of the Staden Package. Available here.
Bozdag S, Li A, Wuchty S, Fine HA. FastMEDUSA: A parallelized tool to infer gene regulatory networks. [Bioinformatics. 2010 May 30]: Introduces FastMEDUSA, a parallelized version of the regulatory network-modeling tool MEDUSA that was designed to construct gene regulatory networks of higher organisms from gene expression and promoter sequence data. FastMEDUSA distributes expression and sequence data among a user-defined number of processors on a single multi-core machine or cluster. The authors demonstrate in the paper that FastMEDUSA can reconstruct a regulatory network of brain tumor in H. sapiens in six hours on 100 processors, as compared to 12 days with MEDUSA. Available here.
Brohée S, Barriot R, Moreau Y. Biological knowledge bases using Wikis: combining the flexibility of Wikis with the structure of databases. [Bioinformatics. 2010 Jun 30]: Describes WikiOpener, an extension to the MediaWiki engine that allows on-the-fly querying and formatting for resources external to the Wiki. "Those resources may provide data extracted from databases or DAS tracks, or even results returned by local or remote bioinformatics analysis tools," according to the paper's abstract. The authors add that the resource "combines the structure of biological databases with the flexibility of collaborative Wikis." Available here.
Dayarian A, Michael TP, Sengupta AM. SOPRA: Scaffolding algorithm for paired reads via statistical optimization. [BMC Bioinformatics. 2010 Jun 24;11(1):345]: Presents SOPRA, a tool that uses mate-pair and paired-end information to improve the assembly of short reads. "The main focus of the algorithm is selecting a sufficiently large subset of simultaneously satisfiable mate pair constraints to achieve a balance between the size and the quality of the output scaffolds," according to the paper's abstract. Available here.
Fiume M, Williams V, Brudno M. Savant: genome browser for high throughput sequencing data. [Bioinformatics. 2010 Jun 20]: Introduces Savant, a desktop visualization and analysis browser for genomic data that was developed specifically for visualizing and analyzing high-throughput sequencing data, "with special care taken to enable dynamic visualization in the presence of gigabases of genomic reads and references the size of the human genome," according to the paper's abstract. Available here.
Forer L, Schoenherr S, Weissensteiner H, Haider F, Kluckner T, Gieger C, Wichmann HE, Specht G, Kronenberg F, Kloss-Brandstaetter A. CONAN: copy number variation analysis software for genome-wide association studies. [BMC Bioinformatics. 2010 Jun 14;11(1):318]: According to the authors, "while several software packages support the determination of [copy number variants] from SNP chip data, the downstream statistical inference of CNV-phenotype associations is still subject to complicated and inefficient in-house solutions, thus strongly limiting the performance of GWAS based on CNVs." In response, they have developed CONAN, a client-server software system that provides a graphical user interface for categorizing, analyzing, and associating CNVs with phenotypes. "CONAN assists the evaluation process by visualizing detected associations via Manhattan plots in order to enable a rapid identification of genome-wide significant CNV regions," the abstract states. Available here.
[ pagebreak ]
Guerrero D, Bautista R, Villalobos DP, Canton FR, Claros MG. AlignMiner: a web-based tool for detection of divergent regions in multiple sequence alignments of conserved sequences. [Algorithms Mol Biol. 2010 Jun 2;5(1):24]: Describes AlignMiner, a web-based application for detecting conserved and divergent regions in alignments of conserved sequences, with a particular focus on divergence. The software accepts protein or nucleic acid alignments "obtained using any of a variety of algorithms, which does not appear to have a significant impact on the final results," according to the authors. Available here.
He D, Choi A, Pipatsrisawat K, Darwiche A, Eskin E. Optimal algorithms for haplotype assembly from whole-genome sequence data. [Bioinformatics. 2010 Jun 15;26(12):i183-i190]: Presents a dynamic programming algorithm that addresses the challenge of haplotype assembly by combining sequence fragments from high-throughput sequencing technologies. The authors claim their method "can reduce the haplotype assembly problem into the maximum satisfiability problem that can often be solved optimally even when [the number of reads] is large."
Misawa K, Kamatani N. ParaHaplo 2.0: a program package for haplotype-estimation and haplotype-based whole-genome association study using parallel computing. [Source Code Biol Med. 2010 Jun 4;5(1):5]: Describes ParaHaplo, a set of computer programs for the parallel computation of accurate P values in haplotype-based genome-wide association studies. The program is designed for workstation clusters using the Intel Message Passing Interface. Available here.
Ondov BD, Cochran C, Landers M, Meredith GD, Dudas M, Bergman NH. An alignment algorithm for bisulfite sequencing using the Applied Biosystems SOLiD System. [Bioinformatics. 2010 Jun 18]: According to the authors, the Applied Biosystems SOLiD sequencer's di-base encoding scheme increases confidence in the detection of nucleotide substitutions, making it "a potentially advantageous" platform for bisulfite sequencing. "However, the di-base encoding also makes reads with many nucleotide substitutions difficult to align to a reference sequence with existing tools, preventing the platform's potential utility for bisulfite sequencing from being realized." In response, they have developed SOCS-B, an un-gapped alignment algorithm for the SOLiD that is tolerant of both bisulfiteinduced nucleotide substitutions and a parametric number of sequencing errors. Available here.
Ostrovnaya I, Nanjangud G, Olshen AB. A classification model for distinguishing copy number variants from cancer-related alterations. [BMC Bioinformatics. 2010 Jun 2;11(1):297]: The authors describe a prediction model that is able to distinguish somatic copy number alterations from germline copy number variants based on data in the Database of Genomic Variants and other variables, including segment length, height, closeness to a telomere or centromere, and occurrence in other patients.
Shterev ID, Jung SH, George SL, Owzar K. permGPU: Using graphics processing units in RNA microarray association studies. [BMC Bioinformatics. 2010 Jun 16;11(1):329]: Describes permGPU, which uses graphics processing units for microarray association studies. "An extensive simulation study demonstrates a dramatic increase in performance when using permGPU on an NVIDIA GTX 280 card compared to an optimized C solution running on a conventional Linux server," according to the paper's abstract. Available here.
Unal EB, Gursoy A, Erman B. VitAL: Viterbi algorithm for de novo peptide design. [PLoS One. 2010 Jun 2;5(6):e10926]: Describes a de novo peptide design approach that generates the peptide by docking its residues pair-by-pair along a chosen path on a protein. The best fitting peptide is constructed by generating all possible peptide pairs at each point along the path and determining the binding energies between these pairs and the specific location on the protein using AutoDock.
Vijaya Satya R, Kumar K, Zavaljevski N, Reifman J. A high-throughput pipeline for the design of real-time PCR signatures. [BMC Bioinformatics. 2010 Jun 23;11(1):340]: Describes the Tool for PCR Signature Identification, or TOPSI, a high-performance computing pipeline for designing PCR-based pathogen diagnostic assays. TOPSI designs PCR signatures that are common to multiple bacterial genomes by obtaining the shared regions through pairwise alignments between the input genomes. Available here.
Wu S, Wang J, Zhao W, Pounds S, Cheng C. ChIP-PaM: an algorithm to identify protein-DNA interaction using ChIP-Seq data. [Theor Biol Med Model. 2010 Jun 3;7(1):18]: Presents ChIP-PaM, an algorithm for identifying transcription factor target regions in ChIP-seq datasets. The algorithm relies on three lines of evidence: tag count modeling at the peak position: pattern matching of a specific tag count distribution; and motif searching along the genome.
Zhang P, Dreher K, Karthikeyan A, Chi A, Pujar A, Caspi R, Karp P, Kirkup V, Latendresse M, Lee C, Mueller LA, Muller R, Rhee SY. Creation of a genome-wide metabolic pathway database for Populus trichocarpa using a new approach for reconstruction and curation of metabolic pathways for plants. [Plant Physiol. 2010 Jun 3]: Describes a general approach for reconstructing metabolic pathway complements of plant genomes. As part of the project, the authors developed two reference databases: a comprehensive, all-plant reference pathway database, PlantCyc; and a reference enzyme sequence database, RESD, for annotating metabolic functions of protein sequences. The authors used these databases, along with the MetaCyc database and the pathway prediction software Pathway Tools, to reconstruct a metabolic pathway database, PoplarCyc, from the recently sequenced genome of Populus trichocarpa. Available here.