Bioinformatics Tool-Related Papers of Note, May 2010
Note: In addition to the below listing, papers for Nucleic Acids Research's annual web server issue are available here.
Baumgartner C, Lewis GD, Netzer M, Pfeifer B, Gerszten RE. A new data mining approach for profiling and categorizing kinetic patterns of metabolic biomarkers after myocardial injury. [Bioinformatics. 2010 May 18. (e-pub ahead of print)]: Discusses a new feature-selection method to identify metabolites of high predictive value in MS/MS data. The method categorizes metabolic signatures into three classes of weak, moderate, and strong predictors that can be applied to both paired and unpaired samples, according to the abstract. The approach "outperformed standard null-hypothesis significance testing and other popular methods for feature selection in terms of the area under the ROC curve and the product of sensitivity and specificity," the authors state. Available here.
Bourgon R, Gentleman R, Huber W. Independent filtering increases detection power for high-throughput experiments. [Proc Natl Acad Sci USA. 2010 May 25;107(21):9546-51]: According to the paper's abstract, variable-by-variable statistical testing, which is often used to select variables whose behavior differs across conditions in high-dimensional data sets, "requires adjustment for multiple testing, which can result in low statistical power." In response, the authors developed a two-stage approach that first filters variables by a criterion independent of the test statistic, and then only tests variables that pass the filter. In particular, they introduce filter/test statistics pairs that do not lead to the loss of type I error control. In an application to microarray data, the authors found that their approach increased the number of discoveries by 50 percent.
Dellinger AE, Saw SM, Goh LK, Seielstad M, Young TL, Li YJ. Comparative analyses of seven algorithms for copy number variant identification from single nucleotide polymorphism arrays. [Nucleic Acids Res. 2010 May;38(9):e105]: Describes the evaluation of seven methods for identifying copy number variants from SNP arrays: circular binary segmentation, CNVFinder, cnvPartition, gain and loss of DNA, Nexus algorithms, PennCNV, and QuantiSNP. The authors found that QuantiSNP outperformed other methods in most of the datasets they evaluated. Nexus Rank and SNPRank "have low specificity and high power," while PennCNV "detects one of the fewest numbers of CNVs."
Floratos A, Smith K, Ji Z, Watkinson J, Califano A. geWorkbench: an open source platform for integrative genomics. [Bioinformatics. 2010 May 28. (e-pub ahead of print)]: Presents geWorkbench, an open source Java desktop application that provides access to an integrated suite of tools for analyzing and visualizing gene expression, sequence, protein structure, and other omics data. The workbench includes more than 70 plug-in modules for "classical analyses," such as clustering, classification, and homology detection, as well as for the reverse engineering of regulatory networks, protein structure prediction, and other applications. Available here.
Fung DC, Hong SH, Wilkins MR, Hart D. Using the clustered circular layout as an informative method for visualizing protein-protein interaction networks. [Proteomics. 2010 May 17. (e-pub ahead of print)]: The authors explain that while the force-directed layout is commonly used in computer-generated visualizations of protein-protein interaction networks, it has poor reproducibility and cannot explicitly display complementary biological information. The paper describes an alternative layout called the clustered circular layout.
Hendrix D, Levine M, Shi W. miRTRAP, a computational method for the systematic identification of miRNAs from high throughput sequencing data. [Genome Biol. 2010 Apr 6;11(4):R39. (e-pub ahead of print):] Describes a computational strategy for whole-genome identification of microRNAs from high-throughput sequencing information. The method incorporates the mechanisms of miRNA biogenesis "and includes additional criteria regarding the prevalence and quality of small RNAs arising from the antisense strand and neighboring loci," according to the paper's abstract. Available here.
Ivakhno S, Tavaré S. CNAnova: a new approach for finding recurrent copy number abnormalities in cancer SNP microarray data. [Bioinformatics. 2010 Jun 1;26(11):1395-402]: Describes an approach for finding regions of recurrent copy number aberrations in Affymetrix SNP 6.0 array data. The method uses control dataset of normal samples "and, in contrast to previous methods, does not require segmentation and permutation steps," according to the paper's abstract. Available here.
[ pagebreak ]
Magi R, Morris AP. GWAMA: software for genome-wide association meta-analysis. [BMC Bioinformatics. 2010 May 28;11(1):288]: Discusses an approach for performing meta-analysis for genome-wide association study data in order to increase the effective sample size and detect further novel loci. "Although statistical software analysis packages incorporate routines for meta-analysis, they are ill equipped to meet the challenges of the scale and complexity of data generated in genome-wide association studies," the abstract states. To address this problem, the authors have developed open-source software that "incorporates a variety of error trapping facilities, and provides a range of meta-analysis summary statistics." Available here.
Paszkowski-Rogacz M, Slabicki M, Pisabarro MT, Buchholz F. PhenoFam — gene set enrichment analysis through protein structural information. [BMC Bioinformatics. 2010 May 17;11(1):254]: Introduces PhenoFam, a software tool that performs gene set enrichment analysis by using structural and functional information on families of protein domains as annotation terms. The tool can analyze data from quantitative high-throughput studies without prior pre-filtering or hit-selection steps, according to the authors. Available here.
Pati A, Ivanova NN, Mikhailova N, Ovchinnikova G, Hooper SD, Lykidis A, Kyrpides NC. GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes. [Nat Methods. 2010 May 2. (e-pub ahead of print)]: Describes the Gene Prediction Improvement Pipeline, or GenePRIMP, a computational process that performs evidence-based evaluation of gene models in prokaryotic genomes and reports anomalies including inconsistent start sites, missed genes and split genes. Available here.
Rohde C, Zhang Y, Reinhardt R, Jeltsch A. BISMA — fast and accurate bisulfite sequencing data analysis of individual clones from unique and repetitive sequences. [BMC Bioinformatics. 2010 May 6;11:230]: BISMA (Bisulfite Sequencing DNA Methylation Analysis) analyzes bisulfite sequencing data. According to the paper's abstract, "it uses an improved strategy for detection of clonal molecules and accurate CpG site detection and it supports for the first time analysis of repetitive sequences." Available here.
Severin J, Beal K, Vilella AJ, Fitzgerald S, Schuster M, Gordon L, Ureta-Vidal A, Flicek P, Herrero J. eHive: An Artificial Intelligence workflow system for genomic analysis. [BMC Bioinformatics. 2010 May 11;11(1):240. (e-pub ahead of print)]: Describes eHive, a fault-tolerant distributed processing system developed by the Ensembl team to support comparative genomic analysis. The system is based on a MySQL database that serves as a "central blackboard." A Perl script queries the system and runs jobs as required. "The system allows us to define dataflow and branching rules to suit all our production pipelines," the authors state.
Teichert F, Minning J, Bastolla U, Porto M. High quality protein sequence alignment by combining structural profile prediction and profile alignment using SABERTOOTH. [BMC Bioinformatics. 2010 May 14;11(1):251. (e-pub ahead of print)]: Describes a sequence-alignment method that "combines the prediction of a structural profile based on the protein's sequence with the alignment of that profile." The method predicts the contact vector of protein structures using an artificial neural network based on position-specific scoring matrices generated by PSIBlast and then aligns these predicted contact vectors. The resulting sequence alignments are then assessed by measuring the derived structural similarity for cases in which structures are available and then by quantifying the ability of the significance score of the alignments to recognize structural and evolutionary relationships. Available here.
Ubaida Mohien C, Hartler J, Breitwieser F, Rix U, Remsing Rix L, Winter GE, Thallinger GG, Bennett KL, Superti-Furga G, Trajanoski Z, Colinge J. MASPECTRAS 2: An Integration and Analysis Platform for Proteomic Data. [Proteomics. 2010 May 7. (e-pub ahead of print)]: Describes MASPECTRAS 2, a platform for integrating MS protein identifications with information from bioinformatics databases. Available here.
Wall DP, Kudtarkar P, Fusaro VA, Pivovarov R, Patil P, Tonellato PJ. Cloud computing for comparative genomics. [BMC Bioinformatics. 2010 May 18;11(1):259]: Describes the redesign of the reciprocal smallest distance algorithm to run on Amazon's Elastic Computing Cloud. The authors ran more than 300,000 RSD-cloud processes within EC2 using 100 high-capacity compute nodes. The total computation time took just under 70 hours and cost $6,302.