Bioinformatics Tool-Related Papers of Note, June 2009
Note: In addition to the below listing, papers for Nucleic Acids Research's annual web server issue are available here.
Ay F, Kahveci T, DE Crécy-Lagard V. A fast and accurate algorithm for comparative analysis of metabolic pathways. [J Bioinform Comput Biol. 2009 Jun;7(3):389-428]: Describes an algorithm for pairwise alignment of metabolic pathways. The method aligns different types of entities, such as enzymes, reactions, and compounds, and is "free of any abstraction" in modeling the pathways, according to the paper's abstract. The algorithm accounts for both pairwise similarities of entities and the organization of their interactions by creating an eigenvalue problem for both homology and topology. Available here.
Bode M, Khor S, Ye H, Li MH, Ying JY. TmPrime: fast, flexible oligonucleotide design software for gene synthesis. [Nucleic Acids Res. 2009 Jun 10. (e-pub ahead of print)]: Introduces TmPrime, a program to design oligonucleotide sets for gene assembly by both ligase chain reaction and polymerase chain reaction. The program "divides the long input DNA sequence based on the input desired melting temperature, and dynamically optimizes the length of oligonucleotides to achieve homologous melting temperatures," according to the paper's abstract. Available here.
Koboldt DC, Chen K, Wylie T, Larson DE, McLellan MD, Mardis ER, Weinstock GM, Wilson RK, Ding L. VarScan: Variant detection in massively parallel sequencing of individual and pooled samples. [Bioinformatics. 2009 Jun 19. (e-pub ahead of print)]: Presents VarScan, a software tool for detecting variants in next-generation sequencing data "that is compatible with several short read aligners," according to the paper's abstract. The paper demonstrates VarScan's ability to detect SNPs and indels with high sensitivity and specificity in both Roche/454 sequencing of individuals and deep Illumina sequencing of pooled samples. Available here.
Krzywinski MI, Schein JE, Birol I, Connors J, Gascoyne R, Horsman D, Jones SJ, Marra MA. Circos: An information aesthetic for comparative genomics. [Genome Res. 2009 Jun 18]: Describes a visualization tool, called Circos, for genome comparison, which "uses a circular ideogram layout to facilitate the display of relationships between pairs of positions by the use of ribbons, which encode the position, size, and orientation of related genomic elements," according to the paper's abstract. Circos is capable of displaying data as scatter, line and histogram plots, heat maps, tiles, connectors, and text. Bitmap or vector images can be created from GFF-style data inputs and hierarchical configuration files, which can be easily generated by automated tools, making Circos suitable for rapid deployment in data analysis and reporting pipelines. Available here.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map (SAM) format and SAMtools. [Bioinformatics. 2009 Jun 8. (e-pub ahead of print)]: Outlines the Sequence Alignment/Map format, which is a generic alignment format for storing read alignments against reference sequences that supports short and long reads. "It is flexible in style, compact in size, efficient in random access, and is the format in which alignments from the 1000 Genomes Project are released," the paper's abstract states. The paper also describes SAMtools, which implements various utilities for postprocessing alignments in the SAM format. Available here.
Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J. SOAP2: an improved ultrafast tool for short read alignment. [Bioinformatics. 2009 Jun 3. (e-pub ahead of print)]: Describes SOAP2, a "significantly improved version of the short oligonucleotide alignment program that both reduces computer memory usage and increases alignment speed at an unprecedented rate," the paper's abstract states. The authors tested SOAP2 on the whole human genome and found that it reduced memory usage from 14.7 Gb to 5.4 Gb and improved alignment speed by 20 to 30 times. Available here.
Mitra S, Klar B, Huson DH. Visual and statistical comparison of metagenomes. [Bioinformatics. 2009 Jun 10. (e-pub ahead of print)]: Describes two techniques for comparing multiple metagenomic datasets: a visualization technique for multiple datasets and a new statistical method for highlighting the differences in a pairwise comparison. Implementations of both methods are available in the metagenome analysis tool MEGAN. Available here.
Qu W, Hashimoto SI, Morishita S. Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing. [Genome Res. 2009 Jun 4. (e-pub ahead of print)]: Describes a frequency-based, de novo short-read clustering method that "organizes erroneous short sequences originating in a single abundant sequence into a tree structure, [in which] each 'child' sequence is considered to be stochastically derived from its more abundant 'parent' sequence with one mutation through sequencing errors," according to the paper's abstract. The root node "is the most frequently observed sequence that represents all erroneous reads in the entire tree, allowing the alignment of the reliable representative read to the genome without the risk of mapping erroneous reads to false-positive positions."
Saito TL, Yoshimura J, Sasaki S, Ahsan B, Sasaki A, Kuroshu R, Morishita S. UTGB Toolkit for Personalized Genome Browsers. [Bioinformatics. 2009 Jun 3. (e-pub ahead of print)]: Introduces the University of Tokyo Genome Browser toolkit, which allows researchers to develop a personalized genome browser for analyzing large amounts of locally stored genomics data. The UTGB toolkit is designed to meet "three major requirements" for personalization of genome browsers, according to the paper's abstract: easy installation, browsing locally stored data, and rapid interactive design of web interfaces tailored to individual needs. Available here.
Schmidt B, Sinha R, Beresford-Smith B, Puglisi SJ. A fast hybrid short read fragment assembly algorithm. [Bioinformatics. 2009 Jun 17. (e-pub ahead of print)]: Describes Taipan, an algorithm for short-read assembly that is a hybrid of the two main approaches to this challenge: greedy extension-based methods and graph-based methods. Taipan "uses greedy extensions for contig construction but at each step realizes enough of the corresponding read graph to make better decisions as to how assembly should continue," according to the paper's abstract. The authors claim that the method offers an assembly quality "at least as good as the graph-based approaches used in the popular Edena and Velvet assembly tools using a moderate amount of computing resources." Available here.
Singer GA, Hajibabaei M. iBarcode.org: web-based molecular biodiversity analysis. [BMC Bioinformatics. 2009 Jun 16;10 Suppl 6:S14]: Introduces a web-based suite of tools called iBarcode to help researchers analyze datasets to create DNA barcodes, which are used as a global standard for species identification and biodiversity studies. The suite allows users to manage their barcode datasets, cull out non-unique sequences, identify haplotypes within a species, and examine the within- to between-species divergences. Available here.
Whiteford N, Skelly T, Curtis C, Ritchie ME, Löhr A, Zaranek AW, Abnizova I, Brown C. Swift: Primary data analysis for the Illumina Solexa sequencing platform. [Bioinformatics. 2009 Jun 23. (e-pub ahead of print)]: Discusses Swift, a tool for performing primary data analysis on the Illumina Genome Analyzer. According to the paper's abstract, Swift "is the first tool, outside of the vendor's own software, which completes the full analysis process, from raw images through to base-calls." The authors claim that Swift is able to increase the sequencing yield by 13.8 percent, "at comparable error rate." Available here.
Yang L, Xu L, He L. A CitationRank algorithm inheriting Google technology designed to highlight genes responsible for serious adverse drug reaction. [Bioinformatics. 2009 Jun 15. (e-pub ahead of print)]: Describes the SADR [Serious Adverse Drug Reaction]-Gengle database, which is made up of gene-SADR relationships extracted from PubMed and covers six major SADR: cholestasis, deafness, muscle toxicity, QT prolongation, Stevens-Johnson syndrome, and torsades de points. The database was constructed with the CitationRank algorithm, "which inherits the principle of the Google PageRank algorithm that a gene should be highly ranked when biologically related to other highly ranked genes," according to the paper's abstract. Available here.