Skip to main content

Tools for Sequence Assembly, Protein Analysis, and More Highlighted at ISMB 2012


This year's Intelligent Systems for Molecular Biology conference, held last week in Long Beach, Calif., featured an array of talks, papers, and posters describing software for sequence assembly, analyzing proteins and RNA, and other activities in the bioinformatics space.

The following is a roundup of some of the tools discussed at the conference.

NGS Assembly

Among the sequence assembly tools presented at the meeting was the Generic Assembly Scaffolder, or GRASS, which uses a series of optimization approaches to scaffold next-generation sequencing assemblies. The software was developed and presented by a team from the Delft University of Technology during the High-Throughput Sequencing, or HiTSeq, special interest group meeting held prior to the official start of ISMB 2012.

Another presentation at HiTSeq described a tool developed by researchers from the University of Hong Kong, dubbed the iterative de Bruijn graph de novo assembler for short read sequencing data with highly uneven sequencing depth, or IDBA-UD.

IDBA-UD builds on the team's previously developed IDBA algorithm for assembling next-generation sequence reads. The developers explained during their presentation that IDBA-UD is meant to address sequence data where uneven sequencing depths result in a large number of errors.

In a Bioinformatics paper that describes IDBA-UD in detail, the researchers explain that their approach deals with sequencing depth issues by using “multiple depth relative thresholds” to remove erroneous k-mers and then it applies a local assembly approach to address the low-depth regions and an error correction step to deal with high-depth regions.

A third tool presented at the meeting was Oases, which is a short read de novo transcriptome assembler. It was developed by a team comprised of researchers from the Max Planck Institute for Molecular Genetics, the European Bioinformatics Institute, and the University of California, Santa Cruz.

In a Bioinformatics paper describing the software that was published last year, the developers explained that Oases relies on an “array of hash lengths, a dynamic filtering of noise, a robust resolution of alternative splicing events, and the efficient merging of multiple assemblies.”

They also noted that the software performed better than existing de novo transcriptome assemblers like transABySS and Trinity when tested on human and mouse RNA-seq data.

Meanwhile, other talks discussed efforts to make sequence assemblies more accurate, such as SEQuel, developed by researchers from the University of California, San Diego, and Wayne State University, and based on a positional de Bruijn graph algorithm.

The tool corrects errors such as insertions, deletions, and substitutions that are still present in contigs assembled using software like BGI’s Soapdenovo and the Broad Institute’s ALLPATHS. According to a Bioinformatics paper describing it, SEQuel “reduced the number of small insertions and deletions in the assemblies of standard multi-cell Escherichia coli data by almost half, and corrected between 30 percent and 94 percent of the substitution errors.”

Protein Analysis

Protein analysis was another area of focus for bioinformatics developers at ISMB. For example, researchers at Seattle Children’s Hospital presented two freely available tools for proteomics analysis â€" the Model Organism Protein Expression Database (MOPED) and the Systematic Proteomics Investigative Research Environment (SPIRE).

MOPED lets users compare data from their proteomics studies with public datasets. Users can search for information in the database by organism, tissue, condition, and localization as well as upload their own data into the system. The database contains information from more than 44,000 proteins and over 15 million spectra.

SPIRE provides a web-based pipeline for analyzing mass spectrometry data in order to identify proteins and peptides, as well as to conduct label-free expression and relative expression analyses. The pipeline includes software such as the Open Mass Spectrometry Search algorithm, or OMSSA, and X! Tandem, as well as tools for visualizing the results of the analysis. Users also have access to information stored in resources like GeneCards, UniProt, and Reactome.

Another protein-related presentation described the Molecular Recognition Features Predictor, or MoRFpred, which is a web-based system for predicting particular binding regions in protein chains that are initially disordered but become structured once they bind. These regions are associated with protein signaling and regulation activities.

According to a conference abstract, MoRFpred works by fusing annotations generated by sequence alignment with predictions generated by a support vector machine, which uses a custom designed set of sequence-derived features that provides information about things like evolutionary profiles, selected physiochemical properties of amino acids, and solvent accessibility. Further details are available in a paper published in Bioinformatics.

Meanwhile, researchers from the University of Munich’s Gene Center presented HMM-HMMâ€"based lightning-fast iterative sequence search, or HHblits, which is used to search large protein databases such as UniProt by representing both query and database sequences with hidden Markov models. Further details of the software are available in a Nature Methods paper that was published last year.

Software Environments

Talks at the Bioinformatics Open Source Conference special interest group meeting, meantime, focused on software environments like the Broad’s GenomeSpace, which lets users move their data between multiple genomics analysis tools such as GenePattern and the UCSC Genome Browser (BI 5/4/2012).

Another BOSC presentation focused on Pypedia, which is an effort to host the python programming environment in a Wikipedia-like environment. Articles in Pypedia describe specific functions and classes in the Python programming language and include the documentation, source code, and unit tests. Users can make edits to the source code as well as download and execute it locally.

Another BOSC talk highlighted updates to the Galaxy platform, including an overview of its application programming interface, automatic parallelization, and the Galaxy toolshed, which contains several tools for things like sequence assembly and analysis, computational chemistry, and programs for studying metagenomes and manipulating ontologies that users can select from, install, and run on their local Galaxy instances.

ChIP-seq, Drug Effects, and More

Other presentations at the meeting focused on efforts to understand drug effects, methods of analyzing ChIP-seq datasets, toolkits for bioinformatics analysis, as well as software for determining gene and protein function.

For example, one presentation â€" discussed in detail in a Bioinformatics paper â€" described a statistical approach for comparing the similarities between individual ChIP-seq datasets that relies on “efficient computation of exact p-values.”

Another speaker described the Malaria Genome Explorations Tool, or MaGnET, which lets users visualize and browse functional genomic data from Plasmodium falciparum, the malaria parasite, and related species.

Still on the subject of gene and protein function, another presentation highlighted BioGPS â€" developed by a team from the Scripps Research Institute, which is a web-based tool that aggregates online gene annotation resources. According to a conference abstract, users can select the resources they need and organize them into a customizable gene report page. They can also contribute to the platform by submitting new resources as plugins, the developers said.

On the drug development front, researchers from Stanford University discussed what they described as a data-driven approach for predicting drug effects and interactions from adverse event data that accounts for missing information that could have an effect on drug response

The team used a propensity scoring matching algorithm to correct for factors such as concomitant medications and patient demographics and medical histories that could be involved in patients’ response to drugs. They used their approach to develop two new resources: a database of drug effects, OFFSIDES, and a database of drug-drug interaction side effects called TWOSIDES. Details about the study were published in Science Translational Medicine in March.

Filed under

The Scan

Call to Look Again

More than a dozen researchers penned a letter in Science saying a previous investigation into the origin of SARS-CoV-2 did not give theories equal consideration.

Not Always Trusted

In a new poll, slightly more than half of US adults have a great deal or quite a lot of trust in the Centers for Disease Control and Prevention, the Hill reports.

Identified Decades Later

A genetic genealogy approach has identified "Christy Crystal Creek," the New York Times reports.

Science Papers Report on Splicing Enhancer, Point of Care Test for Sexual Transmitted Disease

In Science this week: a novel RNA structural element that acts as a splicing enhancer, and more.