Skip to main content
Premium Trial:

Request an Annual Quote

IBM Research Adds to Life Science Repertoire with Genomic Annotation Language


Computational biology at IBM Research is gathering steam. In addition to last week’s sale of a BlueGene/L system to Japan’s AIST for protein-folding simulation, IBM researchers have published a paper outlining a new genome annotation language.

The Genomic Messaging System Language, or GMSL, embeds information related to stretches of the genome within the A/C/G/T string of the DNA itself. The approach is described in a report in the Oct. 11 print edition of the Journal of Proteome Research, and was published online July 22.

Barry Robson, an engineer at IBM’s Computational Biology Center and lead author on the paper, said that the method was developed with an eye toward clinical genomics, and offers a means of storing and transmitting whole sequences of patient DNA with embedded privacy and consent information — as well as other functional annotations and medical information related to the sequence.

Robson told BioInform that other methods “annotate around the DNA … but this is much more condensed.” GMSL was designed to be “larger” than XML annotation, he said. In other words, XML or other input documents — including image data from MRIs and X-rays — are “disassembled” into GMSL to create a transmitted stream. A GMSL parser receives the data, reconstitutes the original annotations, and delivers them in their original format as output.

According to the paper, “data from genomic databases is brought into GMS via files which contain the DNA raw sequences and optionally, but importantly, allow annotation by an expert. … In the current implementation, the expert annotates the DNA files directly with a text editor, and the modified DNA files are then automatically converted into GMS syntax. The syntax of the DNA files prior to conversion is quite flexible and supports XML tags for annotation plus special GMS commands for process control.”

Robson said the approach follows on an earlier method the team developed to embed the properties of amino acids within protein sequence strings.

GMSL is still in the early stages of development, but in an initial study it successfully modeled SNPs in proteins from a patient record. It’s “available as research code,” Robson said, but, he added, “I wouldn’t want any surgery done on me based on it yet.”

Robson said the team has submitted a second paper outlining further details of the GMSL specification that will allow users to “implement it on the bit level.” This first paper, he said, was an opportunity “to make the general ideas available to the community.”

— BT

Filed under

The Scan

Genome Sequences Reveal Range Mutations in Induced Pluripotent Stem Cells

Researchers in Nature Genetics detect somatic mutation variation across iPSCs generated from blood or skin fibroblast cell sources, along with selection for BCOR gene mutations.

Researchers Reprogram Plant Roots With Synthetic Genetic Circuit Strategy

Root gene expression was altered with the help of genetic circuits built around a series of synthetic transcriptional regulators in the Nicotiana benthamiana plant in a Science paper.

Infectious Disease Tracking Study Compares Genome Sequencing Approaches

Researchers in BMC Genomics see advantages for capture-based Illumina sequencing and amplicon-based sequencing on the Nanopore instrument, depending on the situation or samples available.

LINE-1 Linked to Premature Aging Conditions

Researchers report in Science Translational Medicine that the accumulation of LINE-1 RNA contributes to premature aging conditions and that symptoms can be improved by targeting them.