Computational biology at IBM Research is gathering steam. In addition to last week’s sale of a BlueGene/L system to Japan’s AIST for protein-folding simulation, IBM researchers have published a paper outlining a new genome annotation language.
The Genomic Messaging System Language, or GMSL, embeds information related to stretches of the genome within the A/C/G/T string of the DNA itself. The approach is described in a report in the Oct. 11 print edition of the Journal of Proteome Research, and was published online July 22.
Barry Robson, an engineer at IBM’s Computational Biology Center and lead author on the paper, said that the method was developed with an eye toward clinical genomics, and offers a means of storing and transmitting whole sequences of patient DNA with embedded privacy and consent information — as well as other functional annotations and medical information related to the sequence.
Robson told BioInform that other methods “annotate around the DNA … but this is much more condensed.” GMSL was designed to be “larger” than XML annotation, he said. In other words, XML or other input documents — including image data from MRIs and X-rays — are “disassembled” into GMSL to create a transmitted stream. A GMSL parser receives the data, reconstitutes the original annotations, and delivers them in their original format as output.
According to the paper, “data from genomic databases is brought into GMS via files which contain the DNA raw sequences and optionally, but importantly, allow annotation by an expert. … In the current implementation, the expert annotates the DNA files directly with a text editor, and the modified DNA files are then automatically converted into GMS syntax. The syntax of the DNA files prior to conversion is quite flexible and supports XML tags for annotation plus special GMS commands for process control.”
Robson said the approach follows on an earlier method the team developed to embed the properties of amino acids within protein sequence strings.
GMSL is still in the early stages of development, but in an initial study it successfully modeled SNPs in proteins from a patient record. It’s “available as research code,” Robson said, but, he added, “I wouldn’t want any surgery done on me based on it yet.”
Robson said the team has submitted a second paper outlining further details of the GMSL specification that will allow users to “implement it on the bit level.” This first paper, he said, was an opportunity “to make the general ideas available to the community.”