Skip to main content
Premium Trial:

Request an Annual Quote

MIT Scientists Develop New Gene Finding Software, Uses Sequence from Two Genomes


Massachusetts Institute of Technology scientists, working under Eric Lander’s direction, have developed new gene finding software that uses unannotated sequence data from two genomes rather than one.

The principal developers were Lior Pachter, now a visiting assistant professor in the department of mathematics at the University of California, Berkeley, and Serafim Batzoglou, research scientist at MIT’s Whitehead Institute Genome Center.

Eric Lander, MIT professor of biology and director of the university’s center for genome research, coached Pachter and Batzoglou in developing software and algorithms to enable the use of sequences from two organisms.

“The key point is that the best way to distinguish signal from noise in genomes is to ask what evolution has chosen to preserve. The paper demonstrated nicely that coding exons are preserved so well between human and mouse that whereas de novo gene-finding with one species is tough, de novo gene-finding with two species is not,” said Lander, referring to a paper in the July issue of the journal Genome Research, which detailed the work.

Batzoglou and Pachter developed two software programs, Glass and Rosetta, while they were doctoral students at MIT. Rosetta is a program used to find genes in human and mouse simultaneously, as opposed to most gene finding programs, which use human sequence as input to then find genes and annotate the sequence.

Rosetta allows researchers to find genes not just in human but also in unannotated mouse sequence. “You can use both sequences to enhance detection in both genomes,” said Pachter. Until very recently, most gene finding programs have focused on single organism gene finding, although some of them did incorporate information such as homology information or expressed sequence tag information to improve the predictions, said Pachter.

“But the concept of just using two unannotated sequences simultaneously, that was I think a novelty in our paper,” said Pachter.

To use Rosetta to understand which regions are conserved, a researcher has to have the sequences aligned. Pachter and Batzoglou couldn’t find a tool to accurately align large genomic regions, so they developed Glass.

“We needed to have an alignment for every base pair in the human and the corresponding mouse sequences. And we were looking at regions that were roughly 200,000 or 300,000 base pairs long,” said Pachter.

Pachter is currently using the software in his research at Berkeley on finding regulatory elements in sequences. He is also collaborating with the Lawrence Berkeley National Laboratory to develop a program called Vista, designed to visualize parts of the alignments produced from Glass.

In the Genome Research paper, Batzoglou and Pachter reported that their software predicted gene sequences with better than 90 percent accuracy. Batzoglou and Pachter looked at 117 known genes from human and mouse to test the accuracy of the software.

The software can be accessed on the MIT website where anyone can submit two sequences and get back an answer, said Batzoglou. He plans to make the software freely available for downloading in the coming weeks. There have been several hundred visits to the website, he added.

Batzoglou said the approach has been designed for comparing two similar genomes. By looking at where the genomes are similar, the software helps researchers discover which are the biologically important parts of the genomes. The important parts of the genomes are preserved because of evolutionary pressure to preserve function while the less important parts or junk DNA drift over the course of evolution, he explained.

This approach does not assume that any gene in either of the two organisms is known or that any similar gene is in a known database.

Batzoglou said that programs like Glass and Rosetta will be more important as the mouse genome gets sequenced because such software will enable the mouse and human sequences to be more easily compared, which should result in a more accurate annotation of the human genome.

Bioinformatics at the genome analysis level is going to become more of a comparative science as more related organisms are sequenced, said Batzoglou.

Chris Burge, a research scientist in MIT’s biology department, said Glass and Rosetta are important early efforts to obtain comparative information. “It’s the first stab at this problem and my guess is that within several months, there will be several other programs that work on that precise problem,” said Burge, who developed another gene finding program called GenScan.

He cautioned that Glass and Rosetta are not very practical yet because aligning two genomic sequences is a computationally intensive process. But the demonstrated results have been good, he added.

—Matthew Dougherty

Filed under

The Scan

Myotonic Dystrophy Repeat Detected in Family Genome Sequencing Analysis

While sequencing individuals from a multi-generation family, researchers identified a myotonic dystrophy type 2-related short tandem repeat in the European Journal of Human Genetics.

TB Resistance Insights Gleaned From Genome Sequence, Antimicrobial Response Assays

Researchers in PLOS Biology explore M. tuberculosis resistance with a combination of sequencing and assays looking at the minimum inhibitory concentrations of 13 drugs.

Mendelian Disease Genes Prioritized Using Tissue-Specific Expression Clues

Mendelian gene candidates could be flagged for further functional analyses based on tissue-specific transcriptome and proteome profiles, a new Journal of Human Genetics paper says.

Single-Cell Sequencing Points to Embryo Mosaicism

Mosaicism may affect preimplantation genetic tests for aneuploidy, a single-cell sequencing-based analysis of almost three dozen embryos in PLOS Genetics finds.