By Ganapati Mudur
Indian physicist Ram Ramas- wamy had spent a decade looking for order in complexity, seeking patterns in fractals, sand piles, and languages, when biologist Alok Bhattacharya coaxed him to train his sights on genomes. The result of efforts by the researchers at Jawaharlal Nehru University in New Delhi is software that can pick out genes in any genome without prior information about the organism’s genomic structure.
Most existing in silico gene prediction programs need training and often use genome-specific signals to find genes. “But this algorithm looks for a universal signature that shows up in almost all genes of all organisms,” says Ramaswamy, professor and dean at JNU’s physical sciences school. He and Bhattacharya teamed up nearly four years ago to search for the genes of Entamoeba histolytica, a parasite that sometimes colonizes the human gut.
They observed that the bases A, T, C, and G tend to recur in regular intervals along protein-coding stretches of the genome. When the sequence data are digitized, the so-called three-base periodicity can be detected through a computational technique known as Fourier analysis. The periodicity exhibits a strong peak along genes but is absent in non-coding regions.
No training required
The pattern is believed to emerge directly from the codons that make up gene sequences. “Not only is this one of the best signatures of genes, but it is also one of the simplest,” says Wentian Li of the statistical genetics lab at New York’s Rockefeller University. “The rule is so basic that the gene identification program needs no training.”
To establish the universality of the three-base periodicity, JNU researchers used their program GeneScan to screen several thousand genes from more than 40 organisms, including yeast, bacteria, fruit fly, parasites, and humans. “It turns up in 98 percent of all genes,” says Bhattacharya.
They have also used GeneScan to predict previously undetected genes in several microbes including Mycobacterium tuberculosis, Haemophilus influenzae, and Plasmodium falciparum. GeneScan has predicted at least five new genes in Leishmania major chromosome 1.
“Previous annotation techniques may have missed them because they don’t have the typical codon patterns associated with Leishmania,” says Bhattacharya.
But the first version of GeneScan couldn’t pinpoint the terminal ends of genes and it failed to identify coding regions shorter than 100 bases. Also, the three-base periodicity rule doesn’t hold true for two percent of genes. The researchers hope that algorithms that take into account start and stop codons, splice sites, or promoters will eliminate these drawbacks.
To add value to the basic version and turn it into a complete annotation tool, the JNU team is trying to link GeneScan to BLAST so newly discovered genes could be instantly compared with gene databases.
However, the pace of work has been slow. Bioinformaticists are hard to find. Only four Indian universities offer programs in bioinformatics and the best students head into industry.
But GeneScan has attracted an Indian software company. Unitech Infosolutions, a company near New Delhi, has developed a Linux version of the program and a user-friendly graphical interface in an attempt to market it.
The researchers, though, say they are not in it for the money. “We’d be as happy if the version on our website spurs bioinformatics research here,” says Bhattacharya.
Meanwhile, “verifying genes predicted in silico through wet lab work will keep us busy a long time,” says Ramaswamy. “I guess those sand piles can wait.”