Wang Liang, at SOSO.com, the third largest search engine in China, says that searchable bioinformatic databases could learn a lot from search engines that index Chinese language Web sites, according to the Physics arXiv Blog. Liang suggests that bioinformatic search engines adopt an inverted index approach, as a first step. "That dramatically simplifies things but there are various complexities that making the indexing process tricky. For example, in English, the spaces between words show clearly where each word starts and finishes. That isn't the case in genetic data. So one important questions is what constitutes a word," the Physics arXiv Blog reports, adding that "Liang says that an important clue comes from the way search engines index languages like Chinese where there are no spaces between words either." One approach to indexing a Chinese document involves breaking up the text into n-grams, "words that are n-letters long." Liang has applied Zipf's law to the Arabidopsis, Aspergillus, fruit fly, and mouse genomes in order to determine the average "length" of DNA words — about 12 letters. Genome data, then, can be indexed using 12-grams and the advantage of adopting this method is that it doesn't require new technology. "Perhaps there's even a decent business model in such a plan, for example by serving ads targeted at the kind of people who do bioinformatics search," according to the Physics arXiv Blog. "The only question is who will lead the way in this area."
A Bioinformatic Lesson from Chinese Search Engines
Jun 30, 2010