Skip to main content
Premium Trial:

Request an Annual Quote

A Bioinformatic Lesson from Chinese Search Engines

Wang Liang, at SOSO.com, the third largest search engine in China, says that searchable bioinformatic databases could learn a lot from search engines that index Chinese language Web sites, according to the Physics arXiv Blog. Liang suggests that bioinformatic search engines adopt an inverted index approach, as a first step. "That dramatically simplifies things but there are various complexities that making the indexing process tricky. For example, in English, the spaces between words show clearly where each word starts and finishes. That isn't the case in genetic data. So one important questions is what constitutes a word," the Physics arXiv Blog reports, adding that "Liang says that an important clue comes from the way search engines index languages like Chinese where there are no spaces between words either." One approach to indexing a Chinese document involves breaking up the text into n-grams, "words that are n-letters long." Liang has applied Zipf's law to the Arabidopsis, Aspergillus, fruit fly, and mouse genomes in order to determine the average "length" of DNA words — about 12 letters. Genome data, then, can be indexed using 12-grams and the advantage of adopting this method is that it doesn't require new technology. "Perhaps there's even a decent business model in such a plan, for example by serving ads targeted at the kind of people who do bioinformatics search," according to the Physics arXiv Blog. "The only question is who will lead the way in this area."

The Scan

Missed Early Cases

A retrospective analysis of blood samples suggests early SARS-CoV-2 infections may have been missed in the US, the New York Times reports.

Limited Journal Editor Diversity

A survey finds low diversity among scientific and medical journal editors, according to The Scientist.

How Much of a Threat?

Science writes that need for a provision aimed at shoring up genomic data security within a new US bill is being questioned.

PNAS Papers on Historic Helicobacter Spread, Brain Development, C. difficile RNAs

In PNAS this week: Helicobacter genetic diversity gives insight into human migrations, gene expression patterns of brain development, and more.