Skip to main content
Premium Trial:

Request an Annual Quote

A Bioinformatic Lesson from Chinese Search Engines

Wang Liang, at SOSO.com, the third largest search engine in China, says that searchable bioinformatic databases could learn a lot from search engines that index Chinese language Web sites, according to the Physics arXiv Blog. Liang suggests that bioinformatic search engines adopt an inverted index approach, as a first step. "That dramatically simplifies things but there are various complexities that making the indexing process tricky. For example, in English, the spaces between words show clearly where each word starts and finishes. That isn't the case in genetic data. So one important questions is what constitutes a word," the Physics arXiv Blog reports, adding that "Liang says that an important clue comes from the way search engines index languages like Chinese where there are no spaces between words either." One approach to indexing a Chinese document involves breaking up the text into n-grams, "words that are n-letters long." Liang has applied Zipf's law to the Arabidopsis, Aspergillus, fruit fly, and mouse genomes in order to determine the average "length" of DNA words — about 12 letters. Genome data, then, can be indexed using 12-grams and the advantage of adopting this method is that it doesn't require new technology. "Perhaps there's even a decent business model in such a plan, for example by serving ads targeted at the kind of people who do bioinformatics search," according to the Physics arXiv Blog. "The only question is who will lead the way in this area."

The Scan

Pig Organ Transplants Considered

The Wall Street Journal reports that the US Food and Drug Administration may soon allow clinical trials that involve transplanting pig organs into humans.

'Poo-Bank' Proposal

Harvard Medical School researchers suggest people should bank stool samples when they are young to transplant when they later develop age-related diseases.

Spurred to Develop Again

New Scientist reports that researchers may have uncovered why about 60 percent of in vitro fertilization embryos stop developing.

Science Papers Examine Breast Milk Cell Populations, Cerebral Cortex Cellular Diversity, Micronesia Population History

In Science this week: unique cell populations found within breast milk, 100 transcriptionally distinct cell populations uncovered in the cerebral cortex, and more.