Skip to main content
Premium Trial:

Request an Annual Quote

Microsoft Modifies Its Spam-Filtering Software for HIV Vaccine Research


Turns out that Microsoft isn’t only concerned with viruses of the electronic variety. Last week, the company announced that machine-learning software developed within its research labs may help design more effective vaccines to fight HIV.

In a collaboration with the University of Washington and Australia’s Royal Perth Hospital, computer scientists at Microsoft Research are adapting software originally designed for computer vision and spam blocking to sift through large genetic data sets in order to identify features that may lead to improved vaccines.

The program the researchers are using, called Epitome, was designed by Microsoft’s Nebojsa Jojic to condense information by identifying areas of “self similarity” in large data sets. This capability was originally created to enable Microsoft’s e-mail software to differentiate spam from legitimate messages. If the researchers can adapt it to HIV research, the software could eventually address the extreme variability of the virus, which has been one of the persistent challenges of AIDS research.

HIV evolves a million times faster than eukaryotes, and there may be billions of different variants within one infected individual as the virus mutates to outpace the immune system of its host, according to Jim Mullins, chair of the department of microbiology at the University of Washington. “From a practical perspective, it hasn’t been possible to immunize with more than one or two or a few versions of the proteins that you’d want to use in your vaccine,” he said.

Mullins said that his team had already used bioinformatics methods to identify the “ancestral” state of HIV proteins, which turned out to be “unusually rich in the immune recognition sites that we thought were necessary for a good vaccine.” However, he said, the “Achilles’ heel” in that approach was that such a vaccine still wouldn’t encompass the genetic diversity of the virus.

Following a fortuitous meeting of Jojic and a postdoc in Mullins’ lab, the two groups put their heads together and decided that Epitome’s ability to do just what it was named for — that is, identify subsets of information that “epitomize” a larger data set — could help tackle the diversity problem in HIV vaccine research.

The Epitome software “really matches the biology of the problem,” Jojic said. “Essentially we just chop up all the strains of the viruses that we have into patches and assemble them back into epitomes.”

Mullins said that this allowed his group to compress the information in the database they were searching against, “and then they add additional information onto that to describe the variations that exist within the database.”

The upshot is a 10-fold speedup in the ability to filter patient data, according to Simon Mallal, executive director of the Royal Perth Hospital’s Center for Clinical Immunology and Biomedical Statistics. Mallal said that his lab has amassed genetic data for 25 AIDS patients, which he described as “the largest set of HIV samples mapped to specific immune types ever collected.”

Furthermore, Mullins said, the researchers are confident that the method will lead to more effective vaccines. “Let’s say you have 100 different genes, and they’re all HIV proteins, but they’re all subtly different from one another. One approach to the vaccine would be to make all 100 of those genes — express those proteins, and immunize somebody with 100 different proteins,” he said. “The Microsoft approach would be to immunize with the one basic structure that is common to all of those, plus little bits of protein that describe the variations. So instead of having 100 genes, you have the equivalent of maybe three or four genes.”

But will it work? “All we know is that to date, nothing has worked,” he said. “But this is the first attempt to deal with this variation problem.”

Several vaccine models developed using the approach are currently undergoing wet-lab validation. Jojic said he expects to complete the first phase of lab testing in around six months, but added that preliminary tests “have verified some of our assumptions.”

As for Microsoft’s future plans in this area, Jojic said that the initial progress of the project has engendered the support of his supervisors. “Really, the main interest for this work has been general machine learning,” and the HIV work was expected to be a “side project” as the team looked to extend its methods into new data types, he said. But “when it turned out that we could make some impact,” Jojic said his group received approval to devote more resources to the task.

Currently, Jojic, his colleague David Heckerman, manager of Microsoft Research’s machine learning and applied statistics group, along with two other Microsoft researchers and two postdocs currently work on the project. Jojic said that the group plans to hire two more postdocs over the summer.

“It’s rewarding to see our techniques being used for such a great purpose,” Jojic said. “I would have never thought that technology for spam filtering might help solve the AIDS problem.”

Information on Epitome is available on Jojic’s website (, and he said that the team “will probably in the future think about putting up the software itself.” Any hesitation in making the package available would not be due to IP issues, Jojic pointed out, but rather, would arise from the complexity of creating and supporting a publicly available software program.

“I understand that it’s customary in bioinformatics to make your software available on the web,” Jojic said, but he noted that this is a “different angle” than he’s accustomed to. In the computer vision community, he said, “you just publish the idea in a paper.”

— BT

Filed under

The Scan

Study Tracks Responses in Patients Pursuing Polygenic Risk Score Profiling

Using interviews, researchers in the European Journal of Human Genetics qualitatively assess individuals' motivations for, and experiences with, direct-to-consumer polygenic risk score testing.

EHR Quality Improvement Study Detects Demographic-Related Deficiencies in Cancer Family History Data

In a retrospective analysis in JAMA Network Open, researchers find that sex, ethnicity, language, and other features coincide with the quality of cancer family history information in a patient's record.

Inflammatory Bowel Disease Linked to Gut Microbiome Community Structure Gradient in Meta-Analysis

Bringing together data from prior studies, researchers in Genome Biology track down microbial taxa and a population structure gradient with ties to ulcerative colitis or Crohn's disease.

Ancient Greek Army Ancestry Highlights Mercenary Role in Historical Migrations

By profiling genomic patterns in 5th century samples from in and around Himera, researchers saw diverse ancestry in Greek army representatives in the region, as they report in PNAS.