Skip to main content
Premium Trial:

Request an Annual Quote

Machine Learning Tool Uncovers Novel Endogenous Viruses in Human Genome

NEW YORK – Researchers have used a new approach to uncover additional endogenous viruses hidden within the human genome.

Endogenous viruses are thought to be the remnants of ancient viruses that caused infections and hitched a ride within host genomes. Teasing out these genetic leftovers has typically relied on identifying sequences within a host genome that resemble known viruses, but researchers from Kyoto University and their colleagues developed a machine learning approach to detect other ancient viruses left behind in the human genome.

As they reported on Monday in the Proceedings of the National Academy of Sciences, Kyoto's Keizo Tomonaga and his colleagues used their classifier to detect known and novel endogenous viruses, including ones that do not share similarities with known viruses, shedding light on viral diversity.

"Our goal is to detect endogenous viral sequences that are not homologous to sequences of previously identified viruses, i.e., that might have not been identified yet or already have been extinct, and that are not detected by conventional homology analysis," Tomonaga, a molecular virologist, said in an email.

He and his colleagues trained a support vector machine on known non-retroviral endogenous RNA virus elements, particularly from bornaviruses and filoviruses, to distinguish between those sequence patterns and those of the human genome. They noted that k-mers of three or longer were sufficient to distinguish between viral and human sequences.

By applying this classifier to the human reference genome, the researchers sought to detect non-retroviral endogenous RNA virus elements. After a number of steps to reduce false positives — such as searching for poly-A tracts and target site duplications, or removing cellular pseudogenes — they homed in on about 100 non-retroviral endogenous RNA virus element-like sequences within the human reference genome. 

This set included five of the eight known bornaviral non-retroviral endogenous RNA virus elements, suggesting the researchers' approach could identify most known endogenous viral sequences. The researchers suspected that their classifier did not detect the other bornaviruses as it is designed to capture sequences with typical non-retroviral endogenous RNA virus element sequences and these three sequences deviated from those norms.

In addition, two of the sequences their classifier detected fell below the threshold of detection of a typical Blast search, suggesting to the researchers that these sequences would otherwise have gone unnoticed. These sequences, dubbed hsEBLN-8 and hsEBLN-9, had weak similarities with orthobornaviruses and recently discovered bornaviruses belonging to the genus Carbovirus

At the same time, the researchers identified one predicted viral insertion that could belong to an unknown virus. This sequence is about 600 nucleotides long and is marked by both a poly-A tail and target site duplications. A similar insertion site is present in chimpanzees and marmosets, but not in tarsiers, suggesting the insertion occurred at least 43 million years ago.

As he and his colleagues trained their classifier using bornaviruses and filoviruses and looked in particular for sequences with poly-A tracts and target site duplications, Tomonaga noted there could be additional endogenous viruses that they have yet to discover in the human genome that do not follow those particular patterns.

Studying the ancient viruses within human and other animal genomes could give further insight into the diversity of ancient viruses and modern viruses. "Searching the unknown virus-like sequences in the genomes of other animal species — for example bat species that are thought to be vectors for many pathogenic viruses — will not only give rise to our knowledge about the diversity of past and present virosphere, but also prepare us for future pandemics," Tomonaga said.

The Scan

Germline-Targeting HIV Vaccine Shows Promise in Phase I Trial

A National Institutes of Health-led team reports in Science that a broadly neutralizing antibody HIV vaccine induced bnAb precursors in 97 percent of those given the vaccine.

Study Uncovers Genetic Mutation in Childhood Glaucoma

A study in the Journal of Clinical Investigation ties a heterozygous missense variant in thrombospondin 1 to childhood glaucoma.

Gene Co-Expression Database for Humans, Model Organisms Gets Update

GeneFriends has been updated to include gene and transcript co-expression networks based on RNA-seq data from 46,475 human and 34,322 mouse samples, a new paper in Nucleic Acids Research says.

New Study Investigates Genomics of Fanconi Anemia Repair Pathway in Cancer

A Rockefeller University team reports in Nature that FA repair deficiency leads to structural variants that can contribute to genomic instability.