Skip to main content
Premium Trial:

Request an Annual Quote

Machine Learning Tool Uncovers Novel Endogenous Viruses in Human Genome

NEW YORK – Researchers have used a new approach to uncover additional endogenous viruses hidden within the human genome.

Endogenous viruses are thought to be the remnants of ancient viruses that caused infections and hitched a ride within host genomes. Teasing out these genetic leftovers has typically relied on identifying sequences within a host genome that resemble known viruses, but researchers from Kyoto University and their colleagues developed a machine learning approach to detect other ancient viruses left behind in the human genome.

As they reported on Monday in the Proceedings of the National Academy of Sciences, Kyoto's Keizo Tomonaga and his colleagues used their classifier to detect known and novel endogenous viruses, including ones that do not share similarities with known viruses, shedding light on viral diversity.

"Our goal is to detect endogenous viral sequences that are not homologous to sequences of previously identified viruses, i.e., that might have not been identified yet or already have been extinct, and that are not detected by conventional homology analysis," Tomonaga, a molecular virologist, said in an email.

He and his colleagues trained a support vector machine on known non-retroviral endogenous RNA virus elements, particularly from bornaviruses and filoviruses, to distinguish between those sequence patterns and those of the human genome. They noted that k-mers of three or longer were sufficient to distinguish between viral and human sequences.

By applying this classifier to the human reference genome, the researchers sought to detect non-retroviral endogenous RNA virus elements. After a number of steps to reduce false positives — such as searching for poly-A tracts and target site duplications, or removing cellular pseudogenes — they homed in on about 100 non-retroviral endogenous RNA virus element-like sequences within the human reference genome. 

This set included five of the eight known bornaviral non-retroviral endogenous RNA virus elements, suggesting the researchers' approach could identify most known endogenous viral sequences. The researchers suspected that their classifier did not detect the other bornaviruses as it is designed to capture sequences with typical non-retroviral endogenous RNA virus element sequences and these three sequences deviated from those norms.

In addition, two of the sequences their classifier detected fell below the threshold of detection of a typical Blast search, suggesting to the researchers that these sequences would otherwise have gone unnoticed. These sequences, dubbed hsEBLN-8 and hsEBLN-9, had weak similarities with orthobornaviruses and recently discovered bornaviruses belonging to the genus Carbovirus

At the same time, the researchers identified one predicted viral insertion that could belong to an unknown virus. This sequence is about 600 nucleotides long and is marked by both a poly-A tail and target site duplications. A similar insertion site is present in chimpanzees and marmosets, but not in tarsiers, suggesting the insertion occurred at least 43 million years ago.

As he and his colleagues trained their classifier using bornaviruses and filoviruses and looked in particular for sequences with poly-A tracts and target site duplications, Tomonaga noted there could be additional endogenous viruses that they have yet to discover in the human genome that do not follow those particular patterns.

Studying the ancient viruses within human and other animal genomes could give further insight into the diversity of ancient viruses and modern viruses. "Searching the unknown virus-like sequences in the genomes of other animal species — for example bat species that are thought to be vectors for many pathogenic viruses — will not only give rise to our knowledge about the diversity of past and present virosphere, but also prepare us for future pandemics," Tomonaga said.