Skip to main content
Premium Trial:

Request an Annual Quote

Machine Learning Approach Improves Detection of Small Structural Variants from Paired-End Data

Premium

Structural variation is still notoriously difficult to analyze comprehensively, especially in repetitive areas of the genome. Researchers in Italy have now developed a method for detecting small and medium-sized structural variations that complements existing approaches.

The software, called SVM2 for "structural variation mapping using support vector machines," employs supervised learning to predict structural variants from paired-end reads, taking into account deviations from the expected library insert size and information from local patterns of read mapping.

A description of the new tool, which the researchers plan to use to analyze plant and fungal genomes, appeared online in Nucleic Acids Research last month.

According to David Horner, an assistant professor in the Department of Biosciences at the University of Milan and the senior author of the paper, "while we may have found a large proportion of the larger medium-sized [structural variation] events, which are prominent and common in the human population and occur in relatively unique regions of the human genome, there remains a very large amount of small events where many of the available methods struggle, particularly in repetitive regions of the genome."

Jan Korbel, a researcher at the European Molecular Biology Laboratory in Heidelberg and an expert in structural variation analysis, called the new software "an interesting new tool for structural variant detection" that "appears to work well for deletions and insertions, including small events."

Four classes of methods exist for determining structural variations from next-generation sequencing data. One approach is to assemble sequence data de novo into large contigs that can then be mapped and compared to the reference genome. This method also works in repetitive regions, as long as the repeats are smaller than the contigs. "This would probably be the single best approach if it was possible to reliably reassemble short reads to make large contigs," Horner said.

Another approach, which is similar in principle to array comparative genomic hybridization, maps sequence reads onto the reference genome and looks for areas where the read coverage is higher or lower than expected, pointing to changes in copy number. However, despite recent improvements in technology, sequencing reads are not uniformly distributed across the genome, Horner said, and it is difficult to detect small copy number gains or losses in highly repetitive regions. Also, this method does not tell where exactly a structural variation has occurred.

A third method, called split mapping, aligns sequence reads to the reference genome but allows for insertions or deletions in the read. With this approach, which is best suited to detect small structural variations, it is possible to determine exactly where an event has occurred, provided that the read maps unambiguously. It is necessary, however, to anchor part of the read on either side of the event, which is sometimes difficult, Horner said.

SVM2 is an example of a fourth type of method, which maps paired-end reads to the reference genome and looks for deviations from the expected insert size. The smaller the insert size, the easier small events can be detected reliably.

Definitions vary, Horner said, but small events are often considered to be between 5 and 10 bases long, and medium-sized events between 10 and 50 bases long. Most structural variation in the human genome consists of small events, though there are also thousands of medium-sized ones, he said.

"While the split mapping methods have been traditionally seen as perhaps the best methods for searching for very small events, the insert-size perturbation [methods] have classically been seen as maybe the best methods for looking for medium-size events," he explained.

SVM2 differs from other insert-size approaches in that it is also capable of detecting small structural variants from paired-end sequence data, which could come from any sequencing platform. Using machine learning, the method can be trained to recognize certain read mapping patterns that are expected around specific types of structural rearrangements.

Existing methods that use paired-end sequence data usually only look at the observed insert size between reads at a given site. "We reasoned that, in fact, there are some other sources of information you should be able to extract from the mapping pattern of reads," Horner said. For example, whenever there is a junction associated with an insertion or a deletion, there should be a small interval that is not covered well with reads. Also, when only one read of a pair can be mapped, the researchers look at whether that is consistent with perturbations of the insert size.

The team has also introduced more sophisticated statistics for detecting differences in the insert size distribution. "Rather than using one statistical test with a very high significance cutoff, our method looks for positions which maybe only give marginal significance, but from a whole series of tests, and taken together, suggest an event," Horner said. "This is probably why we perform reasonably well in detecting very small events."

To assess the performance of their method, they compared it to two other tools, using an existing dataset of paired-end Illumina reads: BreakDancer, a PE-based method that was developed by the Genome Center at Washington University School of Medicine; and PinDel, a split-mapping method that was developed by the European Bioinformatics Institute.

Performance assessments are difficult, Horner said, because not many good benchmarking datasets for human genomes exist. For example, it is dangerous to evaluate a method based on structural variants found using only short-read next-gen sequencing data, but there are few human genomes available for which Sanger data also exists. And benchmarking methods based on their ability to call heterozygous or somatic structural variations is particularly difficult because that requires high-depth Sanger data, he said.

Which method works best will also likely change as sequencing technology improves, for example as read length increases.

In their comparison, BreakDancer performed well on medium-sized structural variations, showing high specificity, but not so well on very small events, compared to SVM2. "Not wanting to offend anyone, but we feel that at least with the data that we have considered, we outperformed BreakDancer," Horner said.

PinDel, on the other hand, found a large percentage of very small events with high specificity and "reasonable" sensitivity, he said, many of which overlapped with those detected by SVM2. However, PinDel and SVM2 each detected large sets of small events that the other tool missed, so combining both methods might be a good approach. "For groups that may not want to use 10 or 15 different predictors, a good compromise might be to use a method like PinDel and a method like ours to maximize sensitivity without having to do too much bioinformatics work," he said.

A large number of the events detected by SVM2 had previously been observed with methods based on Sanger sequencing and fell into slightly repetitive regions of the genome, Horner noted. This "suggests that by taking more complex evaluations of the mapping patterns, as we tried to do, holds the potential to allow access to what's been known as the inaccessible fraction of the human genome, which essentially is the highly repetitive fraction," he said.

SVM2 could be further improved by combining the concepts of split-mapping methods and insert-size based methods. "It's on our list of things to do," Horner said. "We think that we should be able to effectively introduce some additional mapping features into our machine learning method and some additional post-processing steps in utilizing our split-mapping data to increase the specificity of our method further."

EMBL's Korbel said that "it will be interesting to see whether in the future similar rationales will be applicable to identify more challenging forms of structural variation, such as inversions and SVs in highly duplicated areas of the genome."

In their own work, Horner and colleagues plan to apply SVM2 to several fungal genomes that were recently assembled, which will help them further validate the tool's performance. In addition, they will likely apply it to the genomes of fruit-bearing plants that have been, or are currently being, sequenced by consortia that involve Italian researchers, such as the genomes of grapevine, strawberry, apple, peach, and orange.

The software and manual are available from the researchers' ftp site and will be updated along with the method.

The Scan

Not as High as Hoped

The Associated Press says initial results from a trial of CureVac's SARS-CoV-2 vaccine suggests low effectiveness in preventing COVID-19.

Finding Freshwater DNA

A new research project plans to use eDNA sampling to analyze freshwater rivers across the world, the Guardian reports.

Rise in Payments

Kaiser Health News investigates the rise of payments made by medical device companies to surgeons that could be in violation of anti-kickback laws.

Nature Papers Present Ginkgo Biloba Genome Assembly, Collection of Polygenic Indexes, More

In Nature this week: a nearly complete Ginkgo biloba genome assembly, polygenic indexes for dozens of phenotypes, and more.