By Monica Heger
Researchers at BGI have devised a method to map structural variants using short-read next-gen sequencing data that relies on de novo assembly of the genome.
Publishing their results this week in Nature Biotechnology, the team said that unlike other methods for calling structural variants in short-read data, the BGI approach "can resolve complex rearrangements."
Although structural variation accounts for a greater fraction of the diversity between individuals than SNPs, calling these variants from short sequencing reads has been difficult. Shorter reads can make it tricky to find complex rearrangements due to difficulties of mapping reads to the genome. Additionally, current methods often favor structural variations of certain lengths, or specific types of structural variations, and also may be unable to resolve breakpoints at single nucleotide resolution.
Many of these problems arise because short-read sequencing data is first aligned to a reference genome. So, in the Nature Biotechnology paper, the BGI team first did a de novo assembly, and then designed an algorithm for calling structural variants.
"The fact that they're able to find these rearrangements at all and validate a good proportion of them is exciting," said Erich Jarvis, an associate professor in neurobiology at Duke University who has used 454 technology to do de novo assembly of song bird genomes. "It's really two methods in one — a de novo assembly and structural variation identification," he added.
The BGI team used previously sequenced genomes of an Asian individual and an African individual that had already been sequenced on the Illumina Genome Analyzer. Some of the data was generated from read lengths of between 35 and 44 base pairs, while other from read lengths of 75 base pairs, Yingrui Li, head of BGI's bioinformatics team and a first author of the paper, told In Sequence in an e-mail. The de novo assemblies had N50 contigs of 7.4 kilobases and 5.9 kilobases, respectively.
Li added that going forward the team plans to use 100-base-pair reads, which will improve the assembly and thus their ability to call structural variations.
For de novo assembly, the team used its in-house designed SOAPdenovo algorithm. To detect structural variation, they used an algorithm called LASTZ, which aligns de novo whole-genome assemblies, and they aligned the Asian and African genomes to the human reference build 36.
The team analyzed only structural variants smaller than 50 kilobases since larger variations can be detected with other approaches. After filtering out false positives, they were able to detect in the Asian and African genomes 80,719 and 87,457 insertions, 51,711 and 56,074 deletions, 26 and 23 inversions, and 717 and 516 complex rearrangements, respectively.
The team also analyzed structural variation in the genomes of 106 people in the 1,000 Genomes Project. They found an "excess of low-frequency structural variations," suggesting that "structural variations are more specific to individuals than are SNPs in humans," the researchers wrote.
They also found that structural variants were more likely to be found in non-coding regions of the genome, such as within telomeres and intronic regions, meaning that they probably affect gene and protein regulation, said Jarvis. For instance, the length of telomeres is related to aging, so structural variants within telomeric regions may be involved in determining a person's lifespan, he said.
Comparing the method to the algorithms BreakDancer and pIndel, which also detect structural variants in short-read data, the BGI researchers showed that their method had a lower false-positive rate. Also, most of the variants detected by the BGI method were not detected by BreakDancer or pIndel.
While pIndel had the highest overlap with structural variants detected from previous studies and detected the highest total number of structural variants, the algorithm is optimized for short indels, so only calls deletions of up to 10 kilobase pairs and insertions of up to 16 base pairs. The BGI method, meantime, calls insertions and deletions of up to 50 kilobase pairs. BreakDancer, on the other hand, only detects insertions and deletions greater than 10 base pairs.
While the BGI team did not compare its method to approaches that call structural variants from longer-read platforms, such as from 454 or Pacific Biosciences data, Jarvis said that those platforms would likely be as good, if not better, at calling structural variants — particularly in repetitive regions.
The BGI team was not able to assemble repetitive regions very well, so it's likely that the approach would not be able to call variants from those areas of the genome, Jarvis said.
"This paper did better than expected on the shorter-read technology, but long-read technology will ultimately be better," he added.
Have topics you'd like to see covered by In Sequence? Contact the editor at mheger [at] genomeweb [.] com.