Skip to main content
Premium Trial:

Request an Annual Quote

BGI Devises Method to Map Structural Variants via De Novo Assembly of Short-Read Data


By Monica Heger

Researchers at BGI have devised a method to map structural variants using short-read next-gen sequencing data that relies on de novo assembly of the genome.

Publishing their results this week in Nature Biotechnology, the team said that unlike other methods for calling structural variants in short-read data, the BGI approach "can resolve complex rearrangements."

Although structural variation accounts for a greater fraction of the diversity between individuals than SNPs, calling these variants from short sequencing reads has been difficult. Shorter reads can make it tricky to find complex rearrangements due to difficulties of mapping reads to the genome. Additionally, current methods often favor structural variations of certain lengths, or specific types of structural variations, and also may be unable to resolve breakpoints at single nucleotide resolution.

Many of these problems arise because short-read sequencing data is first aligned to a reference genome. So, in the Nature Biotechnology paper, the BGI team first did a de novo assembly, and then designed an algorithm for calling structural variants.

"The fact that they're able to find these rearrangements at all and validate a good proportion of them is exciting," said Erich Jarvis, an associate professor in neurobiology at Duke University who has used 454 technology to do de novo assembly of song bird genomes. "It's really two methods in one — a de novo assembly and structural variation identification," he added.

The BGI team used previously sequenced genomes of an Asian individual and an African individual that had already been sequenced on the Illumina Genome Analyzer. Some of the data was generated from read lengths of between 35 and 44 base pairs, while other from read lengths of 75 base pairs, Yingrui Li, head of BGI's bioinformatics team and a first author of the paper, told In Sequence in an e-mail. The de novo assemblies had N50 contigs of 7.4 kilobases and 5.9 kilobases, respectively.

Li added that going forward the team plans to use 100-base-pair reads, which will improve the assembly and thus their ability to call structural variations.

For de novo assembly, the team used its in-house designed SOAPdenovo algorithm. To detect structural variation, they used an algorithm called LASTZ, which aligns de novo whole-genome assemblies, and they aligned the Asian and African genomes to the human reference build 36.

The team analyzed only structural variants smaller than 50 kilobases since larger variations can be detected with other approaches. After filtering out false positives, they were able to detect in the Asian and African genomes 80,719 and 87,457 insertions, 51,711 and 56,074 deletions, 26 and 23 inversions, and 717 and 516 complex rearrangements, respectively.

The team also analyzed structural variation in the genomes of 106 people in the 1,000 Genomes Project. They found an "excess of low-frequency structural variations," suggesting that "structural variations are more specific to individuals than are SNPs in humans," the researchers wrote.

They also found that structural variants were more likely to be found in non-coding regions of the genome, such as within telomeres and intronic regions, meaning that they probably affect gene and protein regulation, said Jarvis. For instance, the length of telomeres is related to aging, so structural variants within telomeric regions may be involved in determining a person's lifespan, he said.

Comparing the method to the algorithms BreakDancer and pIndel, which also detect structural variants in short-read data, the BGI researchers showed that their method had a lower false-positive rate. Also, most of the variants detected by the BGI method were not detected by BreakDancer or pIndel.

While pIndel had the highest overlap with structural variants detected from previous studies and detected the highest total number of structural variants, the algorithm is optimized for short indels, so only calls deletions of up to 10 kilobase pairs and insertions of up to 16 base pairs. The BGI method, meantime, calls insertions and deletions of up to 50 kilobase pairs. BreakDancer, on the other hand, only detects insertions and deletions greater than 10 base pairs.

While the BGI team did not compare its method to approaches that call structural variants from longer-read platforms, such as from 454 or Pacific Biosciences data, Jarvis said that those platforms would likely be as good, if not better, at calling structural variants — particularly in repetitive regions.

The BGI team was not able to assemble repetitive regions very well, so it's likely that the approach would not be able to call variants from those areas of the genome, Jarvis said.

"This paper did better than expected on the shorter-read technology, but long-read technology will ultimately be better," he added.

Have topics you'd like to see covered by In Sequence? Contact the editor at mheger [at] genomeweb [.] com.

The Scan

Germline-Targeting HIV Vaccine Shows Promise in Phase I Trial

A National Institutes of Health-led team reports in Science that a broadly neutralizing antibody HIV vaccine induced bnAb precursors in 97 percent of those given the vaccine.

Study Uncovers Genetic Mutation in Childhood Glaucoma

A study in the Journal of Clinical Investigation ties a heterozygous missense variant in thrombospondin 1 to childhood glaucoma.

Gene Co-Expression Database for Humans, Model Organisms Gets Update

GeneFriends has been updated to include gene and transcript co-expression networks based on RNA-seq data from 46,475 human and 34,322 mouse samples, a new paper in Nucleic Acids Research says.

New Study Investigates Genomics of Fanconi Anemia Repair Pathway in Cancer

A Rockefeller University team reports in Nature that FA repair deficiency leads to structural variants that can contribute to genomic instability.