NEW YORK – The Genome in a Bottle Consortium (GIAB) has publicly released a benchmarking data set for calling germline structural variants (SV) of insertions and deletions larger than 50 bp.
The benchmark set contains 7,281 sequence resolved insertions and 5,464 deletions that could be used to identify either false positive or false negative SV calls in normal, nontumor samples. The data also include the genomic regions where these calls are likely to be found, which are critical for identifying false positive variant calls.
The team used 19 different variant calling methods from several sequencing and genome mapping technologies to build the benchmark set. In a paper published Monday in Nature Biotechnology, the researchers demonstrated the ability to identify variant calling errors in both short- and long-read next-generation sequencing data as well as in optical mapping data. Researchers from Pacific Biosciences, Roche, Bionano Genomics, 10x Genomics, Nabsys, Google, and Spiral Genetics contributed to the paper.
"We have tried to combine info from as many different technologies as we could to form this call set," said Justin Zook, a researcher at the National Institute of Standards and Technology and a GIAB coleader. "One challenge is that often a method will not get the call exactly right, it might be off by a little bit. Part of the work that we did was to develop methods that compare these structural variants in a robust way." The group relied heavily on de novo assembly methods to detect particular types of SVs, he added.
Earlier iterations of these data have already been available to GIAB members. "We use the GIAB structural variant data all the time" Benedict Paten, a computational biologist at the University of California, Santa Cruz, Genomics Institute wrote in an email. UCSC is part of the GIAB consortium, but Paten was not an author on the Nature Biotechnology paper. "It has proven extremely useful for developing and testing new methods." Paten has used the data as his lab helps build a graph representation of the human pangenome.
GIAB is a worldwide public-private consortium led by NIST and aimed at characterizing human genomes. So far, the group has focused on seven genomes, a pilot genome and two mother-father-son trios consented from the Personal Genome Project for commercial redistribution. In addition to reference materials, GIAB has been working on smaller indel benchmarking datasets which it released in April 2019, also in Nature Biotechnology. The consortium has now created an updated small variant benchmark set, Zook said, and plans to make that public soon.
Work on the structural variants began in 2016, when the community began collecting SV call sets from individual methods and began comparing them to each other. Zook noted that the benchmarking process was iterative. "We release drafts for the community to try to use. If we release an initial draft it will not call false positives and negatives, initially." GIAB went through at least four different versions prior to this published one, Zook said.
The full technology list used to create the benchmark set includes short-read sequencing from Illumina and Complete Genomics; 10x's now discontinued Linked Reads method; PacBio's long-read sequencing; optical genome mapping from Bionano Genomics; and electronic mapping from Nabsys. Mapping technologies were used for SV size estimates, the authors noted. Zook added that Oxford Nanopore Technologies' platform was not included because it was not yet available to GIAB when the analysis began.
The final large indel call set is available as two files, a VCF file and a bed file — which shows the regions in which the calls are likely to be found.
Zook and his coauthors admitted that the data were limited to a small part of the structural variant universe. "Most notably, we chose to exclude complex SVs and SVs for which we could not determine a consensus sequence," the authors wrote. "Limiting our set to isolated insertions and deletions removed approximately half of SVs for which there was strong support that some SV occurred." Repeats and segmental duplications, therefore, were not included. Zook noted that some complex SVs, including genomic rearrangements such as kataegis and chromothripsis, are usually associated with cancer and outside the scope of this project.
While the community expects the benchmarking to be helpful, everyone agrees there's still plenty of work to be done.
"In the future, we'd love to see closer integration between the SV sets and the sets of benchmark single nucleotide variants and small indels," Paten said.
GIAB is collecting new large indel call sets for the human reference genomes GRCh37 and GRCh38 from Oxford Nanopore's platform, as well as PacBio's newer HiFi reads and Strand-seq.
And the iterations on this SV benchmark will continue: "We don't expect this to be final," Zook said.