NEW YORK (GenomeWeb) – The Genome in a Bottle Consortium has been developing new reference materials for genome sequencing and is working on a first set of high-quality structural variants for human genomes. In the meantime, laboratories have begun to adopt its first pilot genome as a standard for developing new sequencing technologies and assays.
Last month, the private-public consortium, which is spearheaded by the National Institute of Standards and Technology, released four new DNA reference materials, adding to the pilot sample it made available last year. The consortium counts about 20 members on the sample development and an equal number on the data analysis sides, including clinical laboratories, sequencing technology companies, professional organizations, and research groups at academic institutions and government agencies. While the majority of these are US-based, the group also has members from Australia, Asia, and Europe.
All reference materials have been extensively characterized by several sequencing and DNA mapping technologies. The consortium has also published a set of high-confidence variant calls for the original pilot genome, against which researchers can benchmark their own results.
Researchers and developers can purchase all five DNA samples from the NIST Standard Reference Material program. The new reference materials are NIST RM8392, a parent-son trio of Eastern European Ashenazi Jewish ancestry; NIST RM8391, the son of the same trio; NIST RM 8393, a male of Chinese ancestry; and NIST RM 8375, bacterial DNA from Salmonella typhimurium LT2, Staphylococcus aureus, Pseudomonas aeruginosa, and Clostridium sporogenes. The first sample, NIST RM 8398, released in May 2015, comes from a female HapMap sample of European ancestry from Utah, also known as NA12878.
All human genomes contained in the new reference materials derive from participants in Harvard University's Personal Genome Project (PGP), who have provided very broad consent for use of their samples, which have been turned into cell lines that are stored at the Coriell Institute for Medical Research.
At a workshop held at NIST last month that was attended by about 100 participants from academic institutions, government agencies, and companies, consortium members discussed what other samples GIAB should develop as reference materials.
According to Marc Salit, leader of the Genome-Scale Measurements Group at NIST and the Joint Initiative for Metrology in Biology (JIMB), a collaboration between NIST and Stanford University, the plan is to develop reference materials for additional ancestries, as well as to establish reference materials from cancer samples. Because of their broad consent, all new samples will likely also come from PGP participants.
For new ancestries, the consortium will consider individuals with African, Asian (other than Chinese), Hispanic, and mixed heritage. The researchers also plan to recruit a large family, which will require additional clearance since the PGP is currently not IRB-approved to recruit new participants, Salit said.
Instead of relying on self-reported ancestry alone, the scientists plan to scan PGP samples to see what ancestries are available to them, using either existing genotype data or new data from SNP arrays.
Generating cancer reference materials will be more of a long-term project, Salit said, that comes with technical challenges. The consortium currently plans to turn material from a single tumor into several different cell lines, each representing an evolutionary lineage with a different mutational profile, and to turn a blood sample from the same patient into a normal control cell line. The idea is to select tumors with interesting mutational diversity among those cells lines and to develop those into reference samples, he said, which is "a little bit of a science project."
Using clinical partners, the consortium plans to recruit cancer patients to the PGP that could contribute samples. In parallel, it hopes to identify cancer patients in the existing PGP cohort.
Initially, the group plans to focus on one or two solid tumor types. Sarcomas in particular appear to be of interest to many GIAB participants, Salit said, because they are physically large and probably genetically heterogeneous.
Besides determining what new samples to develop into reference materials, the GIAB steering committee also decided that the consortium should continue to focus on "building long-term, high-value samples that form the basis of the best-characterized genomes," Salit said, and not on providing samples with specific mutations for clinical assays. "This question of how we validate genetic assays in the clinic, where you need some true positives, is not going to be addressed by Genome in a Bottle," he said.
Already, a variety of labs and researchers have purchased the original pilot genome reference material and downloaded the variant calls for that sample. In total, NIST has sold several hundred units of the sample, Salit said.
According to Justin Zook, a research scientist at NIST and one of the leaders of the GIAB consortium, almost 40 customers of the pilot reference sample are part of health systems that do clinical genomics, and almost 30 are sequencing technology developers. The other 70-odd customers include academic institutions, pharmaceutical companies, commercial clinical service providers, federal agencies, and international agencies.
The consortium is also hosting about 100 terabases of data characterizing the genomes contained in its reference materials, and has had about 1,000 unique users visiting the site per month, and about 50,000 file downloads in total so far.
"Clinical labs are using our reference materials as part of their process of understanding the performance of their assays," Zook said, in particular for detecting SNPs. However, he added, the data tend to contain few difficult variants, such as large indels or complex changes, even though these might be important for a clinical lab.
Last year, NIST and Stanford researchers published an analysis of the pilot genome, for example, that found many medically relevant genome areas were not included in the high-confidence calls.
To address this, he said, clinical labs have started to spike DNA with specific mutations of clinical interest into a background of genomic DNA, often using the NIST reference material for this. The Partners HealthCare Center for Personalized Genetic Medicine, for example, in collaboration with SeraCare Life Sciences, recently published a paper in which they described spiking DNA fragments with pathogenic variants for hypertrophic cardiomyopathy into genomic DNA.
Several companies have already used the NIST reference materials — which had been identified more than a year prior to their release last month — to create commercial spike-in controls, Salit said. They include the AcroMetrix Oncology Hotspot Control from Thermo Fisher Scientific and a sample with DNA spike-ins from SeraCare.
In addition, Horizon Diagnostics is selling several Genome In A Bottle HDx Reference Standards, including cells from a GIAB sample that have been formalin-fixed and paraffin-embedded. Longer term, companies are also looking to introduce mutations into the GIAB genome, Salit said, which is something the PGP consent allows for. "One can imagine an interesting space for making commercial controls whose basis are these well-consented, super-well characterized genomes," he said.
Making variant calls with high confidence
Last month's workshop also discussed how the GIAB consortium can further characterize the reference samples, calling high-confidence variants in difficult regions of the genome as well as more complicated types of changes, such as structural variants.
The group has already made progress in this area. When it released the pilot reference material last year, the set of high-confidence variant calls covered 77 percent of the genome. Since then, the consortium has developed a new integration method for identifying high-confidence variants, and as a result, the latest call set covers about 90 percent of the genome, Zook said. It also has 19 percent more variant calls overall than the first version. The researchers are pushing further now to make even more variant calls in regions of the genome that are difficult to map, he added.
The consortium has also been working on calling structural variants. So far, the researchers have collected about 20 different structural variant call sets for the Ashkenazi Jewish trio, producing data from a handful of sequencing and mapping technologies that include Illumina short-read data from different libraries, Pacific Biosciences data, 10X Genomics data, BioNano Genomics optical mapping data, and Nabsys electronic mapping data. The group also has some Oxford Nanopore data available, but it currently only covers a small percentage of the genome at very low coverage, which is not sufficient to be used in the integration process, Zook said, though this might change over time.
A lot of the discussions at the workshop centered around how the structural variant call sets can be integrated to form benchmark calls, Zook said. One way to do that is to look at the exact DNA sequences of the variants, such as the breakpoint sequence of deletions and the sequence of insertions, and to compare how well calls from multiple sequencing technologies match, he said.
Another approach is to take candidate structural variant calls from each method and see whether there is evidence for them in data from other technologies. "That will help us to know whether a structural variant call that was made by only one technology initially is reported by other technologies," Zook said, and the approach does not rely on knowing exact sequences.
"We still haven't finalized the process for what exactly will define our benchmark calls, and there probably will be multiple tiers of confidence," Zook said. The consortium is still open to adding call sets from additional technologies, he added.
The current plan is to have an initial call set of high-confidence structural variants available early next year and to discuss users' experience with these at a planned structural variant science workshop at Stanford next spring.
GIAB has also been working with the Global Alliance for Genomics and Health to develop standardized tools that researchers can use to benchmark their results against the reference genomes, as well as to define a set of best practices for using these tools. One of the challenges, Zook said, has been to define exactly what a false-positive or false-negative call is, though the group has now come up with standardized definitions. In addition, he said, complex variants can be represented in different ways in a variant call file, which needs to be accounted for to avoid incorrect calls, and several groups have been working on tools to be able to do that. "Just recently, we developed an integrated pipeline where you can choose between two of these sophisticated comparison tools, and can get out a set of standardized performance metrics in the end," he said.
The team is also working on strategies to report a user's performance in different categories, for example, calling SNPs versus indels, calling different lengths of insertions and deletions, or detecting variants in easy versus difficult regions of the genome. The goal is to allow researchers to "more precisely understand where they're doing well, where they could improve, and which types of variants are in which specific regions of the genome," Zook said.
The group has not published their benchmarking tools and standards yet, but they are available on GitHub. The tools for call set comparisons have also already been integrated into the precisionFDA platform, a portal to help researchers test and validate bioinformatic approaches for processing next-gen sequencing data that is part of the Precision Medicine Initiative, where anyone can use them, he said.
Salit said the collaborative approach NIST took with the Genome In a Bottle Consortium will be a model for other initiatives planned as part of the Joint Initiative for Metrology in Biology at Stanford, which is currently hiring principal investigators. "This idea of working in partnership is really important for NIST long term," he said.