The Genome in a Bottle Consortium, a group that has been working on reference materials for human genome sequencing, plans to release its first pilot reference material, a highly characterized HapMap sample, along with a set of highly confident genotype calls early next year.
The group recently released a preliminary set of highly confident calls for the genome, based on an analysis of publicly available sequence data, and is in the process of characterizing its sequence in greater detail.
Spearheaded by the National Institute of Standards and Technology, the consortium, which convened for the first time last year and will have another workshop next month, aims to develop the "meter stick of the genome" (CSN 9/5/2012), thus filling the need for a reliable standard that labs can use to check the performance of their own sequencing operations.
"At present, there are no widely accepted genomic standards or quantitative performance metrics for confidence in variant calling," according to a recent conference abstract by the consortium. "These are needed to achieve the confidence in measurement results expected for sound, reproducible research and regulated applications in the clinic."
The project is organized into four working groups: one in charge of selecting and designing the reference material; another focusing on bioinformatics, data integration, and data representation; a third dealing with measurements to characterize the reference material; and a fourth concentrating on performance metrics.
After overcoming initial concerns last year about informed consent for HapMap samples, the consortium has decided to use sample NA12878, originating from a woman from the Utah CEPH pedigree, as its pilot reference material. That sample has already been extensively characterized in several published sequencing studies. "It's a great pilot genome because we know what we're getting into," said Marc Salit, one of the consortium organizers and a group leader at NIST.
NIST has approved its release as a reference material, and "we will continue to monitor any new developments in the informed consent world," said Justin Zook, another consortium leader and a biomedical engineer at NIST.
Coriell Cell Repositories recently grew a large quantity of NA12878 cells for the consortium and extracted the DNA, a total of about 8,300 vials of 10 micrograms each, which laboratories will be able to request next year. According to Salit, this is "to our knowledge, the largest batch of genomic DNA ever prepared for genomic studies."
NIST scientists are about to characterize several vials of the DNA, using a combination of deep and shallow sequencing, in order to find out how homogeneous the material is, since mutations can arise during cell culture.
In addition, NIST plans to send the DNA to consortium members interested in helping to characterize the reference sample in greater depth. So far, nine groups have said they would like to participate, and more are expected to join the effort. "The idea is this will be characterized on multiple platforms and in multiple laboratories," Zook said.
Researchers at NIST plan to use Life Technologies' SOLiD and Ion Torrent and Illumina's sequencing platforms to study NA12878. In addition, they expect to obtain sequence data from Complete Genomics, including data from their long fragment read technology, as well as from vendors of sequencing platforms. "Every major platform vendor has expressed that they would characterize reference materials as they came out," Salit said.
Pacific Biosciences has already been sequencing NA12878 in collaboration with academic groups at Weill Cornell Medical College and elsewhere, which they plan to contribute to the consortium, Zook noted, but they have been using a different batch of the DNA. "We don’t expect there will be large differences between different batches of DNA, but we want to confirm what differences there might be," he said. That group has already generated about 20x to 30x coverage, he said, which they plan to increase to 50x to 60x.
The consortium aims to release NA12878 DNA as official NIST reference material early next year, together with a set of highly confident SNPs and insertions and deletions. These variants will be updated over time and extended to structural variants and complex regions of the genome.
In the meantime, NIST scientists have developed methods to identify highly confident genotype calls – including homozygous reference calls – from multiple datasets that are already available for the sample, which they are in the process of publishing in a journal.
So far, they have identified a preliminary set of highly confident SNPs and indels from a total of 12 datasets for NA12878, and have posted the variant files to a new FTP site that was set up in partnership with the National Center for Biotechnology Information last month.
The FTP site, which Zook said is "under active development," also includes the datasets themselves, which were generated on Illumina, SOLiD, Complete Genomics, Ion Torrent, 454, PacBio, and Sanger sequencing platforms.
Eventually, the site will host raw data files, alignment files, and variant call files for all reference genomes the consortium plans to develop.
NIST has already distributed its preliminary set of confident variants to at least a dozen groups, including academic and government labs as well as clinical laboratories, which are helping to refine the calls. "Because a lot of people are already sequencing NA12878 pretty routinely in their process, they found these calls useful for looking at how accurate their methods are, and particularly for refining them to understand where they should be confident and where maybe they cannot be as confident in their variant calls," Zook said.
The consortium is also working with the Genome Comparison and Analytic Testing resource, GCAT (BI 4/19/2013), to host its highly confident genotype calls, so that users can compare different bioinformatics methods and see what effects they have on variant calls. "We think this is going to be a really valuable resource for the community," Zook said.
In addition, the consortium is closely collaborating with the Centers for Disease Control and Prevention's Genetic Testing Reference Materials Coordination Program, GeT-RM, to have it hosted on a browser recently released by GeT-RM. "There will be a lot of overlap," Zook said, noting that both projects will use each other's data.
Besides developing NA12878 as its pilot reference material, the consortium is also looking into other samples with broader and more recent informed consent than the HapMap samples that could be made into reference materials.
Specifically, it recently received samples from a parent-son trio of Chinese origin that is participating in Harvard Medical School's Personal Genomes Project, and just submitted an order to Coriell for growing large batches of cells from these samples, which it hopes to have available by the end of this year. "Our plan is, over the next year or two, to have a set of around eight families, at least trios of mother-father-child, from the PGP, from diverse ancestry groups," Zook said.
In parallel to human genomes, the consortium is developing synthetic DNA sequences that could be used as spike-in controls in sequencing experiments, for example to determine the detection limit for mutations, which is particularly important in cancer sequencing, where somatic mutations might be present at low frequencies.
So far, the consortium has explored synthetic DNA standards about 1 kilobase in size, but larger constructs up to 10 kilobases in size might also be possible. It has not been decided yet, though, whether NIST or another provider, for example a company, might distribute these spike-in standards. According to Salit, several firms, including Life Technologies and Horizon Diagnostics, have expressed an interest in potentially developing commercial kits of synthetic reference materials. "Not every reference material to come out of the Genome in a Bottle Consortium needs to be a NIST reference material," he said.
Moreover, the consortium has started discussions about developing tumor/normal pairs of cell lines into reference materials and is currently exploring various options for that, including an inter-laboratory study of tumor/normal cell lines that are already available.
At its upcoming workshop at NIST Aug. 15 and 16, the consortium plans to make further decisions on what projects to go ahead with. "The Genome in a Bottle Consortium is being hosted and developed by NIST, but certainly it's going to have a life of its own," Salit said. "I'm hoping this workshop establishes some decision-making processes. There are things we need to move forward as a consortium that we have to make choices amongst."
Participation in the workshop, and the consortium, is open to anyone, "in the spirit of 'you get out of it what you put into it,'" Salit said.