BALTIMORE – Researchers at the University of Washington have launched a collaborative initiative to systematically reanalyze all of the approximately 3,300 samples in the 1000 Genomes Project with nanopore long-read sequencing.
Led by UW researchers Danny Miller and Evan Eichler, the project, named the 1000G ONT Sequencing Consortium and first announced in May, aims to augment the existing 1000 Genomes Project dataset with nanopore long-read data to better understand patterns of structural variation in the human genome, identify variation in difficult-to-map genomic regions, and study genomic methylation patterns.
"The genesis [of this initiative] was that we needed help with filtering a lot of the variants that we were finding," said Miller, a physician-scientist who focuses on using nanopore sequencing to help solve challenging clinical cases. According to him, the lack of a comprehensive reference database for genetic variations obtained from long-read sequencing has made it challenging for researchers to discern the pathogenic genetic variants from the normal ones.
To bridge this gap, Miller and his team turned to samples from the 1000 Genomes Project, an international collaboration that took place between 2008 and 2015 to identify common genetic variants with frequencies of at least 1 percent in the study populations. The project's final dataset contained data for more than 2,500 self-reported healthy individuals from 26 populations across five continents, generating one of the largest public catalogs of global human variation and genotype data.
There are a few benefits of using the 1000 Genomes Project samples, Miller noted. For one, he said, its entire study population is well curated and has been analyzed with short-read whole-genome sequencing, generating rich orthogonal data that can help with phased assemblies for this initiative.
Besides, the 1000 Genomes Project contains trio samples, which Miller considers to be "really informative" when it comes to tracing structural variant changes over a small number of generations. In addition, the project includes other datasets such as RNA-seq and ChIP-seq. "There are signals in those datasets that are interesting, and they're not explained by the polymorphisms," Miller pointed out.
According to Miller, the 1000G ONT Sequencing Consortium will be carried out in two phases. For the current pilot phase, 500 "diverse" samples will be selected from the 1000 Genomes Project and sequenced using the Oxford Nanopore platform. Miller said the consortium is being funded by his and Eichler's labs as well as Oxford Nanopore, which will provide sequencing materials such as reagents and flow cells.
As they begin to illustrate the utility of long-read sequencing for the 1000 Genomes Project pilot samples, Miller said the goal is to secure enough funding to initiate the second phase, which is to sequence the rest of the samples within the dataset.
"As we know through the Telomere-to-Telomere consortium there's additional information missing [from the 1000 Genomes Project dataset] and Oxford Nanopore is excited to help to identify that additional information," an Oxford Nanopore spokesperson wrote in an email.
The spokesperson added that Oxford Nanopore is "providing support in kind and guidance where needed" to this consortium.
For the pilot samples, Miller said the sequencing will be executed in four to five production labs, including his and Eichler's, that have established infrastructure for nanopore sequencing. While the exact workflow is still being finalized, he said the samples will be obtained from Coriell Institute for Medical Research, a nonprofit biomedical institute whose biobanks house the cell lines from the 1000 Genomes Project samples.
However, given that the project aims to achieve an average read length of 50 kb and 30X to 40X coverage for each sample, Miller pointed out that one challenge will be to obtain enough high-quality DNA for the samples. To that, Miller said the team is still deciding whether to extract DNA from the samples in-house or order high molecular weight DNA from the Coriell Institute directly.
Next, the DNA samples will be sequenced one flow cell per sample using the Oxford Nanopore PromethIon P24 or P48 devices, which Miller considers the only viable nanopore sequencing platforms for this consortium since they can generate large-scale data at a reasonable cost. The goal, Miller said, is to complete the sequencing of the 500 pilot samples by the first quarter of next year.
Once a sample is sequenced, Miller said the plan is to make the data publicly available online "almost immediately" after basecalling and standard QC. At that point, anyone in the world can download the Fastq files and look at the data, he added.
In addition to sequencing, the consortium will also carry out data analysis on the samples. Specifically, Miller said the team plans to align the sequencing data to the GRCh38 reference genome from the Telomere-to-Telomere (T2T) consortium to benchmark the structural variant qualities. In addition, he said the group will perform phased assemblies for the genomes.
"For a lot of regions of the genome, this will be the first time we can call variants," said Miller, referring to the genomic regions with low complexity or segmental duplications, which short-read sequencing can't efficiently tackle. "There are medically relevant genes in those regions."
Miller said the consortium will also leverage nanopore sequencing's ability to sequence native DNA to harness methylation profiles for these samples.
In the end, he envisions the consortium producing a flagship paper accompanied by several smaller analysis papers, similar to the output of the T2T consortium.
Toward that goal, the 1000G ONT Sequencing Consortium is recruiting researchers worldwide into the collaboration. According to Miller, about 100 or so researchers joined the consortium's kickoff meeting at the end of last month, and about 130 scientists, including a mix of principal investigators, postdocs, and grad students from six continents, have joined the consortium's Slack group so far.
Miller said the consortium’s structure will likely be similar to that of T2T, where smaller analysis groups are formed based on scientists' research interests and expertise. While the consortium has not yet settled on the final analysis groups, Miller envisions they will focus on topics such as structural variant, methylation, and genome assembly.
More importantly, Miller hopes the consortium can serve as a platform for people to garner experience with long-read nanopore sequencing. "I think the number one thing is inviting people to learn how to work with long-read sequencing data, [especially those] who may not have the opportunity," he said.
There are also obstacles for the consortium to overcome. For instance, Miller said the task to carry out phased assemblies, which will require a lot of CPU power, can be "really challenging," especially when working with samples at a large scale. In addition, he said the group also needs to explore the best ways to aggregate the structural variant data and curate them in an easily digestible and meaningful way for users.
Because this initiative focuses solely on nanopore sequencing, Miller acknowledges there will be some trade-offs in data quality for omitting other long-read sequencing modalities, such as Pacific Biosciences HiFi sequencing.
"We can't do a T-to-T assembly with nanopore sequencing alone; you need HiFi reads to do that," said Miller, adding that there will also be "the obvious challenges" of calling indels or homopolymers in segmental duplications due to nanopore sequencing's "inherent error rate." However, he said for the consortium's goal to determine population-level structural variants and methylation, nanopore data is "the right type of data to use for that."
Despite the challenges, Miller, who recently became a new PI and established his own lab earlier this month, considers the opportunity to lead this consortium "very humbling."
"I feel lucky that I have the opportunity to do this," he said. "I am really excited to see how I can use this data to help some of my really challenging unsolved families."