This article has been updated to provide additional information from Soren Germer.
NEW YORK – The National Institutes of Health is funding five Genome Characterization Centers in institutions across the US to systematically study and catalog human somatic variations as part of its five-year, $140 million Somatic Mosaicism across Human Tissues (SMaHT) Network.
By devoting a big slice — $61 million — of the initiative’s total budget to these centers, NIH is expecting them to produce high-quality genomic data using state-of-the-art sequencing technologies, paving a data foundation for the SMaHT Network.
"The genesis of the Genome Characterization Centers is to come up with a more comprehensive catalog of somatic variants," said Richard Conroy, an officer at the NIH Office of Strategic Coordination who is leading the SMaHT Network. "I think the best way of describing these centers is that they are the core data generators within the [SMaHT] consortium."
The five institutions that received Genome Characterization Center funding are the Broad Institute, Seattle Children's Hospital, New York Genome Center, Baylor College of Medicine, and Washington University.
These centers, which will split the $61 million funding, are expected to comprehensively sequence and catalog the somatic variants in a set of 15 health tissue samples from over 150 adult donors, totaling at least 2,250 samples that will serve as the basis for the SMaHT variant catalog.
While the NIH does expect the centers to develop new technologies themselves, they will work with the Tool Development Projects within the network to "improve the ultimate value of the catalog," the agency noted.
According to Conroy, each Genome Characterization Center has demonstrated "a strong track record" of performing short-read, long-read, and RNA sequencing, which are the core technological components for the data generation endeavor. Beyond that, he said the centers are also encouraged to bring their own innovative flares to the table.
"We didn't specify which particular vendors or technologies they use," Conroy said. "But we wanted to make sure that it was consistent data being generated across each of the Genome Characterization Centers."
The New York Genome Center (NYGC), for instance, is planning to deploy the new Ultima Genomics platform to achieve duplex error-corrected short-read sequencing using machine learning, a framework that NYGC researchers, led by Dan Landau, had successfully implemented in cancer blood samples for high-sensitivity circulating tumor DNA detection.
"We thought it would be nice to try [this approach] in the context of somatic mutation detection," said Soren Germer, NYGC’s senior VP of genome technologies who is co-leading the institute's Genome Characterization Center efforts. As the team is adapting the Ultima workflow for somatic variant detection, Germer said, they are also hoping to "fine-tune" the machine learning algorithm to boost the detection signals while minimizing technical errors.
In addition to the core sequencing modalities, Germer said his group is also planning to incorporate a novel amplification-free single-cell whole-genome sequencing assay that promises to detect CNVs from individual cells and SNVs from clones with multiple cells with high sensitivity.
Termed Direct Library Preparation Plus, the workflow uses the CellenOne single-cell isolation and dispensing platform developed by Cellenion to separate cells into nanowells, followed by modified Illumina Nextera library prep and shallow WGS of the individual cells. "Even though you perform shallow sequencing for each cell, if you have multiple cells, you can start to have enough data to call the mutations," Germer explained.
The Genome Characterization Center at the Seattle Children's Hospital, meanwhile, hopes to elucidate somatic mosaicism at a telomere-to-telomere level. To achieve that, the researchers plan to perform ultra-high-depth short-read sequencing on the Illumina NovaSeq sequencer while taking advantage of emerging long-reading sequencing technologies like the Oxford Nanopore PromethIon and Pacific Biosciences Revio platforms.
"If we want to build a catalog of normal somatic variation, we have to do this across the entire genome and not just pick and choose which parts of the genome that we want to focus on," said Andrew Stergachis, who is running Seattle Children's Genome Characterization Center with James Bennett and Evan Eichler.
"The advantage of this approach is that it is going to shine a light on every part of the genome," Bennett said, adding that, along the way, the team also aims to generate donor-specific reference genomes to enable more accurate somatic variant calling.
"Everyone is unique; we have known this in genetics forever," Stergachis said. "If you want to study genetic variation, you need to use that own individual's genome as its own reference."
Furthermore, the Seattle Children’s researchers plan to carry out Fiber-seq, a method originally developed by Stergachis and his collaborators that can profile open chromatin regions using PacBio sequencing, to further explore the potential regulatory impact of the somatic variants.
With this technique, "we are not just measuring somatic mutations, but also their functional significance in one shot," Bennett pointed out.
While the five Genome Characterization Centers will take various approaches, Conroy said there will be a "strong involvement" of NIH staff to steer these groups to work together as their funding is different from regular NIH grants, carried out as a "cooperative agreement."
"We expect a lot of communication between each of the awardees," Conroy said. "For us, that is the whole point of funding this as a network rather than individual projects."
Conroy said a big reason to promote cross-talk among the centers is to ensure the data is generated in a consistent and accurate way across the board. This is particularly important for studying somatic mosaicism, he pointed out, especially when some variants may occur at a low frequency such that they could be camouflaged with sequencing errors or technical artifacts.
In that regard, Conroy said the centers have formed a working group that is currently designing benchmarking studies to establish a performance baseline and cross-validate their results.
"What we need to do is to put out all the information about how data is generated, as well as to show what quality control measures we have taken," Conroy said. "There is not much point in us generating a lot of data unless the wider research community finds it believable."
As the Genome Characterization Centers produce their sequencing data, they are also expected to analyze their results, which are then passed on to the network's Data Analysis Center to be harmonized into a variant catalog, Conroy noted.
"Part of the idea with these programs is that as the data is generated, we make it rapidly available," he said. "The commitment is that the data gets published within six months of being generated."
Given that the NGS field is evolving rapidly, he also noted that the Genome Characterization Centers are encouraged to embrace new technologies as they are available "with a lot of careful versioning."
While the current funding for the SMaHT Network lasts five years, Conroy said there is also an opportunity to extend the grant for up to 10 years, depending on how the initiative progresses.
"Hopefully, over the next five years, we will gain a much deeper understanding of where and when somatic mosaicism occurs plus the different types of somatic mosaicism," he said. "Part of what we propose for the second five years is really to do a deeper dive into understanding the functional consequences of the somatic mosaicism."