NEW YORK (GenomeWeb) – The UK Biobank is overseeing the whole-genome sequencing of an initial 50,000-sample cohort, a process that should conclude by the end of this year.
Yet the £30 million ($40 million) pilot effort, called the Vanguard Project, is merely laying the foundation for a more extensive main phase that will see the remainder of the biobank's repository of 500,000 samples sequenced.
The project dovetails with a separate effort to sequence the whole exomes of all half million biobank samples, the first 50,000 of which became available to researchers this month. UK Biobank also carried out the whole-genome genotyping of the entire repository using Affymetrix arrays earlier in the decade, data that has been made publicly available for years.
According to Rory Collins, a professor of medicine and epidemiology at the University of Oxford and the UK Biobank's principal investigator, the effort to whole-exome sequence the repository was initiated by the pharmaceutical companies Regeneron Pharmaceuticals and GSK, which agreed to make the data available to other researchers after a limited period of exclusive access to the data.
The biobank is "available for others to come along and do certain assays in order to do research on the data, and then the quid pro quo is that those data will become available to other researchers," said Collins.
The push to sequence the whole genomes of all 500,000 participants, however, came from the UK government as part of its Life Sciences Industrial Strategy, a 75-page report published last year. "They said it would be attractive to get whole-genome sequence data as part of the strategy, to create an environment in the UK that would encourage research in the UK from both academic and industrial researchers," Collins said.
The UK Medical Research Council, the organization responsible for coordinating and funding medical research in the UK, made funding available for the pilot phase of the effort and, after a tender process, the Wellcome Sanger Institute in Hinxton was tapped to sequence the first 50,000 samples, which, Collins noted, are more or less identical to the samples whole-exome sequenced by Regeneron and GSK.
Cordelia Langford, director of scientific operations at the Sanger, said the nonprofit institute stood to gain from sequencing the initial 50,000 cohort because its researchers would eventually use the dataset in their studies. "For that reason, we wanted to be involved in generating the data at a really high quality," said Langford. "There is an investment there in terms of making sure that it meets and exceeds our expectations."
There was also the challenge, for the Sanger, of building a pipeline for managing the sequencing of 50,000 whole genomes. Langford noted that the institute added capacity because of the "sheer scale" of the project. The institute had been using the Illumina HiSeq X Ten, a system the San Diego vendor launched five years ago, that consists of 10 instruments capable of sequencing 18,000 genomes per year. For the Vanguard Project, the Sanger moved to the newer NovaSeq 6000, which can sequence up to 48 whole genomes per run. Illumina launched NovaSeq in 2017.
Langford said that the Vanguard Project presented the Sanger with opportunities to learn and implement this new technology.
"There were certain … assumptions that were put forward in the project proposal to deliver Project Vanguard," said Langford. "One was the assumption that we would use the current most up-to-date short-read sequencing," she said. "We spent time developing our processes just to hone them so that we could hit the ground running when the samples started flowing."
That included developing a quality control process to assess data quality at intervals during the sequencing process. "There is an infrastructure required that allows us to automatically QC all of the data that is coming off that massive scale from the production side," said Langford.
Langford said that Sanger is producing compressed sequence or CRAM files that will be passed on to another organization that will perform calling of variants from each of the files on a sample-by-sample basis and will carry out joint-cohort calling, where large numbers of samples are analyzed together. It has not yet been decided what organization will take on the informatics variant-calling process, she noted.
The Sanger received its first batch of Vanguard samples last summer and is on track to complete the whole-genome sequencing of all 50,000 samples by the end of 2019.
While the intention is to make the dataset available to researchers, Collins said it would take "quite a bit of work to get the data into a state where it could be made available."
Another question looming relates to the so-called main phase. Collins said that discussions are underway involving the Wellcome Trust, government, industry, and others to assemble the funds to whole-genome sequence the remaining 450,000 samples in the biobank. He estimated that it would take about £200 million to carry out the sequencing, and estimated that it could be completed within the next two to three years.
"That is the ambition,"said Collins. "We have made genotyping data available, we have started to make exome sequencing data available and will make more over the next year or two, and during this period we will make whole-genome sequencing data available," he said. All of the data, he noted, has been accompanied by other datasets, such as health outcome data and imaging data. UK Biobank is currently conducting MRI scans of the vital organs of about 100,000 people.
"If you start from the end point, the data set that will be generated from Vanguard alone is made more significant by the fact that each sample that has come from an individual participant of UKBB has got an incredibly rich data set already, just in terms of genotyping, whole exome sequencing, but also additional imaging and chemical measurements that have already taken place," noted the Sanger's Langford. "So it's not just about having genetic data, it's layered in with electronic records and other data that will enable such studies that will bring to life massive cohort scale analyses and interpretation," she said.
It has not yet been decided who will undertake the whole-genome sequencing of the remaining 450,000 samples in the UK Biobank. However, Langford said the Sanger is in a position to show best practices and share information about how the main phase might be sequenced. "If we were involved, we would be able to determine the number of machines to use and provide strategies around how libraries might be loaded into flow cells in order to deliver the scale and throughput that would be required to sequence at that scale," she said.
Exome sequencing data
While the Sanger continues to sequence the whole genomes of 50,000 people as part of the Vanguard Project, the whole exome sequencing of the entire repository continues via the biopharma effort led by Regeneron and GSK.
After a nine-month exclusive access period, the data from the first tranche of 50,000 participants was made available to the public earlier this month. Researchers from the companies also released a preprint of a manuscript describing some initial findings from the dataset on BioRxiv.
A larger consortium of companies, including Abbvie, Alnylam, AstraZeneca, Bristol-Meyers Squibb, Biogen, Pfizer, and Takeda, are supporting the exome sequencing of the remaining 450,000 UK Biobank participants by next year.
Sequencing to date has been carried out by the Regeneron Genetics Center at its headquarters in Tarrytown, New York.
UK Biobank's Collins has pledged to make all of the data from the exome sequencing effort available eventually. "The exome sequence data provides you with much more detail about the genetic structure within the protein-producing part of the genome, which goes beyond what genotyping does," he noted. "Exome sequencing enables one to look at rarer variants that are associated more strongly with particular conditions."
Regeneron spokesperson Alexandra Bowie said that the biopharmaceutical company "sees a lot of value" in both the exome sequencing as well as the whole-genome sequencing efforts.
"We don't think of genome versus exome sequencing in a polarized way — one does not preclude the other, and we believe right now the exome is most informative for drug development efforts, which is our main focus at Regeneron," Bowie said.
She said that Regeneron expects to have all 500,000 exomes sequenced by 2020, and has already sequenced the exomes of 100,000 participants, in addition to the 50,000 it sequenced initially. The UK Biobank will continue to release the exome sequencing data in tranches after Regeneron and its partners have access to it during their nine-month exclusivity periods. Bowie said that exome sequencing data on the next 100,000 participants will be released in early 2020, and all data will be available by 2021.
Collins acknowledged that the impending availability of reams of whole-genome and whole-exome data is bound to flummox the most seasoned bioinformaticians. Still, he said that any analytical challenges presented will be surmountable.
"Our job is to creatively create problems that others solve," Collins noted. "No one imagined that we could image on the level that we proposed imaging, but we were able to, but if we create the problem, make it available widely, then people will solve it," he said. "The same goes for the sequencing data."