NEW YORK — Genomics England plans to build a new cancer research platform that will provide researchers with access to a wealth of genomic, phenotypic, and imaging data that could fuel future diagnostics and therapeutics.
At a research summit last week, the company, which is funded by the UK government, provided additional details about the planned platform, which it hopes to bring online later this year.
In March, Genomics England engaged Insitro, a South San Francisco, California-based drug discovery and development firm, to help make the data easier to query. During the summit, held both virtually and in London, Insitro CEO Daphne Koller and Genomics England Chief Commercial Officer Parker Moss discussed the effort, which Moss said would support new research in the oncology space. The platform will start to become available by December of this year.
Koller called the effort a "really exciting opportunity to let biology speak its own language," noting that Insitro is using machine learning tools to better interpret and group histopathology images.
"What we are building together with Genomics England is a multimodal search capability that allows researchers and clinicians to interrogate the Genomics England dataset in a way that leverages that language of images that we learned, and over time, other modalities as you acquire those as well," said Koller.
"Hopefully, the really large datasets that are being created alongside these machine learning methods will help point us into the right direction without being subject to hypothesis-driven and sometimes intuitive approaches that have led us astray," she said.
Moss said the resource would enable researchers to identify similar patients in the database not only based on phenotyping features, but "based on sophisticated imaging features that will be embedded within our dataset."
In a follow-up email, Moss confirmed the first tranche of imaging data, for which there is already matching tumor and normal whole-genome sequencing data available, plus a range of other omics data, should be queryable by the end of the year. Prabhu Arumugam, director of clinical data and imaging for Genomics England, confirmed the end-of-year release date, and said in an email that the company's partners at the National Pathology and Imaging Cooperative aimed to complete processing about 300,000 images from 16,000 participants in the 100,000 Genomes Project by mid-2023.
Founded in 2019, the NPIC is a program involving the UK's National Health Service as well as academic and industry partners. The effort is funded via the UK Industry Strategy Challenge Fund. Darren Treanor, a consultant pathologist at the University of Leeds and director of NPIC, called the work with Genomics England a "fairly ambitious project" that includes processing up to 300,000 slides, which produces around 300 terabytes of data.
"All [data] will go into Genomics England's trusted research environment for researchers," Treanor said during a separate talk at the summit. "Our work is going to benefit 100,000 Genomes participants by having better analysis of the tissue and data samples created."
Dal Bansal, NPIC's operations director, noted in her talk at the summit that NPIC is the largest of five UK Research and Innovation centers of excellence in digital imaging and AI that were established in 2019, and that it has received about £33 million ($40.7 million) in public funding and nearly £12 million in industry support to date. NPIC has a range of industry partners, she noted, including Japan's Futamura and UK-based FFEI, which support image processing; Leica and Roche, which have provided scanners to NPIC; and mTuitive, a US firm that offers software for data capture and reporting. According to Bansal, NPIC is working to develop a vendor-neutral archive, meaning it can work with any AI, any scanner, or any software.
All of that imaging data is now being combined with Genomics England's genomics data to create its cancer research platform. According to Genomics England CEO Chris Wigley, its resources currently include 60 petabytes of data covering 150,000 genomes and all associated clinical data points, to which the high-definition cancer images will be added. "The basic concept is a trusted research environment, where the researchers come to the data," Wigley said at the summit.
Such a multimodal representation of clinical data, covering histopathology images, genomic, and other data, will allow users to search for images, biopsies, and cases based on "semantic similarity rather than visual similarity," according to a company statement released to coincide with the summit. The firm said the platform aligns with its Cancer 2.0 initiative to employ multimodal data, as well as long-read and methylation sequencing data, to better identify features of cancer, with the goal to improve diagnostics, prognostics, and treatments.
A Genomics England spokesperson in an email called the work with Insitro a "central piece of the multimodal project," and said that the company's machine learning-based tools will provide "consistent structure" for the pathology images. Later, Insitro's algorithms will be available to Genomics England's research partners for interrogating the data, the spokesperson added. He noted that the creation of the resource is being funded by the UK government through the Department of Health and Social Care.
As noted in its statement, genomics, pathology, and radiology data are often held in different formats and within different health disciplines. Via its new platform, Genomics England will make these diverse data sources available at a population scale and searchable thanks to AI.
"This is the multimodal part, the digitized pathology and radiology images that will now be possible to interrogate in tandem with the other part of the platform, the whole-genome sequencing, clinical, and other data," the spokesperson said. "That's what we need [machine learning] for, as it's beyond statistical or ordinary bioinformatics."
He added that the planned cancer research platform includes Genomics England's expertise in analyzing data and answering research questions, as well as its patient community, which supports and drives its research.
"That's what makes us much more than a database," he said.
Genomics England, in its statement, acknowledged The Cancer Genome Atlas (TCGA) as a resource that has served as a model for its new platform. TCGA, hosted by the US National Cancer Institute, provides access to data from 11,000 patients. Genomics England's resource will include data on about 16,000 cancer patients. It said it expects the first publication related to the use of its platform to be focused on brain cancer and published in 2023.
Established as a company in 2013, Genomics England led the 100,000 Genomes Project in partnership with the NHS, and the two have always been intrinsically linked. But Genomics England is now expanding its reach, aiming not only to work with the NHS to improve genomic testing in the UK and make health data more accessible to domestic researchers, but also to improve its research library and make it more useful for international research endeavors, according to CSO Matt Brown.
Brown said in a separate talk during the summit that Genomics England continued to be defined in some ways by when, and for what purpose, it was established. When it commenced operations, he noted, whole-genome sequencing was the technology of choice, and cancer and rare diseases were the priority research areas. While they will continue to be its main focus, Brown said Genomics England is also investing in bioinformatics development, functional genomics, and new sequencing methodologies.
Central to its activities is improving and diversifying its research library and environment by adding multimodal and multiomic data, "so that people can step between the proteome, the genome, and the transcriptome."
Brown also acknowledged that since Genomics England was set up to study the British population, its dataset is weighted toward white participants, who make up about 85 percent of the overall population, according to the 2020 UK census, while it underrepresents South Asian, East Asian, and African ancestries, which comprise sizable minorities in the UK.
Genomics England is engaged in programs to make its dataset more diverse, though, he said, and will sequence about 25,000 people from underrepresented ancestries. The resulting dataset, he said, would also support research into common diseases.
Also, according to Brown, 94 percent of researchers who use the Genomics England research library and environment are based in the UK, a situation he believes needs to change. About 4,000 researchers are registered to use the resource worldwide.
"There is a large global research community that we are currently not partnering with," he said. Having a more diverse dataset that is better accessible and contains more multimodal and multiomic data could help to make the resource more easily accessible and useful for researchers worldwide. In Brown's words, all of these activities align with Genomics England's intent to become a "much more holistic and international personalized medicine organization."