CHICAGO (GenomeWeb) – In keeping with its mission of providing free care for any patient in need, St. Jude Children's Research Hospital in Memphis, Tennessee, has made its entire set of genomics data on pediatric cancer available to the global research community.
The public repository, dubbed St. Jude Cloud, contains 5,000 anonymized whole-genome sequences, 5,000 whole-exome sequences, and 1,200 RNA sequences from young cancer patients. The plan is to have 10,000 whole-genome sequences by next year and to encourage further growth by inviting outside researchers to share their own data.
"It's all about building the community to generate greater research, and St. Jude is going to be the foundational anchor of that global research project," said Richard Daly, CEO of DNAnexus, which, along with Microsoft, is providing the technology behind St. Jude Cloud.
"It may seem like a no-brainer, but it is very difficult to find anyone who will freely make their data available in this way, and it's because of their mission and the way in which they finance themselves," Daly added. "The biggest problem in bioinformatics today is [that] people are protecting, shepherding, [and] siloing their data, and St. Jude has gone in totally the opposite direction."
Microsoft is serving as the cloud host for St. Jude Cloud, while DNAnexus is the front end of the platform, which includes freely available analytics, visualization, and collaboration tools.
St. Jude Cloud officially launched Sunday at the American Association for Cancer Research annual meeting in Chicago.
There are two groups of intended end users: genomic cancer researchers — with or without informatics training — and computational scientists who want to test novel algorithms with the St. Jude data sets, according to Jinghui Zhang, chair of computational biology at the Memphis hospital.
"These computational scientists can bring their tools to the cloud and can use our datasets for testing or for performance enhancement or for making discoveries by using their novel tools," Zhang said. "Making novel discoveries also applies to research scientists who have no computational background."
This also helps researchers cope with the well-known global shortage of bioinformatics professionals.
St. Jude Cloud lets users set up private, password-protected research areas where they can upload their own data for processing on the platform.
"They can have their own private project and they can put whatever they like in that and run our tools on their data," said Scott Newman, group lead for clinical bioinformatics analysis at St. Jude. "They can access our data and analyze it side-by-side with theirs."
Tools currently available on the St. Jude Cloud include Rapid RNA-seq, PeCan PIE (the Pediatric Cancer Variant Pathogenicity Information Exchange), a test for neoepitope prediction, and ProteinPaint, a tool for visualizing somatic mutations.
The platform also includes the standard suite of DNAnexus technologies. Mountain View, California-based DNAnexus has experience building platforms for research communities, including PrecisionFDA and Mosaic; the latter is a Janssen Human Microbiome Institute-sponsored platform for analysis of microbiome data.
The St. Jude Cloud has roots in the 2010 founding of the St. Jude-Washington University Pediatric Cancer Genome Project. (Elaine Mardis, who was co-director of Washington University's Genome Sequencing Center then, today serves on the DNAnexus scientific advisory board.)
"This is going to be the most significant set of data that we could imagine in terms of understanding childhood cancers," National Institutes of Health Director Francis Collins said at the time.
That three-year project collected whole-genome sequencing data from 700 matched tumor/normal pairs as well as 2,000 exomes from 23 different types of pediatric cancers. It led to more than 20 published, peer-reviewed articles, according to Zhang, and St. Jude made each dataset publicly available.
In 2014, St. Jude began developing a clinical sequencing pipeline, including whole-genome sequencing, exome sequencing, and transcriptome sequencing of every new pediatric cancer patient. The cloud dataset also includes sequences from the Genomes for Kids project and the St. Jude Lifetime Cohort study.
Zhang said that about 300 labs worldwide have requested downloads from the Pediatric Cancer Genome Project to date. "But it has been a very torturous experience to download data," she said.
Newman described how, at a previous job, it took him nine months to complete a download of about 100 whole-genome sequences from St. Jude when researching high-grade glioma. St. Jude gave him an access token but no technical guidance.
"The dataset was big, the connection was slow, and also, for reasons I don't fully understand, the downloads kept failing, so I kept having to restart again and again," he said.
"That's not an uncommon experience," Zhang admitted. It often looks like the download is fine, but some of the BAM files are missing data.
The St. Jude Cloud promises to eliminate this pain point and accelerate research processes by moving computational processes to the cloud.
"We tried to avoid downloads. If you can bring your tools to the cloud, you don't need to spend all this effort to download the data," she said. "You can focus on getting analysis done."
With St. Jude Cloud now in public release, the creators are looking to expand its capabilities.
In the future, Zhang hopes that the cloud will allow researchers to integrate their own data with the St. Jude data for visualization. "Right now, you have to take an alternate route to do this, but we would like to make it seamless on the cloud. That will be our ultimate goal," she said.
Users also would be able to make their own genomic data publicly available, but St. Jude and its technology partners still have to sort through access-control mechanisms, according to Zhang.
Other plans call for integrating clinical records with the genome sequences. "We'd like to make the portal eventually become a place where we can upload our clinical data and clinical sequencing data … and share it with the rest of the community," Zhang said.
Machine learning and artificial intelligence also are in the works.
"It's likely that the next phase of tools to appear on the St. Jude Cloud will be machine learning-deep learning," said Daly, the DNAnexus chief. Though DNAnexus is currently running a pilot to test the integration of Google's DeepVariant variant-calling tool into its core genome informatics platform, it will be Microsoft bringing its machine learning technology to the table in the St. Jude initiative.
In a subsequent phase, there likely will be research challenges, similar to the PrecisionFDA project, Daly said, though he was not ready to offer details.