NEW YORK – A free, cloud-based data analysis platform to identify pathogens and other microbes from sequencing data dubbed Chan Zuckerberg ID (CZ ID) has been making inroads with infectious disease researchers around the world.
The platform, developed by the Chan Zuckerberg Initiative and first released in 2019, recently added the capability to process long-read data from Oxford Nanopore Technologies.
According to a CZI spokesperson, the platform has garnered more than 2,200 users from 116 different countries, including 76 low- or middle-income countries (LMICs). Approximately 65 percent of CZ ID users self-report as being from LMICs, and around 60 percent hail from academic labs.
"You're able to see what kind of [pathogens] are present within your samples with just a few clicks," said Mariama Kujabi, a scientific officer at the London School of Hygiene & Tropical Medicine based in Gambia in Africa who has been using the platform since 2020. "You don't need any programming background to be able to upload your data, analyze it, and get the results out there."
Supported by a grant from the Bill & Melinda Gates Foundation and CZI, Kujabi’s team is currently working on a research project to identify pathogens from babies with sepsis.
While Kujabi has been using CZ ID with Illumina data, she said the ability to use nanopore data on CZ ID is "very exciting," especially since her lab plans to run a large number of samples on Oxford Nanopore sequencers to help detect viral pathogens.
"The challenge of having a huge dataset is that you have to analyze it," she noted. "With CZ ID, the new pipeline will allow us to analyze the metagenomic dataset from nanopore [sequencing]."
Initially, the platform was only available for use with short-read Illumina sequencing data. By combining CZ ID’s plug-and-play features with nanopore sequencing’s advantages in portability and low upfront cost, the developers hope to enable researchers around the world, particularly those in low-resource settings, to analyze pathogens rapidly in their metagenomic samples.
"There's a lot of tools out there that are command-line only," said Sam Scovanner, director of product management for the infectious disease technology team at CZI. "I think our particular product is really unique because it is web-based, and it is something that is designed to be really easy to use."
Overall, Scovanner said CZ ID’s nanopore data analysis workflow is largely analogous to its Illumina counterpart, despite a few differences in the initial data processing step.
In both cases, researchers can upload sequence Fastq files and sample metadata to the platform’s web interface, and the software will carry out automated data processing including quality control as well as removing low-quality and host reads. The processed data are then compared against National Center for Biotechnology Information (NCBI) databases, generating a sample report that names the identified pathogens in the samples.
In the end, users can store the data on the platform or download the results for offline analysis. They can also choose to share their data as well as explore and compare their results against other NGS datasets that are public on CZ ID.
While the Illumina analysis pipeline allows users to analyze and visualize results on the platform, such as generate heatmaps and consensus genomes for viruses of interest, these features are not yet available for nanopore users. In addition, Scovanner said, CZ ID is developed for research use only.
Compared with other command-line-based metagenomic analysis software, Scovanner said, one advantage of CZ ID is that it requires no coding skills from the researchers. In addition, the software has integrated pipelines for host contamination removal, read quality control, read mapping, and sequence assembly, preventing researchers from having to transfer or reformat data between multiple NGS analysis software tools.
Furthermore, the platform saves researchers from having to set up individual cloud-based data storage solutions. According to Scovanner, CZ ID operates in a cloud environment supported by Amazon Web Services, and the software currently does not limit how much data users can upload.
In terms of data privacy and security, Scovanner said CZ ID does not contain any personally identifiable information, and human data is filtered out during data pre-processing. In addition, researchers always own and control their data, which they can choose to share or delete from the platform at any time.
The goal for CZ ID is to provide a convenient analysis tool for researchers with limited bioinformatic resources — particularly those in LMICs — to identify and analyze pathogens using metagenomics, Scovanner said.
"What we wanted to do is to build technology that enables researchers in LMICs to do analysis," she said. "LMICs face really high burden of infectious diseases, yet researchers in those regions often lack access to the tools and technology that they need to detect and track emerging infectious diseases."
Meanwhile, Kujabi pointed out some current limitations of CZ ID. For one, although the software enables pathogen analysis in individual samples, it is still challenging to use the platform to generate a phylogenetic tree among different isolates, which can be important for tracing the source of an infection or outbreak. In addition, she said the software is currently not compatible with data from amplicon sequencing, which her group often uses for microbiology analysis.
Moving forward, Scovanner said the team will continue to expand the nanopore pipeline to include data visualization features. In addition, she said the team is hoping to release an antimicrobial resistance (AMR) analysis feature later this year, which can automatically detect AMR genes within the sample metagenomic data.