NEW YORK (GenomeWeb) – An international team of investigators from various academic institutions, public health agencies, and nongovernmental organizations is developing a cloud-based repository of global tuberculosis data to support improved diagnostic development and clinical decision making.
The database is expected to include genotype, phenotype, and clinical information along with associated metadata such as geographic location, testing methodology, and phenotypic drug susceptibility testing results collected from TB patients around the world.
The so-called Rapid Drug Susceptibility Consortium (RDST) comprises investigators from several organizations including the Foundation for Innovative New Diagnostics (FIND), the Critical Path Institute (CPATH), the US Centers for Disease Control and Prevention (CDC), and the World Health Organization.
Among other activities, the investigators are developing the tuberculosis relational sequencing platform (ReSeqTB), a system that will offer access to data and tools for identifying molecular mutations in TB samples and exploring correlations between these variations and drug susceptibility testing results.
The researchers explained in a paper published recently in Clinical Infectious Diseases that the planned repository addresses a communal need for a resource capable of handling continuous collation, management, and validation of both retrospective and prospective data on Mycobacterium tuberculosis drug resistance. Relocating information that is currently collecting in siloed repositories into a shared pool accessible to researchers would help expand current knowledge on the genetic basis for resistance mutations, they wrote in CID. It could help expose important information on geographic variations associated with major mutations, lineage-specific polymorphisms, and new mutations that arise as a result of practices such as using standardized treatment regimens, they said.
Access to this kind of information would bolster efforts to develop more effective diagnostic tools for rapidly detecting drug resistance, the researchers wrote. Earlier this year, researchers associated with Médecins Sans Frontières/Doctors Without Borders reported the results of a study in which they discovered that approximately 30 percent of MDR TB strains collected during a 2009 outbreak in Swaziland contained a mutation that could not be detected by most molecular tests of drug resistance, including Cepheid's widely adopted GeneXpert MTB/RIF test. Access to data on the geographical occurrence of resistance mutations could help developers design more tailored tests moving forward.
The repository could also boost efforts to develop more potent treatments for recalcitrant TB cases. According to statistics reported in the paper, of the 9 million new TB cases and 1.5 million TB-related deaths that occurred in 2013, MDR TB — iterations of the disease that resist two of the most effective first-line TB drugs — accounted for an estimated 480,000 cases and 210,000 deaths that year. Besides better diagnostics and therapies, these datasets could also improve clinical decision making and even inform national policy decisions for diagnosing and treating TB.
Marco Schito, associate scientific director for CPATH's Critical Path to TB Drug Regimens initiative and one of the authors of the CID paper, told GenomeWeb that the consortium plans to make the first iteration of ReSeqTB available on Amazon Web Services in the US with the possibility of creating mirror sites at other locations around the world later on.
Initially, the database will be open to consortium members only, starting at the end of the month, for early-access testing and to gather feedback on ways to improve the system Current non-members who are interested in early access to the database are encouraged to contact CPATH for details on how to join the consortium. Their current plan is to make ReSeqTB more broadly available in October 2016.
ReSeqTB will build on the efforts of existing repositories such as the Tuberculosis Drug Resistance Mutation database and others like it that already exist in the TB community. In fact, RDST is actively partnering with developers of some of these existing databases to incorporate their data into ReSeqTB, Schito said. They plan to obtain the raw sequences that these groups have collected and run them through an internally developed computational pipeline annotating the relevant genes and recalling variants.
This process will be repeated for all samples that the consortium collects for ReSeqTB. This way, the consortium controls the quality of the data that feeds into the platform and will help ensure consistent, reproducible results across studies, Schito said. For contributors who aren't comfortable with all of their research data being made widely available, the consortium will have mechanisms in place to access their datasets in aggregate, he added.
Part of ReSeqTB's development process involved assembling two expert panels to provide guidance on how to build the actual database architecture and to come up with criteria for defining drug resistance variants, according to Timothy Rodwell, FIND's senior scientific officer. Although he is a member of the consortium, Rodwell is not one of the authors on the CID paper. The first of these panels, the so-called input group, was comprised of researchers with expertise in building whole-genome sequencing analysis pipelines specifically for TB data. Their task, Rodwell told GenomeWeb, was to design a standardized computational analysis pipeline, which would be used to analyze raw sequence from patient isolates. The pipeline is currently housed on a CDC server but will eventually be co-located alongside the data stored on the cloud.
Proposed guidelines needed to include specifics on analysis parameters for tasks such as variant filtering as well as particulars on input file specifications, SNP definitions, and ways of reaching consensus in unclear cases such as when multiple variant callers report different calls for a given position, he said. These guidelines were then turned over to a team at the CDC, under the supervision of James Posey, leader of the CDC's Applied Research team, who were tasked with the responsibility of actually developing and validating the pipeline.
A second panel has been tasked with defining appropriate criteria for determining the relationship between variants and drug resistance. Members of the output group, as it's called, are expected to come up with validated group of important drug resistance-related mutations that will serve as a standard for testing the efficacy of diagnostic assays, Rodwell said. Another task, which the output group will tackle, is establishing criteria for determining the clinical relevance of TB mutations, he said.
When it's completed, ReSeqTB will offer tiered access to data depending on who is trying to use it. In addition to assay developers, the list of potential users includes researchers, clinicians, ministries of health, and national tuberculosis programs, all of whom will be able to tailor the system to return the kind of information that's most useful to them. So clinicians, for example, would be able to search for information on potential treatment options for patients based on the specific mutations found in test samples, while diagnostic developers who might be more interested in which mutations are associated with geography-specific drug resistance could search for those specific bits of information.
For now, the consortium's primary focus will be on providing data to researchers and diagnostics developers for this first phase of ReSeqTB's development, Schito said. Their efforts here include designing a user-friendly way of reporting results to diagnostic developers, Rodwell said. They are also exploring mechanisms for making the raw sequence data easily accessible to researchers who may want to apply their own algorithms and software to the ReSeqTB data rather than use the consortium's pipeline. One possible option, Schito said, is it to make FastQ files from ReSeqTB available in one of the National Center for Biotechnology Information databases, where high-volume users can easily download them.
If the initial deployment to test developers and researchers goes as planned, the consortium will then look into expanding access to other user groups such as national healthcare systems, clinicians' practices, and even patients and advocacy groups, Schito said. With an eye towards expanding access, the consortium has begun reaching out to some of these parties to figure out what sorts of questions they might want to address and gain a better sense of how the database could be of benefit, he said.
The researchers are also continuing to gather data to populate ReSeqTB. Currently, they have gathered information on about 5,000 isolates and these are the first datasets that will be hosted in the repository. Moving forward, they will accept data from academic, governmental, and nonprofit researchers as well as from clinical laboratories, clinical trial sponsors, and countries performing drug resistance surveys, according to the CID paper. When the repository goes live, researchers will have access to the data under specific use agreements and contributors will always be able to access and own datasets that they submit.
In terms of specific contributions, the consortium is primarily interested, at least for now, in datasets that include good phenotype data in addition to genotype information. The reason for this, as Schito explained, is to help clear up discrepancies between drug resistance phenotypes and associated genotype data. Currently, "we have all this phenotypic data so we know what's resistant and what's susceptible but when we compare it with genotypic data, we have all these discordances," Schito explained. Access to good phenotype information could help researchers figure out why these discordances occur, he said.
The consortium also hopes to capture information on patient outcomes in ReSeqTB, a task which is difficult to do outside of the context of clinical trials. As a result of the disease's lengthy lifetime and equally lengthy treatment regimens, patients sometimes fail to complete their therapy regimens, or drop one treatment protocol in favor of another making it difficult to track treatment response. Schito told GenomeWeb that the consortium is reaching out to some groups that are attempting to track TB patient outcomes and will work with them to include this information in future releases.
The Gates Foundation provided the initial funding for the ReSeqTB project — the exact amount is not being disclosed — with CPATH and FIND as the main grantees. Part of the consortium's mandate will be to figure out how best to sustain the database in the long term, Schito told GenomeWeb. He said that consortium members are mulling options such as charging commercial testing labs in the US and other high-income countries a small fee for access to the data. They also hope that global non-profit organizations like the WHO and Gates Foundation will help subsidize the cost of analyzing test results in lower-income, high-disease-burden countries, he said.
Moving forward, the developers will also publish additional details of ReSeqTB and their activities. Rodwell told GenomeWeb that in addition to the current CID paper, the consortium plans to publish a white paper by the end of this year that will describe its computational pipeline including details of its development and construction as well as running parameters. "The whole point of this entire process is to be completely transparent, make it available publicly, and also get it peer-reviewed," he said.