NEW YORK (GenomeWeb) – A team from the US and UK has released a free online computational resource for interpreting RNA sequencing data and chromatin immunoprecipitation sequencing data in an infectious disease research context.
As they reported online in Bioinformatics last month, researchers from Virginia Tech's Virginia Bioinformatics Institute and elsewhere put together a suite of bioinformatics tools under the umbrella of its online service, called RNA-Rocket.
"We wanted to create an RNA-seq processing server that could be updated without high overhead because RNA-seq is changing — the tools for processing data are still changing — all the time," first author Andrew Warren, a software engineer and PhD candidate at the Virginia Bioinformatics Institute, told GenomeWeb.
"There are lots of updates and lots of new tools, so we wanted to use a system that would minimize the overhead but still make it easy for users to approach RNA-seq for the first time if they weren't bioinformaticians," Warren added.
The site, which is available through the National Institute of Allergy and Infectious Diseases (NIAID)-funded Pathogen Portal, combines several existing clearinghouses for genome sequence and annotation data with open-source software for analyzing RNA-seq or ChIP-seq data.
Those involved in developing RNA-Rocket hope the tool will find favor with researchers involved in a wide range of research applications. The site also includes several host genomes, Warren noted, to aid those interested in investigating host-pathogen interactions.
"RNA-Rocket has allowed us to do projects on the scale that we do them, without having to search for a bioinformatics person," Melissa Caimano, a microbiologist at UConn Health, told GenomeWeb.
"I really appreciate that a group of fellow scientists took the time to develop a free resource like this," said Caimano, a user who was not involved with the RNA-Rocket paper. "Not every lab has access to a team of bioinformatics or IT people."
Caimano has been using the tool on and off for roughly a year and a half to look at gene expression profiles in mutant and wild-type strains of the Lyme disease-causing organism Borrelia burgdorferi grown in different conditions in vitro and in vivo.
She and her team initially contracted out their RNA sequencing library prep and data analysis steps, but decided to start doing their own RNA sequencing and data analysis in house.
"We're not a big lab, but I really believe in researchers having that start-to-finish experience," Caimano explained. "If you … get bioinformatics information back from a company, you've lost a lot of valuable information by being once or twice removed from your data."
Nevertheless, the data analysis arm of these experiments initially proved challenging, she noted, since a commonly used open-source analysis site Galaxy primarily contained information related to human, mouse, and other large eukaryotic genomes.
"It seemed like [Galaxy] was such a great platform, but I couldn't use it," Caimano said. "So many of the big Galaxy sites would have a lot of reference genomes for higher eukaryotes but very limited bacterial genomes."
Not long after, though, she discovered RNA-Rocket, which "has almost all of the same resources for basic data analysis in the same pull-down menu as the big Galaxy site and interfaces look remarkably similar, but it's tailored for bacterial genomes."
Such similarities are no accident: Warren and his colleagues built up from the existing Galaxy framework and interfaces when they designed RNA-Rocket, tweaking the system so that it would be compatible with RNA sequence software working in tandem with several pathogen-specific databases.
RNA sequencing has gained popularity as a transcriptome profiling method as the cost of sequencing has dipped, the team noted. But despite the sensitivity and relative affordability of this approach, there are still several different ways to produce and parse RNA-seq data.
For their part, Warren and his co-authors wanted to come up with a comprehensive tool for analyzing RNA-seq data in the infectious disease research context — from analyses focused on expression patterns and transcript structure in prokaryotic or eukaryotic pathogens to studies of RNA profiles in their host or vector organisms.
To that end, the RNA-Rocket service is designed to take on quality control, alignment, annotation, and analysis of RNA sequence data generated from pathogen, vector, or host samples, by combining open-source software tools with genomic, transcriptomic, sequence typing, protein structure, and interaction data contained in Bioinformatics Resource Centers (BRCs).
The BRCs currently paired with RNA-Rocket include Pathosystems Resource Integration Center (PATRIC), which focuses on bacterial species, the eukaryotic pathogen-centered site EuPathDB, and VectorBase, a database that houses bioinformatics tools to study invertebrate vectors of infectious disease such as mosquitoes, ticks, and tsetse flies.
As part of a Driving Biological Projects mini grant project, the researchers worked closely with individuals generating RNA-seq data.
After speaking with experts in the field about the issues they were seeing with their reads, base call evaluations, contracts with core labs, desired applications, and so on, they settled on a set of software tools that seemed most appropriate for valuing the data and transforming it into information on gene expression, transcript structure, and the like.
To use the site, researchers upload FASTQ data files from RNA-seq experiments, RNA-Rocket uses methods such as Bowtie2 or TopHat2 to align the sequences to appropriate reference genomes in these databases.
From there, sequences can be assembled, annotated, and analyzed with other open-source software, offering a look at transcript structure, gene expression patterns, differential expression profiles between various samples, and so on.
"Some people might be interested in novel feature discovery, in terms of the genome, or looking at isoforms or differential expression profiles under different conditions," Warren said.
"You're not bound to using all of the programs in the flow diagram," Caimano noted. "But what's nice is that you know that after each step, the [data] is in the appropriate format for the next step."
In parallel with the analytical aspects of RNA-Rocket, the site also provides users with feedback regarding read features related to the quality of their RNA sequence data — from mapping coverage to potential signs of PCR bias — using software designed to deal with read quality, read trimming, and read mapping quality.
"We did try to do things to make users aware of the issues with RNA-seq," Warren said. "It would be doing them a disservice if we just took their input and ran everything automatically and just gave them an output."
So far, the approach has been applied to high-throughput sequencing data generated using SOLiD, Illumina, and Roche 454 technologies.
The group has not yet dealt with long-read data such as that generated using the PacBio RS instrument, Warren noted, though such reads should be compatible with the pipeline, provided appropriate software is included to address the particular error profile of a given long read system.
"Even during the life of the project, the read technology has changed, even within the short-read technologies," he said. "We have seen changes in the entire pipeline."
RNA-Rocket is also designed to deal with data from ChIP-seq experiments on infectious disease-related samples, aiding in the search for spikes of signal corresponding to the protein of interest in the genome and their quantification.
In either case, the service returns alignment, annotation, and analytical information to users through the appropriate BRC, Warren and his colleagues explained.
"After results have been computed at the RNA-Rocket site they can be streamed back to the respective BRC depending on the reference organism selected for analysis," they wrote. "This provides users with the ability to process and analyze their RNA-seq data remotely without have to download potentially large files to their own computer."
Daily updates to RNA-Rocket incorporate new genomic data that's been added to BRC sites such as PATRIC, EuPathDB, and VectorBase.
Each BRC has its own data release schedule, reflecting the speed of projects being done by members of the research community, Warren noted. For instance, the bacterial database tends to get updated more frequently than the vector database, owing to the increased complexity associated with sequencing, assembling, and annotating the genomes of insects and other vector organisms.
At the moment, VectorBase contains data on a few dozen vector genomes and EuPathDB is home to information on more than 250 eukaryotes with proposed pathogenic roles. Some 30,000 bacterial strains or species are represented in PATRIC.
Warren noted that there is a possibility of expanding RNA-Rocket to interact with viral databases, though he and his colleagues have not yet considered strategies for doing so. "There's definitely potential, but we don't have a plan in place right now," he said.
For her part, Caimano plans to continue using RNA-Rocket not only in conjunction with PATRIC but also with the VectorBase BRC as her team looks more closely at B. burgdorferi pathogen interactions with its tick vector.
"[RNA-Rocket] is serving an underappreciated niche in infectious diseases by having all of these smaller or less widely studied genomes available to an end user," she said.