NEW YORK (GenomeWeb) – Researchers at the Los Alamos National Laboratory and the Naval Medical Research Center have developed a web-based bioinformatics platform called Empowering the Development of Genomics Expertise (EDGE) that is designed to help users with limited or no bioinformatics expertise use existing tools to analyze and interpret microbial genomic sequence data.
EDGE Bioinformatics integrates hundreds of public, open-source software and internally developed tools that are designed to process primarily Illumina raw reads. Available pipelines allow users to assemble, annotate, and compare genomes as well as characterize complex clinical or environmental samples including data from bacterial, archaeal, and viral isolates or shotgun metagenome samples. There are also methods for visualizing the output of taxonomy classification tools for easy comparison as well as links to output directories where data from each pipeline is stored.
According to a paper published recently in Nucleic Acids Research, the tools available in EDGE were selected for the quality of results that they provide across sample types, their speed, and the computational resources that are required to run them. It packages publicly available open-source software into six modules that can be run individually or in combination. "We've done a robust comparison between a number of different tools and we've assembled together some basic workflows where we are aiming to get 80 to 90 percent of the questions answered for 80 to 90 percent of the problems that the user might have," Patrick Chain, leader of the bioinformatics and analytics team and the metagenomics program within LANL's Biosecurity and Public Health group, and lead for the EDGE development team, said in an interview.
The list includes well-known tools such as Blast, BowTie, Burrows-Wheeler Aligner, Kraken, MetaPhlan, IDBA, and SAMtools. Full details of the tools included in the platform are provided in accompanying documents on the EDGE website. These tools have been assembled into ready-to-run pipelines for sample pre-processing, de novo assembly and annotation, comparing samples to reference genomes, taxonomic classification, phylogenetics, and PCR primer analysis. It also includes pre-processing and reference-based analysis functions for eukaryotic genomes. Users can tweak the default settings of the pipelines as well as activate or deactivate some steps depending on their needs. They can also view the results of their analysis at the genus, species, or strain level.
Compared to available alternative environments for NGS data analysis such as Galaxy, "EDGE is the only open-source platform that can be used locally and that integrates both the processing of individual samples and the presentation of results in a seamless web-based interface." It's also unique because it provides pre-selected algorithms and parameters for users rather than letting them choose and combine tools into workflows themselves which can be daunting for novice bioinformaticians. "You have to know what tools you want to pick for your particular analysis [and] that's not always intuitive," Chain said.
Furthermore, compared to EDGE, Galaxy doesn't provide much visualization. "You can create workflows and you can use that workflow to run your data through [but] then you have to run around for another program to feed your outputs [into for visualization]," he noted. In contrast EDGE provides users with quality control graphics, assembly summary charts, heat maps, and phylogenetic trees. It also links to third-party visualization tools such as the JBrowse genome browser.
EDGE is also a cheaper alternative to commercial packages that can be "inflexible" and can affect interpretation results if users don't know the details of the proprietary algorithms that the packages use, the developers wrote.
The paper also describes the results of a few analysis experiments performed to demonstrate the efficacy of the EDGE platform. One of these focused on two sequence datasets from separate isolate genome sequencing projects involving Bacillus anthracis and Yersinia pestis strains. According to the paper, results from EDGE's assembly and annotation module were consistent with known genomic elements from the microbes including known insertion sequences and rRNA operons. The assembled sequences were also consistent with the known genome size and number of genes found in the microbes. They were also able to confirm the expected identities of the sequenced organisms using taxonomy classification tools available in EDGE.
The researchers also used EDGE to successfully characterize pathogenic sequences in a number of clinical samples including one from the recent Ebola outbreak and one from a fecal sample collected from a patient infected with Escherichia coli.
Currently, EDGE is used by research groups around the world as well as in several government laboratories in the United States. "There are some collaborators that are using this to teach individuals how some tools work," Chain said. For example, "there are a number of funded programs to teach graduate students to collect various organisms and analyze them."
Turner Conrad, a research microbiologist in the diagnostics systems division of the United States Army Medical Research Institute for Infectious Diseases (USAMRIID) and one of the platform's beta testers, highlighted EDGE's ease of use compared to some existing workflow environments. "When you look at something like Galaxy or any of these other workflow managers or workflow software ... they are so broad and open that you have to figure out how to make your own pipelines," he said. "The advantage that EDGE puts forth is ... they've still made it general use enough while offering a more universal type of workflow where you just pick and select what you want out of that whole thing to do [but] it's still all one big workflow."
EDGE's source code is available from GitHub. Researchers can also access the code in Docker containers and virtual machine images for local installation. The developers have provided a publicly accessible webserver that can be run with publicly available data from repositories such as the National Center for Biotechnology Information's Sequence Read Archive and the European Molecular Biology Laboratory's European Nucleotide Archive — the webserver does not support upload of personal datasets for security reasons. EDGE's modular design and open source license allow other researchers to expand its capabilities beyond the initial implementation, according to the developers. They can also integrate the platform into their existing workflows.
The developers recommend that researchers running EDGE use computers that have at least 16GB of memory and eight central processing units available to run pipelines — using more CPUs will reduce run times. For their next steps, the developers hope to add more tools to the EDGE platform including RNA- and 16S- sequence data analysis pipelines, Chain said. They are also currently testing an amplicon sequence analysis pipeline that they hope to integrate into the platform. Also, some current users have requested new visualization tools, he said. They will also work on creating definitions and methods that will allow third-party developers to contribute best-practice tools and workflows to the platform.