To help biology teachers introduce their students to some types of basic bioinformatics analyses, researchers in the iPlant Collaborative — a research consortium that builds cyberinfrastructure for the plant sciences — developed an easy to use web application called DNA Subway that guides users through four separate sequence analysis workflows using a colorful interface modeled after rail transit systems.
Developed with an eye towards educating undergraduates and high school students, DNA Subway tries, at a high level, to make DNA analysis available to faculty and students by providing simplified workflows for annotation and comparative genomics. Pipelines in this freely available platform combine open-source software such as Blast, TopHat, and Cufflinks and programs developed by iPlant participants into color-coded "tracks" for research activities like gene annotation, phylogenetic tree building, and transcriptome analysis.
"Our aim was to bring [the analysis steps] all into one place where the workflow is outlined for you, the tools are basically all built in and the data processing is being handled behind the scenes," Mohammed Khalfan, a senior bioinformatics developer at Cold Spring Harbor Laboratory and one of DNA Subway's developers, told BioInform.
When development began in 2010, DNA Subway initially offered tools for just plant-based analyses but it has since grown to include support for animal genomes in some tracks as well. Between the four lines — red, yellow, blue, and green — DNA Subway can analyze data from species such as Arabidopsis Thaliana, corn, soybean, zebrafish, fruitfly, and humans.
Stops along each of the colored routes mark the different parts of the analysis process with each stop detailing steps and providing the tools needed to complete the phase in the question. Riding the red line, for instance, takes users through an annotation workflow with tools for finding repeats, predicting genes, building gene models, and comparing input sequences to reference genome annotations.
The three other tracks in this virtual subway system are a yellow line that’s used to identify homologous genes. The workflow used here integrates Blast searches, multiple sequence alignments, and tree-drawing capabilities that display the relationships between matching sequences.
Meanwhile, the subway's blue line is used to generate phylogenetic trees and analyze barcode regions — sections of DNA that are used to identify the particular species to which an organism belongs.
Programs used here include Merger, a global alignment algorithm for merging overlapping sequences from the EMBOSS software package; Multiple Sequence Comparison by Log- Expectation, or MUSCLE, software, used for multiple alignments; and an electropherogram viewer for exploring sequence trace file data. The track also includes tools for consensus building, sequence trimming, building gene models, and running Blast search capabilities. In addition, this track runs the Phylogeny Inference Package, or PHYLIP, a package of program that, as the name implies, is used to infer phylogenetic trees.
Finally, traveling the green line will take the user through an RNA sequence data analysis pipeline. This track, which is available in beta for now, has tools for assessing data quality, aligning and assembling sequences as well as RNA quantification. Its applications include TopHat, which assembles the reads into transcripts based on a reference; Cufflinks, which then assembles the aligned transcripts; and CuffDiff, which looks at differential expression in genes and transcripts.
Of the four lines in the system, this one will likely be used by more experienced biologists who are looking for a user-friendly method of analyzing transcriptome data that doesn’t require command line expertise, said Khalfan.
This part of the subway will also handle the largest data uploads of all four lines. The blue and red lines have upper limits of 150 kilobases while the yellow line has a 10 kilobase limit, but the green line is designed to accept whole transcriptomes. As a result, its jobs run on iPlant's allocated space on XSEDE, a National Science Foundation-funded supercomputing ecosystem made up of 16 supercomputers and visualization and data analysis resources hosted at academic institutions across the country.
Khalfan presented a poster describing DNA Subway at this year's Genome Informatics conference which was held in CSHL earlier this month. The system is being developed under the auspices of iPlant's education, outreach, and training division by researchers from CSHL's Dolan DNA Learning Center and the University of Arizona.
He told BioInform that recent estimates put the total number of registered DNA Subway users at 5,850. Combined, these researchers are running about 15,000 projects on the red and blue lines each, and about 5000 projects on the yellow line. Some of these projects — specifically done as part of the International Barcode of Life project — led to the discovery of several fungal, plant, and animal sequences that were not included in existing public repositories, Khalfan said, and 58 of these have since been submitted to GenBank.