Skip to main content

Software Surfeit


While people tend to look for the most cutting-edge software, when it comes to informatics tools for comparative genomics, "cutting edge" is a relative term. Everything is a moving target — experimental tools are constantly evolving, which means data quality and types change, which in turn affect the types of questions researchers can ask, which then lead to new demands on software and analytics pipelines. There are also no large commercial entities rolling out new editions of the same handful of tools year after year as genomics matures. Instead, software developers in the community are building new programs from scratch without many existing code bases to build on.

"Overall, the genomics field is not as mature as other fields — for example, molecular evolution or population genetics — where relatively few high-quality programs dominate specific analysis types," says Michael Cummings, associate professor at the University of Maryland's Center for Bioinformatics and Computational Biology. "A mature field from an analysis software standpoint has a productive balance between competition among programs, ongoing entrance of new ideas and new competitors, and continued development of existing state-of-the-art implementations."

Cummings co-directs the Workshop on Comparative Genomics meetings hosted by the Society of Systematic Biologists, which is aimed at acquainting attendees with how to install and run the myriad comparative genomics software tools. The workshop includes tutorials on software packages like ABySS and Galaxy — used for de novo assembly and read alignment — as well as the Ensembl project — headed up by EMBL-EBI and the Wellcome Trust Sanger Institute — which aims to develop a software system that produces automatic annotation on eukaryotic genomes.

"The 'NIH' — not invented here — problem, where people almost always build a new program from scratch rather than taking an existing code base and improving it" is one of the reasons why comparative genomics software development is still in its adolescence, Cummings says. "But we are slowly seeing some progress in overcoming these issues — for example, the expanding use of Bowtie as a core analysis component involving other programs." Bowtie is an ultrafast, memory-efficient short-read aligner that aligns short DNA sequences to the human genome at a rate of more than 25 million 35-basepair reads per hour — using a Burrows-Wheeler index to keep its memory footprint small — at a typical 2.2 gigabytes for the human genome.

According to Inna Dubchak, staff scientist at Lawrence Berkeley National Laboratory and the US Department of Energy's Joint Genome Institute, the two biggest challenges to developing tools for comparative genomics are having a reliable alignment platform to obtain comparative data, and developing effective data visualization features for the software.

Dubchak speaks from experience: She developed the VISTA toolkit, a suite of programs and databases for comparative analysis of genomics sequences. It's one of the most popular and mature comparative genomics analysis toolkits available today. The suite contains more than 28 searchable genomes, including bacteria, fungi, algae, plants, and vertebrates.

"VISTA started about 10 years ago. It became very popular — we have about 2,000 citations in Google Scholar … and we have about 30,000 unique IPs of our users' access, so we have a pretty high level of usage considering the narrower focus of the tools," Dubchak says. "The visualization of comparative data is becoming a real area of research, and VISTA is one of the early successes. It moves from a static image for comparative data showing level of conservation of peak and color according to annotation, then moving on to different types of browsing capabilities. This keeps VISTA alive as a signature tool."

Most recently, she and her colleagues rolled out the VISTA Region Viewer, an interactive online tool for comparing and prioritizing genomic intervals. "A lot of new information requires improving tools," Dubchak says. "We're working on the tool constantly, added new features, and expanding to different parts of genomics and more medical areas."

[ pagebreak ]

Investigators studying microbial communities can take advantage of robust informatics pipelines like Quantitative Insights Into Microbial Ecology, which is capable of analyzing large sequencing data sets from different types of sequencing platforms. QIIME combines third-party tools implemented in different laboratories with third-party databases, like VAMPS and MG-RAST, to allow users to pass data between QIIME and those resources more seamlessly. As new tools are developed for microbial community analyses, the QIIME team works with the developers' tools to get them wrapped and integrated into the QIIME pipeline, while at the same time leaving in legacy software for benchmarking and to satisfy niche users.

According to Antonio González Peña and his team at the University of Colorado, Boulder, there are several differences between QIIME and other analysis pipelines. "Every function in QIIME is tested using test-driven development and instead of re-implementing third-party tools — which can lead to incorrect results and extra development time — we wrap the original tools," he says. "We also provide workflow scripts, which reduce the number of steps an individual needs to run to get results, and we are always trying to find and integrate new analysis and visualization features which lead to more compelling results."

The analysis of genome synteny — the set of conserved genomic features on a set of homologous chromosomes — is a routine practice in comparative genomic studies. While there are plenty of Web-based synteny visualization tools for investigators to use, only the Genome Synteny Viewer, developed at the University of North Texas, allows researchers to work with their own data. GSV users upload two data files for synteny visualization, which are then presented as two selected genomes in an integrated view. Users can then browse and filter forgenomic regions of interest, change the color or shape of each annotation track, as well as hide, re-order, or show the tracks dynamically.

"Most tools use local data. They have data provided in their database, so if I want to use genomic browser I can't put my data there. But with the GSV, any user can generate his or her own data and visualize it, and they can customize the visualization, so they don't have to depend on the back end of the other tools," says Kashi Revanna, research analyst at North Texas, who co-developed GSV.

Revanna and Cummings echo Dubchak's emphasis on the importance of developing and integrating visualization tools into comparative genomics software, and all are keenly interested to see what role trends like cloud computing will play in their research.

Many other comparative genomics tools have been shown to work well in the cloud, including QIIME and Galaxy. Last December, a team from Harvard Medical School published a paper in BMC Bioinformatics describing its redesign of the reciprocal smallest distance algorithm, a common comparative genomics algorithm, to run on Amazon's Elastic Computing Cloud. The team employed the RSD-cloud for more than 300,000 ortholog calculations across a wide selection of fully sequenced genomes, and the entire job took just under 70 hours with a cost of $6,302.

"I suspect with the rate of data generation, vis-à-vis rate of improvements in network bandwidth, that moving the computation to the data rather than the other way around will ultimately be most effective," Cummings says.

The software

Here are just some of the essential tools for your comparative genomics toolbox:
Scripture: Java script for transcriptome reconstruction relying solely on RNA-seq reads and an assembled genome to build a transcriptome ab initio
ABySS (Assembly By Short Sequences): A de novo, parallel, paired-end sequence assembler
Galaxy: A Web portal equipped with existing genome annotation databases, enabling users to search remote resources, combine data from independent queries, and visualize the results
TopHat: Aligns RNA-seq reads to a genome in order to identify exon-exon splice junctions. It is built on the short read mapping program Bowtie.
Cufflinks: Assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-seq samples
Velvet: A set of algorithms for manipulating de Bruijn graphs for genomic sequence assembly
PyCogent: A software library and collection of rigorously validated tools for the analysis of genome biology data sets

Filed under

The Scan

Call to Look Again

More than a dozen researchers penned a letter in Science saying a previous investigation into the origin of SARS-CoV-2 did not give theories equal consideration.

Not Always Trusted

In a new poll, slightly more than half of US adults have a great deal or quite a lot of trust in the Centers for Disease Control and Prevention, the Hill reports.

Identified Decades Later

A genetic genealogy approach has identified "Christy Crystal Creek," the New York Times reports.

Science Papers Report on Splicing Enhancer, Point of Care Test for Sexual Transmitted Disease

In Science this week: a novel RNA structural element that acts as a splicing enhancer, and more.