Anticipating increased use of cloud computing as sequencing datasets continue to grow, scientists from Yale and Stanford Universities and Weill Cornell Medical College have published a cloud-enabled version of the software used for the functional annotation of variants identified in the first phase of the 1000 Genomes Project.
The software is available as a virtual machine option via Amazon's cloud platform, but users will also be able to download the software and run it locally or use a web-based version of the annotation tool to process their data, Mark Gerstein, a professor of biomedical informatics, molecular biophysics, biochemistry, and computer science at Yale University, told BioInform.
Gerstein, who is a co-author on the paper, explained that the cloud-based version of the software “reflects our sense of where the 1000 Genomes [Project] and a lot of big data computing is going,” noting that there is “a lot of interest in putting all the datasets on the cloud” and running analysis pipelines in that environment.
A recent Bioinformatics paper that described the software, dubbed the Variant Annotation Tool, or VAT, noted that because cloud computing provides “immense storage capacity and scalable compute resources, as well as the ability to share data and perform collaborative analyses,” it is likely that in the future sequencing data will be stored on platforms offered by vendors such as Amazon.
As a result, “the importance of software residing in the same space as the data on which it operates requires that the analysis pipelines processing these reads migrate to the cloud as well," the paper explains. "As VAT will constitute an integral part of such pipelines, having it reside on the cloud will be necessary.”
The VAT virtual machine is tailored for use on Amazon’s EC2 and S3 infrastructure as well as on the free Eucalyptus cloud, but “the actual source code of the internal executable … you could easily compile on basically any type of platform,” Gerstein said.
In addition to the 1000 Genomes Project, VAT is being used in projects at the Yale Center for Mendelian Genomics — one of several National Institutes of Health-funded centers that aim to apply next-generation sequencing and computational tools to discover genes and variants that underlie inherited diseases.
It has also been used to analyze several exome datasets from the NIH’s National Heart, Lung, and Blood Institute and the Broad Institute, Gerstein said.
Gerstein added that he sees “no reason” why VAT couldn’t also be used to analyze data from individuals whose genomes are sequenced for clinical use, although “we haven’t really tried to roll it out for that.”
In general, “we want VAT to work within these personal genomics analyses workflows and that’s what we are hoping happens,” he said. “We don’t see it as an encompassing tool that does everything, but we imagine that what’s going to happen in the future is that people are going to do tons of these personal genomes and they are going to run lots of tools on them and we hope that VAT will be able to fit into someone’s workflow very readily.”
Dealing with Complexity
VAT contains, among other applications, snpMapper and indelMapper, which determine the effects of SNPs and indels, respectively, on the coding potential of gene transcripts; svMapper, which determines if structural variants overlap with different gene transcript isoforms; and genericMapper, which checks whether variants overlap with entries in specified annotation sets.
“VAT is very targeted toward … identifying loss-of-function variants in genes” and it address a lot of the complexities that crop up when this sort of analysis is done, Gerstein explained to BioInform.
For example, while the average protein-coding gene has many transcripts, the genetic mutation may affect only one or half of all of the transcripts, he explained.
A deeper analysis of the data may reveal that only a quarter of the loss-of-function mutations are “simple truncating stop codons,” while the rest might be insertions, deletions, and frameshifts, he said. “You have to think about the transcript but then you also have to think about the splice structure, the gene, and sometimes what appears to be a frame shift will actually work out to be a cryptic splice site and so forth.”
Furthermore, “sometimes an insertion or deletion or even a mutation will affect the splicing and it’s not completely obvious how this does it, and the same is true for structural variants too,” he said.
Another level of complexity is that “many of these disabling mutations actually take the form of not a single variation but … a multiple nucleotide polymorphism where you have coupled events together — for instance, an insertion/deletion right next to a SNP … and again you have to take that into account carefully,” he noted.
VAT tries to take all these possibilities into account and to provide users with “graphical summaries” that show “where the mutation is relative to where all the transcripts are” and to clarify the location of disabling mutations.
In later iterations of the tool, the researchers are looking into “incorporating more subtle aspects of gene annotation,” including methods of predicting nonsense-mediated decay of gene transcripts as well as the effects of upstream open reading frames in the untranslated regions of genes, Gerstein said.
The team also plans to develop tools for the analysis of non-coding regions of the genome, but that will likely be made available in a separate software tool, he said.
However, while these tools are similar, they were not developed to do exactly the same things, and, as such, each program does some things that others do not.
For example, SIFT and PolyPhen are more focused on analyzing the impact of the effect of non-synonymous changes and are “oriented” to the effects of transcripts on frameshifts, for instance, because “that’s not really their mission,” he said.
Annovar, on the other hand, has more in common with VAT, but “it doesn’t have all the features,” he said. For example, it doesn’t target multiple nucleotide polymorphisms and some of the more complex variants.
In terms of speed and scalability, “it really depends on exactly how you configure” the tool, Gerstein said.
He explained that users can tweak VAT to suit their needs “in terms of the positioning of your data” on the compute infrastructure.
Furthermore, VAT “will parallelize well and it will run on the cloud, and if you set up in an optimum fashion, it will be extremely fast,” he said. This is compared to other variant annotation tools, which currently aren’t cloud-enabled, the Bioinformatics paper notes.