Scientists from the public health departments of the University of Virginia and Harvard University have developed Genome Mining, or GEMINI, a free tool for annotating and exploring inherited genetic variations in the context of Mendelian disease studies.
GEMINI, its developers explained in a recently published PLoS Computational Biology paper, extends data analysis concepts used by existing variant annotation software like ANNOVAR and BEDTools by offering a "flexible framework" for exploring genetic variation in disease and population genetic studies, including a standardized interface to query and explore data and to develop new methods that can be adapted as research needs evolve.
When variants are uploaded, GEMINI automatically annotates them with pre-installed curated annotations such as chromosomal cytobands, CpG islands, segmental duplications, and assembly gaps gathered from resources such as the Single Nucleotide Polymorphism database and ClinVar. GEMINI stores the annotated variants in an SQL database that researchers query to find variants based on things like sample genotypes and inheritance patterns, the researchers wrote. It also provides mechanisms for "ad hoc queries and data exploration, a simple programming interface for custom analyses that leverage the underlying database, and both command line and graphical tools for common analyses," they wrote.
Aaron Quinlan, an assistant professor at UVA and one of GEMINI's developers, explained that his team built the system to enable them to prioritize and more effectively explore variants identified by large-scale whole-genome and whole-exome sequencing studies aimed at exploring inherited diseases.
While there is a standard format and tools for describing and annotating genetic variation, "we found that to identify variants that meet specific inheritance patterns for Mendalian diseases … existing tools are really lacking," he told BioInform. Existing software like BEDTools — which filters variants based on overlaps between genome annotations — offers some of the needed annotation capabilities but lacks things like mechanisms for exploring genotypes from many different samples, he said.
Also, BEDTools requires many custom scripts, making it "laborious and error-prone," Quinlan et al note in the paper. Other programs such as Harvard's PLINK/SEQ "are either focused primarily on applying disease association tests to variants identified among a study cohort, provide a limited set of annotations, or are more difficult to use because annotations are not directly integrated with genetic variation," they wrote.
As a result, "we sought to devise a framework where we could not only bring in these annotations so that we could place variants that we discover in context, but also enable ad hoc data analysis queries so that we can try different hypotheses easily using the same framework without having to write new scripts for each question we want to ask," Quinlan said.
To use GEMINI, researchers load a VCF file format of genetic variants into the tool's database framework. They can also include information about sex, phenotypes, and relatedness of the samples to "facilitate downstream analysis searches." Once inside GEMINI, variants are then annotated with information from pre-installed annotation files — curated and maintained by Quinlan and his colleagues — as well as with data from any other repositories that the researchers want to use.
Both the variants and the annotations are stored in database tables, which lets GEMINI "index variants by their genomic coordinates [and] by their associated annotations," thus speeding up "more sophisticated queries," the researchers wrote. Genotype information for the variants, meanwhile, is compressed and "stored as a single column for each variant row." This approach enables both "query performance and scalability while still providing necessary access to individual sample genotype information," the PLoS Comp. Bio. paper explains.
The end result of the process is a database that researchers can query to identify variants based on the annotations or on genotypes of specific samples being studied. For example, a researcher could run one of several standard queries to identify variants that meet an autosomal recessive or autosomal dominant model or they could search for de novo mutations or compound heterozygotes, Quinlan said. They can also construct they own queries to identify variants that meet their own unique criteria, he said.
Moving forward, Quinlan and his collaborators are working on adding capabilities to GEMINI that will enable users to explore the impacts of rare genetic variants in complex disorders. Additionally, GEMINI's performance is being scaled up so that it can handle variant data from thousands of samples, he said.
Currently, GEMINI is being used in four research projects including one focused on improving annotations for cancer studies and a second study exploring the Smith-Magenis syndrome.