As whole-genome sequencing gains ground in the clinical diagnostics market, the bioinformatics community is stepping up to help practitioners identify the most important disease-associated variants among billions of base pairs.
At the Genome Informatics conference held earlier this month at Cold Spring Harbor Laboratory, research groups from the Medical College of Wisconsin and the University of Toronto showcased two new software systems for identifying and annotating candidate disease-causing variants in whole-genome sequence data.
MCW's CarpeNovo and UT's MedSavant both let users wade through large lists of genomic variants and pare them down to a few candidates based on a number of factors, including whether they have been predicted to be damaging, whether they have already been associated with a disease, and whether they are found in genes with a known function. Both tools also link to publicly available data resources to pull in information that is used to annotate the variants.
But there are some differences between the two platforms — primarily in terms of their intended uses.
MedSavant, for example, was designed for use clinical research to identify causal variants in patient populations. During his presentation at the Genome Informatics conference, Marc Fiume, a PhD student in UT's computer science department and one of program's developers, said he began developing that tool as part of efforts to determine the genetic basis for autism at the Hospital for Sick Children in Canada.
MCW team's CarpeNovo, on the other hand, is being used for research efforts as well as for clinical diagnostic interpretation, Elizabeth Worthey, assistant professor of bioinformatics, genomics, and genetics at MCW and one of tool's developers, told BioInform.
Worthey said that MCW and its collaborators at the Children's Hospital of Wisconsin are currently using CarpeNovo to analyze clinical cases at the hospital, although she could not provide specific details for confidentiality reasons.
MCW last year launched a clinical whole-genome sequencing program for children with very rare, undiagnosed diseases (see sister publication Clinical Sequencing News, 3/29/2011). While it currently outsources the sequencing to Illumina, which has a CLIA-certified laboratory, Worthey said the college is working on setting up its own sequencing lab.
As part of that effort, CarpeNovo's developers are taking steps to have the tool CLIA-certified, Worthey said.
And even though MedSavant is currently intended for research use, its developers hope to follow a similar path as MCW has with CarpeNovo.
The eventual goal is for clinical geneticists to use MedSavant to find variants of interest once they have sequenced patient genomes, Michael Brudno, a UT associate professor of computer science and one of the tool's developers, told BioInform.
Aiding Clinical Diagnostics
CarpeNovo was initially developed in late 2009. Worthey told BioInform that MCW is currently using it to identify rare mutations in 11 clinical cases in addition to research efforts aimed at identifying mutations in cardiovascular disease, eye disease, cancer, and developmental and multisystem disorders.
CarpeNovo is "not decision-support software," Worthey noted. "It doesn’t tell you the answer but it gives you all of the information that’s required to find the answer."
The software links functional, positional, biochemical, and disease-association data for each variant in a patient's genome. It allows users to perform targeted analysis on particular genes, gene sets, or regions, and can perform cross-sample analyses involving multiple genomes.
According to its developers, users can load and analyze variants in a variety of file formats, such as VCF and BAM, vendor formats such as Illumina's CASAVA, and files from third-party bioinformatics companies like CLC Bio.
CarpeNovo runs Harvard University's PolyPhen and the J. Craig Venter Institute's SIFT to predict the likelihood that variants are deleterious. It also uses a series of internally developed applications to annotate genes and variants with information that is useful for whole-genome analysis, such as nucleotide-level conservation data and depth of coverage information, Worthey said. In addition, it calculates zygosity based on allele and total read depths for each variant and calls possible errors based on these results.
It also links the variants to disease associations using data from the Online Mendelian Inheritance in Man database and the Human Gene Mutation Database as well as polymorphism information from dbSNP and the system's internal repository, called the Variant Annotation, Listing and Classification Repository with Interface Environment(VALCRIE), which holds variants culled from publicly available data and from sequencing experiments.
"At the end of the analysis, we basically have a very richly annotated set of variants," which are then passed through CarpeNovo's filters, Worthey said.
The tool lets users filter variants by chromosome or chromosomal regions, or "if you have a set of candidate genes that you know are associated with a [separate] phenotype that seems similar to a patient's phenotype, you can also filter just to look for variants in those genes" for example, she said.
The software spits out a table of all of the genes containing the variants that match the user's criteria and links to annotation information, quality scores, and disease associations among other types of data.
The MCW team has "a long features list" for later versions of CarpeNovo, Worthey said.
For example, the team is looking to add a third variant function prediction algorithm in addition to PolyPhen and SIFT, which tend to have high false positive and negative rates, she said.
Additionally, she said the developers will add tools to identify promoters and enhancer elements, adding that these will be used in clinical research efforts to find variants that occur in non-coding regions of the genome. The team also plans to redesign the front end of the system.
At present, CarpeNovo isn't available for use outside MCW, Worthey said, although the team is not opposed to the idea and in fact has provided remote access for some of its collaborators.
"At the moment, there is enough work to be done to keep doing the development as a tool rather than … making it available for other people," she said. "I am still trying to work out how to fund making it a resource or a web application."
'Google for Genetic Variants'
The University of Toronto researchers, meantime, began developing MedSavant about six months ago and were recently awarded a $50,000 grant from the Ontario Genomics Institute to support their work (BI 11/11/2011).
Fiume told BioInform this week that besides its use in the autism genome project, MedSavant will be part of the infrastructure used in the Finding of Rare Disease Genes in Canada (FORGE Canada) initiative — a group that aims to identify genes involved in genetic diseases and cancers in children by collecting and sequencing data from hundred of individuals in Canada and globally.
Earlier this year, the FORGE group was one of two beneficiaries of a C$4.5 million ($4.6 million) grant awarded by the Canadian government to support genomic research into childhood diseases (GWDN 2/22/2011).
Brudno explained that his team is handling "the data storage for the project" and that MedSavant "is the way we are going to allow others to look at FORGE data."
The tool is linked to the Savant genome browser — a tool for manually inspecting SNPs and structural variants that was also developed by the UT team. The researchers described it in a paper published in Bioinformatics last year.
"The idea behind the partnership between MedSavant and Savant is that you have a large list of genes that you are going to filter using MedSavant and once you have distilled down a small enough subset then you can look at manually, then you can export that and take a look at that in Savant," Fiume explained to BioInform this week, adding that the tool is "like a Google for genetic variants."
MedSavant is comprised of two parts: a graphical user interface and a backend database.
The database holds basic patient data such as age, sex, and pedigree; phenotype data such as disease, signs, and symptoms; and genotype information, which includes things like candidate variants, their types, and genomic locations. It also includes a compression tool that makes it possible to store large quantities of variant records.
The tool accepts data in the VCF file format and lets users visualize global trends in their data, create and run queries on it, and then analyze the results. It also incorporates known information about variants from resources like dbSNP, the Gene Ontology, and OMIM databases using a series of plug-ins.
For instance, one plug-in lets users select nodes from the Gene Ontology and then filter variants based on the intersection of transcripts that are associated with those nodes.
Like CarpeNovo, MedSavant runs PolyPhen and SIFT to predict the function of variants. It also incorporates a third functional prediction tool called Protein Analysis Through Evolutionary Relationships, or Panther, which categorizes genes by their functions based on published data and evolutionary relationships.
Following filtration, users can move their data over to the Savant genome browser, where they can manually inspect the read alignment data that supports the candidate variants.
Unlike, CarpeNovo, MedSavant is available for general use. Users simply download a client that "talks" to a server that hosts MedSavant's database of variants, Fiume said adding that the tool can run on any standard computer system.
"What we have now is a prototype, which we like, but the ultimate goal is to add on more analytic components, more visualization components," he said.
"I want users to think about the genome more as a functional unit, so being able to filter based on gene functions and participation in various pathways, which will involve a lot of intersection with external datasets," he added.
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.