NEW YORK (GenomeWeb) – Researchers from Marshfield Clinic Research Foundation and other US-based institutions have published a description of a freely available Hadoop-based system they developed to identify mutations that may contribute to genetic disorders from next-generation sequencing data collected from families.
The so-called SeqHBase solution, described in a recently published Journal of Medical Genetics paper, leverages the open source Hadoop framework and HBase database infrastructure, which together provide the requisite scalable, distributed processing and hosting capabilities to support the analysis of hundreds of thousands of genomic datasets, the researchers wrote.
These particular resources, which have been harnessed for other NGS-based tasks such as sequence alignment and SNP calling, have the ability to address what the researchers called a "critical need" for tools that can quickly and efficiently manipulate variants and their associated annotations to help researchers identify candidate mutations for genetic disorders.
Although so far the method has been applied only to familial data, it could be used in population studies such as the UK's 10K and 100K Genomes projects or even the recently announced Precision Medicine Initiative in the US, Max He, an assistant professor of human genetics and biomedical informatics at Marshfield Clinic and one of the lead authors on the SeqHBase paper, told GenomeWeb.
As such, He and his team plan to include statistical methods for analyzing population-based sequencing data in a later iteration of the solution, and are currently seeking funds to support this effort.
Benchmarking studies described in the paper using familial data have already demonstrated the solution's ability to handle larger quantities of data than might typically be used in familial studies, he noted. While such studies typically focus on data from trios — two parents and a proband — with SeqHBase, the researchers were able to combine and analyze whole genome or exome sequence, variant, annotation, and coverage information from multiple family members in three separate studies and to predict candidate mutations for the respective diseases.
Having that extra information made it easier to exclude false positives as well as find additional support for likely disease-causing mutations, He said. Moreover, the system works quickly, requiring roughly a minute or less to return results, which makes it well suited to clinical use, he added
SeqHBase takes as input BAM or pile up files, variant call files, and functional annotation files and then uses MapReduce programming models to spilt input data into separate chunks that are processed in parallel. Specifically, "in conjunction with a pedigree file, coverage information, genetic variations, and variant annotations are extracted [from input files] by the reduce tasks in a parallel manner," He said the researchers wrote. Based on the provided information, the system suggests candidate de novo, inherited homozygous, or compound heterozygous mutations.
From annotation files, it extracts information on variants such as chromosome number, start and end position, reference and alternative allele, PolyPhen and SIFT scores, and more. From VCF files, SeqHBase pulls information on called variant genotypes, read depth, and phred quality scores; and it extracts coverage information from BAM files. To detect de novo variants, users supply adjustable input parameters such as variant frequency, minimum read depths, and predicted functional deleteriousness scores. The system then compares parental and offspring data "for all potential de novo mutations where the affected carries a heterozygous variant and both parents carry high coverage ... reference alleles," the researchers wrote.
For potential heritable homozygous mutations, the system compares data from parents and affected offspring to determine where the affected individuals carry a homozygous variant and both parents carry heterozygous variants. For potential compound heterozygous mutations, it compares the datasets to ensure that the affected individual carries two different variants in the same gene region and that each comes from a different parent. Moreover, the system incorporates sequence information from multiple healthy siblings to reduce false positive rates as well as data from affected siblings to bolster the chances of detecting true disease-contributing mutations, the paper states.
To demonstrate the tool's efficacy, the researchers applied SeqHBase to datasets from a five-member nuclear family that included one child with Rodriguez syndrome; a four-member nuclear family that included a child with idiopathic hemolytic anemia (IHA); and a third 10-member extended family with two male siblings with severe intellectual disabilities, autistic behaviors, and attention deficit hyperactivity disorder.
In the five-member family study, the researchers were able to identify six candidate de novo mutations and two candidate compound heterozygous mutations that could be associated with Rodriguez syndrome. The system also suggested a candidate inherited homozygous mutation that occurs in a gene that does not have any known associations to the genetic condition.
In the four-member family analysis, the system identified 16 candidate de novo mutations — none of which occur in IHA-linked genes based on available literature — as well as two candidate compound heterozygous mutations, whose associations with the disease have been documented in previous studies. Analyzed datasets from the 10-member cohort returned 18 candidate mutations shared between the two affected siblings. The list included an inherited homozygous mutation found in both siblings in a gene that has not previously been associated with their particular condition; a potentially pertinent X-linked non-synonymous mutation; and two possible compound heterozygous mutations.
In terms of speed, analysis times were almost linearly scalable with the number of data nodes, according to the paper. For example, with access to 20 data nodes, each equipped with 6 gigabytes of memory, two central processing units, and a terabyte of hard disk space, SeqHBase required roughly 16 seconds to identify de novo, inherited homozygous, and compound heterozygous mutations in WGS data from the five-member family — that's compared to between 65 to 75 seconds required for that analysis with just five data nodes, and about 25 to 35 seconds with 15 nodes, according to one of the figures in the paper
Using the same 20-data-node setup to analyze whole exome sequence from the four-member nuclear family produced candidate mutations in all three aforementioned categories in roughly five seconds apiece, according to the paper. Finally, when the cluster was applied to datasets from the 10-member extended family unit, it took about 80 seconds to return candidate variants of each type.
SeqHBase is free available for use by academic or non-profit organizations. Its source code can be downloaded after securing a licensing agreement with Marshfield Clinic Applied Sciences. The system can be deployed and run locally or run on the cloud.