Researchers in the pathology and bioinformatics arms of the Memorial Sloan-Kettering Cancer Center are developing pipelines that will be eventually be used to manage and analyze data and workflows for the center's next-generation sequencing-based clinical diagnostic assays.
Both pipelines combine open source software and internally developed applications into a scalable automated infrastructure for managing and analyzing data that is currently being used in cancer research studies at MSKCC, including one focused on detecting low-frequency somatic mutations.
MSKCC scientists described the pipelines in two posters at the Genome Informatics conference held last week at Cold Spring Harbor Laboratory.
The first poster described the NGS Data Management System, or DMS. This system is intended to manage data generated by targeted hybridization capture assays performed primarily on Illumina instruments, although it can be extended relatively quickly to work with data from other sequencing platforms, Aijazuddin Syed, the lead bioinformatics engineer on the DMS pipeline project, told BioInform.
Features in the Python-based framework include tools to detect and monitor new sequence runs, ensure that they are completed, and to process the resulting data in the FASTQ files including running quality checks on the data and generating BAM files, Syed said. The system is also responsible for downsizing and archiving information and associated metadata in run folder files, Syed said. It also generates summary statistical reports containing information about the completed runs.
Basically, "we keep a record of everything," he said. "For every entity there is a history table, for every analysis pipeline we maintain a history of what kind of tools were used, what version of tools were used" and so on.
The system is also responsible for setting up analysis jobs, ensuring that the raw sequence enters the appropriate pipelines in the system's analysis framework, which was the subject of the second poster at the Genome Informatics conference.
This Perl-based pipeline, which is intended to identify variation in multiple samples with high sensitivity and specificity, includes tools for trimming adaptors; mapping and duplicate masking; local realignment around indels; recalibrating base call quality scores; and calling single nucleotide variants, insertions, and deletions. The pipeline also provides variant annotation and filtering capabilities.
Ronak Shah, a bioinformatics engineer at MSKCC and one of the developers of the analysis pipeline, told BioInform that among other open source tools, the researchers use open source programs such as the Burrows Wheeler Aligner and the Broad's Genome Analysis Toolkit. Other tools in the pipeline include MuTect which is used to call somatic mutations, IndelLocator for calling indels — for some applications VarScan, Pindel, and Dindel are used instead — and ANNOVAR for variant annotation.
In his poster describing the pipeline, Shah focused specifically on how it could be used to detect low-frequency somatic alterations. Among other assays, the analysis pipeline along with the DMS system will be used to analyze data from a 340-cancer-gene panel that is being developed at MSKCC.
Both pipelines are currently being validated in research studies at MSKCC.