NEW YORK (GenomeWeb) — Scientists at the Ontario Institute for Cancer Research have developed a technique for predicting the quality and coverage of a sequencing experiment from a small set of data.
The statistical framework, called SeqControl, allows users to monitor the quality of production-scale sequencing. The researchers published a description of the approach, including a demonstration of its use for whole-genome sequencing of tumors and normal samples, in Nature Methods this week.
According to Paul Boutros, a principal investigator of informatics and biocomputing at OICR and the senior author of the study, many researchers do not know ahead of time how much data they will get from a sequencing experiment, or what the quality of that data will be like. "Essentially, we're working in a retroactive mode, where we will do a piece of work and then evaluate how good it was," he said.
This means experiments sometimes fail to generate sufficient data of high enough quality. For example, according to the paper, about a quarter of normal and tumor samples in the Canadian Prostate Cancer Genome Project, CPC-GENE, did not reach their target sequencing depth and thus required additional sequencing. "This imprecision may be acceptable in research settings but not when sequencing is a component of clinical and industrial processes," the authors wrote.
For that reason, Boutros, who is also an assistant professor of medical biophysics at the University of Torontro, and his colleagues decided to develop so-called statistical process control for next-gen sequencing. "To our knowledge, there has never been a framework for the process control of sequencing data before — there is nothing in the literature," Boutros said, though he said he assumes that large sequencing centers have their own internal methods.
SeqControl works by rapidly assessing experimental sequencing data for a set of 15 quality metrics in four categories: overall coverage, coverage distribution, basewise coverage, and basewise variant-calling confidence scores.
The data do not have to come from a complete experiment — a small subset is sufficient, Boutros said. The researchers then use the results from their analysis to make suggestions for how the experiment could be improved, or to predict how much sequencing is required to get enough data for robust statistical analysis. "It's kind of a machine-learning technique that starts off with those 15 metrics and then can predict anything you might want to determine about a sequencing experiment," Boutros said.
To test SeqControl, they applied it to existing whole-genome sequencing data from a set of 27 prostate cancers and 26 matched controls. They found that using about 2 percent of those data was enough to predict the quality of the entire experiment with high accuracy.
Moreover, using SeqControl they were able to detect subtle changes in a sequencing experiment, for example, changes to the protocol, new reagent batches, or a different technician. The technique could therefore also be used to predict how such changes will affect the quality of a study, Boutros said.
In practice, to predict data quality for a new experiment, the researchers run a set of pooled barcoded libraries in a single lane of an Illumina MiSeq sequencer overnight and apply SeqControl to assess the quality of the libraries. They then decide how much production sequencing on a HiSeq instrument each of the libraries requires.
At OICR, SeqControl has been implemented in the sequencing production pipeline. Once implemented, the software starts collecting the 15 quality metrics for all sequence data that come off a sequencer and "you start building up this significant database of information about your experiment, and then it becomes quite rapid to do the quality control assessments," Boutros said.
SeqControl can be used on data from any sequencing platform. However, Boutros expects that it will be most useful for real-time single-molecule platforms like Oxford Nanopore's because it could provide feedback on an ongoing sequencing run in real time. "You can adaptively decide when to stop the sequencing experiment," he said.
So far, the researchers have demonstrated SeqControl for whole-genome sequencing of tumors and normal samples, and they suspect it will also work for exome sequencing and ChIP-seq studies, though they have not shown that yet. Applying it to RNA-seq studies would likely require some algorithmic modifications, Boutros said.
The current version, which is available from their website, is already "sufficient for a production-scale sequencing environment," he said. However, his team is working on improving the software to make it run faster and more efficiently, and to optimize it for targeted sequencing panels. They have also patented SeqControl and are in the process of identifying potential partners for commercialization.
In terms of monitoring sequencing quality control, while large-scale sequencing centers can generate sufficient data to build quality models rapidly, that might not be so easy for a small laboratory with a single sequencer, Boutros said. Therefore, it would be helpful if users submitted their sequencing quality metrics in a standardized format to a public database, such as dbGAP. "As a community, we would be able to say, 'this batch of reagents seems to have led to a really good increase in the quality of sequencing,' or 'this protocol change had a really bad effect on a certain type of error,'" he said.
Sequencing instrument vendors or manufacturers of laboratory information systems could help by enabling users to collect quality data in a standardized format. "As a community, we could rapidly learn together, and it would allow people who don't even have access to data straight off the sequencer to start contributing to our understanding of sequencing data quality," he said.