Scientists at the University of Southern California have developed a mathematical method called Preseq that predicts the molecular complexity of sequencing libraries in order to give researchers an estimate of how deeply they need to sequence their DNA samples to achieve complete coverage.
The team, which comprises Andrew Smith, a computational biologist at the USC, and Timothy Daley, a USC graduate student, believes the software could help researchers save time and cut costs associated with next-generation sequencing — particularly in the clinical setting, where costs are a key consideration.
In a recent Nature Methods paper that describes Preseq, the developers explain that the command-line software uses data from initial shallow sequencing runs to evaluate the complexity of sequencing libraries. The initial runs are between 1 million and 5 million reads, which is minimal compared to complete runs of several hundred million to billions of reads.
Defining complexity to be "the expected number of distinct molecules that can be observed in a given set of sequenced reads," the authors explain that they use the initial run to determine how many previously unsequenced molecules would be obtained from additional sequencing.
In other words, it predicts how much non-redundant information could be generated from the library, USC's Smith said.
After the initial shallow run, researchers can use the software to "look at the relative frequencies of molecules and … estimate how much we will actually need to be sequencing in order to really have surveyed the full molecular complexity of the library," he explained to BioInform.
Researchers can use this information to plan their sequencing experiments — for example, to determine whether it will be best to sequence more deeply from an existing library or to generate another library.
Smith said that he and Daley developed the method so that they would have a way to ensure that they were getting a diverse set of molecules from their NGS experiments.
"We found ourselves … frequently looking at DNA sequencing results where there were a lot of duplicate copies of molecules sequenced … when you have multiple reads that you know correspond to the same original molecule," he explained. "In a lot of studies, those aren't really very useful because they are just telling you what gets amplified better in PCR rather than what's actually present more frequently in a real biological sample."
He said that they first tried to use an older Poisson-based method that was developed for Sanger sequencing technologies by Eric Lander and Michael Waterman. However, that did not work for NGS data because it "does not account for the various biases typical in applications of high-throughput sequencing," the paper explains.
After investigating a number of other statistical approaches, the team found one from ecology known as capture-recapture that seemed to work well for their purposes, Smith said.
In this method, individuals are captured and tagged so that researchers will be able to tell when an individual is captured a second time. The number of times each individual is captured is then used to make inferences about the population as a whole, for example, the number of gorillas remaining in the wild.
Building on this model, Preseq uses unique molecular identifiers such as DNA barcodes to track "the frequency of each unique observation," according to the Nature Methods paper. "Using these frequencies, we estimate the expected number of molecules that would be observed once, twice, and so on, in an experiment of the same size from the same library."
Prior to using Preseq, researchers need to sequence a small portion of their sample and then map their reads to the reference genome, Smith told BioInform. These mapped reads are then entered into the software, which generates a curve "that will tell you, [of the] reads you've sequenced, what proportion of them will be distinct molecules," he said.
Smith said he and Daley have tested the software on a library that contained information from Illumina's Genome Analyzer IIx, MiSeq, and HiSeq sequencers.
The developers expect the tool to be most useful to researchers who might need to do deep sequencing to identify rare molecules, Smith said. In addition, the authors note in their paper, as clinical sequencing gains ground, "methods for evaluating libraries will be essential to controlling costs and interpreting the results of sequencing that potentially could inform clinical decisions."
Besides its applications to the sequencing arena, the USC researchers believe that the underlying algorithm behind Preseq could be used in other settings. For example, public health officials could use it to estimate the population of HIV-positive individuals in a population; astronomers could use it to determine how many exoplanets exist in the galaxy based on the ones they have already discovered; and biologists could use it to estimate the diversity of antibodies in an individual, they said.
Commenting on the Nature Methods paper, Xiaole Shirley Liu, a professor of biostatistics and computational biology at Harvard School of Public Health, described Preseq as “an interesting approach” and added that it will be “very valuable as many laboratories start using multiplex sequencing.”
She also noted that while there are similar methods, including one called Chip-seq Analytics and Confidence Estimation, or CHANCE, Preseq “probably provides a more theoretical approach.”