Researchers from Harvard and the Massachusetts Institute of Technology have developed an algorithm for profiling short tandem repeats with next-generation whole-genome sequencing.
Short tandem repeats, or STRs, are a class of genetic variations with repetitive elements between two to six nucleotides and they comprise about a quarter million loci in the human genome. STRs have a higher spontaneous mutation rate than any other known genetic variation and have been implicated in genetic disorders such as Huntington's disease and fragile X syndrome. Additionally, they have applications in forensics and genealogy and can be used to trace cell lineages in cancer samples.
Currently, most STR profiling relies on capillary electrophoresis sequencing, although some researchers are developing targeted sequencing protocols on Roche's 454 GS FLX platform, including bioinformatics firm SoftGenetics, which offers software specifically for STR profiling from 454 sequencing.
NGS-based STR analysis offers the advantage of profiling "tens of thousands of STRs in one shot," Yaniv Erlich, a bioinformatics researcher at the Whitehead Institute of Biomedical Research at MIT, told In Sequence.
Current bioinformatics solutions for next-gen sequencing aren't effective for STR profiling, however, because STRs are so polymorphic compared to the reference sequence. "If you have a difference in only a couple of repeats, then it presents as a large indel that traditional mainstream aligners won't pick up on, so they'll just throw that read away," said Melissa Gymrek, undergraduate researcher in Erlich's lab and the lead author of the paper, which was published last month in Genome Research.
There are three basic steps to lobSTR, Gymrek explained. In the first step, the reads containing an STR are flagged and characterized for their motif — an AT repeat or a CAG repeat, for instance. This step is very fast and results in discarding nearly 97 percent of the total reads, so only the ones that might contain an STR are left.
The second step aligns the flanking regions around the STR to the reference genome. This step helps determine the position and length of the STR.
And then the third step takes into account noise generated from PCR to determine the most likely genotype of the STR.
The researchers tested their method on an Illumina whole-genome sequencing library with 101-base paired reads and compared it to other mainstream aligners like BWA, Novoalign, Bowtie, and BLAT. It tested the different methods both with and without the GATK local indel realignment tool.
The researchers compared the different tools in terms of speed, the number of informative and noninformative reads each generated, the number of reads generated that differed from the reference genome, and the amount of memory each tool required.
By all metrics, lobSTR performed well. Perhaps most importantly, it was able to detect the largest number of informative reads with STR variations. The others tended to only call STR reads that contained the reference allele.
Previous studies have estimated that between 33 percent and 66 percent of STR sequencer reads should have a nonreference allele. Using lobSTR, 50 percent of the reads had a nonreference allele, compared to between 19 percent and 25 percent with the other tools.
Another major advantage of lobSTR is its speed, which is about 20 times faster than the BWA aligner and just over twice as fast as Bowtie, said Erlich.
He said the tool should be used to supplement traditional alignment programs to enable STR profiling from whole-genome sequencing data.
The researchers next tested lobSTR on whole genomes with diseases that involve repeat expansions, including oculopharyngeal muscular dystrophy, which is caused by a repeat expansion of GCN in the PABPN1 gene; and synpolydactyly, which is caused by a GCG expansion in the HOXD13 gene.
To simulate each of these conditions, the researchers generated 100 reads of 101 base pairs that were equally sampled from the disease locus consisting of a normal and pathogeneic allele with 100 base pairs flanking upstream and downstream regions.
For both conditions, lobSTR accurately aligned the normal and pathogenic reads to the correct location in the genome and identified the disease loci with the correct repeat lengths.
Erlich and colleagues also compared lobSTR's performance to CE sequencing, which is considered the gold standard for STR profiling, and found "good concordance."
The researchers sequenced a male genome on the Illumina GA to 36-fold coverage, identifying 1.6 million informative reads that mapped to about 140,000 STR loci, and also used a commercial forensic kit to genotype 14 autosomal STR markers on a CE platform.
They found that 13 out of the 14 markers identified with CE were covered by at least a single sequence read using the GA, and eight markers were covered by at least three sequence reads. The marker that was not covered spanned more than 129 base pairs, greater than Illumina's 101 base pair reads.
Erlich acknowledged that the shorter Illumina reads are the major limitation of the lobSTR method. For medical genetics applications, many of the diseases caused by repeat expansions will not be able to be studied with Illumina sequencing because the loci are longer than Illumina's read lengths, he said.
However, he said, when he tested lobSTR with other platforms such as 454 and Ion Torrent, it seemed that sequencing with Illumina was the "most stable." Most STR sequences have long homopolymer runs, he said, which make detection with 454 and Ion Torrent difficult.
Erlich added that lobSTR will be made freely available, and said that he does not intend to commercialize the tool. The team's next steps will be to continue to improve the method based on feedback from other researchers.
While the lobSTR method may be useful when looking to do STR profiling from whole-genome sequence data, for practical forensic applications, a targeted sequencing approach will be faster and cheaper, said John McGuigan, a biologist at SoftGenetics, which makes software for profiling STRs from targeted next-gen sequence data.
When looking at STRs to determine human identity, it will "always be better to do a targeted sequencing approach, rather than looking at everything," McGuigan said.
"But, if you are looking at everything, then [the lobSTR] method could be useful for looking at STRs."
Additionally, he said that long reads are important because "you need to have the full repeat length, and it's helpful to have the unique sequence on each side" to align to the reference.
Currently, 454 is useful for this application, he said, but noted that the Illumina MiSeq and Ion Torrent PGM are increasing their read lengths, and could be good platforms for this application as well. Additionally, they are also much faster.
John Fosnacht, a cofounder and vice president of marketing at SoftGenetics, said that next-gen sequencing could be used in forensics to do STR profiling of already convicted offenders, for archiving purposes. "There's a tremendous amount of backlog," he said. Using NGS would be advantageous because many samples could be multiplexed at one, and he estimated the approach could offer a five-fold reduction in price compared to current CE-based methods.
For rapid identification, like at a crime scene, for instance, the goal is to have a technology that would enable identification in a couple of hours, so the whole-genome sequencing approach developed by Erlich's team would not be practical for this, Fosnacht said.
He noted that a number of other companies are working on rapid STR profiling for such purposes, including IntegeneX, which is developing a CE-sequencing based system called RapidHIT 200, and is employing SoftGenetics' software. That system promises to be able to identify a human sample within 90 minutes.