Skip to main content
Premium Trial:

Request an Annual Quote

Leukemia Genome Project Highlights Second-Gen Sequencing Software Needs

The first effort to sequence a complete cancer genome has underscored the power of second-generation sequencing while further establishing the lack of a “killer software app” in the field.
In the study, published this week in Nature, a team of 48 scientists at the Genome Center of Washington University and elsewhere sequenced a female patient’s acute myeloid leukemia genome and compared it to the genome of her biopsied skin as well as reference genomes to uncover 10 cancer-associated mutations — eight of which were previously unknown.  
The team used two high-throughput sequencing platforms — the Illumina Genome Analyzer and the Roche/454 FLX platform — and software tools such as Maq, Cross_Match, BLAT, and Decision Tree analysis. The team also did its own scripting and algorithm development in the course of the project, Rick Wilson, director of the Genome Sequencing Center at Washington University School of Medicine, told BioInform.
The AML sequencing team applied several established software tools and algorithms as well as those developed specifically for the project, underscoring the fact that second-generation sequencing projects are not taking place in a one-pipeline-fits-all world.
As Wilson explained, the leukemia genome sequencing project started a little over a year ago with billions of short reads that needed to be aligned against several large datasets and queried in multiple ways.
“When we got started on that, we were just really clawing for software tools,” Wilson said. “So we borrowed a few things from here and there and ended up having to modify a lot of things we did borrow and we also ended up having to design a lot of our own algorithms.”
Second-generation sequencing is opening up new lines of previously inaccessible scientific inquiry “and the software is the key,” Andrew Fire, a professor of pathology and genetics at Stanford University School of Medicine, told BioInform. Fire was not involved in the AML sequencing study, but his lab is part of Stanford's High Throughput Sequencing Initiative, which is using new sequencing technologies to study cancer and other diseases.
For now, Fire said, "there is no killer app" for second-generation sequencing, which means that most labs must still develop new software tools on the fly for particular projects.
“We are finding the need for a wide variety of computational tools, from simple to complex," Fire said. Although existing packages like BLAT, BLAST, and Illumina's Eland can rapidly provide alignment data, and other packages provide great support at specific stages in the analysis, Fire said that bioinformaticians in his lab nearly always need to develop their own tools to extract meaningful answers to biological questions.
To be sure, the lack of a single decisive analysis pipeline is typical of quickly evolving fields such as high-throughput sequencing. “Here’s the short answer: Software will always be a problem. It will never be adequate,” said Wilson.
He said that researchers for the AML sequencing project developed software for all facets of alignment and analysis, but he cautioned that these tools may not be ready for other scientists to try because he and his colleagues are still validating their pipeline.
“It’s a bit early days,” he said. “Really what you can export to other users is algorithmic strategy as opposed necessarily to algorithms.”
The Wash U team also had to make some difficult hardware-related decisions for the project, particularly regarding storage. “You had to figure out exactly what data that came off the next-gen platforms you needed to save and which you could afford to toss,” Wilson said. “We are still learning that, I think.”
Wilson and other scientists said that an important element in large-scale sequencing projects is working out the right analysis pipeline. John McPherson, platform leader in cancer genomics and high-throughput screening at the Ontario Institute for Cancer Research, told BioInform that he has 10 second-generation sequencers in his keep— five Illumina Genome Analyzers and five Applied Biosystems SOLiDs — and that “just trying to keep up with the informatics on that is a real challenge.”
As the reads get longer, “it becomes computationally more rigorous to analyze them and as the machines [are] increasingly putting out more and more, it is a compounding problem right now that needs a clever solution,” said McPherson.
“Throwing more RAM and more CPUs at it isn’t a practical solution for most people,” he said. OICR has 600 cores at its disposal, but “we can saturate those with 10 instruments pretty easily.”
While it would be helpful for sequencing centers to swap notes on informatics strategies, Wilson said this approach is often not feasible because software tools developed for large-scale projects tend to be written for a specific center’s computational infrastructure.
“I could have my IT guy sit down with, say, John McPherson’s IT guy, [and] they would get a lot of good ideas from each other,” he said. “They’d ultimately have to say ‘OK, that’s a cool idea but it won’t work the way we have our own database structure and so forth set up.’”
In addition, Wilson said, different research groups approach similar computational challenges in different ways. “This is going to be like the six blind guys and the elephant for a while. We are all trying to use this technology for different applications and we are all going to come at it a little differently,” he said. “What might be difficult to some of us is a solved problem for others.”
Climbing the Decision Tree
For the leukemia genome-sequencing project, the team performed 98 runs on the Illumina Genome Analyzer with the leukemia genome and 34 runs with the patient’s skin cell genome. They aligned the reads to the human reference genome, the genomes of Craig Venter and Jim Watson and also against the dbSNP database. 
The Maq algorithm predicted 3.81 million single nucleotide variants in the 98 billion sequenced bases of the tumor genome. Given this high number of variants to start, the scientists developed filtering tools to separate the true variants from false positives. They generated an experimental data set by re-sequencing Maq-predicted SNVs, selecting a training subset, and then a test data set that they submitted to the algorithm Decision Tree C4.5.

“Here’s the short answer: Software will always be a problem. It will never be adequate.”

Of the original 3.81 million predictions, 2,647,695 were supported by the Decision Tree analysis in the tumor genome, and 2,584,418 of those were also detected in the skin genome. “Implementing rules obtained from the Decision Tree analysis resulted in 91.9 percent sensitivity and 83.5 percent specificity for validated SNVs,” the researchers wrote.
Around 11,200 of the tumor-specific variants were located within annotated genes, and the researchers winnowed this down to a set of 181 variants that were either non-synonymous, or were predicted to alter splice site function. Further sequencing determined that 152 of these variants were false positives, 14 were inherited SNPs, and eight were somatic mutations.
To better define the percentage of tumor cells that contained the somatic mutations, the scientists amplified each locus containing a mutation and the amplicons were sequenced using the Roche/454 FLX platform. 
To work out the small indels from the reads, the scientists took 236 million reads that “were not confidently aligned by Maq to the reference genome,” and then used Cross_Match and BLAT to identify gapped alignments unique to the genome.
The researchers placed strong demands on their tools as they plowed through the data. “The most important thing was: Could we detect single-base variants, [and] could we do that in a way that we’re comparing tumor and normal at the same [time] we were comparing [them] with the reference sequence, dbSNP, with the Watson genome sequence, with the Venter genome sequence?” Wilson said.
While there were plenty of computational instruments in the Wash U tool box to start, “there were questions about sensitivity,” he said. “It was also a substantially bigger dataset than what we had dealt with before to do even similar things.”
With second-generation sequencing, he said, you can’t “buy a software package of programs, load it on a computer and be good to go.”
Wilson said that it took the research team six months to analyze the data for this project, but that effort is expected to pay off in future studies. “Now what we’ve done is develop a very robust pipeline, so [for] the second genome [that] is being sequenced, when that comes off [the instruments], it’s going to be a couple of weeks to turn the crank and find all these mutations,” he said.
Working with Biologists
Software development for the AML sequencing project “from day one involved both computational people and biology people; that’s a must,” said Wilson. “You do want your biologists in there.”
Stanford’s Fire agreed that collaboration between bioinformaticians and biologists is a key part of the workflow. In some situations, he said, "we have been having fun" with letting biologists in on some of the software tinkering.
In Fire’s experience, it’s a “great advantage for wet-lab biologists to be able to work directly with second-generation datasets, learning some programming along the way and setting up their own queries for specific scientific questions.”
It works the other way around as well, he said. Anyone “who has substantial knowledge of the insides and outsides of a computer and is willing to go and work in a biology lab and get their hands wet doing a mixed bench/computational project is in a position to do things that will be tremendously useful to science and society.”
McPherson concurred with this inter-disciplinary approach to second-gen sequencing analysis in large-scale ventures. “Most of the recent hires of faculty [at OICR] are people who cross the boundary — they do wet-bench and they do bioinformatics,” he said.
He said that he tries to format raw data so it is useful to biologists, such as an output of SNPs. “Then the biologists can start asking questions about where are the significant mappings to the genome, what genes are there, et cetera, trying to create networks out of the data,” he said.
“If you get a list of SNPs, you have to decide which ones are real, where is the confidence,” he said. While all of the software generates some kind of confidence value, those values “are hard to understand,” he added. “There has been very little validation. We are all trying to get a handle on where are the cutoffs on the data, what do we believe, and what do we question.”
More Challenges on the Horizon
Even as genome centers are getting a handle on the informatics challenges of the current crop of high-throughput sequencers, manufacturers are rapidly increasing the read length and throughput of these systems, which could pose future hurdles.
“As the reads get longer, I am concerned that performance is going to go down,” McPherson said. “It’s not going to be an eight-hour run — it’s going to be 24 hours of CPU time.”  
As read lengths increase, he explained, software packages developed for current second-generation sequencers “may not scale so easily because they use a lot of hashing tables, where they could hash 32 bases, but [once] you get up into 100 bases, you are starting to dabble back with BLAST and BLAT and that will never cut it when you want to do 100 million reads.” 
Genome centers already have their hands full with the current systems. McPherson noted that it is sometimes difficult to set up vendor software to stream data off an appliance to a cluster, and that documentation for many tools is often lacking.
“You get output files and you don’t even know what the calls are and what they mean and you don’t know how they are calculated,” he said, “so that can be very frustrating.”
McPherson noted, however, that vendors are generally responsive to questions.
The dearth of out-of-the-box software for next-generation sequence analysis may not be all bad, according to some observers. Stephen Harrison, the head of the Laboratory for Structural Cell Biology at Harvard Medical School, told BioInform in an e-mail that “you need to write code in order to get your work done” in emerging areas of research, and that “gives a certain flexibility and adaptability to the investigators in those fields.”
As scientific areas mature, they are more likely to have shrink-wrapped software provided by vendors, “but it means that it is hard or impossible to modify for your own purposes,” he said.

Filed under

The Scan

Study Finds Sorghum Genetic Loci Influencing Composition, Function of Human Gut Microbes

Focusing on microbes found in the human gut microbiome, researchers in Nature Communications identified 10 sorghum loci that appear to influence the microbial taxa or microbial metabolite features.

Treatment Costs May Not Coincide With R&D Investment, Study Suggests

Researchers in JAMA Network Open did not find an association between ultimate treatment costs and investments in a drug when they analyzed available data on 60 approved drugs.

Sleep-Related Variants Show Low Penetrance in Large Population Analysis

A limited number of variants had documented sleep effects in an investigation in PLOS Genetics of 10 genes with reported sleep ties in nearly 192,000 participants in four population studies.

Researchers Develop Polygenic Risk Scores for Dozens of Disease-Related Exposures

With genetic data from two large population cohorts and summary statistics from prior genome-wide association studies, researchers came up with 27 exposure polygenic risk scores in the American Journal of Human Genetics.