Last July, Washington University’s Genome Sequencing Center was one of the first two labs to receive Solexa’s Genome Analyzer under the company’s early-access program.
A researcher in the lab, which installed a second instrument in December, said the tool had to overcome some “hurdles,” but watched as it matured over the last seven months from beta-version to a machine that outputs 1 gigabase of data.
Illumina, which acquired Solexa earlier this year, started commercializing the instrument broadly shortly after Solexa concluded its early-access phase.
Earlier this month, the researcher, Matt Hickenbotham, a staff scientist in the Genome Sequencing Center’s technology development group, gave a talk at the Advances in Genome Biology & Technology Conference in Marco Island, Fla., during which he showed how he and his group have been validating the Genome Analyzer.
The instrument first arrived at WashU in July 2006 and “was still maturing at that point,” Hickenbotham told In Sequence last week. “We were aware of this and gladly took on that challenge.”
Initially, the researchers tested “the basic functionality” of the machine, leaving the library and sample preparation up to Solexa. The aim of this initial validation was to make sure that the scientists could consistently generate data of a certain quality.
“When you run the instrument multiple times, are you able to achieve that quality standard consistently?” Hickenbotham explained.
They used control flow cells prepared by Solexa that already contained clusters of amplified DNA from a human BAC clone that Solexa had used for its internal validation.
The researchers set out to obtain fewer than 1.5 percent errors per 25-base-pair read when aligning the reads back to the reference sequence. Also, their goal was to keep dephasing — when some clusters run ahead of the current incorporation cycle, and others fall behind — below a certain threshold.
The aim was not to maximize data output: In each of the eight channels of the flow cell, the researchers only imaged 70 tiles, or software-defined areas. Solexa had prepared the clusters at a density of 5,000 per tile, of which about 75 percent passed quality filtering, Hickenbotham said, so the scientists could only expect to obtain about 50 megabases of sequence per run.
“We did have some hurdles,” Hickenbotham recalled. Among these problems were bubbles in the system, maintaining focus stability, and a higher-than-expected rate of dephasing. The latter problem in particular turned out to be tricky to resolve, taking until about the end of October to solve.
“At the time, neither us nor the company really knew where [it] was coming from,” he said.
The problem, it turned out, stemmed from Solexa’s proprietary fluorescent nucleotides, which the company initially sent out pre-mixed with buffer. “The chemistry works much better if they send us the fluorescent nucleotides and we mix them up in buffer on site,” Hickenbotham said. “So it was a relatively simple solution, but it made a big difference in terms of quality.”
Once it overcame these initial problems, “the instrument began performing very consistently with those control flow cells,” he said, so the group decided in November to tackle sample preparation and cluster generation in a second validation phase.
For this, they used the same BAC clone that Solexa had put on the control flow cells. The aim was to generate clusters of which at least three quarters passed quality filters. Clusters that have no signal, that are too close together, or that are non-clonal are filtered out, Hickenbotham explained. They are generated on a separate instrument provided by the company, a so-called cluster station, which holds one flow cell and isothermally amplifies the DNA.
“That validation was very smooth and quick,” he said. “At that point, it was time to take samples of our own.”
So later in November, the researchers chose an area of chimp chromosome 7 for analysis, for which they had a high quality consensus sequence, generated by capillary sequencing. They had previously used BAC clones from the same region to test the 454 platform.
“Just to make it a little more fun, given the output of the [Illumina] instrument, we decided to take a multiplex approach, and we pooled several clones,” covering about four megabases of sequence, Hickenbotham said.
“As a first experiment, it was pretty encouraging,” Hickenbotham said. The researchers generated about 110 megabases of sequence, covering the four-megabase region about 20 times on average. There was some variability in coverage, he said, but this was probably due to the way the BAC clones were pooled rather than the sequencing.
Also, the researchers were able to generate 31 base-pair reads, and grew the clusters to varying densities across the channels, reaching up to 15,000 per tile, “which at the time was kind of pushing the envelope,” he said. They measured the quality scores for their reads after converting them to the Phred scale and found that although they “did seem to taper off a bit” with longer reads, quality scores remained above Q20 on average.
Only 80 percent of the reads, though, aligned to the consensus sequence, “which at first seemed a little bit low,” Hickenbotham said. But it turned out that about 10 percent of the reads derived from E. coli and vector sequences that were left over from the preparation of the BAC DNA.
In addition, Solexa’s alignment tool excluded 5 to 6 percent of the reads because they mapped to multiple locations. That tool, he explained, also throws out all reads that do map uniquely but have more than two mismatches.
Although the researchers were pleased with their results, the throughput of 110 megabases would not be sufficient to resequence entire human genomes, and did not come close to the 1-gigabase output that Solexa was touting for the first commercial version of its instrument.
“It just worked out that right about that same time Solexa also wanted to encourage users to increase data output per run,” Hickenbotham said, so the scientists embarked on another validation project for their instrument, as well as for their second one, which they received in December.
“The technology was still maturing [when the lab received it in July 2006]. We were aware of this and gladly took on that challenge.”
Data output can be increased in three ways, Hickenbotham said: by growing clusters at higher densities, by imaging more tiles in each channel, and by increasing the number of cycles, and thus read length. At the moment, he said, the researchers image 200 tiles per channel — two rows of 100 — and improvements by the company to its imaging software now allows them to grow and image clusters at a density of 20,000 per tile.
But scaling up harbored a number of other problems. For example, a single run increases from two days to three days, during which focus can shift, and fluorescent signals can decay.
Initially, the researchers performed a longer run with Solexa’s control flow cells, and “pretty much right away we were able to validate both our original instrument and [our] second one at about 600 megabases per run and 25 base-pair reads” with a very low average error rate per read compared to the reference sequence. “That was quite encouraging as we look forward to other applications, such as human resequencing,” Hickenbotham said.
Most recently, the group has begun testing the instrument for cancer genome resequencing and is hoping to expand this project in the future. As part of a collaboration with Tim Ley of the Washington University Siteman Cancer Center, the researchers have already characterized a number of samples from patients with acute myelogenous leukemia, mainly by PCR-based resequencing.
This “Genomics of AML” project is funded through a program project grant from the National Cancer Institute and aims to discover and characterize all the changes that occur in the DNA of AML patients.
Now, the researchers just started sequencing one of the AML samples on the Illumina platform to generate some preliminary data to renew the grant.
“This is the first sample [where] we have actually tried to resequence the entire patient,” Hickenbotham said.
The project is still ongoing, but so far, several runs have each generated 1 gigabase of high quality data. If the grant is renewed, the researchers plan to sequence several AML samples that have already been characterized by PCR-based resequencing at 10x coverage using the Illumina platform to see how the data compares to PCR-based sequencing.
But even resequencing a single human genome is no small feat at the moment, Hickenbotham cautioned. “These are very, very large datasets, and they are challenging to all of our informaticians,” he said. “As one of them told me recently, they are breaking all of the tools that they currently use.”
Finding new approaches to data analysis will still require “quite some time” for all next-gen sequencing platforms, he said. What differentiates them most from capillary sequencing, apart from the size of the data sets, is their short reads.
“There is value in them, but it’s, ‘Where is the value, and how can we use them the best way?’“ Hickenbotham said. “I do understand that there is an enormous challenge there.”