This article, originally published Jan. 14, has been updated with additional information from Illumina.
A week after Illumina acquired San Francisco-based startup Moleculo for an undisclosed amount, some early-access customers spoke about their experience with the firm's long fragment sequences for the assembly of complex genomes.
Moleculo, which has developed a technology to break genomic DNA into large pieces, sequence them with short reads, and assemble these into so-called "long reads," said it sees applications in the de novo assembly of plant and other complex genomes, human genome sequencing, and cancer sequencing.
At the Plant and Animal Genome conference in San Diego over the weekend, early customers of the technology, which Moleculo has been providing as a service, showed first results in insect and fish genomes.
Early-access customers include researchers at the US Department of Agriculture; the Department of Energy's Joint Genome Institute; Stanford University; the University of California, Davis; the University California, Santa Cruz; the Broad Institute; Monsanto; Syngenta; and several small pharmaceutical companies.
Although it is currently "not cheap," the technology "adds to Illumina's portfolio and expands the HiSeq's capabilities, particularly with regard to haplotype resolution," said Richard Michelmore, director of the UC Davis Genome Center, who has used Moleculo's service.
Moleculo was founded a year ago by Steve Quake, a professor of bioengineering at Stanford University; Mickey Kertesz, a postdoc in his lab at the time; and Dmitry Pushkarev, then a graduate student in Quake's lab. Soon after, the founders were joined by Tim Blauwkamp, another Stanford postdoc, who has led molecular biology development efforts at the company.
Besides Quake, the company's scientific advisory board includes Affymetrix founder Steve Fodor; University of California, Santa Cruz, bioinformatician David Haussler; former Solexa chief operating officer Brock Siegel; and Stanford computer scientist Serafim Batzoglou.
After raising less than $1 million in seed funding from angel investors, the company set up shop in the Mission Bay Innovation Center, an incubator of the California Institute for Quantitative Biosciences. Following last week's acquisition, the company will move from QB3 into Illumina's own facility in nearby Hayward and expand its R&D team.
Over the last year, Moleculo developed its technology and started offering it as a paid service to a number of early-access customers who tested it for various applications. This provided the company with early revenue, allowing it to proceed without additional fundraising, according to Kertesz, Moleculo's president and CEO.
As Moleculo prepared to close a "nice" Series A funding round, Illumina stepped in with an offer to acquire the company, said Kertesz. They decided to accept the offer, allowing them to tap into Illumina's manufacturing, distribution, marketing, and sales capabilities — expertise that Moleculo did not have in house. "We are scientists, and we want to make sure that this hits the market and helps people with their science," Kertesz said.
In addition, all of Moleculo's customers were already using Illumina sequencing platforms – although the firm was considering developing its technology for the Ion Torrent platform as well – so teaming up with Illumina seemed like "a great fit," he said.
While he could not comment on the purchase price, Kertesz said that it was "a nice outcome for investors and for founders." The company's story may encourage others to step out of academia to commercialize their ideas, he added. "It doesn't take a huge investment; it doesn't take years of development to bring it to market, but really, if the idea is concrete enough, if the team is strong enough, this can be done in this accelerated way."
The company is not revealing details of its technology yet, which is IP-protected. Generally speaking, it cuts up genomic DNA into fragments of around 10 kilobases, tags these with unique barcodes and breaks them up, sequences them with Illumina short-read technology, and assembles the sequence of the fragments.
The first part of the method – generating the tagged fragments – is still under wraps, but it results in a sequencing-ready library that is shipped to the customer for sequence analysis.
According to Illumina senior director of scientific research Geoff Smith, the company uses "standard molecular biology methods" for these steps. Tagging the DNA, he said, allows them to mark the ends of the fragments in order to ensure that non-overlapping pieces of DNA are assembled.
Customers' sequencing machines automatically send the short-read data to the company, which reconstructs the sequences of the fragments, or "long reads," and sends them back to the customer.
Fragment sequences range in size, with a peak around 8 to 10 kilobases. The company has already generated longer sequences, and future versions might double the current size, but for most applications, "there is not a lot of advantage in going beyond 10 to 15 kilobases," Kertesz said, because most repeats in complex genomes are on the order of 2 to 5 kilobases long.
The technology works on any type of genomic DNA, as long as it is not degraded. Sample prep currently takes less than two days, which could be further improved, Kertesz said.
A scientific publication detailing the method is currently under review, he added.
Moleculo has been providing its service at a "pretty expensive" price, Kertesz said, though it is still several orders of magnitude cheaper than producing and sequencing BAC libraries. And because the data makes the assembly of genomes much easier than assembly from short-read data alone, it allows researchers to save costs on the bioinformatics side.
It is unclear at what price Illumina will provide the technology, but customers expect that prices may come down once the firm scales up production of the reagents.
Smith said Illumina will provide pricing information "in due course," adding that generating phasing information for human DNA will likely require one additional lane of HiSeq data.
Illumina plans to continue providing Moleculo's technology as a service, starting in the second quarter, and to begin selling it as a kit, device, and method in the fourth quarter (IS 1/8/2012).
According to Smith, Moleculo's technology "provides an important 'missing piece' in the next-generation sequencing toolbox."
Several customers believe the acquisition by Illumina will benefit them. It "just shows that they recognize the Moleculo process is a game changer," said Geoff Waldbieser, a research molecular biologist with the USDA and an early customer. "Hopefully it will be transparent from a user's standpoint because the Moleculo group was very nice to work with," he told In Sequence via e-mail.
Also, "IIlumina’s deep pockets and experience will likely allow the technology to be developed further," according to UC Davis' Michelmore.
Early Customers, Early Results
Through its collaborations, Moleculo has shown applications for its technology in three main areas: de novo assembly of complex plant genomes as well as metagenome analysis, clinical human genome sequencing, and cancer genome sequencing.
In plant genomics, the data could be useful for seed development, and in metagenomics, for the development of commercial enzymes, Kertesz said. In clinical sequencing, it might help distinguish between functional genes with roles in pharmacogenomics and closely related pseudogenes, for example. And in cancer genomics, it could help detect copy number variants and other structural variants, he said.
Michelmore said his center has used Moleculo's technology to help assemble the genome of the pathogenic oomycete Bremia lactucae, which is very heterozygous.
The genome, which is about 100 megabases in size, has been "very problematic" to assemble with deep Illumina short-read data from several types of libraries alone, he said. Sequencing it to 3.6x depth with Moleculo's approach provided "an impressive assembly that was somewhat improved when combined with Illumina-generated scaffolds," he said, adding that his team has not yet validated these assemblies extensively.
"What was impressive was the clear separation into two distinct highly polymorphic haplotypes that were causing the assembly problems with solely Illumina reads," he said.
Michelmore said he is "sufficiently encouraged" by the results to try the technology on larger genomes now, starting with two 1.5-gigabase plant genomes, diploid progenitors of tetraploid peanut.
Geoff Waldbieser, a research molecular biologist with the USDA, and his colleagues at Auburn University have obtained Moleculo sequence data to help them produce a draft assembly of the blue catfish genome.
Overall, from a "tune-up" library to establish parameters and six production libraries, they have obtained about 700,000 long reads from Moleculo, which have an average read length of about 4.7 kilobases, ranging from 1.5 kilobases to 17.5 kilobases. In total, they have obtained 3.3 gigabases of Moleculo sequence data, about 3x coverage of the blue catfish genome, which they plan to assemble soon.
Waldbieser, who presented data from the project at the PAG Conference last weekend, told In Sequence via e-mail that he was impressed by the speed of Moleculo's computational assembly of the short-read data into long reads, which only took a couple of days and could probably be faster with better data upload speed.
He said a pairwise alignment of the Moleculo reads resulted in many perfect overlaps over kilobase stretches of sequence. Single-base mismatches seem to occur in low-complexity sequence, such as homopolymer stretches.
He also put together several contigs by hand from Moleculo data, resulting, for example, in one 28-kilobase contig from 11 long reads with only five base mismatches in the alignment. "Perfect matches of one kilobase or more certainly reduce the complexity of sequence assembly — I think my kids could assemble these on the living room floor," he said.
He also found about two million simple and complex repeats in the Moleculo reads. What he found to be the "most encouraging" result was that on average, there were two kilobases of sequence flanking the 5'-most and the 3'-most repeat. "I think having spanned so many repetitive elements will certainly increase the contiguity in any assembly," he said.
The accuracy of the long reads is also very good, he said, based on a 35-kilobase contig from nine long reads that spans a 12-exon gene from channel catfish, which was more than 99 percent identical in coding sequence and deduced protein sequence to Sanger EST sequences from unrelated catfish.
"Like everyone else, I'm hoping some day for a platform that produces sequence from one end of a chromosome to the other at 100 percent accuracy," Waldbieser said. "Until then, this is the best I've seen — long and very accurate."
The cost, he said, was "reasonable considering what we received." He now plans to start assembling Moleculo long reads along with other Illumina scaffolding reads, allowing them to detect any bias in coverage.
His group also plans to produce more Moleculo data for the channel catfish genome and other research projects. "I think these are a must for anyone who wants to assemble a eukaryotic genome de novo," he said, noting that the approach has utility for gene discovery, detection of copy number variants, resolution of recently duplicated chromosomal regions, resolution of repetitive elements, and mapping of coding and non-coding RNA. In addition, the long reads could improve metagenomic studies.
Todd Michael, a researcher at Monsanto, said during a presentation at the PAG conference that his group has used Moleculo data to improve the assembly of the corn rootworm genome, which is currently in progress.
The genome is complex and heterogeneous, even between individuals from the same population, and his team is working on a high-quality reference genome.
He said initial data from Moleculo — long reads on the order of 8 to 10 kilobases — are "really promising" because they are very accurate. In the future, he and his colleagues plan to add 20-kilobase mate pair reads, BAC sequences, as well as additional data from Moleculo.
Moleculo's technology will likely compete with other approaches that also offer long-range genomic data to provide information on structural variation or haplotypes, among them OpGen's optical mapping, Pacific Biosciences' long-read sequencing, BioNano Genomics' DNA mapping, Nabsys' genome mapping (see other article, this issue), and Complete Genomics' Long Fragment Read service.
Michelmore's team has sequenced the same oomycete genome with PacBio and Moleculo and is in the process of completing a side-by-side comparison of the results. Like Moleculo, PacBio provided "much better assemblies" than assemblies based on Illumina short-read data alone.
Kertesz said one of the main advantages of Moleculo's technology is that it works with sequencing equipment that users already have. "Thousands of customers currently have HiSeqs or MiSeqs; it immediately puts their existing infrastructure [to use], allowing them to get the long reads."
− Andrea Anderson provided additional reporting for this article from the Plant and Animal Genome conference in San Diego.