The National Center for Genome Resources is creating a sequencing center at its facilities in Santa Fe., NM, that is expected to help drive development of software for next-generation sequencing instruments.
This week, NCGR said that it is partnering with the New Mexico Institute of Mining and Technology to create the New Mexico Genome Sequencing Center and that it had secured $600,000 in funding from the state of New Mexico to purchase its first sequencer.
Stephen Kingsmore, president of NCGR, declined to specify which instrument the center intends to purchase, but confirmed that it would be a post-Sanger platform.
NCGR, a non-profit research institute founded in 1994 to provide informatics support for the Human Genome Project while it was still under the auspices of the Department of Energy, has a long history of computational methods development.
Kingsmore said that the creation of the sequencing center doesn’t signal a change in focus towards experimental science, but is envisioned as a means for accelerating development of informatics tools that can handle the vast amounts of data expected from next-generation sequencing platforms.
Kingsmore said that when he joined NCGR in 2004, the center primarily relied on experimental data from external research groups. “Pretty much all the software development we were doing was for third parties, and it was pretty clear that we needed to be principal investigator in at least some of the effort,” he said. “It was clear that we also need to be generating data, analyzing data, and that that type of rigor would improve the type of products that we were generating in terms of software.”
As next-generation sequencers began to emerge, “we foresaw that the bottleneck would move from data generation to data analysis, and that’s occurring now,” Kingsmore said. “If there’s going to be this new type of data that’s going to be very prominent for the next 10 years, we [said that we] really ought to get a head start on the software development by actually placing one of those instruments and being a client ourselves.”
Kingsmore stressed that the NMGSC isn’t going to go head-to-head with large-scale sequencing centers like the Broad Institute, Washington University, or Baylor College of Medicine. “Our remit is not to become one of the top five or six sequencing centers, but more to use these abilities to inform the software that we’re developing, and make sure we stay cutting edge in terms of the tools, the algorithms, and the functionality.”
NCGR may offer sequence-analysis software through an ASP model or via software as a service. “The new sequencing instruments are going to democratize sequencing, which hitherto has been done largely by a very small number of large sequencing centers,” Kingsmore said. “We believe that increasingly PIs and most universities and departments are going to have those capabilities, but few will have the IT hardware/software solutions to allow their PIs to mine that data.”
NCGR already has some experience analyzing next-gen sequencing data. Last year, the center received funding from the National Science Foundation, the Department of Energy, and the US Department of Agriculture to sequence the genome of Phytophthora capsici, a fungus that kills green chiles and other vegetable crops.
Kingsmore said that 454 Life Sciences provided sequencing services for the project, as did the DOE’s Joint Genome Institute, which used Sanger sequencing. NCGR has been working closely with JGI on “a hybrid assembly of Sanger reads and 454 reads,” Kingsmore said, and plans to present initial results from the project at next week’s Plant and Animal Genome conference in San Diego.
The project evaluated several assembly methods, including JGI’s Phrap-based approach, 454’s de novo assembly program called Newbler, and an algorithm called Forge that was designed for hybrid assemblies of Sanger reads and short reads.
“Our remit is not to become one of the top five or six sequencing centers, but more to use these abilities to inform the software that we’re developing, and make sure we stay cutting edge in terms of the tools, the algorithms, and the functionality.”
Kingsmore said that the work has so far uncovered some “surprising” results, including the fact that very short paired ends “are very effective in tying together next-gen short sequence reads and giving a very acceptable scaffold size or contig size, which is something people hadn’t really thought.”
The results indicate that de novo sequencing of eukaryotes might be possible with next-generation technologies — a possibility that many in the community had ruled out.
He added that 454’s Newbler assembly “gave very acceptable gene models,” although it “looks kind of ugly in terms of contig size, scaffold size, [and] scaffold number.”
Kingsmore noted that emerging sequencing technologies will drive demand for many new algorithms. For example, he said, “there’s a need for a new algorithm that can align paired reads rather than singleton reads. And as the reads get very short, that’s going to be mandatory for giving unambiguous localization just with alignment.”
In addition, he said that as instruments begin generating more and more data per run, “the speed of existing algorithms like MegaBlast just won’t cut it. Nobody’s compute cluster will be able to crunch that as fast as the data’s coming off the instrument.”
There will also be a need for improved data-mining techniques and for “nice user interfaces, filter sets that will allow us to go from vast amounts of information to candidate genes,” he said.
Kingsmore declined to provide further details of NCGR’s current software development activities.