Skip to main content
Premium Trial:

Request an Annual Quote

100K Genome Project to Include Genome Finishing by Hybrid Assembly for Some Microbes


A five-year effort to sequence the genomes of around 100,000 microbes associated with foodborne illness will rely heavily on Illumina's HiSeq 2000 platform for generating draft genome sequences, while data from a second, longer-read platform will be used to finish a subset of the assemblies.

"We're going to finish 1,000 to 2,000 genomes completely and we're going to combine the Illumina reads with a second technology, which is still in negotiation, to finish those genomes," Bart Weimer, a veterinary medicine researcher at the University of California, Davis, and co-director of the BGI@UC Davis center, told In Sequence.

Weimer will direct the massive sequencing study, dubbed the 100K Genome Project, which was introduced last month by UC Davis, Agilent Technologies, and the US Food and Drug Administration.

The effort, funded through a public-private partnership that includes federal agencies, private companies, and UC Davis, is expected to yield genetic tools for studying the biology of foodborne pathogens, identifying culprits during food infection outbreaks, and tracing them back to their sources.

The Centers for Disease Control and Prevention and the Department of Agriculture's Food Safety and Inspection Service are also collaborating on the project, as are research groups in the US and internationally.

Data generated for the project will be deposited into open access sequence databases to serve as a resource for those working to prevent, diagnose, and track foodborne disease outbreaks and for researchers interested in more general questions about pathogen virulence and evolution.

The 100K Genome Project stemmed, in part, from past efforts by researchers at UC Davis and elsewhere to develop genomic resources for studying food pathogens, Weimer explained.

When it became clear that there was not enough genetic or genomic information available for some of the assays they had in mind, the team decided to turn to genome sequencing — something that their now collaborators at the FDA were already starting to apply.

Weimer noted that the FDA has been quickly shifting over to a whole-genome sequencing-based approach for assessing pathogens during foodborne illness outbreaks, for instance.

But because "it's a very onerous task to do that for every outbreak," he explained, there is interest in finding quick and effective tools for detecting and identifying pathogens. By sequencing and comparing the genomes of many food pathogens from different parts of the world and with different virulence patterns, the researchers hope to find markers or stretches of sequence that are specific for such applications.

"We do think that this will be the basis for revising or remodeling the way detection is done in foods and the environment generally," Weimer said.

"[I]t is very likely the database will provide for better, more specific tests for food pathogens," FDA spokesman Curtis Allen told IS in an e-mail message.

In addition to providing a jumping-off point for finding more robust diagnostic biomarkers, the new genome sequences may also prove useful for answering broader questions about food pathogen biology.

Sequencing for the 100K Genome Project will be done using the Illumina HiSeq 2000 at BGI@UC Davis's next-generation sequencing facility, set to house 10 of the instruments (IS 11/1/2011).

The HiSeq platform boasts several advantages that make it well suited for the pathogen sequencing study, according to Weimer, including its high throughput, which the team plans to exploit by barcoding and multiplexing samples, and the depth of coverage it allows. He also noted that it has also been possible to hammer out very competitive sequencing rates for the Illumina instrument owing to the capacity of the BGI@UC Davis facility itself.

In addition to the multitude of draft genome sequences to be generated for the project, the team anticipates assembling at least 1,000 finished genome assemblies, which will be pieced together using a combination of Illumina short-read data and long-read data generated on a second, yet-to-be determined instrument.

Several groups have recently adopted the Pacific Biosciences RS to aid with genome finishing (IS 8/7/2012), but Weimer declined to provide details about which long-read sequencing platform the project is considering for this aspect of the work.

Additional genomes may ultimately be finished as well, depending on the quality of the initial draft genome assemblies and/or the level of information required by those partnering with the 100K Genome Project.

"In some pilot projects, we've gotten anywhere from nine to 30 contigs with this approach," Weimer said. "At that level, it's not so crazy to think that we could go ahead and finish more of them."

"In the out years, we'll see what the read depth is, what the alignment scores are," he added, "and then, in specific genomes where it's needed biologically we'll go and finish those more carefully."

The group expects to rely on some commercially available software and some open source software. But the comparative nature of the project is expected to present some bioinformatics challenges as well, since none of the existing algorithms or software are equipped to accommodate and compare 100,000 genome sequences simultaneously, Weimer explained. In addition to the software groups that they are already working with, he noted that the 100K Genome Project members are "looking for partners who want to take on a challenge in this area."

"The biggest thing that needs to be done is there's no piece of software that can do this large-scale [genome] comparison side-by-side," he said.

Data generated for the 100K Genome Project will be made available to other members of the research community through open-access databases, including those hosted by the National Center for Biotechnology Information.

"NCBI has made the commitment to harmonize the annotations across all of the genomes — those coming from this project and other genomes that are being done by other groups," Weimer said. "So [for example] there will be a harmonious annotation set for all of Salmonella that are deposited to NCBI."

In addition, there will likely be sequence information deposited to similar databases in Europe and to the Pathosystems Resource Integration Center, or PATRIC, a pathogen genomics database hosted by the Virginia Bioinformatics Institute at Virginia Tech.

Organisms that have been prioritized for sequencing first, known as "tier 1" isolates, include bugs from the Salmonella, Campylobacter, Vibrio, and Listeria genera, as well as E. coli. After that, they will sequence more or less of a given pathogen depending on the interests of collaborating partners, the genome representations available, and so on.

Some of these organisms have not been well characterized genomically, while others have already been the subject of large-scale sequencing efforts but have so much genetic diversity that further sequencing is still needed.

For instance, although an astounding number of Campylobacter isolates have been collected, Weimer explained, just 35 or so have been sequenced.

On the other hand, several large-scale sequencing projects are underway for another tier 1 organism: Salmonella. But because so much serodiversity exists for the organism, the FDA and other members of the 100K Genome Project partnership are keen to include it in the new study.

"We'll be sequencing about 50 or 60 serotypes of Salmonella and we'll do 10 to 20 isolates of every serotype," he said "So we'll get a really, really good perspective of what the genomic diversity is by serotype."

The second tier includes Yersinia, Clostridium, Shigella, Enterococcus, and Cronobacter, while several food-associated viruses, including rotavirus, hepatitis, and norovirus, are among the tier 3 organisms.

The timeline for sequencing genomes within each tier will depend to some extent on how the production process goes and on how collaborations with other groups evolve, Weimer said.

The researchers estimate that they'll be able to plow through around 1,500 genomes in the first year while they're tweaking library preparation protocols and setting up the framework for automating the rest of the project.

During the second and subsequent years of the project, they hope to ramp up that sequencing output to around 25,000 organisms annually.

At the moment, members of the 100K Genome Project team are continuing to shore up collaborations with researchers at academic centers, agencies, and companies who are working on isolates of interest from different parts of the world.