By Julia Karow
This article was originally published Nov. 9.
In anticipation of cheaper DNA sequencing technologies coming online in the next few years, an international consortium of scientists is organizing to sequence the genomes of about 10,000 vertebrate species.
The "Genome 10K" project aims to understand the genetic basis of evolutionary processes, and the organizers have already compiled a virtual database of existing sample collections of more than 16,000 vertebrate species. Last week, they published an outline of the project in the Journal of Heredity.
Consortium leader David Haussler, a professor of biomolecular engineering at the University of California, Santa Cruz, said the project will cost an estimated $50 million to $100 million in total, including approximately $30 million in sequencing costs.
The consortium, which currently involves about 70 scientists from more than 40 institutions, is currently seeking funding from governments and private sources. Its goal is to embark on a pilot project within the next few years and to complete the entire project within five to 10 years after that.
According to Haussler, the price assumes a sequencing technology that can analyze a sample for about $3,000. "That does not exist at this point, but we anticipate that it will exist within a few years, and whatever technology comes along that will meet that requirement, we will use," he told In Sequence last week. Sequencing will likely be distributed between several yet undetermined centers.
But even in the absence of funding and appropriate sequencing technology, he and the other leaders of the project — Stephen O'Brien, chief of the laboratory of genomic diversity at the National Cancer Institute, and Oliver Ryder, director of genetics at the San Diego Zoo's Institute for Conservation Research — said they felt the time was right to gather samples and collaborators for the project now.
"The sequencing gets easier all the time, but [collecting] the samples and the expertise needed doesn't," Haussler said. In April, he and his colleagues organized a workshop in Santa Cruz that was attended by representatives from 43 institutions with large collections of mostly frozen tissue samples.
The more than 16,000 vertebrate species contained in the database to date — of an estimated 60,000 vertebrate species overall — cover mammals, birds, reptiles, amphibians, and fish, including some recently extinct species. Haussler said more samples might be added in the future, and the consortium has established guidelines for how they should be collected and stored to make sure the DNA extracted from them can be sequenced. In addition, the database includes fibroblast cell lines from several hundred species, primarily mammals.
Besides $30 million for sequencing, the scientists estimate $20 million to $70 million in fixed costs for the project, associated with sample collection and handling, creating and maintaining databases, and data analysis. Those costs, Haussler said, will probably amount to several thousand dollars per genome — similar to other large-scale sequencing projects that Haussler is involved in, such as the Cancer Genome Atlas or the 1,000 Genome Project. "You cannot really do this for less," he said.
The aim of the project is to generate vertebrate genomes that are of "substantial quality," said Haussler, who also helps run the UCSC Genome Browser, which currently includes 44 vertebrate genomes.
Genomes sequenced for the project may not be as high-quality as the human genome, but should reach at least the quality of the dog genome "or some of the other vertebrate genomes that have been done recently to something like 6x or 7x Sanger coverage per genome," he said.
Prior to embarking on the full-scale project, the consortium plans to conduct a pilot project, specifics of which it is currently formulating. The project will use the 16,000-sample collection and "poke at this resource in a strategic way so that we can get some initially exciting scientific information from it, and we can demonstrate that the tissues are fully viable for sequencing," Haussler said. "One of the goals of the pilot is to absolutely demonstrate that there is no obstacle other than the need for cheap [sequencing] technology, that we can get the DNA information that we need from these samples."
[ pagebreak ]
The scientists plan to start the pilot as soon as possible, using sequencing technologies available at that time — which could be any of today's technologies, or a mix of them, Haussler said.
Sequencing for the pilot could be provided by the genome sequencing center at UCSC, which currently houses Applied Biosystems SOLiD and Roche/454 platforms, and is in the process of installing an Illumina Genome Analyzer (see In Sequence 10/27/2009).
In addition, Haussler said, the comparative genomics lab at the Institute of Molecular and Cell Biology in Singapore — co-headed by Nobel laureate Sydney Brenner, who is also a participant in the Genome 10K project — may provide some sequencing capacity, and others are also interested in participating in the pilot.
What is likely going to be more challenging than sequencing 10,000 vertebrate genomes is to analyze the data, according to Haussler. "It's certainly a much bigger project than anything we have tackled before," he said. For example, comparing just the 44 vertebrate genomes currently in the UCSC Genome Browser "completely maxes out our computational capabilities, which are substantial," and that include a compute farm with several thousand CPUs.
Doing the same with 10,000 genomes will be impossible, he said, and the scientists will have to come up with new analysis methods. "You have to organize [each piece of DNA] according to the phylogenetic relationship among the species, and you have to build a system that allows for any comparison to be made dynamically and quickly," Haussler said.
Finally, correlating genetic differences with phenotypic traits will provide work for decades of research to come, according to Haussler. "We expect to lay the foundation for that work, we do not expect to complete that aspect of the project," he said. "That's not part of the mandate — otherwise it would be a 100-year project."