As the Human Genome Project participants prepare to submit their paper on the completed human genome for publication, one member, the University of California, Santa Cruz, has been working diligently to incorporate information into its assembly of the genome.
So far, this assembly, known internally as Golden Path, is one of two publicly available. The US National Center for Biotechnology Information just placed its assembly on-line last week.
Since the working draft of the human genome was completed in June, UCSC has been updating and re-assembling the public version with more data and more types of data, such as SNPs, as well as refining the assembly software, said David Haussler, a professor of computer science at Santa Cruz who is directing the assembly component of the Human Genome Project.
“The addition of the SNPs has happened in the last month. And there are many new tracks on the browser,” said Haussler. The tracks offer different information sources on the browser, which was designed by Jim Kent, a graduate student in biology at the university. Kent also developed the GigAssembly software that the Santa Cruz group used to assemble the genome. The assembly can be found at http://genome.ucsc.edu/.
Besides being added to the database, SNP data from the SNP Consortium have been used to improve the working draft and to add more detail. The genome project and the SNP Consortium, a group of prominent pharmaceutical and technology companies, announced in July that they would collaborate to upgrade the draft and also speed the generation of a higher-density SNP map.
The next data to be added to the assembly are cytogenetic and radiation hybrid maps, said Haussler.
In addition to SNPs, the browser also displays human sequence information, chromosome bands, ESTs, pufferfish sequence information, and gene prediction data.
The UCSC group plans to include gene predictions from the European Bioinformatics Institute, Massachusetts Institute of Technology, and the US National Center for Biotechnology Information. For now, the browser shows gene prediction information from Neomorphic’s Genie. Neomorphic, now part of Affymetrix, commercialized Genie based on a prototype that was developed at the university, said Haussler.
About one year ago, in the early days of the Santa Cruz effort, Neomorphic helped with design and analysis and developed software tools, including some for the synthetic chromosome 22 data that UCSC used to test its methods, said David Kulp, an Affymetrix vice president, who was formerly vice president of bioinformatics at Neomorphic.
Because GigAssembler is continually being updated and is therefore not finished yet, there are no immediate plans to distribute it – commercially or otherwise. Kent wrote the initial assembly software in four weeks and has been modifying it weekly.
Haussler expects the traffic on the Santa Cruz website, which receives some 10,000 information requests each day, to increase once the paper on the genome is published. Haussler did not know when the paper would be submitted or published but he expects the public effort’s paper to be published at the same time Celera Genomics’ paper is.
Celera announced last week that it had submitted its manuscript on the human genome to Science and expects the paper to be published in the first quarter of 2001.
The university has a Linux-based 100 PC processor farm that it has used to assemble the genome. Only four of those machines are used to handle the Web requests, so more power is available if needed. Haussler’s research group began using Linux about a year ago, when its involvement in the assembly effort began.
Kent said that he got involved in the assembly work because it needed to be done to enable him to do his research on alternative splicing. At this point, he would be happy if NCBI could take over.
“They [NCBI] very much want the job,” he said.
A Sanger Centre group will judge the Santa Cruz and NCBI assemblies to determine their respective merits. More tools will have to be developed just to evaluate the two assemblies. Beyond that, Kent said he is curious to see how the public efforts will match up against those of Celera. “Once the tools for comparing the two assemblies are in place, I’m hoping we’ll be able to apply them to Celera’s [assembly] to get a more or less objective and a more or less thorough answer to these questions,” Kent said.
Haussler said that Eric Lander, director of the Genome Center at MIT’s Whitehead Institute, invited him and his group to participate in a coordinated bioinformatics effort to assemble the human genome.
“We needed to scramble to increase the bioinformatics component of the Human Genome Project in order to put it all together in a short frame of time – shorter than it had originally been planned. So we stepped in and helped out in that way along with NCBI and EBI,” said Haussler.
Haussler’s group received data and assistance from NCBI and EBI.
Haussler said he welcomes other assemblies and other views of the data because they will help science advance more quickly, especially since uncertainties about the data and the annotation still exist. Having only one view “would not only stifle creativity but we would be casting in concrete something that is really only a fleeting glance, a temporary way station on our way towards a complete understanding of the genome,” he said, noting that there is much more to be done.
“To make the key advances that we want to make from the human genome, it’s essential that we use all of our bioinformatics capabilities to really dig deeply into the data. We’ve only scratched the surface here. There hasn’t been time for the more detailed algorithms and analyses to be run over this dataset,” said Haussler.
Haussler’s group has not received major funding for its work on the genome project in part because it joined near the end and didn’t have enough time to go through traditional grant process, said Haussler. The group has received some funding from the National Institutes of Health, the Department of Energy, the Packard Foundation, and the Sloan Foundation. He hopes to get more funding soon.