This is how genomics is supposed to work.
When the University of California, Santa Cruz, released its genome browser for the SARS virus draft assembly with some 200 tightly aligned mRNAs, it wasn’t the most striking moment for genomics researchers. But it marked a new way of attacking a spreading virus, and it showed the opportunities available through open source, genomics and proteomics databases, and cooperative research.
Angie Hinrichs, one of UCSC’s genome browser team members, was on an extended vacation in New Zealand when she started hearing about the SARS fears. When the organism’s genome sequence was deposited in GenBank in mid-April, she decided to see what she could do from an Internet café there.
Heather Trumbower, the QA manager for the UCSC team, says it came down to applying technologies and processes already in use for other genomes to SARS. So Hinrichs logged on from New Zealand, “grabbed all the other viral mRNAs” in addition to the GenBank sequence data, and started working on alignments with Jim Kent’s BLAT tool, according to Trumbower. With some hundreds of thousands of viral mRNAs to work with, Hinrichs connected to the UCSC Kilocluster, a Linux cluster of 500 dual-processor Dell boxes. The initial alignment took less than 24 hours on the Kilocluster, says Trumbower.
Meanwhile, the rest of the team at UCSC snapped into gear as well. “David Haussler was very supportive, as was Jim Kent,” Trumbower says. The mentality was a new one for the human, rat, and mouse-focused UCSC: “We didn’t think we were uncovering a lot of new ground,” she says. “But since it was fairly easy for us, we should crank it through and make it available … and perhaps it might be helpful.”
Kevin Karplus and Fan Hsu joined in to work on predicted proteins from the sequence. Robert Baertsch, a grad student, pulled in partial and complete sequence data for more comparisons, even as Hinrichs kept sending in her datasets from vacation. Even Victor Solovyev of Softberry, who was adding his build 33 for the human genome, noticed that UCSC was working on the SARS genome — “and he thought, ‘Hey, I can add to that,’” Trumbower says. Solovyev used his FgenesV software to predict genes for the virus. The UCSC team also went to SwissProt, downloaded the entire database, and checked for alignments until they found a few hundred matching proteins.
The total effort lasted about three weeks and took up no more than 20 hours per person, Trumbower says. That was due to manageable datasets: the SARS genome weighed in at just 29KB. By UCSC’s estimations, the entire genome could be actively in use, and in the team’s assembly, Trumbower says they can see the differences inherent in viral genomes, such as overlapping genes. The general idea in having it available, she says, is to provide “a reality check. Hopefully people can just quickly look and make sure that there’s nothing we’ve picked up that they missed, or vice versa.”
— Meredith Salisbury