CAMBRIDGE, Mass. — To keep pace with growing sequence datasets, the Genome Browser team at the University of California, Santa Cruz, plans to set up a system where researchers can host their datasets locally and view the data via the UCSC Genome Browser.
At the Workshop on Visualizing Biological Data held here this week, Robert Kuhn, associate director of the UCSC Genome Browser project, said that data storage constraints are increasing the time it takes to process and analyze large data sets and make them available in the browser.
The UCSC team currently has around 2 terabytes of storage for the custom tracks section of the browser.
“We are rapidly outgrowing our physical space and our ability to post [data] because datasets [are] multiple tens of gigabases in size and it’s harder and harder to transmit it [and] to use and display it,” he told BioInform, adding that the transmission time is often the rate-limiting step in the process.
Kuhn outlined his team’s plans during a presentation at the conference, where he described methods of visualizing data enabled by the UCSC Genome Browser as well as plans for additional functionalities.
When the new system is in place, “large datasets can live on [the researcher’s] site but only small pieces of it ever get transmitted,” Kuhn said. “Only two fairly small queries go back and forth instead of the whole dataset being uploaded.”
Rather than uploading entire datasets as part of a track, researchers will host their data locally and create data indexes. Next, they can make the data available on the web and submit its location and a description of the data to UCSC. When they need to view a piece of the information, they can send a message to the browser telling it where to find the data so that if, for example, a researcher wants to look at a region of chromosome 2, the browser can look at the index, identify the requested region, and then pull the information it needs out of the main dataset and display it for the user.
In this way, researchers could post their data on the browser “without our having to interact with it,” Kuhn said.
As part of the development process, Kuhn said UCSC's Jim Kent is creating software for researchers to run on their systems in order to serve as hosts. In addition, he said that the team is considering working with so-called "data validators" to ensure that user data is in the proper format.
Currently, the browser can accept data in several formats, including BED files, Wiggle files, and BigBed files.
While he could not give a specific deadline, Kuhn said that the team plans to have the distributed system in place within a few months.
“The ultimate goal is to make it [possible for] people [to] simply submit their coordinates, their http address on the web, and … they will become tracks available on the browser,” Kuhn said. “Other users will be able to click on it.”
Furthermore, Kuhn said during his presentation that his team is planning some improvements to the browser, including the ability “to display differences between [a] new sequence and the reference assembly” as well as to “find a useful display for three-dimensional data such as chromosome folding data that do not map well to one-dimensional or two-dimensional plots.”
Kuhn’s team has begun working with Ting Wang at Washington University at St Louis to develop a prototype of the planned system.
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.