NEW YORK (GenomeWeb) – Researchers from the University of New South Wales in Australia and the Oswaldo Cruz Foundation (Fiocruz) in Brazil will use IBM's World Community Grid — a collection of crowd-sourced computers volunteered by thousands of individuals — to analyze millions of protein-coding genes from multiple microorganisms for a project called Uncovering Genome Mysteries.
The project aims to shed light on microbial gene and protein function to help researchers better understand and predict how these organisms cause disease, produce metabolites, break down environmental contaminants, and so on. The organizers intend to create an open access database of protein sequence comparison information that will serve as a reference for the scientific community.
Other goals for the project include augmenting existing knowledge about biochemical processes, exploring the ways through which organisms interact with each other and with their environments, and "document[ing] the current baseline microbial diversity, allowing us to understand how microorganisms change under environmental stresses, such as climate change," the team said on its website. Among other benefits, the researchers expect that the fruits of their analysis could be useful for research focused on identifying and designing new antibiotics and drugs or new enzymes for industrial applications such as food processing and production of green plastics or biofuels.
The UNSW and Fiocruz researchers plan to leverage the large computing infrastructure and parallel processing power provided by IBM's grid to run about 20 quadrillion comparisons of 200 million genes that they've sequenced from various marine microorganisms. They'll compare these sequences to data in public repositories such as RefSeq as well as compare their data to itself and search for similarities that could suggest the functions of analogous proteins, Torsten Thomas, an associate professor in UNSW's School of Biotechnology and Biomolecular Sciences & Center for Marine Bio-Innovation and one of the lead researchers on the project, told BioInform this week. The researchers will compare proteins from individual microbial genomes as well as compare genetic information from communities of organisms.
The sequence data for the project comes from microorganisms found in, among other sources, seawater and the surfaces of seaweed and sponges that inhabit Australia's marine environment as well as in seawater from the Amazon River, both of which are sources of largely unexplored microbial communities, Thomas said. The researchers might include additional samples as the project progresses, he said. Samples are being sequenced on both Illumina and Roche 454 platforms.
By using the grid to handle their computational needs, Thomas and his team will be able to cut down on the amount of time needed to analyze the protein data. Estimates indicate that running the calculations on a standard personal computer would take 40,000 continuous years compared to a matter of months on the grid.
The World Community Grid is made up of thousands of computers from all over the world on loan from a host of volunteers who have agreed to let others use their computing devices — both computers and smartphones — when they themselves aren't using them to work on research projects that revolve around health, sustainability, and poverty. It is enabled by software developed in 2002 by Berkeley Open Infrastructure for Network Computing at the University of California, Berkeley, and with support from the National Science Foundation.
Since IBM launched the project about 10 years ago, nearly three million computers and mobile devices used by over 670,000 people and 460 institutions from 80 countries have contributed virtual computing that has been used in more than 20 research projects, according to the company — though not all those systems are currently active. IBM's Viktors Berstis, the technology architect and lead scientist for the WCG, told BioInform that right now probably a few hundred thousand systems are being used for analysis.
Individuals interested in being part of the grid simply download and install an application that gets their systems plugged in. Then when their computers aren't being used, the installed software connects to the grid servers, receives small tasks and the requisite software for running those tasks on volunteers' systems, completes the tasks, and returns the results, Berstis explained. Volunteers can loan their systems to specific projects or just make them available whenever and wherever they are needed. To get time on the grid, applicants submit a proposal that's reviewed by IBM scientists to ensure that it fits with the stated purpose of the grid and can work with the software that the system uses, he said. The exact length of each project varies.
Other omics-based projects currently using the grid include one at the Princess Margaret Cancer Center in Toronto focused on mapping cancer markers. Researchers there are using the computational power of the grid and internally developed software to analyze data from tissue and blood samples collected from cancer patients and healthy controls to identify biomarker combinations that are involved in the development, progression, and treatments of various kinds of cancer. The team is focusing initially on prostate, pancreatic, and breast cancers.
Past genomics projects include one from a computational biology research group at the University of Washington, Seattle, which focused on predicting the structure of proteins from major strains of rice, with an eye towards improving crop yield. The grid was also used for the Human Proteome Folding project, which was designed to help researchers better understand protein function and how it's affected by disease, and was the first project to run on the infrastructure. Some related efforts are using the grid to screen millions of drug compounds against molecular targets to identify more effective treatments, Berstis said.