NEW YORK (GenomeWeb) – A collaborative effort involving four UK universities is attempting to provide academic researchers with cloud-based compute, storage, and bioinformatics algorithms and pipelines for analyzing and making sense of microbial genomes.
The Cloud Infrastructure for Microbial Bioinformatics (CLIMB) involves researchers at Swansea University, the University of Warwick, University of Birmingham, and Cardiff University. It's composed of four components: the planned cloud system itself, updated bioinformatics facilities at Warwick and Swansea Universities, planned training workshops and events, and three microbial bioinformatics research fellows, who are developing and implementing pipelines on the system that support various analysis needs.
CLIMB has its roots in a funding call issued by the UK's Medical Research Council for investment in infrastructure in medical bioinformatics. Initially, the four universities involved in CLIMB submitted two separate funding applications to the council but were ultimately asked to merge their submissions into a single application focused on equipping researchers with microbial research tools. A total of £8.4 million (about $13 million) over five years has been provided for the project.
Those funds are being used to develop a system of shared servers and storage space as well as virtual machines that come with pre-installed bioinformatics algorithms and pipelines for performing a variety of genomic analysis tasks. A system like this helps remove the need to reinvent the wheel each time a new group wants to run a microbial research project by providing something that they can just take off the shelf and use right away, Mark Pallen, a professor of microbial genomics at Warwick and one of CLIMB's principal investigators, told GenomeWeb. It also provides a forum for sharing research pipelines publically where they can be used as they are and also improved upon by other research groups, he added.
Furthermore, improvements in next-generation sequencing technologies and hardware have made generating and computing across genomic datasets less onerous, Sam Sheppard, a chair in medical microbiology and infectious diseases at Swansea and one of the CLIMB PIs, noted. But investigators are still trying to address research questions — understanding genomic diversity across microbial strains as well as mechanisms of drug resistance, for instance. The real need now, he told GenomeWeb, is to invest in smarter tools and methods of analyzing large quantities of sequence data and that's something that CLIMB is set up to do.
For the first year of the project, the developers focused primarily on purchasing all the requisite hardware needed for CLIMB and installing it at each of the four sites. About £3.7 million of the £8.4 million in total funding was spent on hardware, Pallen told GenomeWeb. Also, they spent about £700,000 on refurbishing space at two of the partner institutions — Swansea and Warwick — from where the CLIMB group will provide bioinformatics training workshops as well as host hackathons.
Pallen told GenomeWeb that the CLIMB collaborators did mull whether or not it made more sense to use existing cloud platforms such as Amazon Web Services but ultimately decided not to because of the variable nature of the costs associated with using the cloud, which makes it difficult to know upfront exactly how much such a system could cost longer term. Also, since CLIMB is an investment by the UK government in science infrastructure, it is possible to provide the system to the community for free, he said. On a commercial cloud system, researchers would have to foot the bill for cloud compute costs themselves.
Also, concerns about the security of clinical data or metadata associated with patient isolates for instance might make some microbial researchers hesitant to use public clouds for data storage and analysis. Furthermore, at the time when the idea for CLIMB was first conceived, cloud technology was not quite as well formed as it is today, Sheppard noted, making installing and running a local system the more sensible alternative. Moreover, as a non-commercial system owned and maintained by academia, CLIMB can be run completely openly with all associated scripts, algorithms, and programs available open source. However, both he and Pallen did say that commercial clouds will remain a potential option for CLIMB's future.
All of the purchased infrastructure has now been installed at each of the four universities. According to the CLIMB site, the complete system comprises over 7,600 CPU cores of processing power, 78 terabytes of random access memory, and a total raw storage capacity of just over 6,900 TB with a usable capacity of just over 2,304 TB of usable storage. The infrastructure is split mostly equally across all four institutions with some minor differences that shouldn't affect users.
For this the second year of the project, the developers are working on installing appropriate software including OpenStack — open source software for creating private and public clouds — and tools that enable the VMs to communicate with platform's storage. The system is designed to support over 1,000 VMs running simultaneously. When it's up and running, the system will function as a single unit so users won't know where the particular server they are using is located and if one site goes down, the other sites will simply pick up the slack, Pallen said.
CLIMB's developers have begun beta testing installations at some sites with a number of early adopters. One test group that Pallen talked about is the Genomics Virtual Laboratory in Australia, where researchers are currently testing and fine-tuning a microbial analysis pipeline they developed that includes Galaxy and other open-source tools that they have implemented on CLIMB. Another test group has pipelines implemented on CLIMB for analyzing data from Salmonella and Escherichia coli as part of efforts to better understand their patterns of spread.
The next step will be to open the system up for broader use by the UK academic research community. If all goes well with testing, the plan is to make CLIMB available more broadly by March next year, Pallen said. When the system goes live, users with verified academic identities will be allowed to register for accounts and will receive a standard package as a default that includes a to-be-determined amount of server capacity and memory as well as software for common tasks such as sequence search, assembly, and annotation, Pallen said. Researchers are also encouraged to install their own scripts and pipelines on CLIMB in addition to the standard algorithms offered on the system
Also, users that have higher memory requirements or need more VMs than are initially provided will be able to put in requests for additional resources and these will be reviewed by the developers. For now, the developers expect that the system they have put in place has sufficient capacity to support the needs of the academic community but if that turns out not to be the case, they'll review and make adjustments to resource allocation as necessary.
Initially, the system will be available to academics only, but the CLIMB collaborators could eventually consider offering access to the infrastructure to commercial users for a fee, Pallen said. That won't happen, however, until the system has been running successfully for a number of years and the priority is access for academic groups. Other immediate activities for CLIMB include hiring additional systems administrators, he said. Currently only one person is employed full time on the grant to manage to the system but several other developers have been donating their time free to the project.