Skip to main content
Premium Trial:

Request an Annual Quote

NIAID's Nephele Offers Free Compute Power, Software for Microbial Genome Analysis

Premium

UPDATE: This story has been updated to provide further clarification on Nephele's development and long term sustainability plans.

NEW YORK (GenomeWeb) – The National Institutes of Health's National Institute of Allergy and Infectious Diseases is offering free access to Nephele, a platform that offers cloud computing infrastructure and bioinformatics tools for analyzing 16S and whole genome sequencing datasets.

The developers kicked off a promotional phase earlier this month and are offering one-time-use access codes that will let users run multiple single analyses without cost on Nephele. It's an opportunity for the developers to collect data on Nephele's usage, gauge its value to community, and mull possible funding models for maintaining the system long-term, Nick Weber, BCBB's translational bioinformatics program lead and Nephele's project manager, told GenomeWeb. The promotional period, he said, will run until the funds allocated for the project have run out, which Weber could take about six months to a year — it could last longer if the group is able to secure additional funding in the intervening months.

Nephele, named for a nymph from Greek mythology, was developed by the Bioinformatics and Computational Biosciences Branch of the NIAID Office of Cyber Infrastructure and Computational Biology in collaboration with partners at other institutions, some of whom were involved in the Human Microbiome Project, an effort to characterize microbial communities found at multiple human body sites and to look for correlations between changes in the microbiome and human health. Weber said that the team began creating the system about two and a half years ago. At the time, NIAID researchers and others at the NIH were interested in exploring how cloud computing resources could be used for scientific research.

"One of the ideas was to see how we could extend [the] investment that was made in HMP," Weber told GenomeWeb. "So we put together a small project team at the time to think about some [ideas and] coordinate with external collaborators who were involved in the HMP who either were associated with hosting the data, the data analysis and coordination center at the University of Maryland, or building tools to use those datasets." They ultimately decided to build a cloud-based platform that would make it easier for the research community to analyze microbiome data.

The system consists of a frontend webserver that runs on Amazon Web Services (AWS) and offers access, primarily, to two commonly used bioinformatics tools: the Quantitative Insights Into Microbial Ecology (QIIME) and Mothur, which are both open-source pipelines for analyzing and processing microbial sequences. The system also includes two other pipelines, namely A5-MiSeq, an open-source microbial genome assembly tool; and biobakery, which offers various tools for microbial analysis.

In addition to AWS, the Nephele planning commitee also evaluated Microsoft Azure and Google Compute Engine. AWS was the most mature of the cloud platforms evaluated and it had the most number of research users. Amazon also offered funding to some of the Nephele collaborators through its research grant program, which helped get some development efforts off the ground, Weber said. The researchers have since had discussions with Microsoft about the possibility of making Nephele available on its cloud so that they can, for example, compare the experience of running the platform on this system versus Amazon in terms of costs, performance, and other factors.

Nephele uses AWS Lambda, an Amazon tool that lets users run code without provisioning or managing servers, to spin up compute instances and the requisite bioinformatics tools as well as transfer the data files for processing. The system is architected so that as researchers submit jobs, they are assigned to run on unique servers in the cloud. They generally receive results in about two to 10 hours. "We've seen some good performance and cost savings from using this model as compared to some other architectures that we tried out," Weber said.

For instance, they considered setting up an EC2 instance and then lining up submitted jobs as they came in and running them one after that other. But that proved too expensive and also resulted in longer wait times for users further down the queue. They also considered analyzing usage patterns over time, estimating how many concurrent instances they needed, reserving that number of instances, and then lining up and running jobs on those servers. This offered marginal time savings compared to the aforementioned option but was still not as good as using AWS Lambda, Weber said. "Definitely in terms of time to results, its much lower in the way that we've set it up and it costs less based on our analysis, as well," he said.

Researchers interested in trying out the system can go to the Nephele website and fill out a simple form that asks for details about the analysis they want to run as well as a way to contact them when the analysis is complete. Users can either upload their files directly to the cloud or provide a url link to wherever their data is located. To get around slow upload times or to more easily move large datasets, users can put their files on publically accessible resources such as Dropbox or Google drive, Weber said. 

Researchers receive five single-use access codes upfront and they can ask for more if they want. There's no official cap on how many codes a user could request during the promotion period, and so far there haven't been any impossible requests, Weber said, but there could be cases where it might make the most sense for users to set up their own Nephele system and foot the storage and compute bill.

"We do have to be somewhat conscientious about how many codes we give out to an individual users. If we were to give 2,000 codes to one person that would eat up a sizable chunk of our budget," he said. "It would give us some feedback that this is type of user who really wants this resource but we want to be able to share it, at least in the promotional period, with as many diverse users to get that extra data."

Input files are generally small — the largest ones are about five or six gigabytes in size — and users have their pick of the QIIME and Mothur pipelines or bespoke pipelines that the Nephele development team put together based on information from the scientific literature and internally run microbiome analysis projects. Details of each tool are provided on the Nephele website and include diagrams of the different steps involved in running each of the pipelines.

QIIME was developed by researchers at University of Colorado and elsewhere to capture and analyze large quantities of microbial sequence. It provides tools for demultiplexing and quality filtering, OTU picking, taxonomic assignment, phylogenetic reconstruction, diversity analyses, and visualization. Additional details are available in a Nature Methods paper published in 2010. 

In 2011, QIIME's developers deployed the tool on Amazon's EC2 infrastructure to make it more accessible to users with limited access to local compute power and storage. They also added QIIME to BaseSpace last year in a bid to simplify access to the tool, Rob Knight, a professor of pediatrics and computer science and engineering at the University of California, San Diego, and one of QIIME's developers, told GenomeWeb at the time. QIIME was used to analyze data from the Human Microbiome Project and it's currently being used for the American Gut Project, a crowdsourced, crowdfunded project to collect and study sequences gleaned from samples collected from members of the public in an effort to understand the microbial diversity of human gut. 

For its part, Mothur was developed by researchers in the University of Michigan's department of microbiology and immunology and elsewhere. The software lets users describe and compare data from microbial communities. It incorporates a number of algorithms that have been implemented in existing tools such as TreeClimber and also offers tools for visualizing data, screening sequence collections based on quality, aligning sequences, and calculating pairwise sequence distance among other features. The software is discussed in more detail in a paper published in 2009 in Applied and Environmental Microbiology.

The exact analysis pipeline that a user selects depends, in part, on the type of analysis that the user wants to run or the type of instrument used to generate the data but ultimately it comes down to user preference, Weber said. The pipelines are integrated to some extent because some tools are better for some tasks than others. 

The developers are also mulling ways to allow users to install and run their own pipelines on the Nephele cloud. "I think the best way in the short term is [that] users can spin off their own self-contained Nephele environment if they have their own AWS credentials" although that would require some expertise on the user's part, Weber said. "They could modify [the machine image], put the software that they want to use for their own analysis on that and then modify the scripts." The team is working on making the Nephele source code available so that more expert users can build and expand on it, Weber said.

They are also considering how best to support Nephele after the promotional period including sources of future funding as well as whether the NIAID should continue to be responsible for maintaining and extending the resource long term. "Compared to purchasing infrastructure as a capital expense, managing and maintaining that infrastructure internally, hiring people, and replacing it; renting it on the cloud can be cost effective. We are seeing that," Weber said. "I think there's good promise to this approach and obviously there is federal guidance telling us to go in this direction."

One way to make the tool available after the promotional period might be to place it on AWS marketplace and let users be responsible for the costs of storage and compute resources. This particular mechanism might be a hard sell conceptually because "we are a government organization and people are typically used to getting things free but is a possible mechanism moving forward," Weber said. They could also set up some sort of hybrid system that will include an option for NIH grantees to access the system using their grant funds. "We'll have to wait from guidance from our management as to what's going to be the most logical way to proceed there," he said.

The development team is also organizing a webinar to be held on March 7 that will introduce users to different kinds of microbial genome analysis including 16S and metagenome sequencing as well as teach them how to use Nephele to perform these types of analyses and what sort of results they can expect to see. The developers are also looking to expand the list of pipelines available in Nephele in future development cycles. That includes adding in new pipelines for things like 18S and fungal ITS analysis, Weber said. "We have a list of probably six or eight in our backlog that we are currently evaluating."

Nephele is similar to at least one other cloud-based microbial analysis platform. The Cloud Virtual Resource, or CloVR, which was developed by researchers at the University of Maryland, Baltimore's Institute for Genome Sciences, offers pre-configured analysis tools and automated pipelines for microbial genome analysis. Like Nephele, the system offers pipelines for 16S rRNA-based analysis; taxonomic and functional analysis of metagenomic whole-genome shotgun sequence data; bacterial single-genome sequence assembly and annotation; and large-scale Blast searches of sequence data. About two years ago, researchers at UMB received a grant from NIH to develop a CloVR-based platform for whole-genome microbial diagnostics in clinical settings.

Weber noted the similarities between the two systems — both platforms offer semi-automated analysis and use virtual machines — but there are differences, he pointed out. For example, Nephele is hosted entirely on the cloud while CloVR is built to run primarily on desktops through tools like VirtualBox, which let users run virtual machines on their own compute and storage resources.

CloVR is available as an Amazon machine image so users do have the option to run it on the cloud if they want to, however, unlike Nephele, "as far as I can tell, it requires you to have an AWS account, manage the AWS resources — e.g., start up and shut down instances manually — and pay for use," Weber told GenomeWeb in an email. He also noted the similarities in pipelines offered by both systems. "There is a good deal of similarity at a high level in terms of what can be processed and for what purposes — though there are certainly differences in pipelines offered, tools that are used, data files that are accepted, parameters that can be modified, [and so on]," he said. 

A separate but similar effort in the UK seeks to provide the academic research community there with cloud-based compute, storage, and bioinformatics algorithms and pipelines for analyzing and making sense of microbial genomes. The Cloud Infrastructure for Microbial Bioinformatics project is a collaborative effort involving researchers at Swansea University, the University of Warwick, University of Birmingham, and Cardiff University. The UK's Medical Research Council has made a total of £8.4 million (about $13 million) over five years available for the project.