NEW YORK (GenomeWeb) – The principal investigators of the iPlant Collaborative, a data management, storage, and computation platform for plant science, have rebranded the project, changing its name to CyVerse, which they believe better reflects the platform's capacity and their current mandate from the National Science Foundation to offer services to life sciences researchers more generally.
"We use the buzzword 'democratizing access'," Parker Antin, a CyVerse principal investigator and a professor at the University of Arizona's College of Medicine, said in a conversation with GenomeWeb this week. "NSF is very much supportive of this, and one of their goals is to democratize access to their most sophisticated resources which previously were available only to a small subset of scientists."
The new moniker —a contraction of the words cyber and universe — also better reflects the growing popularity of the platform among researchers in domains outside of the life sciences, including ecology, climate sciences, and astronomy, according to the PIs. When the iPlant Collaborative launched in 2008 — with a $50 million grant from NSF's Directorate for Biological Sciences — the four participating institutions had an initial mandate to provide computational infrastructure for use in the plant sciences. The project is led by University of Arizona in partnership with Texas Advanced Computing Center, Cold Spring Harbor Laboratory, and the University of North Carolina Wilmington.
At the time, there was a "significant investment in data generation but there wasn’t a counterpart to the analytics that went with it," Nirav Merchant, director of the University of Arizona's biocomputing facility and one of the PIs, told GenomeWeb. "That was the motivation for the NSF to put out a call for creating an organization that would support the next generation of analytics needs for the plant sciences community." As a first step, the researchers spent nearly two years running about a dozen community workshops to explore the most pressing needs for plant science research and what sort of cyberinfrastructure could help address those needs.
These workshops offered the PIs an opportunity to hear from different groups with various foci, including education, training, and outreach; computational analysis of plant data including sequence; and researchers from traditional STEM disciplines. Feedback from these groups was framed as grand challenges for the plant science community and used to direct the collaborative's development efforts. "A lot of them were focused around better management of the data [and the] ability to create applications that were specific to their needs," Merchant said. "They were the bottlenecks to letting people address grand challenge level questions ... and that's what we need to make sure we put out to the community."
The result is an infrastructure that has being used by both plant and animal scientists. The platform is comprised of a series of building blocks that can be put together into tools that address specific research needs. When users create accounts, they are exposed to resources such as the discovery environment, which provides access to hundreds of bioinformatics applications including apps for phylogenetic, transcriptomic, genomic, and network analyses. Also available is DNA subway, which offers education, outreach, and training materials for introducing undergraduate and high school students to genome and transcriptome analysis. It features simplified pipelines for tasks such as gene annotation and phylogenetic analysis. Other resources include the Bio-image Semantic Query User Environment, which provides tools for sharing and analyzing biological images and metadata, and the Data Commons, which provides space for sharing data with collaborators and depositing data in public repositories.
Underlying these tools is a data management infrastructure where uploaded datasets are hosted and stored in a way that makes them easier to share and analyze, and that provides different environments for running analysis applications including high-performance computing systems and cloud-based infrastructure. There are also application programming interfaces that offer access to platforms and tools that let users work back and forth between their local systems and the CyVerse platform. Full details of the platform are described in a PLOS Biology paper published this week.
Currently, there are about 30,000 registered users with about 400 to 500 active users per compute platform per week. Typically, when they sign up for accounts, users are allocated about 100 gigabytes of storage space but they can apply for more, Merchant said — a similar arrangement exists for access to compute space. To apply for more resources, researchers have to provide justification, for example, if they need additional space they have to provide details about their data management plan and strategy.
Since its launch, the platform has supported projects such as The Arabidopsis Information Resource; the Planteome Project, an international collaboration to develop reference ontologies and applications for plant biology; Gramene, an open-source repository for comparative functional genomics in crops and model plant species; and the Comparative Genomics resource, an online resource of data and tools to analyze and visualize next-generation sequencing data. It has also supported research into the genetic response of grass species to viral infections and an effort to assemble and annotate the pineapple genome. It's also been used by the bioinformatics coordination program of the US Department of Agriculture's National Animal Genome Research Program, for example, which worked with the collaborative to implement a variant calling pipeline for the livestock research community, and it handles data from a plant breeding project under the Genomes to Fields initiative, a public research effort to improve maize production.
Noting the broader spectrum of life science users that access and use the platform, when the NSF renewed iPlant's funding in 2013 — it received another $50 million for five years — the agency expanded the collaborative's mandate, asking it to support the broader life sciences community. "The problems that we encountered in plant science, the tools that they needed, the problems they were solving with big data, those ran across the biological sciences," said David Micklos, an iPlant PI and executive director of the DNA Learning Center at Cold Spring Harbor Laboratory. "We had infrastructure that was ready to handle broad biological sciences, and the name was becoming in some sense unfortunate because people were wondering if our tools worked for all of their organisms."
But the platform is also proving attractive to non-life scientists. There are climatologists, for example, who run climate prediction models and combine them with other kinds of data using the collaborative's infrastructure, Merchant said. One group of geoscientists, for example, were working with computational models that generated hundreds of terabytes of data, which they needed to share with collaborators at multiple institutions. The iPlant platform seemed a natural fit because it supported the data format they use — NetCDF — and because it provided the components they needed for sharing and analyzing data, they gravitated towards it. There's similar interest from researchers in astronomy and other domains. "It's very interesting for me to see," Merchant said. "When we started iPlant, I went to these same communities to learn how they were analyzing this data, and we took the same tools that they were using and we extended them, and now they are turning around and taking our extensions and doing more with it, so it’s a beautiful cycle."
It's a testament to the openness and flexibility of the iPlant platform, according to its developers, who point out that there never was anything inherently plant specific about it. "It was just a label that was put on it to make it plant-science specific ... we don't draw a fence or a box and say 'if you aren't plant sciences, you cannot get an account or you cannot use our infrastructure'," Merchant said. "If you want to do science, and you want to work as a team, as a community, and be fairly open about it, here's how you can start using our infrastructure."
The CyVerse PIs expect to continue to largely support researchers in the life sciences and to maintain a presence at meetings such as the Plant and Animal Genomes (PAG) and the Intelligent Systems for Molecular Biology conferences. "Our roots are still in life sciences, so it's not like we'll be frequenting the astronomy conferences," Merchant said. However, he added, "I very much expect our short term or even long term future will be a little bit beyond life sciences." Some of that will be due to non-life science CyVerse users presenting research at their specific conferences, thus providing concrete examples of how the infrastructure could be of use to potential users in their domains. "They almost become our ambassadors at the meeting," he said.
But the investigators are exploring the possibility of attending some new meetings where there might be interest in their tools, for example the Association of Biomolecular Resource Facilities annual meeting, as well as some computational sciences conferences, Merchant said. They're also exploring opportunities in the biomedical sciences arena, which Antin said the group is being "highly encouraged" to expand into. "They have huge needs in the biomedical space for data analysis capabilities ... one of our mission goals is to be integrated there and to better serve that very large research community."
They also intend to beef up their education and training efforts. Previously, Micklos told GenomeWeb, the collaborative has run a mix of virtual and face-to-face meetings with thousands of researchers, faculty, and students. They also continually gather feedback from attendees at meetings about their data use to inform the training they provide. "When we go to a conference, like PAG, we survey students, faculty, and researchers and 90 percent ... will tell you that they now use, or within 18 months will be using, a large scale dataset," he said, "but two-thirds of those same people will tell you that they don't have much experience in bioinformatics or computation [and] that they don't have sufficient computation resources at their institutions to analyze these kinds of datasets. That's the mission of iPlant. It's to give all these people access, and then to train them on how to find and use these resources."
One of the group's main outreach efforts at present is reaching out to college faculty who are trying to scale up experimentation, particularly at the freshman and sophomore level, Micklos said. Datasets from repositories such as the Sequence Read Archive and tools like DNA Subway are freely available and offer solid opportunities for student research that don't cost much money. "We've been involved in the past and we will increasingly be in the future," he said.
Other efforts by the group include implementing a version of the CyVerse infrastructure in the United Kingdom to provide dedicated resources for computing, storage, application development, and training for researchers there. The initiative, which was first announced last year, is a collaboration between CyVerse and scientists at the University of Warwick, the University of Liverpool, the University of Nottingham, and The Genome Analysis Center. It is funded by the UK's Biotechnology and Biological Sciences Research Council, which provided about $2.6 million to implement the infrastructure. The group is also looking at opportunities to bring the infrastructure to researchers in South America, Antin said, as well as ways to integrate it with similar existing resources.