By Vivien Marx
This article was posted on March 9.
BioDiscovery has turned to the Amazon Elastic Cloud to support a collaborative repository for genomic variation data that is linked with its Nexus Copy Number analysis software.
Last month, the company released the cloud-based Nexus DB, a repository for storing, querying, and sharing genomic variation data that is intended to provide researchers with "secure off-site storage" along with automated backup and versioning control. It lets users query results across multiple projects of their own, as well as datasets from collaborators.
Bert Eussen, a researcher in the department of clinical genetics at Erasmus University Medical Center in Rotterdam, the Netherlands, told BioInform via e-mail last week that he has been using Nexus DB as a beta tester since last June and that the database "functions as an independent backup" for storing data, which eliminates the need to create a custom database.
For small and medium-sized research groups and labs around the world, the cloud-based database offers "a low-cost data-sharing tool without the need to develop or buy a custom database solution," Eussen said.
The repository will also be an important step in creating a shared repository of genomic variation data, he said. "Even commercial array vendors can share their reference files," he noted.
Nexus DB joins a growing list of public domain databases that have arisen to meet the growing demand for information on copy number variants and other structural genomic-variation data, such as the Database of Genomic Variants hosted by Toronto's Center for Applied Genomics, and the Wellcome Trust Sanger Institute's Database of Chromosomal Imbalance and Phenotype in Humans using Ensembl Resources, or DECIPHER.
Soheil Shams, president of BioDiscovery, told BioInform that Nexus DB is "a complementary solution" to these resources. The main difference, Shams explained, is the access to the repository. With resources such as DECIPHER scientists find information via a browser whereas for Nexus DB they can locate information through the software, an approach that changes the way displays are generated and changes the options for computational analysis, he said.
Nexus DB is built on the Amazon Elastic Compute Cloud and its data is stored in Amazon's Simple Storage Service. "It is our own home-built database that is optimized for handling genomic region queries across many thousands of samples," Shams said.
BioDiscovery has added security features to the database to regulate individual and group-level access. Users access Nexus DB from within the firm's Nexus Copy Number software and there are three levels of data access: One level is the user's own data that is only accessible to the owner; the second is public data that BioDiscovery has aggregated from a number of resources, which is accessible by any Nexus DB user; and the third level is group access, which is available for "shared interest groups" around disease areas such as autism, breast cancer, and brain cancer, Shams said.
The publicly available data in Nexus DB comes from the scientific literature, genome-wide association studies, the Cancer Genome Atlas project, the Gene Expression Omnibus, and other publicly available datasets for specific diseases, Shams said. Researchers can query the database to identify specific samples that meet their research criteria and then download that data to their local project for further study.
The public functionality is helpful, Eussen said, "because with this option you can use the published data directly in combination with your data using the same analysis settings."
Shams said that BioDiscovery chose Amazon to host the database because of the "robustness" of the EC2 system, its security, and "most importantly, the development tools available." He said that the company's development team spent several months learning and experimenting with the various EC2 tools.
Amazon has "just introduced a versioning feature and we are now already supporting this so we can roll back the database to any point in time as all data is stored and version controlled," Shams said. "We were able to do this very quickly because we took advantage of the development that Amazon had done."
[ pagebreak ]
In a sense, uploading the data to Nexus DB is a form of pre-publication, Eussen said. Linking one's own data and results to other data from the community increases the chance that "people will be interested in your results [and] hits," he said.
If the cloud-based data service "can be linked to a discovery storage/cloud with proper metadata content, there are huge opportunities to develop new services and applications, Eussen said.
Shams said that BioDiscovery developed Nexus DB as a response to the data-sharing needs of institutions and consortia. "It allows users to enhance their dataset by querying for specific aberrations or phenotypes within results from the growing community-contributed data," he said in a statement.
He also sees the resource as a progression of the science itself, since experimental methods have made it possible "to easily and cost-effectively generate high-resolution copy number maps for any sample," which fuels more demand for such experiments, he said.
"Now having the ability to store and mine this data is just a natural next step," Shams said.
Another contributor to the demand for these services has to do with the complexity of the data. Finding copy number changes in healthy individuals "is complicating research," and scientists must determine if a particular aberration is normal or perhaps linked to a disease, he said.
Although there are several existing repositories that serve this purpose, such as the Database of Genomic Variants and DECIPHER, they all "have certain limitations," he said.
The "key" difference between Nexus DB and other web tools such as DGV and DECIPHER is that these public repositories are accessed using a browser as opposed to a software tool, Shams said.
The ability to access data through the Nexus Copy Number software "makes a huge difference in both the type and quality of displays we generate and, more importantly, the computational processes that we can perform on the data," Shams said.
Shams noted that BioDiscovery does not intend to duplicate the functionality of existing databases. DECIPHER and DGV offer "extensive sample information," which Nexus DB users can directly access from within the software, he said.
In addition, in contrast to existing resources, "the ability of creating consortia in Nexus DB has enabled us to create population-specific groups," such as a Northern European "normal" group and a Chinese population, which "can be very helpful in reviewing samples to make sure a matching 'normal' reference is being used," Shams said.
There are other new CNV repositories on the horizon. As BioInform sister publication BioArray News reported last October, the National Institute of Health awarded a $3.5 million "Grand Opportunities" grant to Emory University and the International Standard Cytogenetic Array Consortium to develop an online database of copy number variation information related to abnormal phenotypes.
This database is also intended to complement the DGV, which focuses on CNVs from normal populations that are considered to be benign, rather than abnormal phenotypes. The Center for Applied Genomics at the Children's Hospital of Philadelphia also hosts a database of CNVs found in healthy individuals.
Shams cited this growing list of databases as proof that existing resources are inadequate. The Emory effort, for example, is intended to be a database "for just a subset of array types and phenotypes," he said.
Shams said that he expects Nexus DB will help "accelerate" growth for the company's platforms, "as we are creating communities of collaborators that can share results regardless of physical location or even [the] array type they have used."
Shams said Nexus DB includes features to enhance collaboration. Researchers who don't have a license to Nexus Copy number can get a reader client at "low cost," he said, which will allow them to download data from the repository.
Eussen said that he and his colleagues are still determining what sort of data they will share through Nexus DB "and how it can be shared as a group."
That policy is currently being adapted and changed as Erasmus MC experiments with new storage systems. Currently Eussen and his colleagues store all data locally in the central storage compute facility at Erasmus MC.
The center is collaborating with HP to test "new types of extreme data-storage systems including a meta-index application," he said. In this system, all data will be stored in its native data format along with metadata on who generated the data, clinical information, and copy number analysis results.
Eussen said that as the HP storage system unfolds, he expects that he and his colleagues will have the opportunity "to create a virtual storage policy" for sharing research data across different labs.