Developers of InSilico DB, a publicly available repository of genomic datasets, are preparing to offer a commercial service based on the resource that will offer paying customers the chance to compare their proprietary RNA sequence data with public information and analyze it using open source analysis tools such as the Broad Institute’s GenePattern.
Members of the research team, primarily affiliated with Belgium's Free University of Brussels, formed a new company dubbed InSilico Genomics that licensed the platform from the university and will support the future development of Insilico DB as well as handle its commercialization activities, David Weiss, the firm’s president and CEO, told BioInform this week.
The service, which will launch when the Brussels-based company officially opens its doors next month, specifically targets customers in academia and industry who want to keep their data private, he said. The company also expects the service to appeal to smaller labs that don’t have the resources to invest in the kind of infrastructure needed for large-scale RNA-seq analysis.
Customers will be able to upload their raw RNA sequence to the system, which uses algorithms such as TopHat, Cufflinks, and CummerBund to process the data prior to analysis, he explained.
Once the processing step is complete, paying users can then compare their datasets to existing RNA-seq data culled from public resources and then export it into external programs like GenePattern or an R/Bioconductor program developed by the InSilico DB team for further analysis.
The company will initially test the service on a small subset of its current user base — on a first-come, first-served basis — and then “scale up,” Weiss said. Pricing will be decided on a case-by-case basis, he said.
Conversely, users who are willing to share their RNA data with the community will continue to have free access to InSilico DB datasets and its external resources, which also includes software available through the Broad’s GenomeSpace (BI 5/4/2012).
The group has published a paper in Genome Biology that provides detailed information about InSilico DB.
The researchers began developing the platform about five years ago to provide a resource that would handle “low-level tasks” such as data downloads and formatting, explained Weiss, who is also the principal investigator on the project.. It also helps solve the genomics community’s data fragmentation problem by bringing disparate datasets into a single location, he added.
Furthermore, users have access to open source tools in an environment that’s free from the restrictions that accompany commercial offerings such as Compendia Bioscience’s Oncomine database — now owned by Life Technologies (BI 11/12/2012) — Weiss noted.
Also, while companies like CLC Bio and Omixon offer data processing algorithms with their software programs, since these are proprietary, “it’s just a black box” and “there is no reproducibility,” he said.
Since InSilico Genomics collaborates directly with the developers of tools like GenePattern, users can access these algorithms in a more “transparent” fashion, he said.
Finally, while DNAnexus offers access to the National Center for Biotechnology Information’s Short Read Archive, the company’s offer doesn’t include preprocessed datasets that are ready for analysis and it doesn’t have the same link to external analysis resources that InSilico Genomics can offer, Weiss said.
So far, the team has gathered, curated, and annotated microarray and next-generation sequencing datasets from research groups and public repositories including the SRA, the Gene Expression Omnibus, the Cancer Genome Atlas, the Broad Institute, and Gemma, a database and software system that provides data from gene expression profile studies.
As of this summer, InSilico DB contained about 6,784 public datasets accounting for 214,880 samples, among which 3,382 datasets and 151,131 samples have been manually curated, according to the Genome Biology paper.
The company will continue to accept curated datasets from the community to include in InSilico DB, Weiss said.
Currently, the database is being used in places like GlaxoSmithKline, Pfizer, Roche, Mayo Clinic, and the Broad, among others.
At the Broad, for instance, data from InSilico DB was used in a research project that explored a simplified method of gene set enrichment analysis. It has also been used in a study done by researchers at the Free University of Brussels, which compared published and unpublished gene expression datasets in thyroid cancer.