NEW YORK (GenomeWeb) – Dutch bioinformatics firm InsideDNA is preparing to launch a Google-based platform it has developed to help life science researchers run and share computational tools for genomic analysis in a reproducible fashion, without requiring bioinformatics expertise or large local compute infrastructure.
The firm unveiled a beta version of its platform, which it describes as "Netflix for the life sciences" about eight months, and plans a formal launch in September this year. The system currently offers access to more than 1,000 bioinformatics algorithms and programs for analyzing genomic data including well-known programs for RNA-seq analysis such as TopHat and Cufflinks, for read-mapping such as BowTie and the Burrows-Wheeler Aligner, and for variant calling such as SamTools and the Genome Analysis Toolkit.
Researchers can select tools from the existing menu, or if a tool they would like to use is not there, they can request that the company add it to the platform. Biologists also can use the pipeline manager tool, which features an easy-to-use drag-and-drop interface to combine single tools into more complex analysis pipelines. And there is a command-line access option to data and tools for more computationally savvy researchers as well as an access option for researchers who want to interact with data using programs such as RStudio or iPython notebooks.
Since the system is based on the Google cloud, users have access to virtually unlimited compute resources that they can use for their analysis as well as an easy-to-use implementation of the Google Genomics application programming interface for researchers who want programmatic access to the platform. The platform also links to a growing number of open-data repositories including several managed by the National Center for Biotechnology Information. Researchers can easily connect to these resources, extract information from them, and analyze the data on the cloud.
One of the platform's most important features is a unique tool called interactive methods, or iMethods, which lets scientists package their analysis tools, pipelines, and protocols as they ran them during their analyses in Docker containers and share them with others as part of their publications. Essentially, iMethods lets researchers create an "an executable bundle comprising analytical tools, data and tool settings which can published under a permanent or temporal web URL," Anna Kostikova, InsideDNA's founder and chief science officer, explained in an email to GenomeWeb. Researchers can add these URLs to their publications so that when others access their papers and click on these links, they are connected to the relevant tools and pipelines on InsideDNA's platform. Since the tools are already available on the Google cloud, other researchers can simply launch and run them on existing or new datasets.
Kostikova and her co-founder Andrey Khmelevskiy, also InsideDNA's CEO, said they began developing the platform about three years ago in response to concerns in the community about research reproducibility in genomics and life sciences. As a doctoral student at the University of Lausanne, Kostikova ran an evolutionary biology blog where she published some of her own code and method descriptions so that other researchers could rerun her analysis themselves. However, she said that she received almost daily queries from readers asking for her help in setting up her scripts or in running or interpreting their results. Seeking a more automated solution to the problem, Kostikova reached out to Khmelevskiy, a computer scientist by training, and asked for his help. They began developing the platform about three years ago.
Specifically, they sought to address two issues, insufficient infrastructure and lack of bioinformatics know-how, which they claimed have contributed to a jump in irreproducible biomedical research results from 15 percent to over 60 percent. Efforts within the bioinformatics community to address aspects of the reproducibility problem include developing the Common Workflow Language, which provides standardized specifications for describing analysis tools and workflows that are intended to make it easier not just to share computational pipelines but also to run them on multiple platforms.
However, it is difficult for biologists with no bioinformatics training to implement and run open-source tools on their datasets, Kostikova told GenomeWeb. Researchers can and do add their code to repositories like Github, but there is still some level of expertise required in order for other researchers to download and implement the code. "It's not necessarily possible to run those tools straightaway."
The InsideDNA platform addresses both the expertise and insufficient infrastructure issues by providing access to cheap and virtually unlimited cloud compute and simple tools for putting together analysis pipelines, according to the company. While it might take a biology researcher with no bioinformatics expertise about six months and some $20,000 in direct and indirect costs to analyze 50 sequenced genomes, they can get the results in a few hours with an InsideDNA account and pay about $100 for the analysis, according to the company's estimates.
Realizing that their solution could be of benefit to the much broader community, Kostikova and Khmelevskiy began taking steps to commercialize the platform. They chose to launch it on the Google cloud platform because it was the most affordable and offered the most flexibility compared to the alternatives. Initially, "we thought of it as a small side project which we could easily code ourselves," Kostikova said. But expansion proved a much bigger challenge than they had anticipated, she said, so about six months into the project they hired a part time programmer. They have since added more employees, bringing the company's total headcount to 15 individuals, and they will likely hire more in the future.
The iMethods feature, in particular, sets InsideDNA apart from competing cloud-based genomic analysis platforms such as Illumina's BaseSpace and DNAnexus. "In addition, we provide fully integrated access to a big data infrastructure such Hadoop, Spark, Jupyter, and RStudio which lets users leverage data mining and machine learning for genomic data analysis," Khmelevskiy added. Also, unlike BaseSpace, which is tied to Illumina's sequencers, InsideDNA's platform is vendor-independent, so users can work with any kind of sequencing data.
"We believe it has the potential to radically change how scientists run analysis and share their results," she said. "It should bring much better knowledge transfer within academia and from academia to industry." Since she moved the methods she once shared via her blog to the InsideDNA platform, Kostikova said she no longer receives requests for installation help from researchers who want to use her code.
InsideDNA competes with conventional local clusters and server infrastructure, which some researchers prefer to using cloud for their analysis needs, but both of those are "inconvenient and expensive" for working with genomic data, Kostikova said. "Most researchers admit that it is nearly impossible to do reproducible research within the closed cluster environment," she said. "There is still a lot of education that needs to be done in explaining to people the advantages of cloud infrastructure." To help smooth the process for researchers making the move from cluster-based analysis to the cloud, "we tried to re-create a cluster/server-like experience in the cloud, so users would not need to change much their habits when migrating from local infrastructure to InsideDNA," she added.
Since the soft launch eight months ago, the platform has garnered 1,000 active researchers and 5,000 blog readers. The company has also received requests from researchers who are interested in implementing their tools on its platform, Kostikova said. They will accept any open-source tools that have obtained GPL licenses.
One such researcher is Nadir Alvarez, a professor in the University of Lausanne's department of ecology and evolution. He and other colleagues worked with Kostikova and Khmelevskiy to develop a computational pipeline for analyzing data from degraded DNA samples generated by a sequencing technique developed in his lab. The technique, called hybridization RAD sequencing, is similar to Restriction-Associated-DNA-sequencing, or RADseq, but relies on hybridization capture, Alvarez explained to GenomeWeb. The technique works by using biotinylated RAD fragments from a random fraction of the genome to capture homologous fragments from genomic shotgun sequencing libraries.
For the analysis, the researchers first demultiplex and clean the reads and then map the captured fragments to different types of reference datasets. They correct for DNA damages and then call SNPs. Their pipeline uses well-known tools such as the SOAPdenovo software for assembling reads into contigs, BowTie for read mapping, and vcftools and FreeBayes for filtering and calling SNPS respectively. Full details of both the laboratory and bioinformatics pipelines and protocols are provided in a paper co-authored with InsideDNA researchers that was published in PLOS One in March this year.
Alvarez said that InsideDNA researchers worked with his team to develop and test the pipeline as well as to benchmark it against other pipelines. At the start of this project, "we had some ideas [for] this pipeline, and [InsideDNA] considered them and then came up with additional tools and benchmarks that really made things easier," he said. "They are really good at adapting to anyone's project and making it as optimal as possible." Also, "I really appreciate the possibility to build a pipeline by dragging and dropping modules," he added.
Other contributors to the platform include the developers of the Classifier based on Reduced K-mers, or CLARK, software which is used for classifying metagenomics reads at the species or genes level quickly and accurately. Full details of that software were published in a paper last year in BMC Genomics. Rachid Ounit, a doctoral candidate in the University of California, Riverside's computer science and engineering department and one of CLARK's developers, told GenomeWeb that he first heard about InsideDNA after the company uploaded the first two metagenome classifiers of the CLARK series of tools. He and his colleagues are now partnering with the company to integrate more features from the CLARK tools into the InsideDNA platform.
"Compared to other platforms, InsideDNA is very user-oriented and hosts a wide selection of popular tools for sequence analysis, without requiring a strong background in computer science," Ounit told GenomeWeb in an email. "Their effort to provide a single platform [that is] easy to use with high computational resources exploited in the back for a large audience of life scientists is laudable." It's also, "a great opportunity for researchers with limited resources to work on published data," he added. "An internet connection and your personal laptop is all you need to get your results."
Researchers can sign up for free trials to test most of the basic functionality of the system, but if they want permanent storage space and more compute resources for larger projects, they will have to pay for it. InsideDNA customers can either sign up for a pay-as-you-go option where they pay only for the compute resources they use or they can sign up for monthly plans.
InsideDNA takes a small percentage of the baseline Google Cloud Compute prices that users pay for computing and storage. Detailed pricing for Google's computer resources are available on the Google Cloud platform website. Exact costs vary depending on factors such as the type of machine needed, the number of CPUs, and the amount of memory used.
The company is also offering free use of the system for educators who want to use the InsideDNA platform in university classes or other online courses, Khmelevskiy said. Teachers can easily set up workspaces for their classes where they can share tools, data, and compute credits for student projects. At least one institution in Bogata, Columbia has used the platform for bioinformatics coursework, he said.
Kostikova and Khmelevskiy hope to sustain the company solely from the proceeds that they get from subscription sales as well as from the pay-per-use option. So far, the company is entirely self-funded and the founders do not have plans at this time to seek venture capital money. However, that could change.
"We have spent a lot of time and resources on building this platform, and if there is no way we can support it without additional funding, then we will need to search for additional investment," Kostikova said. "Nevertheless, we would prefer to have someone on board with a good network in academia and industry, so we could get in touch with more potential users and customers or angel investments."