NEW YORK (GenomeWeb) – FactBio, a developer of bioinformatics software to help pharmaceutical companies and academic research groups explore, manage, and share scientific data, opened for business this week.
The company, which was registered a year ago, is developing its first product, which it expects to launch by the second half of next year. The so-called Knowledge Sharing Platform, or Kusp, provides tools for integrating public and proprietary datasets, as well as for tracking information related to biological entities of interest, such as genes, pathways, proteins, and scientific papers, in so-called virtual BioBuckets. CEO James Malone told GenomeWeb this week that the company intends to beta-test the platform in the first quarter of 2016 with a number of unnamed early adopters that will be a mix of researchers from large pharma and academia.
When the system launches, customers will be able to track a limited number of biological entities in a set number of BioBuckets for free — the exact numbers are still being determined. To create additional buckets and host a larger number of entities, users will be required to pay a yet-to-be determined subscription fee — the company is mulling offering different price points for academia and industry. Also, FactBio plans to release an application programming interface for more compute-savvy users for which it plans to charge a subscription fee.
FactBio developed Kusp in response to a perceived need among life sciences researchers for tools to effectively combine public and private datasets, as well as to link siloed information internally, Malone told GenomeWeb. "I have worked with quite a few different groups, who bring in public data themselves into an internal system to some bespoke pipeline that they've built and ... there's a legacy software core, and costs attached to maintaining that, and they all do this in lots of different ways," he said. Furthermore, different data types might require different pipelines, so labs have to install various types of software to support their research projects.
This scenario might be repeated in different labs operating within a single company, resulting in increased costs and lost opportunities for collaborative research. "It's quite possible one lab is working on something that's relevant to another lab in the same organization [and] they just don't know," Malone said. "This is something that really is a [missed] opportunity, one that may get worse as data becomes cheaper to produce."
Kusp offers a potential solution to these issues by providing not just a forum for combining internally generated datasets with external data contained in databases such as Ensembl, Entrez, and PubMed, but also by enabling users to capture valuable information around their datasets that provide needed context and meaning. Specifically, the system semantically connects biological entities into large networks that help to elucidate the connections between things like genes and pathways, for example. Moreover, it lets researchers make statements about their datasets in a computationally formal way that are captured as metadata within the platform. For example, they could note that the activity of a given gene regulates or knocks out a second gene.
All of this information is stored in so-called BioBuckets, which are essentially private repositories where users can hold information related to specific investigations or projects. For example, a researcher could create one to hold a list of genes, or a list of diseases of interest. Datasets placed in these buckets are automatically linked to pertinent information from the broader pool of public data that the Kusp platform has access to. These links are automatically updated as new information becomes available in the public domain. Users have the option to keep their buckets private or to open them up to collaborators.
For now, the platform is limited to human data only, but the company plans to expand it to include data from model organisms, Malone said. "We are not focusing on a specific disease at the moment, we are just trying to essentially integrate as much of the type of data that people want for now," he said. The ontologies used to describe data in the system cover a wider range of diseases, although some ailments, for example cancers and autoimmune disease, by virtue of what is available about them in the public domain, have much richer vocabularies than others. Other plans include adding manual curation capabilities, so that users can cross-check the automatic links between their data and existing information, correct mistakes, and feed their edits back into the system, he said.
In addition to Kusp, FactBio has plans to develop a range of other bioinformatics products. Among them is software that will be integrated with laboratory information management systems to capture data close to the data generation source, such as sequencers and other kinds of laboratory instruments. At issue here are the varying formats, vocabularies, and ontologies that are used to describe data after it has been generated, which make it difficult to use, Malone explained. "[When] we look at databases that have been around for ten years ... you have to try and pick them apart and work out what [the] headers in this column mean in this database that came off some LIMS four years ago.'"
Applying consistent vocabularies and ontologies to the data early in the experimental process could ameliorate this challenge and simplify the task of adding new datasets as needed. "It's about making sure that everything that comes out of the LIMS system is fully described, fully annotated in this kind of semantic way," Malone said. He added that FactBio will likely offer this product as a separate software from Kusp but that both products would be integrated.
Besides providing its software, FactBio also offers consulting services, developing bespoke solutions based on its technology to help clients integrate their in house data, Malone said. Prices vary depending on the customer and the size of the project.
The company also plans to offer bioinformatics training events, starting with one in March 2016 in London that will focus on ontology development and deploying semantics in search and querying, he said. A second event, focused on data integration using graph structures, is scheduled for the summer.