NEW YORK (GenomeWeb) – Informatics firm Tamr said last week that it will offer its data integration platform and data science expertise free of charge to researchers affiliated with the White House's Cancer Moonshot Task Force, led by Vice President Joe Biden, to help them organize, unify, and integrate large quantities of genomic and other data for analysis.
Tamr representatives told GenomeWeb this week that researchers and organizations affiliated with the task force could apply for free unlimited licenses to the company's web-based software for use in genomics-based studies and other projects launched as part of the initiative. The company is also offering its expertise in computer science, genomics, bioinformatics, and computational biology to researchers involved in the project.
Based in Cambridge, Massachusetts, Tamr officially opened its doors in 2014. The company is backed by investors such as Google Ventures and New Enterprise Associates, and has raised $41.2 million from investors to date. The company offers machine learning-based software for preparing and aggregating data from disparate sources.
"There is a lot of data out there in the world that people are trying to use but it's really hard to unify that data or use it together in a comprehensive way," Nidhi Aggarwal, Tamr's head of product and strategy, told GenomeWeb. Tamr aims to solve that problem by providing tools that enable users to transform and combine datasets in various ways to enable meaningful research in cancer and other contexts.
Tamr isn't strictly a bioinformatics company — its software is used in various domains including information services, automotive manufacturing, and retail — but it has also found specific use cases in the life sciences and pharmaceutical domains including genomics. Customers in these categories typically fall into two categories, Timothy Danford, a Tamr field engineer, explained to GenomeWeb. Some customers are primarily interested in aggregating and querying clinical data from patients, while others are researchers who want to use the software for more basic or translational research projects. The latter group typically needs to aggregate a much broader set of data types including genomic and cheminformatics data as well as information from chemical registries.
Tamr's software helps scientists, statisticians, and other analysts build usable datasets out of their raw information. That includes internally generated datasets and information from public repositories, Danford said. "We've focused a lot on combining both the abilities of machine learning algorithms to automate the [integration] process with [input from] people who understand the data as its collected and who can help drive or guide algorithms to the correct answer," he said.
Customers use the software to first search for and locate datasets that they need for their projects. For example, a customer might use the system to scan internal distributed file systems or databases and identify spreadsheets or text files that contain data that they might need for a clinical trial. Once the relevant datasets have been identified, the software then analyzes and indexes the data. Basically, it assesses the outline, shape and quality of the different data sources that are being primed for integration," Danford explained. "We go through and say what does [the data] look like now? What does it look like where we found it? [Then] we build succinct summaries of all of the data characteristics."
In the final data integration step, datasets are transformed and mapped into the customers desired output. Tamr works with semi-structured datasets and these do not have to be formatted in the same way. Transformation tools within the system let users take datasets that use different schemas, semantics, and so on and change them into formats that are more amenable for integration.
The exact datasets that are integrated depend on the question the customer wants to answer. A user might, for example, want to search for adverse event information for participants in multiple clinical studies. The software is able to extract the information from the source material and present the data in a single table with columns properly aligned and duplicates removed. The data integration process is largely automated but not completely — there is some curation required — however customers can save the automated portions of the integration pipeline and reuse them to integrate their datasets in future.
The actual integration step is moderated by two processes, according to Danford. As datasets are combined, the software applies a series of statistical models that try to summarize all of the processes involved in the integration. That includes steps involved in mapping columns from source datasets into target schemas or in transforming source material into correct vocabularies, Danford explained.
Those summaries form the basis of future data integration projects. "If you have a task force and you had a number of data scientists and bioinformaticians who are in charge of building these datasets for analysis and answering specific questions, a lot of what they do with the same raw data sources will be the same thing over and over again," he said. Other researchers interested in searching the same data sources can simply duplicate their peers' efforts. This way, "you get provenance, recording, reproducibility, [and] the ability to understand not only what you did but why you did it," he said.
Provenance is one of the key benefits of the system that should be of particular relevance to researchers involved in the Cancer Moonshot, according to Danford. There are also time and, by extension, cost savings to de derived from using the system. "You can spend more money to make things happen quicker or you can do it on the cheap and often that requires doing it manually," he said. "We think that this will hit the sweet spot of cheap since its free but also faster than you would have got if you had done it manually or a kind of ad hoc programming effort."
The software also helps address ever-present data-sharing issues, Aggarwal added, noting that Vice President Biden made a point of highlighting that particular problem when he discussed the Moonshot. "[He] pointed out that data sharing and making data interoperable is a key obstacle to doing any meaningful research. That's what Tamr enables you to do," she said.
Customers have the option to install the software on servers in their own data centers, on virtual private clouds, or on Tamr's own servers. For researchers not affiliated with the Cancer Moonshot initiative who are interested in using the company's solution, Aggarwal told GenomeWeb that the company charges per application and customers can purchase these licenses for as many applications as they want. Tamr does not disclose exactly how much it charges per application but the company did say that it has customers paying for licenses that range from $50,000 up to $1 million or more depending on the number of applications.