Skip to main content
Premium Trial:

Request an Annual Quote

CHOP's Harvest Toolkit Offers Reusable Components for Biomedical Data Exploration

Premium

A team of informatics experts and biomedical researchers at the Children's Hospital of Pennsylvania has developed a general framework called Harvest that provides bioinformatics developers with reusable tools to create customized applications for exploring and querying information in large multivariate biomedical databases.

According to a paper published last week in the Journal of the American Medical Informatics Association, the Harvest toolkit "facilitates construction of accessible biomedical data discovery applications by providing informatics researchers with open, standards-based components, an adaptable framework for defining domain-specific data concepts, and a user interface design that makes large and complex datasets accessible."

It is composed of three main parts. The first is a data abstraction layer, which helps with things like generating and managing application metadata and indexing text data for searches. Next is an application programming interface that enables web clients to consume data from the data abstraction component. And finally, it has a web client that generates and displays data visualizations such as histograms, bar charts, and pie charts.

These tools are intended, the developers wrote, to help researchers "generate meaningful views of raw data according to their domain expertise and their specific interests; dynamically query key aspects of a dataset based on the inherent characteristics of individual data attributes; combine single attribute queries into multiattribute set operation queries; and provide an actionable endpoint by exporting immediately available raw data in an analysis-ready format."

The paper also describes two use cases that serve to highlight Harvest's effectiveness and adaptability for different types of biomedical research settings and data types. In the first, a Harvest-based application was used to analyze data from a cardiology database at CHOP that contained clinical information from 47,300 patients that had collectively undergone 24,900 catheterization and 54,000 echocardiogram procedures. A second application involved a dataset from the Open Medical Record System community with data on infection status, disease management, and clinical laboratory results from 5,300 patients.

In one project not discussed in the paper, Harvest was used to explore hearing impairment and genomic data in AudGenDB, an audiology database funded by the National Institute of Deafness and other Communications Disorders. In fact, Harvest has its origins in this particular project, according to Michael Italia, manager of applications research at CHOP's Center for Biomedical Informatics and a co-author on the JAMIA paper.

He explained to BioInform that as he and colleagues worked on putting together the tools needed to explore phenotypic, genomic, and other sorts of data in AudGenDB, they thought that it made sense to combine what they considered to be general components for biomedical data discovery applications into a reusable framework that could be offered more broadly to bioinformatics researchers. This way, these developers wouldn’t have to constantly recreate the same capabilities with each new project; they could simply reuse Harvest's components and adapt them as needed. They could also avoid developing one-off solutions on a case-by-case basis for each project.

He estimates that Harvest's components take care of about 80 percent of the development work but this isn't "shrink-wrapped, ready-to-go software," he said. "You have to configure it and you have to make decisions about how you are going to model your data and you are going to have to make some configuration changes and so on."

This flexible approach to biomedical data analysis distinguishes Harvest from similar open-source biomedical platforms that are designed with very specific applications in mind, such as the Informatics for Integrating Biology and the Bedside, or i2b2, platform, which offers tools for analyzing information in patient data warehouses.

Systems like these, the researchers wrote in JAMIA, provide "the convenience of a fixed database model, but at the expense of the benefits provided by normalized relational models such as database-level referential integrity and performance optimization through indexing." They also have "difficulty supporting ad hoc, attribute-centric queries on highly dimensional data, such as clinical and annotated genomic data," the paper states.

Since they developed the framework, the CHOP researchers have also used the Harvest framework to analyze next-generation sequence datasets. They plan to publish those results in a separate paper but they provided some details this week in a poster at the American Society of Human Genetics conference in Boston. According to the abstract, the team used Harvest to build an open-source integrated variant data warehouse, knowledgebase, and analysis suite called Varify, which they used to analyze over 100 million variants in exome and whole genome sequence data collected from several thousand patients.

Varify's capabilities, the researchers wrote, include "patient-specific query and filtration of variants using complex annotation criteria; calculation of allele frequency for custom cohorts; capture of analyst decisions and evidence on pathogenicity; and ad-hoc query of variants across patients based on gene, phenotype, and clinical characteristics." Currently, it's being used by the Newborn Screening Translational Research Network to explore data in the Longitudinal Pediatric Data Resource — an informatics system to enable data collection, sharing, management, and analysis for conditions identified as part of or that may benefit from newborn screening.

They're also developing Harvest instances to support medical imaging studies, according to the JAMIA paper. Other development efforts will focus on things like improving the efficiency of the toolkit for handling much larger datasets than it currently can, as well as creating capabilities that will help informatics researchers maintain the Harvest instances, Italia said. They're also hoping that the user community will provide feedback about bugs and software patches, and share with others the customizations they make to Harvest, he said.

Longer term plans involve expanding Harvest to work with alternatives to relational databases. "There's a lot of stuff going on in the database world with non-relational technologies … [and] we've got our eye on that," Italia said.

Filed under

The Scan

Not Yet a Permanent One

NPR says the lack of a permanent Food and Drug Administration commissioner has "flummoxed" public health officials.

Unfair Targeting

Technology Review writes that a new report says the US has been unfairly targeting Chinese and Chinese-American individuals in economic espionage cases.

Limited Rapid Testing

The New York Times wonders why rapid tests for COVID-19 are not widely available in the US.

Genome Research Papers on IPAFinder, Structural Variant Expression Effects, Single-Cell RNA-Seq Markers

In Genome Research this week: IPAFinder method to detect intronic polyadenylation, influence of structural variants on gene expression, and more.