CHICAGO (GenomeWeb) – In an effort to integrate additional types of data, the New York Genome Center has released a new version of its genomic data visualization tool MetroNome that supports RNA data, allowing researchers to see gene expression in addition to genomic variation for both individuals and patient populations.
The update, released at the end of June, also features side-by-side cohort comparison tools and support for both builds 38 and 37 of the Genome Reference Consortium's human reference genome. Other changes makes it more likely MetroNome be able to integrate genotype and phenotype data, according to Christian Stolte, a data visualization designer at NYGC.
Behind the scenes, developers are actively working on adding data from the Genotype-Tissue Expression (GTEx) project so MetroNome users will be able to draw on the eQTL dataset. "That basically makes the connection between genomic variation and gene expression," Stolte said.
"One motivation [for creating MetroNome] was to show genomic data in the context of phenotypes," Stolte said. "The other thing was, we want to make all this data accessible to the broadest possible range of scientists."
NYGC believes it accomplishes the latter goal with the user interface on the MetroNome website. "People don't have to write any code. There's no software to install," Stolte said.
The institution has been working on MetroNome for about two years and has had a dedicated URL since last year. "However, we haven't really advertised widely because we were still very much working on adding functionality. Now, we feel like it's at the point where it's really become very useful," Stolte said.
His boss, NYGC informatics chief Toby Bloom, has long wanted to create a data repository and platform for connecting phenotypes with genotypes. When Stolte joined in 2015, he suggested the use of data visualization rather than just offering an application programming interface in order to make the information and tools more accessible.
MetroNome displays phenotypes in diagrams that attempt to show as much information as possible, applying a technique known as parallel coordinates to all variables that can be expressed numerically, including age, weight, and height.
"For each attribute, you draw a vertical axis," Stolte explained. "For each patient, you can find a point on that axis where you can plot the value for that individual."
Next, the system connects the points, drawing a line traversing each vertical axis to create what Stolte called a "web of lines" that shows relationships and helps researchers identify clusters and trends.
Categorical attributes, including gender, ethnicity, and stage of cancer, are processed with a technique called parallel sets. "Basically, you subdivide each attribute into lines or line segments that are proportional to the percentage of individuals that you see that fall in that category," Stolte said, such as a 60-40 split by gender.
"We can draw connections between the different dimensions as parallelograms that allow you to show combinations of attributes," he said. With both parallel coordinates and parallel sets, users can move any axis they like to compare various attributes side by side.
"When you click on one of these lines, that then becomes a filter for all the data that's shown in the [user interface]," he said. "For example, what are the numeric attributes like for these patients? What variants are present in the genomes of those individuals?"
With gene diagrams, MetroNome can, for example, plot an axis with genomic coordinates at the bottom, annotating introns and exons. A separate axis at the top shows the protein transcript.
"On that protein diagram, we show annotated functional domains that come from [the] Pfam [database], and we draw connections between the exons and those functional domains so that you can see how one maps to the other," Stolte said. Using RNA data, users can call up a heatmap on a gene-sample matrix sorted by tissue source.
Since its launch two years ago, the primary user of MetroNome has been the NYGC-hosted ALS Consortium, a group of about 100 amyotrophic lateral sclerosis researchers from around the world. MetroNome is also working with collaborators from the Fred Hutchinson Cancer Research Center who track how many patients with Barrett's esophagus develop esophageal cancer.
For ALS, the development group created a diagram that shows gene expression in the context of the neuroaxis. "You have different areas of the brain colored according to whether gene expression is elevated or reduced for a particular gene," Stolte said.
MetroNome has integrated data from public datasets, including those from the 1,000 Genomes Project, the Cancer Genome Atlas, and, soon, GTEx. Those repositories have varying levels of integration with phenotypic information.
"For 1,000 Genomes, it's mostly population-specific. For TCGA, there's a lot more medical information in there. For the ALS data, it's all the information that was collected by the project," Stolte explained.
The ALS data does lack RNA samples from the control group of healthy, living patients. "People don't like having their brain cut into," Stolte quipped. RNA from ALS patients comes from brain autopsies.
NYGC plans on integrating MetroNome with electronic health records at some point. "Harmonizing all of that data, of course, is a big challenge," Stolte said. NYGC is also actively working with the Human Phenotype Ontology on establishing a standard set of terminologies for ALS.
Other conditions involve a lot of curation. "You have to map terms from one data source to another and try and translate the values or recalculate them if necessary," Stolte said. In the US, HIPAA requires dates to be stripped out by recalculating them as days since birth or days since diagnosis.
Outside users can start putting their own clinical data into MetroNome now, but it is a manual process.
"Down the road, we will be working on a more automated procedure that allows you to directly upload data," Stolte said, though he acknowledged that that may be impractical for large datasets. "One way we are cutting down on the size of data is that we are asking people to send us VCF files instead of the raw sequencing data," he said. That reduces the size of files by a scale of about 1,000.