Skip to main content
Premium Trial:

Request an Annual Quote

Q&A: Sanger's Julian Parkhill on the Challenge of Keeping Genomic Databases Up to Date and Stable


In a recent paper, researchers at the Wellcome Trust Sanger Institute and the European Bioinformatics Institute proposed a three-tier structure for keeping reference genome information up to date and ensuring a steady stream of funding to maintain these genomic resources for the long term.

As it becomes easier and less expensive to sequence entire genomes, researchers are generating more and more genomic data as well as functional information such as gene expression and protein data. While there are a surfeit of algorithms and software to analyze and assemble this data, ultimately, to make sense of the information, researchers must turn to reference genomes.

And therein lies the rub, according to the Sanger/EBI researchers, who note that “the ability of the scientific community to maintain such resources is failing as a result of the onslaught of new data and the disconnect between the archival DNA databases and the new types of information and analysis… in the scientific literature.”

In the proposed structure, each tier builds on the one below it. On the first tier are laboratory-based specific databases that generate and analyze data primarily for publication; the second tier is made up of clade-specific biological databases; and the third tier amalgamates datasets from tier-two databases. The authors also identify funding sources for each tier, and note that “creating funding schemes that deliberately span two tiers is optimal.”

The structure proposed in the paper isn’t entirely novel, the authors concede, but is rather, “in many ways a formalization of current best practices particularly in the model organism databases."

Peter Good, program director of genome informatics at the National Human Genome Research Institute, agreed with the authors’ point that the structure outlined in the paper “exists right now to some extent” but said he would have liked to see more of an emphasis on communication “vertically” and “horizontally” in the paper.

“They argue that there is a problem with coordination and it’s not clear how calling these things tier one, tier two, and tier three is going to change that coordination,” he told BioInform.

Julian Parkhill, a researcher at the Sanger Institute and one of the authors of the paper, spoke to BioInform last week about the difficulties of updating older genomic information, the need to better link new and old information, and to focus funding where it is most needed. Below is an edited version of the interview.

Give me some background on why you wrote this paper.

We’ve been talking about the issues in the paper for a long time. I think the paper came about because we’ve been trying to work out how we can address this problem. I think everyone recognizes that once genome sequences are produced [and] put in a database, the information in them decays unless it’s kept up to date. It’s very difficult to address because if you go to a funding agency or if you have funds yourself and you have a choice between maintaining old data or getting new data and doing interesting new science, the funding will always come down [on the side of] the interesting new science.

At the same time, I think everyone recognizes that more and more, we are relying on reference genomes and the annotation and analysis of reference genomes. I think people thought for awhile that very high-throughput sequencing and new technology would almost make [reference genomes] obsolete or make the problem go away, but in fact it's become greater because all of the new sequencing technologies, to a greater or lesser extent, require comparison against a reference either because you are doing RNA sequencing or because you are doing variation detection. Therefore it’s become very clear that with very high-throughput new sequencing technologies, its become even more important that we have good reference information, and I think there’s a fundamental disconnect between the data that’s stored in the archival databases, which are archival databases for a very good reason, and the subsequent information that collects in people’s labs and in the literature about the function of genes and the genomes.

The mechanisms that are put in place to solve the problem of connecting the genomic information through to the functional information in the literature are kind of ad hoc. They’ve grown up in different ways [and] tend to be concentrated on well-supported model organisms so there is a big gap between the reference genomes that people go to and the up-to-date information on function that’s in the literature.

I think what we tried to do in the paper is identify ways that [the problem] can be solved not by waving a magic wand and creating a system that will solve it but by identifying ways that you can direct funding, the ways that you can build structures that will solve it. I think the fundamental problem that needs to be solved is that of transfer of information through from new experiments that are being done back onto the archival data that’s in the databases and sort of closing the loop between the experimental data and the genomic data.

So what we have tried to suggest are ways of funding that, rather than funding databases here and experimental work there, bridges the gaps, promotes the movement of data, [and] promotes this cycle of getting the data through onto the archival genomes.

In your paper you mention that there has been some success with open source resources. Could you elaborate?

I think an open source [resource] is one way of making sure that the information is easily viewable, [and] is easily linkable, I think open source is fundamentally a solution to the problem.

In the end, one of the problems is, you can’t do what [open source resources] do for model organisms for everything. So for the Saccharomyces Genome Database and some of these well-constructed model organism databases, they solve the problem by having large numbers of people who read their literature and attach information back to the genome. Now you can’t do that for every organism. So what you need is to promote a system whereby the links become almost automatic, that you have open source information and you encourage the linking of information. So the database itself doesn’t store the information but it shows you where to get it and it shows you how to link to it.

Do you incorporate open-source resources into your structure?

I am not sure we specifically state it but it is implicit in the structure. All the resources have to be open source otherwise its not really going to work.

[ pagebreak ]

The paper was published recently but has there been a response to the any of the suggestions in the paper?

I have not had any direct response but it only came out last week. But as I say we have been discussing it with a number of people over the years and I think most people seem to be very supportive of the model when you put it to them. I think in some ways it’s not a novel idea; this is the structure that’s being built around some of the model organism databases. I think the fundamental idea is that you have this kind of middle layer of what we called clade-specific databases or biological databases where you have a direct interaction between the people generating the data and the people who are interested in that particular organism who can collate data [and] make those links.

That happens in many places [where] people set up databases of their own interests. I think what we are trying to encourage is that those databases, rather than existing by themselves and being the end point, become a middle link so that the data that’s curated in those databases is then linked back into the large-scale archival aggregated databases. So there is this continuous flow of information rather than collecting information in one place.

You mention two issues in the paper — ensuring that bioinformatics resources are up to date and stable, and creating mechanisms to give end-users access to raw data — but you only focus on the first. Are there any plans to address the second issue?

I am not sure we will necessarily write a paper about it, but there [are] a number of potential solutions people are exploring. One is more and more use of cloud computing. The issue fundamentally with these very large datasets is [that] it's almost impossible for an individual or small group to download all the data they want to analyze and do the analysis locally. The solution to that, which is starting to become possible through these sorts of cloud computing approaches, is rather than downloading the dataset and analyzing it locally, they move their analysis software to the dataset, do the analysis locally and then just pull back the results.

For example, there is an instance of Ensembl in the Amazon cloud so that if you want to do large-scale computing across all of the Ensembl datasets, you don’t have to download the whole of Ensembl and you don’t have to go and work with a computer at EBI. You can buy, effectively, a small amount of processing time on the Amazon cloud, the data is there, and you can do your processing in the Amazon computing cloud and download the results. I think that’s a solution that a lot of people are looking at so that the analysis goes to the data rather than the data going to the analysis.

You mentioned in the paper that there was a workshop in 2008 and you said that some aspects of the model where discussed there. Could you tell me a little bit about what the discussion was about?

There was a workshop in London that was sponsored by the Wellcome Trust to look at these issues of how we keep genomic data up to date and how we structure genomic information, particularly in reference genomes. This is essentially the model that was presented to the meeting and it was refined after input from people at the meeting, but essentially I think again the people at the meeting were quite supportive of the model as it came out.

They were representatives of funding agencies and databases and researchers from the UK and from the US.

I spoke with Peter Good at NHGRI and he pointed out that while he agrees that there is a problem with coordination, he doesn’t quite see what how your structure improves communication vertically and horizontally. Could you flesh that out a bit more?

The proposal is that funding agencies encourage — and support — proposals that explicitly link tiers, and thus promote transfer of data between them, rather than funding only standalone databases.

An example would be a joint application for funding for a clade-specific database, that included co-funding for one of the top-tier aggregating databases and specifically funded data transfer to the aggregating database and consolidation of the data in that database. Another example could be funding for a [tier-one] data-generation group that included some specific funding for amalgamating the data into a relevant tier-two database.

Clearly, this already goes on to a certain extent, and we are trying to highlight this as the best way to promote integration.

The Scan

Not Yet a Permanent One

NPR says the lack of a permanent Food and Drug Administration commissioner has "flummoxed" public health officials.

Unfair Targeting

Technology Review writes that a new report says the US has been unfairly targeting Chinese and Chinese-American individuals in economic espionage cases.

Limited Rapid Testing

The New York Times wonders why rapid tests for COVID-19 are not widely available in the US.

Genome Research Papers on IPAFinder, Structural Variant Expression Effects, Single-Cell RNA-Seq Markers

In Genome Research this week: IPAFinder method to detect intronic polyadenylation, influence of structural variants on gene expression, and more.