Q&A: EBI's Paul Flicek Discusses Reorganization Plans for Ensembl


Last month, a blog post from the Ensembl development team said that the Wellcome Trust Sanger Institute and the European Bioinformatics Institute planned to reorganize the project to "best leverage the strengths of Ensembl’s parent institutes to capitalize on emerging opportunities in genomics."

According to the post, the reorganization effort would include consolidating existing Ensembl services at the EBI to "facilitate closer links with UniProt, Ensembl Genomes, and the Expression Atlas." The post also stated that the developers would be working on new methods for storing and representing variation data as well as working more closely with projects that focus on providing resources to the clinical community, such as the Database of Chromosomal Imbalance and Phenotype in Humans using Ensembl Resources (DECIPHER).

This week, BioInform interviewed Paul Flicek, a senior scientist and team leader of the vertebrate genomics at the EBI, about the details of the planned reorganization including ways Ensembl could be of use in clinical settings. His team, along with their collaborators at the Sanger Institute, develops the Ensembl genome annotation resources and analysis infrastructure.

What follows is an edited version of the conversation.

What prompted the reorganization effort?

The science of genomics is changing from where we were when the project was founded in 1999, which was about creating reference genomes and information on reference genomes. Genomics overall is moving much more into an era of clinical translation and genomic medicine. Ensembl provides a fantastic foundation for that. [Also], the EBI and Sanger themselves have also changed. Sanger has a greater focus now on translational aspects including some of the work they've done with the DECIPHER project and projects coming out of their human genetics and cancer programs, and EBI provides a larger and more comprehensive set of biological data resources.

How do you see Ensembl contributing to the clinical and genomic medicine arenas?

Fundamentally, Ensembl is not a clinical tool. It’s a reference resource for understanding what's in the genome. But it's already the case that people take the reference data we provide … and use that in interesting, clever, and valuable ways. We intend to keep Ensembl [in] its reference resource role even as we have closer ties to projects like DECIPHER which are directly appealing to clinicians.

Can Ensembl's resources be used as they are in more clinical type settings or do you foresee having to bring in additional sources of information, for example, to make it more relevant for the space?

We are always interested in creating the most comprehensive resource for genome annotation possible and we define annotation in the broadest scope from locations of genes, which we do with projects like GENCODE, to annotation of variation, which we have done by integrating lots of different data sources. Going forward, we will continue to look for all valuable and relevant reference data sources and will add those that are valuable.

The blog post you put up a couple of weeks back mentioned a few new activities that the Ensembl team will be working on. Can you talk a little bit about those?

There's one major new scientific direction that will be an Ensembl activity in the future, and that’s methods for storing and representing human variation data to be scientifically led by Richard Durbin. Following from projects like the 1000 Genomes and supporting new efforts to understand and share clinical information like the Global Alliance, questions of how we represent 1,000 or 1 million human genomes and the way the information in the genome exists are very important but also not solved.

For example, the linear genome sequence has enabled many discoveries based on lining up data on a genome sequence that served as a useful reference indexing system. But to go from a reference indexing or a reference coordinate system to something that represents the whole of human variation requires a new way of thinking about these data types. We think it also requires not only new data structures but also new ways of presenting the genome visually and interacting with it programmatically and computationally.

Can you go into a little bit more detail about how you are thinking about storing and representing data?

From the point of view of representation, it's more thinking about any genome as part of a graph-based structure where there are common haplotypes but also population-specific sequences and rare and common variation. The heart of the plan is to represent that in a coherent way.

One of the things mentioned in the blog post is that you plan to consolidate Ensembl's current services at the EBI and that this would support closer links with resources like UniProt. Can you talk a little bit about how you are hoping to achieve that?

A couple of things: Most importantly, from our users' perspective, nothing's changed. Since its inception, Ensembl has been and continues to be a joint project of the EBI and the Sanger Institute. Some of the activities that have been at the Sanger Institute since the start of the project — including the team that creates the gene annotation … as well as the team that creates the website — will move to the EBI. The new activities that I described at the Sanger will form part of our future directions. Really, the consolidation is more of people. We will continue to provide all the services that we’ve already provided.

What's the timeline for completing the reorganization?

The easy part to answer is people moving and that will happen sometime in the early part of the New Year. The harder part is a timeline for new directions. They are underway from the theoretical point of view and they will become more real relatively quickly as we go into next year. We've been strengthening connections with DECIPHER throughout this year and we'll continue to do so next year.

I'd like to follow up on that last comment. What have you been working on in terms of DECIPHER?

DECIPHER's … informatics infrastructure is shared with Ensembl or strongly based on it. For example, DECIPHER already uses things like the Ensembl Variant Effect Predictor in some of its analysis. It's these activities that we'll strengthen. We will also work closely with DECIPHER after the new human genome assembly comes out — it will be released in Ensembl in summer 2014 — so that they can support data and annotation on that assembly as necessary.

Going back to the post, you mentioned working to align Ensembl with the aims of the Global Alliance. Can you could shed some light on how you are thinking about making that happen?

The Global Alliance is designed to enable sharing of secure genomic data. Understanding those data requires the reference resources that we have in Ensembl. Fundamental to the Global Alliance is the fact that people won't copy large files around but will access data via APIs or have other interaction with data resources that might sit in a distributed way. From that point of view, having Ensembl as a Global Alliance-compatible resource such that one could access it using the same kind of programmatic or API calls that you would access other secure datasets with, would [make] Ensembl more valuable in the context of the Alliance and in understanding individual genomes.

Will the new capabilities you are bringing onboard in any way impact the way scientists currently access and use Ensembl?

We have a long history of maintaining compatibility with existing or legacy analysis. The new capabilities should be seen as new features or new ways to do things. But we will be supporting the existing ways to access and interact with Ensembl. It's going to be new things and not direct replacements for existing things.

Is there any funding earmarked for the reorganization effort? If so, how much?

The new tasks are planned from the same core budget. Of course, we do anticipate applying for relevant grants.

