Wellcome Trust Centre for Gene Regulation and Expression,
College of Life Sciences,
University of Dundee
Jason Swedlow is co-organizer of this year’s Genome Informatics meeting, jointly run by Cold Spring Harbor Laboratory and Wellcome Trust Sanger Institute (see BioInform’s coverage of the conference, this issue).
His research focus is part biology and part informatics at the Wellcome Trust Centre for Gene Regulation and Expression at the College of Life Sciences at the University of Dundee in Scotland.
To study cellular machinery such as the interaction of nuclear components of the cell, the choreography of mitosis, the formation of chromosomes, and chromosome proteomics, he applies different types of digital fluorescence imaging.
Swedlow has also been working out ways to manage and analyze this information. To that end, he and a group of collaborators have developed the Open Microscopy Environment, or OME, an informatics platform designed to help make microscopy about capturing and quantitatively measuring images.
Other team members are Ilya Goldberg, who directs the image informatics and computational biology unit at the National Institute on Aging; Peter Sorger, systems biology professor at Harvard Medical School; and Kevin Eliceiri at the Laboratory for Optical and Computational Instrumentation, LOCI, at the University of Wisconsin-Madison.
At Swedlow’s home institution in Dundee, construction has expanded the physical plan of the university to connect three life-science buildings into a single long facility with the goal of putting around 750 staff members in informatics, microscopy, and biology in even closer proximity to enable the kind of collaboration that is essential in all areas of bioinformatics, he told BioInform.
An edited version of the conversation follows.
How does Genome Informatics compare these days to other meetings like RECOMB and ISMB?
One thing about this community that I appreciate and applaud is a very strong ethic of openness. If you want to get beaten up, come give a presentation at this meeting where your work, results, your software are not available as you present. Even if a complex dataset is not ready yet, tell us where it is going to be and when it going to happen.
Openness and an insistence on making data, tools, and results available all play an important role at this meeting.
Is the meeting more focused on tools or more about biology?
In the early days there was a big emphasis on tool development, resource development. As time has gone on there has been a movement toward providing more results based on biological insight using these tools.
Since I began coming to this meeting, for example, I was interested in seeing more proteomics and imaging appear at the meeting and, along with others, had other ideas and suggestions. For this meeting we have a range, including alignments and algorithms, imaging atlases, combining informatics and imaging, epigenomics, a focus on Drosophila and C. elegans, sessions on gene expression atlases in the mouse.
We wanted to see if we could take classic bioinformatics and also move it into biology and applications, for example, imaging map gene-expression analysis of all developmentally regulated genes in the context of the organism, for example where you are doing in situ hybridization across the entire embryo, across a range of developmental stages. So there is a developmental component, a temporal component, and obviously there is a spatial component.
The sessions cover geeky subjects like assembly and annotation, which have been here since the meeting began. Those are coming from the genome project scientists, such as Sanger and RIKEN and others.
The data-management session is one that Michele Clamp, Jim Kent, and I instigated and is a pretty geeky, hardcore kind of session. Then there is epigenomics and high-throughput sequencing. How are we going to take these short reads and turn them into anything useful?
Last year, a session on medical pathogenic genome analysis with a bias toward human and animal pathogens set off some bells. In each case researchers were saying, ‘We see trends which may or may not have to do with pathogenic organisms or might be common all organisms, because of the bias on sequencing pathogenic organisms.’ That field is exploding. Because you can now say, ‘Let’s sequence every malarial strain out there.’ Or all infective yeasts. That was a session that I walked away from thinking this field is going to change the way we understand disease. This is what is happening in many areas.
The bioinformatics tools are now being used to understand critical questions in human health and biology. One area we haven’t done much of but we could is plant informatics, the genomics of agriculture, and that can also touch upon applications. The future is very exciting.
When you don’t have your conference-organizing hat on, you work on imaging and informatics on a rather large scale. How do you split your time?
It’s about 50 percent in cell biology and 50 percent in informatics these days. The two synergize. On any given day one may be more than the other but I am really trying to do both. To develop software in the context of real science helps a huge amount, to be able to offer the software tools really helps the science as well.
Microscopy is our assay system; it is how we make our measurements.
You work on a large scale. What kind of challenges does that present?
All the microscopes, currently 12 that we brought in, are part of the facility. Two have been purchased and at least one more is scheduled to be purchased. These are various versions of multi-dimensional, high spatial-temporal and/or temporal resolution imaging systems. Basically, they all end up having their sweet spot for specific applications so you can use the microscope you need for a given measurement in an experiment. They are heavily used.
One of the major problems in funding science is funding infrastructure. For example: a salary for the person running the light microscopy facility and a salary for the person to run mass spec, salaries for people running all the computing infrastructure. When you are talking about high-performance technology, you are talking about serious levels of sophistication, so you have to pay people pretty well. With university salaries, that can get hard. We were able to obtain funding to pay close to 10 people and get maintenance contracts for our equipment.
In the US doing those things is nearly impossible, in the UK there are a number of organizations willing to do these kinds of things and the Wellcome Trust, quite frankly, stands out.
The Open Microscopy Environment involves several institutions and has company involvement. How does that all come together?
Dundee is OME’s center of development. All the development resources and the majority of developers are either based at Dundee or are working through Dundee.
Partnering organizations include the Center for Bio-Image Informatics at the University of California, Santa Barbara, and Vanderbilt University’s School of Medicine. We are working very heavily with the LOCI group based in Madison and the National Institute of Aging, where [Ilya Goldberg and his colleagues] are doing a lot of work on machine learning for applications that we are integrating into our informatics tools.
A number of companies have become involved in various ways, expressing their support, such as Leica, PerkinElmer, Zeiss. They don’t do that lightly. Some are releasing our software. For example, Applied Precision, for a little over a year, has been releasing our OMERO data-management software inside their microscopes. If somebody buys a Delta Vision microscope, inside there is our data-management server. Simplistically put, this is another way for us to get our software into people’s hands. Other companies I don’t want to name just yet are busy building our software into their instruments.
It is amazing to hear large companies say they want to sell a software product that has an open API so the scientist-customer can bring in any applications they develop and hook it in, and let them take on a community-led standard for those interfaces, that is pretty unusual. And it is so gratifying to hear from our end.
Another area you are interested in is open tools and storage for all the imaging data being generated in facilities such as yours. Sounds a bit like second-generation sequencing challenges.
Completely. It’s that same nightmare, again and again and again. There are many anecdotes but when people ask, ‘What’s the major problem in this business?’ the answer everyone gives is: data management.
The community has found that storing data is not that hard nor terribly expensive. You can go out and get for not tons of money disk arrays with a nice set of flashing blue lights up front, convincing everyone it is doing all sorts of nice stuff as it sucks a lot of power. Most people realize you need something more than the fact that you have been able to write the data down somewhere.
In an imaging experiment we do these time-lapse, 3D, multi-channel movies over, say, 48 hours on multiple points on multiple samples. So in some ways the numbers are small, but I have 10 15 GB files, each of which is a complex time-lapse movie. And I did another set last week. I can store them, sure. But how do I keep track of them? Which ones are good, which ones are bad? Which ones showed phenotype A, which ones showed phenotype B? When did phenotype B happen?
People have gotten very obsessed with the fact they can store data, but then how do I present that data in any useful way?
I have a bunch of 15 GB files, how am I supposed to show them? Should I send a collaborator a 250 GB disk and say, ‘There you are’?
What kind of informatics challenges are you facing? How is the field of image bioinformatics emerging?
There is a lot to do in genomics, proof of that are miRNAs, copy number variation, [and] discoveries [that] continue to be made in genomics. Large institutes like the Sanger are installing second-generation sequencing machines, which will generate important data. That technology will [require] huge data processing and the data analysis needs that is going to generate are critical. It’s a whole new generation of problems.
Here is what the classic bioinformatics community has missed: The fundamental principle of genome informatics was that you have a resource, a single dataset. You have the genome. It is publicly available and everyone agrees that is the thing we are going to operate on.
There is no analogy for that in experimental biology. A student of mine generates somewhere between a half and 4 terabyes of data during their three- to four-year tenure in the lab. That is their data and it all needs to be analyzed, but it isn’t clear that it is a community resource. It’s not clear that the community wants to see that. It is hypothesis-driven, experimentally derived data.
Here is a big transition. It will be interesting to see what classic bioinformatics does. There was the idea there were going to be a small number of data producers and all the rest were going to analyze data. Now everyone has a second-generation sequencer already. Everybody is producing the data and doing experiments on the data. So how is it that we are going to work together?
In imaging we think there may be 75 or 80 different proprietary file formats out there, meaning in common use by some reasonable number of people. In genomics there is basically one file format, FASTA. Every sequence in the universe is written in FASTA. There is nothing that exists like that in experimental biology. The proteomics field is full of commercially driven data acquisition systems. Their file formats are whatever they decide they need to be.
This afternoon I did an experiment on the Orbitrap [mass spectrometer system]. There’s 20 GB of data, and guess what? A substantial number of the proteomics analysis tools have, as their interface, a web browser. That is great idea. Except that the digested [version] of that 20 GB of data is actually 15 MB, which no modern browser is supposed to display. Browsers aren’t supposed to take huge multi-megabyte text files and display them and they don’t do that very well.
There are all kinds of problems, [and] this is an example of a trivial problem, easily solved, and things we have solved in OME. What happens when you want to do analysis on any of that kind of data? What happens when I have data at my location, many hundreds of gigabytes of files, and a data-analysis person has been developing algorithms and we want the analysis to be tested on my data, but I am running on Linux and the data-analysis person has written their programs on Windows and it is all in C++ and so on. That might sound stupid but that is the real world.
So what can and should be done to enhance collaboration on the tool-making side?
Not everything is Java. Almost all of the data-acquisition software systems run on C/C++. So there are a lot of issues.
After almost nine years of OME’s existence, there are no standardized file formats … The key seems to be, in our experience, [that] standardized interfaces [and] software tools provide interfaces that [in turn] provide a window into something. So a user can say, this is what we put an effort in, our file formats library, which reads something on the order of 60 different file formats. It is a labor of love chasing file formats.
Because we can’t get everybody into a room to agree on a standard, so what we are going to do is say: there is going to be a single interface, which everybody can agree on.
People on one side of the fence can do anything they want and on the other side of the fence, too, they are both writing to a common API. In imaging we feel with OME we have a reasonable start on this with Bio-Formats [a library for reading and writing microscopy file formats.]