Many have used the vocabulary of mapmaking to describe the human genome sequence, but Juan Enriquez, director of the Life Sciences Project at Harvard Business School, has bolder cartographic aspirations for that linear string of As, Cs, Gs, and Ts. Enriquez is literally tracking the course of genomic data across the globe.
Enriquez and his colleagues recently completed a preliminary map that charts which countries and domains accessed huge chunks of GenBank, DDBJ, and EMBL via FTP in September, October, and November 2000 and 2001. During this time, researchers across the globe downloaded 43 terabytes of data — 32.7 terabytes from GenBank alone. The vast majority of the data — 92 percent — went to users in 10 countries, and half of those users were in the US. Europe accessed just 22 percent of the data, and no country in Africa, Latin America, the Middle East, or Asia (except Japan) downloaded one percent or more. In addition, Enriquez and his colleagues found that commercial users downloaded about half of all the GenBank data, while universities downloaded only 38 percent.
BioInform recently spoke to Enriquez about his mission to map the genomic “download-ome.”
What was the motivation behind this project?
This is the first time that anybody’s asked if there is a customer for this stuff, and, if so, who is the customer? There’s been a lot of push, but the question that’s interesting to us is, ‘Where is the pull coming from?’ There are a couple things that are important about the map. The first is that we’re not looking at individual gene downloads. So we’re not interested if a doctor is interested in BRCA2. What we are interested in finding out is who can take a big data dump — just take a core of this stuff and work with a billion letters — because I think there is a difference between somebody who is a specialist in genes and is looking at specific gene sequences across a particular disease vs. somebody where the firehose opens and the stuff gets drawn in and dumped.
Part of the reason why we did that was because you have much less of a privacy issue. We anonymized everybody so there was no individual data in this thing. You don’t want to know who is looking at specific disease genes because you’re going to run across patient privacy. So we worked with these centers [NCBI, DDBJ, EMBL] very carefully to make sure that this data remained anonymous.
We are interested in large-scale data flows in the same way some of the early maps of the Internet told you where the pipes went and what the thicknesses of those pipes were. I want to stress that this is not a perfect map; it’s a first crack at a map, so there are centers that probably mirror this information, but the folks that we talked to didn’t think that the mirrors were a significant part of this system.
But there are some really interesting questions brought up by this, like if these are the largest public libraries [for genomic information] in the world today, why is it that so few people are accessing this free data?
So you found the number of downloads to be smaller than expected?
Not smaller than expected, because the volume of downloads is huge. The volume of downloads is eight to nine times what people are downloading from the Library of Congress, so these are really large libraries, some of the most interesting libraries in the world today, and they’re free. That’s a really neat thing to have as a resource for humankind. On the other hand, there’s a really interesting question here: If this stuff is freely available, why is nobody out there in Asia, besides Japan, or Africa or Latin America reading large-scale databases? That doesn’t mean there isn’t good research going on in these places, but it does mean that the ability to handle really large-scale datasets may not be there.
Is that an IT infrastructure issue or an educational issue?
That’s actually some of the stuff that we’re focusing on writing now. We’re doing a couple of articles about what [these findings] imply in terms of long-term competitiveness across things like agriculture, and chemistry, and energy, and cosmetics, and foods, and pharma, and biotech.
So it wasn’t obvious that it was a technical issue — that researchers in these regions don’t have the computational capacity to download these datasets?
Downloading it is not really the problem. It’s being able to cope with really large-scale datasets. It’s, ‘What do you do with this stuff?’ Because the skill set[s] in wet and in silico biology [are] very different skill set[s]. So particularly in countries or companies that are conservative, that have been doing things very well in one way for a long time, it gets very hard to make some of these shifts, and to understand entirely new, very large datasets that are suddenly there. So you have these really odd situations, where if you look at stuff being downloaded from Europe into Canada — over 90 percent of the stuff from Europe into Canada goes to three educational institutions, three .edu domains. These are interesting patterns of communication. They tell you what these networks start to look like. They start to tell you who’s talking to whom, they start to tell you who’s reading what on a national level.
You found that .com domains downloaded more genomic data than .edu domains. Why did you find this surprising?
With a leading-edge technology like this, you’d expect the first users and the first researchers to be academic researchers. It takes time to figure out, on leading-edge technologies, what you’re going to do with these really large databases. It used to be that basic science research would be done at universities, and one of the things that is interesting to us is that a lot of the interesting, leading-edge research is not just being carried out now in universities. In the measure that it’s been hard for some of the top PhDs to get positions in universities…suddenly a company shows up and says, ‘Look, you can do really interesting, leading-edge research in the stuff you’re doing and, by the way, there’s the potential that you can really make a profit doing this.’
You looked at the data from 2000 and 2001. Isn’t it possible that the commercial organizations stopped downloading as much data in 2002? There was so much hype about the value of this information for biotech and pharmaceutical companies, but that died down in the last year and a half.
Yes, but that’s happened across a whole series of industries. It happened across the pharmaceutical industry with all the headlines, ‘Is interferon a cancer cure?’ It happened across the computer industry, and happened across the car industry, and across the railroad industry. You see this burst of companies coming up and then consolidating down. Yes, there was a dot-com bubble, and it burst, but did you quit using e-mail? Did you quit researching online or buying online? I suspect you’re going to see the same thing in biotech. There will be a bunch of bioinformatics companies that may not be there tomorrow, but bioinformatics is not going to go away. And the usefulness of this data is not going to go away. On the contrary, I think as you annotate, and have more genomes and such, the more people that have it and the more people that are literate, the more of a network you build with it.
All three of the databases you tracked are supposed to have the exact same data, so why is the number of downloads from GenBank so much larger than from EMBL and DDBJ? I’m surprised that more European users don’t just use EMBL.
There are two things going on. One is ease of access. Sometimes people find interfaces in one place or another easier, and the speed with which things show up is easier. Also, when you look at where things are actually being deposited, you can deposit the data anywhere, but most of it is being deposited into GenBank — it’s not just being accessed via GenBank, but it’s being deposited into GenBank. Partly that’s force of habit — ‘I was trained in a lab that did it this way, and this is the protocol, and I do it’ — but another is just who makes it easy to do what. The third thing that’s important is the annotation — and the annotation and the articles that are linked to some of this stuff are not identical across databases.
Did you look at all into how people are using genome browsers like the UCSC browser and Ensembl?
No. This is just rivers of data and where is it flowing. One of the things we hope to do with this [project] is to get a series of people interested in that kind of question. Let’s start looking at what are the differences between the browsers, and how does this start branching out into specific use patterns — once you’ve got the raw block of data, then who’s accessing this raw block of data, and how does that block of data get broken out? The next stage of this mapping process has to be what do the hubs look like — who’s hubbing out of this stuff? You transfer it and then you’ve got this block of data that acts like a big, public library. Then a whole series of people come into the library and start accessing books, and it would be really interesting to start looking at those access patterns. But what we’re looking at now is a big, rough map, and I hope it’s the first of many.
What’s next for this project?
One of the things that we’ve been looking at is what happens when information gets inside specific companies. So, all of a sudden the floodgates open in a company and a whole series of things start changing. All of a sudden things like the financing requirements of the company change, and the organizational behavior of the company changes, and the power structure inside the company changes, all because you open up a little spigot and start bringing in information. Except that that spigot is very, very broad, so it’s really interesting to see how this stuff starts flowing across.