GT: Roundtable Panelists
William Hayes, Associate Director
Research Informatics, Biogen Idec
Sandy Aronson, Director of Information Technology
Harvard Medical School – Partners HealthCare Center for Genetics & Genomics
Georges Grinstein, Director
Center for Biomolecular and Medical Informatics
University of Massachusetts, Lowell
Temple Smith, Director, BioMolecular
Engineering Research Center
Earlier this year, four bioinformatics experts settled into a conference room in downtown Boston to hash out the second installment of the Genome Technology roundtable series. Keeping in mind the “lab of the future” topic, our fearless foursome addressed database automation and other issues, data storage concerns, and the need for leaps-and-bounds advances in designing and preparing bioinformatics experiments in order to keep systems biology scientists on the cutting edge.
The group consisted of Temple Smith, best known as the Smith of Smith-Waterman fame, from Boston University; William Hayes, associate director of research informatics in the Library and Information Services unit of Biogen Idec; Sandy Aronson, director of information technology, hailing from the Harvard-Partners Center for Genetics and Genomics; and Georges Grinstein, director of the Center for Biomolecular and Medical Informatics at the University of Massachusetts at Lowell.
What follows is a transcript of their conversation, edited for space.
Genome Technology: Let’s start with quick introductions to let our readers know what you do at your organizations.
William Hayes: I’m at Biogen Idec, where we’re starting to apply literature informatics techniques — approaches to managing the literature information — in support of drug discovery and development.
Sandy Aronson: We’re really focused on translational medicine. What that entails for us is we have a significant research component where we have laboratories that will assay samples for both our own researchers and other researchers in the community. We also have a clinical component which involves a CLIA-certified lab that will run tests for the purpose of helping clinicians figure out how to better treat their patients. So on both sides, we work to add IT support to those processes to make them more efficient, more robust, higher quality.
Georges Grinstein: My labs are involved in visual analytics using computation and visualization to solve data problems. The current largest areas are microarrays in a wide variety of applications; cheminformatics, in terms of drug structure; and the most recent project within the past year is on breast cancer risk modeling analysis and personalization.
Temple Smith: Our primary focus in the last couple of years has been looking at the origin of the eukaryotic cell system. We’ve published a couple of extremely controversial papers, which is quite entertaining — and of course we run a very large website and maintain the database for a couple of things including all of the WD propeller proteins.
GT: Speaking of databases, how do you decide which databases to use, which ones to invest in, and when to build your own?
Smith: I think basically all of the public databases have their problems and limitations — including GenBank, which I’m one of the founders of. They were all created by amateurs — not professional computer scientists or database people — and they’re legacy databases so they’re filled with redundancies, errors, historical errors, and so on.
I know some major pharmaceutical companies and big companies rebuild everything from scratch [because] they just can’t tolerate it. As the graduate students always say, ‘I think I’ll use the redundant nonredundant GenBank today.’ There have been efforts to try to correct some of this by groups trying to reorganize the databases by functional domains, by evolutionary domains.
None of the publicly available databases with all their funny Web links allow you to go on and ask complex queries. I can’t say I want to see all of a particular family of proteins from primates for which there are full cDNAs or ESTs and that might be backed up by RNAi knockouts. You can’t ask such a question. If you go to your IT department in most companies — at least this is the complaint I get — you give them that problem [and] they go to work on it, but often the people working on it don’t know enough biology to quite interpret that query the way you want it, so you’ll go a couple of rounds before you get that query done.
You’re talking about lab of the future in industry or academia; it’s very unclear to me how to undo the legacy we have of all these separate databases. I have always been somewhat discouraged about how in the small lab, in academic labs, how do we do this without tying up a new graduate student for weeks. I don’t really know how to solve this problem, but it is a problem that’s only going to get worse.
Hayes: I completely agree. The databases have been sadly lacking in quality control — or even any measure upon submission of what the considered opinion of the quality is, or the curated opinion of what the quality is. That was a problem ever since the genome sequence started being produced. I think you and Jim Fickett discussed that in the very earliest days of GenBank and decided to just let everything come in.
Smith: GenBank was originally in Framus, one of the oldest relational databases. The advisory council told GenBank, ‘Look, it’ll never be big enough to justify that. Just make it in human readable flat files.’ That was the advice.
Hayes: Having a clean database copy is really the remit of follow-on companies to take the raw data from the public sources, analyze it, repackage it, and deliver it. But there haven’t been very many customers for that — there have been a few companies that attempted it, but they’ve not made any progress.
Smith: You have to make money, though.
Hayes: That’s the thing: those databases haven’t been purchased.
Grinstein: That problem is not just for GenBank and others. Last year I ran a contest where we took the ACM and IEEE digital libraries and asked people to look for trends in a specific domain. We spent in my lab several months with 10 graduate students cleaning the data. Now you’d think that the ACM and the IEEE digital libraries would be clean, but the citations were wrong, references were wrong, duplicate names — all the classic errors that one would expect. So the curation seems to be title, possibly some keywords, and that’s about the kinds of activities that go on. Now good interfaces are provided — you can retrieve documents with typical SQL queries — but the cleansing of the data is not there.
Smith: Can any of that be automated?
Grinstein: Parts of it. ISI actually spends a lot of money on cleaning the citations for correlations and analysis and so on, so there are commercial groups out there that have invested in parts of database cleanup.
Aronson: I think that the impact of this problem is about to get much more significant, which may drive some of those solutions in a way. One of the things that we deal with is the need for these types of databases to be leveraged in a routine fashion on the clinical side. Clinicians and people involved in the clinical care delivery process more and more need this information — so there needs to be some effective way of getting it to them.
Within the HPCGG, what we’ve done is we’ve put together a database for the clinical side to help geneticists curate information about variants that have particular clinical relevance that can be then used in the clinical reporting process. The problem is that that process relies on geneticists doing curation, so it’s inherently limited in its ability to scale. The process of doing that curation involves interacting with a lot of these public databases, which is today for us a manual process. But I think that as genetics and genomics get used with increasing frequency in clinical medicine there’ll be a real driver to streamline that process.
I think that what that’s going to drive us towards is more robust interfaces to these databases and also federation strategies — and I don’t mean that from a technical database-federation sense, but from a sense of figuring out how things that are curated within our center can be shared with things curated in other centers in a way that’s literally intended for use in clinical care and patient care.
Smith: You raised the same issue — that current curation is not scalable — and that’s I guess what is depressing. I haven’t yet heard a proposal that’s going to get us around this at all. I work with FlyBase over at Harvard; they have a hall full of curators and I would say that [William] Gelbart’s group works very hard to minimize the errors — [but] they haven’t seemed to have any inkling of how to automate any of it.
GT: What’s the answer for getting this kind of data into labs, then?
Aronson: Really pushing it into the laboratory where it can be used to really improve laboratory processes — to put more information in the hands of the laboratory technicians so that they can be much more efficient in their workflow within the laboratory.
Grinstein: The term ‘more information’ really bothers me because more information is not a simplification.
Aronson: So maybe I should have stated it differently. Putting information in front of them that would allow them to make decisions that today have to be made by someone else — that was what I really intended.
Hayes: And we’ve been doing that as a continual progression. Blast? Blast is rarely run by a bioinformatician anymore. It’s usually run in the lab.
Smith: But then it’s terribly misused. I think what’s going to happen in the modern laboratory is something very different. Just looking inside this small company of ours [Smith is a cofounder of Modular Genetics], the technicians are vanishing. What counts is the guy sitting in the room with myself and some of our programmers and our advisors, designing what we’re going to do with all the data we can get our hands on — and then the robots do it. The robots do the analysis, they lay out the arrays. Nobody makes a decision, in a sense, until the data comes back. You have a design team; that’s even true now in a couple of the laboratories at MIT. You have a graduate student standing in the lab, making sure the Tecan’s doing what it’s supposed to do, but the rest of the time is spent in the conference room figuring out what experiment we should do. What I don’t see [is] a set of screens with visualizations supporting them. What’s not happening in the modern laboratory [are] all these design systems — what do you call them?
Hayes: Computer-aided design systems.
Smith: Yeah, the CAD/CAMs. I don’t see the equivalent of a CAD/CAM in molecular biology labs or clinical labs, and that is what’s missing. We sit around a table like this at our company and people bring up PowerPoints and stuff, but then we’re talking about putting mutants in this position and then you say, ‘Well, what mutants has nature already tried there?’ and then somebody has to run off, run Blast, and do all this stuff. But if I was doing that over at MIT, talking about designing a new computer chip, that data would be right in front of us on the table and we could search it, ask questions about it, test the design right then and there. What is the biological CAD/CAM?
Hayes: I think BioSPICE is an attempt at that.
Smith: So they’re linking LIMS systems to design?
Hayes: And iterating along with systems biology pathway analysis — with standard differential equations.
Smith: Well, that’s where the future lab I should think is going.
Hayes: That’s all built, though, on having really good infrastructure — and that’s the issue in bioinformatics. Our databases are really hard to manage, there aren’t any companies that provide these effectively as services. Biogen Idec actually looked to see if there was a company that could just deliver us Blast databases and all we’d have to do is run queries, and have everything updated in the databases as well as the code updated on our services. We couldn’t find anyone.
Grinstein: I’m an optimist because I believe that those tools will eventually be well developed both by academic and commercial partners. I think that that’s going to happen more and more.
Aronson: We’re seeing an increasing number of large studies within the academic arena — and those large studies do lend themselves to investments in IT infrastructure to make the processes that underlie those studies as efficient as possible.
Hayes: I saw very few grants come across this last study section that had much in the way of infrastructure, and they are hard to get funded.
GT: Our readers are also interested in data storage. How do you best plan for the future when you think about data storage?
Grinstein: Up until about two years ago I was really worried about the same question — and then digital photography took off. The amount of storage by individuals who have digital cameras is reaching a terabyte per family, easily. That’s funding an immense amount of research into digital storage. 256 terabyte address spaces will be available in about a year.
Hayes: That’s holographic storage, right? Holographic storage is potentially world changing. But magnetic disk is getting pretty close to the limit.
Smith: Even without going to the 10-year-out stuff, my lab maintains these databases, has all the public databases in house. We just went out and got a 2 terabyte striped RAID hot swappable system — it was $2500, period. Any laboratory can put together a dozen terabyte system if they so desire.
Hayes: But the better question is, how many of those images do you need to keep?
Aronson: I think that’s where the big challenge is here. The cost of individual units of storage is going down considerably, but the amount of data that we need to manage is going up — and it’s the management of that data that we see as a challenge. I have found that to be a challenge because of what each next generation of instrument is going to generate. We wound up looking to commercial partners to help us. We partner with HP to help us figure out how to optimize storage in a kind of unpredictable environment.
Hayes: The one thing that I have not seen at all is better data backup. The best technology for backup is tape, and tape is really painful.
Grinstein: If you’re generating data fast, tape systems cannot handle it.
Aronson: I think that all of this winds up being tradeoffs between how much dollars get put into infrastructure and how much dollars get put into operating people to manage that infrastructure. What we really need to look to the vendors to do is to not only make strides in giving us cheaper storage, but in giving us the tools so that the FTE count required to manage that over time [goes down].