University of Virginia researcher Ira Hall has received a five-year New Innovator Award from the National Institutes of Health — one of 55 such awards for "highly innovative research" that NIH granted in 2009 — to develop computational tools to study genomic structural variation.
Specifically, Hall plans to use the award, worth $2.3 million in fiscal year 2009, to explore the role of genomic structural variation in evolution and disease in mammals. According to the grant abstract, he is working on tools to map genome-wide structural variation with "modest" computational power.
Hall and his colleagues plan to apply these tools in three specific areas: among diverse mouse strains with shared genealogical origins, among related mouse colonies that are separated by approximately 2,000 generations of breeding, and among single cells from diverse somatic lineages of the body and brain.
Hall completed his PhD in genetics at Cold Spring Harbor Laboratory in 2003 and was a post-doctoral fellow there until the end of 2007 when he landed a position as assistant professor of biochemistry and molecular genetics at the University of Virginia School of Medicine.
BioInform spoke with Hall last week and what follows is an edited version of that conversation.
How does it feel to be a "new innovator"?
I'm relatively new. This is pretty much the first grant I ever applied for. It's hugely helpful [for] starting a lab.
Grants take so long. You get a good idea, you have to have money around to pursue that idea. It might be a year later by the time you get the money for it. You're obviously still interested in it but the pace of your brain is sometimes not matched by the pace of science.
I assume that [NIH] was trying to have an award that was high-risk, specifically for new investigators. Unlike other grants, they seem much more interested in you. They ask you for a proposal, but they are putting their faith in you and your ability that you are going to be able to do what you said, or at least do something good.
Will you be hiring or buying any types of instruments?
We are going to hire a couple more people but we already have a pretty good group, a good post-doc, programmer, and technician. I just need to get a few more students and post-docs in.
Mainly it's going to allow us to do all the experiments we want. We will have money to do sequencing and buy some bigger computers.
What's really exciting is that it's going to allow us to analyze the genomes that we want to analyze. That's the key to move beyond ideas.
Do you have access to a second-generation sequencer?
We have an Illumina GA II downstairs, and we've used it a bunch. My lab is half wet-lab, half computer lab.
What kind of computers do you have?
We have access to various different clusters. And for our own use, we have sort of taken a different approach, with high RAM, and 16- or 32-core machines. A lot of things we do take a lot of memory and it makes it is easier to have all the processors in one box. We have a couple of machines with 128 gigabytes of RAM, which makes them easier to program.
Are the algorithms memory hogs?
It is a problem, but it's what you have to deal with. Anything you write can be written for high memory usage or low memory usage. The higher memory usage is generally faster to write and to run. So if you have the memory, you might as well use it.
[ pagebreak ]
Unlike researchers with two-year grants connected to the American Recovery and Reinvestment Act, you now have a five-year stretch ahead of you. What's the plan?
We're interested in a lot of genomic questions. One of the main lines has to do with genomic rearrangements and structural variation, which spans human genetics, evolution, and cancer. Genomic rearrangement has to do with almost any question you can think of.
We have a long history of using microarrays for this same purpose. When I moved to UVA, I decided to make a clean break and just start getting sequence data.
When we started this work, there were no tools. In the past year, a number of tools have come out — not a huge amount, but several algorithms for copy number differences by depth of coverage and for finding rearrangements by paired-end mapping.
Ours is almost out. What is unique about our algorithm is that we designed it to be able to, in theory, detect all different classes of structural variants. So we should be able to see anything, no mater how repetitive it is.
A lot of the methods that have come out have relied on reads that map uniquely to the genome. This really limits you when you are talking about structural variation, because a huge amount of the variants are in duplicated genomic regions or involve transposons.
How does the tool landscape look in this area and how does your tool measure up?
We haven't used other tools, [but] we've heard good things about BreakDancer [an algorithm developed by researchers at Washington University School of Medicine and published in Nature Methods in August].
[Computer scientist] Ben Raphael [at Brown University] has done really good work but I don’t think they have a tool out yet.
The only tool that is out and that sort of takes the same approach we have taken — which is to try to look at all parts of the genome, not just the unique parts of the genome — is from a team at Simon Fraser University and the University of Washington [published in Genome Research in July].
Their tool VariationHunter tries to look at all possible mappings. It does so differently than ours, but I'm hesitant to say much about it. We're still testing it and are in the process of trying to see what it finds and what it doesn't find. It's the only other algorithm out there that does try to take that comprehensive approach, to try to see everything you can possibly see.
What happens with mammalian genomes that are so repetitive is that as soon as you try to take into account all the possible mappings, the computational problem becomes much more complex. If you're just trying to write code to find simple duplications and deletions and simple inversions in unique sequence, it really is not that hard.
What makes it hard is to try to be comprehensive and unbiased and to see whatever is there. That has been our approach. By taking the least biased approach possible, we will see new things. That's generally the way genomics works.
Why did you start working with mouse genomes?
One of the other things unique about our algorithm and which no one else has done, is instead of working it all out on human [genomes], we worked it out on mouse instead, because we could re-sequence the reference genomes.
[ pagebreak ]
There is a lot of noise that comes from the quality of the reference genome and also how carefully you map the reads to the reference genome. It's extremely high.
The technical goal [here] is to be able to do a reasonable and affordable amount of paired-end sequencing, using, say, the Illumina system or another one, and to be able to find all the structural differences between the genome you are sequencing and the reference genome. That has been the chore and there [are] all sorts of complications every step of the way.
Are you connected with Collaborative Cross, the large-scale cross of eight mouse strains ongoing at Oak Ridge National Laboratory?
That team really knows what it is doing. We're not planning to analyze any of those samples. I believe that is going to be pursued with Jackson Labs.
We're really interested in mouse germline genetic variation and our first study has looked at a couple of mouse strains. But we're probably not going to sequence many more mouse strains, because that is being done by much larger consortia.
How does your project unfold practically? You ring the doorbell at the sequencing office downstairs, drop off your samples, and the data goes to a cluster?
The instrument belongs to the core facility. We don't own it. It's a shared instrument, but we've been the main user so far. We make our libraries, drop them off downstairs, they sequence them, and we take the data and do everything else.
You don't ask for preliminary analysis?
There are other researchers who want bioinformatics support, but we're kind of wedded to doing things our own way, so we just want the data.
Is your storage adequate for your needs?
Yes, we have it all set up so we can get our data analyzed and stored.
We don't store everything. We store the images for three to six months. That's done by the core facility. They have a tape backup system, just in case the base-calling algorithm changes hugely.
Do you need to ramp up computationally for this project of yours?
We'll find out. We don't know yet. We're buying 30 terabytes [of storage], but that's not going to last the whole time.
Who will run the computational side of the project, handle data management, and analysis?
I have a programmer I share with Bill Pearson's lab next door [a computer scientist at the University of Virginia] who is a database engineer. He knows how to make the computers go. He has a server room full of equipment, and all of the equipment works.
… That little room gets really hot in the summer and has to have a lot of air conditioning.
What is next on your list?
The first thing is to wrap up and get the algorithm out to the public and used. Then we're going to hunker down and start generating a lot more data.
After that, we're going to try and go after questions at the intersection between two fields: There [are] a lot of structural variants in the germ line of humans and mammals and then if you look at cancer, there is a huge amount of re-arrangement.
There is very little known about the normal body, about how divergent different cells become over the course of development and how that affects health, aging, or tumor progression.
We're working on ways to look at somatic variation in its natural context and it's really hard to do that.
That means looking at individual organs and tissues and taking that down to the genomic level?
The hard part is getting good samples to sequence and being able to interpret the information that comes out. I think now it can be done.
Can people try out your algorithm to detect structural variants?
Not yet. We're still working on it and expect it will be done pretty soon. It was written by a very talented post-doc, Alan Quinlan, and myself. It is an algorithm to detect all classes of variants by paired-end mapping.