Stanley Chair in Interdisciplinary Engineering
Iowa State University
Srinivas Aluru is the Stanley Chair in Interdisciplinary Engineering in the department of electrical and computer engineering at Iowa State University and for the past several years, has done extensive research in bioinformatics on myriad topics such as parallel models, algorithms, and applications.
His department is completing a three-year project funded by the National Science Foundation to develop a better method of storing biomolecular sequences. The goal of the project, which began in September 2004, is to design storage structures, algorithmic techniques, and software for disk-resident sequence data, and then apply it to applications in computational biology.
Aluru recently spoke to BioInform by telephone to discuss this project, which has been extended through next August, as well as his other bioinformatics efforts at Iowa State. Following is an edited version of that discussion.
What would inspire an engineer to get into bioinformatics?
I’m actually a computer scientist by training. I am in the electrical and computer engineering department and [those disciplines] have so much in common … Computer scientists, perhaps, focus more on theory [whereas] engineers may focus more on problem-solving, so it’s not unusual that I am focusing in this area [of bioinformatics].
When I got into this field back in ‘98 I had no knowledge of biology; in fact I had not even taken any high school biology, [because in] … the Indian educational system one would chose either biology or math and I went along the mathematics track … [Later on] I was working at New Mexico State University when ACM [the Association for Computer Machinery] conducted [Research in Computational Molecular Biology], the first computational biology conference in Santa Fe in 1998 and … I was curious to go check it out, and came back quite impressed with this field and decided to get involved.
So I took a semester off to study biology in something we [at Iowa State] call a study-in-a-second discipline program that relieved me of my teaching duties for a semester. I think that was in 2002.
What are your principal areas of research that involve bioinformatics?
Most of my work is concentrated in genomics and I also have expertise in high- performance parallel and supercomputing so I am trying to use parallel computers and supercomputers to solve large-scale biological applications. One of the niche areas I look at is how do we solve large-scale problems using supercomputers effectively….[How do we, for example] carry out genome assemblies using 1,000 processors simultaneously to solve the same problem? [It] could be difficult to solve because it takes too much time or it takes too much memory ….
For example, If I have a parallel machine and 1,000 processors, each processor can have two gigabytes or four gigabytes, so collectively they will have a large amount of memory.
So that allows us to solve large problems that might be difficult on a desktop. The other … is speed. …If we find a way to solve it in parallel using multiple processors we can, of course, do things faster.
Typically it takes about three months to do [genome] assemblies. We are able to use an IBM Blue Gene supercomputer and we are able to do jobs in a day or two …
[We] can play with parameters … If you need to do some analysis it can easily take a year before the benefits of a sequencing project reach the community, but we can do it in a week or two.
What percentage of your time is spent on this?
I would say about 80 percent of my research is now in the bioinformatics area.
We are curious to hear more about the grant awarded back in 2004 regarding storage of biomolecular sequences and accessing them to determine sequence homologies. With the grant money running out in a year, how far along are you in your progress towards achieving your goals?
First, this project has nothing to do with parallel computing [which we were just talking about]. [Rather], the goal here is to double up technologies which allow fast access to sequence data – like DNA and protein sequences in a database.
When you use a database there is always an underlying storage structure of how the data is laid out … it’s very important because when you try to access it the disks are very slow compared to the rate at which the processors work. … Chips work at nanoseconds, but disks can only work at [the] millisecond [level] because they are a mechanical device … and it is very important to come up with methods using a minimum number [of] disk accesses.
So the way in which the data is laid out determines … how frequently you need to access the disk. A lot of work has been done on databases in the last two to three decades, but typically when we store data, it can be ordered somehow. For example, when we store student social security numbers we do that in a particular order [that] you can access very easily.
But to access [a] DNA sequence is very different…[for] we are looking for parts of the sequences. [For example, we are saying], ‘Here is a DNA sequence, can you find other sequences that have parts that look similar to parts of the sequence?’ So the way we query the sequence in biology is very different than how we do databases in other areas.
This research is basically about doubling up storage structures that are particularly suited to biological data, particularly string data because we model the biological sequences as strings.
So we made two major breakthroughs. One of them is that there is a data structure called “suffix trees” that is used in main memory to solve a lot of computational biology problems. One of the things we did was come up with a way to organize the suffix trees on disks so you get the optimal number of disk accesses. Suffix trees are used in a number of computational biological applications, but they all assume the tree fits in the main memory.
If you have very large [amounts of] data and it must be stored in the disk, what is the best way to organize the layout on the tree? We proved that there is a way to organize data on the suffix tree to guarantee the number of disk accesses is the best possible.
The second [breakthrough] is a continuation of the same thing, where suppose you have a series of biological queries that are coming … What you want to do is come up with a way to organize the data on the disk and keep modifying the way you lay out the data so the entire sequence of queries that are coming [allows you to] minimize the number of disk accesses you take to satisfy them.
What goals have you set for yourself? Anything you can add to what you’ve already stated?
Yes, the main goal for the project is to invent storage structures and mechanisms to access the sequence data. So in that sense these breakthroughs are tied into our goals.
The first [goal] is an open problem from 1985 … There is a classic paper published by [Bob] Tarjan, a faculty member at Princeton, and he left this as an open problem in 1985. In that paper he gives a way to answer a sequence of queries for numbered data and he leaves it as an open problem for how do you solve this for string data?
Good papers and good people always solve some problems but also pose some questions. He was a Turing award winner [in 1986, as an HP Senior Fellow].
What challenges have you faced toward this end?
The main problems are the intellectual difficulty. It was not clear when we started working on it that we would come up with the results that we came up with. I was not expecting I would be able to solve these problems.
[Tarjan] just posed the problem for people to work on, for the community to work on.
Interesting. So please now give an overview of your other major bioinformatics projects.
[T]he achievements I’ve talked about so far are breakthroughs in terms of algorithms, so I have taken a one-year extension on the project so we can actually go ahead and try to implement systems using these methods. I would like to implement these systems and see how they work in practice. The grant was actually till 2007 but NSF allows us to take a one-year extension.
There are a number of other projects, too. I currently have probably six NSF grants, and we have grants from other agencies. One we are involved in is the provision of assemblies for the Maize [Genome] Sequencing Project. That is an NSF/US. Department of Agriculture/Department of Energy, $30 million genome sequencing project led by the University of Washington in St. Louis … We are doing the gene assemblies using the Blue Gene supercomputer.
I also read about CREST (The Center for Excellence in Bioinformatics and Computational Biology) grant. What can you tell me about that?
Yes, regarding CREST, I have a sub-contract from New Mexico State University. They received the CREST funding to set up a center of excellence in bioinformatics … I have joint research with them and we are helping them set up a program at New Mexico in bioinformatics. We have a PhD program at Iowa State with more than 55 PhD students in bioinformatics, which is possibly the largest in the country. There will be a masters program at NM [in bioinformatics, as a result of this initiative.
And [all of this has been possible] through the IGERT program – meaning Integrated Graduate Education and Research Training program. We can recruit talented students, give them fellowships and direct them to certain research areas. … We have an IGERT to jointly carry out education in this area.
Fellowships will be given by NSF to offer IGERT fellowships at both schools, and offer interaction both in the informatics and research areas.
Who aids you in your research?
The graduate students are a very significant part of it. I have graduated three PhD students this past summer and I am still left with one post-doc researcher and seven [more] PhD students. So we are fairly large group.