David Searls, senior vice president of worldwide bioinformatics for GlaxoSmithKline, has been studying the overlap between computational linguistics and biology since before the term “bioinformatics” was in common use. Searls recently penned an overview of the use of linguistics within bioinformatics for the journal Nature. The review article, “The Language of Genes,” discussed how various methods developed for analyzing languages have been applied to molecular biology. BioInform caught up with Searls last week to chat about his findings.
What originally piqued your interest in studying the crossover between linguistics and bioinformatics?
To go back to the very beginning, I got an undergraduate degree in philosophy, actually, so I started with a little bit of a humanist approach. I also studied biology and did my PhD in biology. I had always been interested in computer science and the application of computation in biology, and this was in the early days — I hesitate to say how long ago that was — so there really wasn’t such a field as bioinformatics then. But I did want to study computer science seriously, so while I was doing a postdoc and after a postdoc I studied computer science and did a master’s degree at Penn and I concentrated in computational linguistics. I was very interested in Chomsky-style grammars and how they were used to specify languages, and with my background in biology I wondered why this sort of approach had never been taken to DNA, which has all the elements required: It’s got an alphabet and it occurs in strings, and it encodes information. I looked through the literature and there really wasn’t very much at all on the overlap between formal language theory and biological sequences. So I started working on the area myself and started putting together some basic observations and basic results and really surprised myself with how far you can go in terms of classifying the complexity of various kinds of structures and nucleic acids using all of these tools that had already been developed and extensively characterized in the field of formal language theory.
Are many more people applying linguistics to bioinformatics now than when you first started out?
It’s certainly not a mainstream approach in bioinformatics. My contention is that those of us in bioinformatics are actually using these techniques without even realizing it in many cases. So a lot of the examples I give in my review are examples where techniques in bioinformatics are actually reinventing techniques that had already existed in linguistics. Sometimes it’s apparent. For example, I talk a lot about how hidden Markov models, HMMs, have become a really dominant paradigm in bioinformatics, and they actually arose in the speech-processing application first and [then] were adapted to bioinformatics. I think more and more in bioinformatics we’re seeing more sophisticated modeling of domains analyzed with methods like HMMs.
What other important techniques in bioinformatics have been borrowed from the field of linguistics?
Well, the classic sort of algorithm in computational biology is parsing algorithms, and basically parsers do pattern recognition. When I was working actively in this field I used parsers for pattern recognition — things like gene finding [for] protein-coding genes and tRNA genes and things like that. You can write grammars and you can use parsers. Since then, that field has been more or less taken over by stochastic methods like HMMs, which are also a form of grammar or a linguistic approach. So the whole area of pattern recognition borrows a lot from computational linguistics. If you look in some of the standard textbooks, this is pretty well recognized. There are some other kinds of work on things like modeling algorithms with finite state machines and that kind of thing. So I’d say there’s been a steady borrowing from a lot of the basics in formal language theory as people take a more formal approach to sequence analysis algorithms.
More broadly speaking, there are a lot of other kinds of activities that linguists have long engaged in [where] people are starting to recognize that there’s some kind of crossover. And it’s not just one way. It goes both ways. One example is in the area of evolutionary biology: Some of the algorithms that people use for phylogenetic reconstruction have been applied to languages — where did various languages come from, what’s the relationship between various modern day languages and what were their ancestral languages? So you can use the same kind of phylogenetic approaches to languages. But what people need to realize is that the original approach to that sort of cladistic view of evolution really goes back to the ‘50s and even earlier in the linguistic community.
There’s a lot of interest nowadays in ontologies, like the GO ontology, and modeling the semantics of biological systems in terms if “is-a” hierarchies and “part-whole” hierarchies and these complex sort of graphical representations of the relationships among genes. Well, that really borrows heavily from work that’s been done for a long time in lexical semantics, where people try to understand and catalog the meaning of words and word families in the same sort of hierarchies. Linguists have been building ontologies for quite a while.
You mentioned the amount of reinvention of linguistics approaches that you’ve witnessed in bioinformatics. What specific benefits might bioinformatics developers gain from a closer study of linguistics?
Fundamentally, there’s nothing wrong with reinvention. It’s just that you can save yourself a little time if you’re aware of what’s gone before. And when these techniques and these ideas are applied in biology, the details are very much different than they are in language, so you can’t just take these techniques wholesale. In fact there’s probably a lot more people working in bioinformatics now than there are in computational linguistics. So my argument is not that you can simply reuse wholesale the techniques that have been developed in linguistics, but you’ll have to re-implement and tune them for the biological domain. The overall idea is where I think you can save time. Or you can benefit from having the two communities interact by just learning from each other at a very general level what the types of problems are and how different communities have approached them.
What kind of interaction currently exists between these communities?
There are occasional interactions that I hear about and know about. Actually, I helped to organize a workshop about a year and a half ago at the University of Pennsylvania with one of my colleagues there, Aravind Joshi, who’s a computational linguist, and the other co-organizer was Sean Eddy at Washington University, who of course is a prominent bioinformaticist. We put together a workshop where we invited a couple of dozen workers from each field, just people who didn’t necessarily know anything about the other field, but we thought that they would have something to talk about. And in fact we found that they had a lot to talk about. It was a very productive interaction and it has led to a couple of other interactions.
Are there any promising areas within linguistics that remain undiscovered by bioinformatics?
Almost any aspects of biology could benefit from what I call a linguistic sensibility. I think almost any field could benefit to some extent. As I said, it’s not going to be instant results because the two domains vary greatly at the level of detail, but I think some of the themes are in common between them.
Also, at an even higher level, there are some sociological aspects about how the fields work. Because biology is such a varied set of fields, just as linguistics is, you see some of the same sorts of phenomena playing out. Like literary genres, for example; I think you see different genres of bioinformatics. You see people who take primarily an evolutionary approach, and others primarily a structural approach, and others who depend more or less on model organisms. So the best bioinformatics is done when you look at all sorts of approaches in common, and I think the same thing has always happened in linguistics and in literary studies, where there are certain key techniques that are in common and then you get specialization, and sometimes you get controversy.
I was just reading about a controversy in the literary community back in the 80s. There was a new edition of James Joyce’s Ulysses being put together, and the problem was that James Joyce was a little bit sloppy and there were all sorts of spelling errors, and as a result there were many different versions of Ulysses and they all varied quite a lot. There were thousands of differences between them. So in order to come up with what they called the authoritative text or the corrected text, there was one worker who basically did what I would call an assembly of the various different versions, and there were other people who objected to these methods and said that the assembly should have been done from a single authoritative text. So it became very controversial and there were scientific meetings with people shouting at each other, and it struck me how similar that whole situation was to the controversies around the assemblies of the human genome and the different camps that had their own approaches.
It just goes to show that even at a sociological level there’s not too much difference between the fields.
Looking over the field as a whole, where have you seen the biggest impact of this kind of approach?
Definitely the use of HMMs and stochastic methods. We’re beginning to see the use of the clustering algorithms and approaches to dimensionality reduction in gene expression studies. I have a little section in my paper about power law distributions that people have recognized now in all kinds of biological situations, and again that goes back to original work by a linguist. So I think there are a number of areas where there is interesting work going on. It’s not always explicitly tied back to the linguistic origins, but it’s sort of endemic at this point.
The best thing that can happen from recognizing these top-level connections is maybe to get people talking.