Research scientist, department of medicine
At A Glance
Name: Louis Cleveland
Position: Research scientist, department of medicine, Columbia University; director, Laboratory for Molecular Mechanisms in Human Diseases, St. Luke's-Roosevelt Hospital Center
Background: Various staff, then faculty positions in departments of microbiology and medicine, Columbia University, 1980-present.
Education: PhD, chemistry, Rutgers University, 1974; Postdoc, Immunology, Columbia University, 1975-1979.
Throughout his career, Louis Cleveland has been interested in immunology and the molecular underpinnings of diseases ranging from chronic lymphocytic leukemia to schizophrenia. More recently, Cleveland and colleagues at Columbia University have taken an interest in using computer science techniques to automate biological imaging as a means to facilitate their molecular biology research. The researchers published a paper on their work in the April 2006 issue of Computers in Biology and Medicine [2006 Apr; 36(4): 339-62], on which Cleveland was corresponding author. CBA News caught up with him this week to discuss his work.
This paper was just published in April, but your group submitted it in September 2004. How long has this project been going on?
It's been in the works for quite a while. In 1997, I had funding from the National Cancer Institute to develop an immunotherapeutic vaccine for chronic lymphocytic leukemia. During the course of that project, I made a manual device for capturing cells and monitoring gene expression. That gave me a tangible sense of what the possibilities would be if we were to build an automated system. Since that time, it has been my goal to build a robotic system that would facilitate the monitoring of gene expression at the single-cell level in viable cells. It was difficult to get funding for this highly interdisciplinary project. However, a proposal submitted to the National Cancer Institute in collaboration with Larry Yao was eventually successful.
From the beginning, it has been my intention to build a system that is as automated as possible. Ideally, one would like to replace the human operator with computer vision and other types of algorithms. A critical first step towards this goal is the development of an algorithm that can recognize cells in microscope images of cultures. In our grant proposal, we initially planned to use traditional image-analysis algorithms that would recognize fluorescent-stained nuclei. We also planned to have a human operator review the algorithm's results because we weren't confident that they'd be all that accurate. Only after human review would the image-derived information be used to control the hardware.
Soon after our grant application was funded, my collaborator, Larry Yao, and I were fortunate to recruit a graduate student, Xi Long, who had a strong background in computer vision. As discussions evolved, I decided to do something much more ambitious than we originally planned in our grant proposal. Essentially, I asked Xi to explore the use of statistical learning machines for cell recognition and localization. These have two major advantages. First, the end user simply trains the machine with pre-classified samples. Therefore, there is no need for an end user to struggle with the esoterica of traditional image analysis. Second, there is a high degree of flexibility. To work with a different cell type, one simply retrains the machine. This is in contrast with traditional image analysis approaches, where a detailed optimization must be done for each cell type.
I also asked Xi to try to avoid the use of fluorescent staining. This was important, because we're interested in studying gene expression with fluorescent probes. With green fluorescent fusion proteins, for example, one can monitor gene expression in real time with a fluorescence microscope. In many experiments, one needs to monitor multiple biologically relevant parameters with fluorescent probes. Unfortunately, in most microscopes, there are typically six colors that can be monitored simultaneously (or eight sequentially). It is therefore highly desirable to use the limited number of fluorescence channels to study something like gene expression rather than to use them just for cell identification. Initially, Xi Long used an artificial neural network with a novel preprocessing algorithm. The results exceeded any expectations I had at the beginning. Accuracy was in the low 90-percent range. Even cells in clumps could be recognized. With regard to accuracy, this algorithm is quite adequate for many practical applications. However, artificial neural networks are not ideal for biologists as end users, since they can be difficult and time-consuming to optimize. To get around this problem, Xi Long explored the use of support vector machines, which have the advantage of being much easier to optimize than artificial neural networks.
In study with support vector machines, I asked Xi to reach for an even more ambitious goal, namely the discrimination of unstained viable and nonviable cells in images obtained with brightfield microscopy. To get a practical level of accuracy, it was necessary to develop a new strategy for training support vector machines, which we refer to as CISS, or compensatory iterative sample selection. With this strategy, the results again greatly exceeded expectations. Accuracy was in the low 90-percent range. At this point, we believe that we have an algorithm that is both robust and practical for end users.
Prior to this, were there other methods to detect and count viable cells besides fluorescence or manually counting?
The time-honored method for determining cell viability with a microscope is to use a stain such as nigrosine or trypan blue. As someone with decades of experience observing cells with an ordinary tissue-culture brightfield microscope, I can generally determine if an unstained cell is dead or alive. However, even with extensive experience, a human observer is not sufficiently reliable for rigorous measurements. For example, when I add nigrosine to a culture, I usually find surprises — some cells that looked dead before adding the stain may exclude dye, indicating their viability. Similarly, there may be some cells that take up dye even though they looked viable before adding the stain. Accordingly, published results on cell viability are virtually always dependent on use of a stain.
What do the terms 'support vector machine' and 'CISS' mean?
Starting in the late 1970's, Vapnik began development of the so-called support vector machine, or SVM. SVMs are statistical learning machines that use supervised learning, meaning they are trained with a training set that is prepared by an independent means. SVMs classify data samples that are represented as vectors. A vector is a mathematical quantity that is simply an ordered set of numbers. A familiar example can be found in high school algebra where vectors are represented as the xy-coordinates in a Cartesian plane. With support vector machines, the vector spaces are abstract, and they can have large, even infinite, numbers of dimensions.
In the case of an SVM — let's say you're trying to solve a binary classification problem, where you have two classes: what you want, and everything else. There is a decision function that separates these two classes in a way that maximizes the margin. The graph of this function in the abstract vector space is a decision surface. The vectors that are closest to the decision surface are called support vectors. SVM technology has evolved over quite some time. It is now very sophisticated and is being used in diverse applications, ranging from face recognition to fraud detection. It's a very general technology.
So how have you applied these to cell detection and identification?
In our case, we take images of cultured cells, and generate vectors with what we call pixel patch decomposition. You take a typical image — say 640 by 480 pixels — and move a small rectangle across the image, covering all possible positions. The size of the rectangle is chosen so that it fits around the largest cell that we're dealing with — say 25 by 25 pixels. For each position of the rectangle, we get a vector consisting of the 625 pixel values in that rectangle. We're interested in pixel patches that have a centered cell. If we know that the cell is centered, then we know where the cell is in the image, since we know the position of the center of the rectangle. With this strategy, the classifier is taught to recognize pixel patches that contain a centered cell.
The pixel patch decomposition technique generates a large number of vectors, about 290,000 in the above example. On the other hand, the number of pixel patches with centered cells might be 100 or so. Essentially, we have a needle-in-a-haystack problem because a large fraction of the pixel patches you get are not what you're looking for. The first step in solving this problem is to reduce the dimensionality of the vectors. It turns out that you don't need 625 dimensions to capture the complexity of the objects being classified. To reduce dimensionality, we use a standard technique — principal component analysis — which reduces the dimensionality to a much lower number, typically 10. The support vector machine is then trained to classify the set of vectors having reduced dimensionality. Fortunately, this is a tractable problem for existing PC hardware.
To prepare a training set, the vectors are divided into two classes: viable cells and everything else — needles and the rest of the haystack. One problem that has kept this technology from being used is that the training set is much larger than you can possibly handle with existing computer hardware. Consequently, you end up randomly choosing a portion of the training set. The problem is that the chosen set may not be a representative sample. This is especially a problem for the large "haystack" set.
To solve the above problem, we developed the CISS technique. In this technique, you iteratively correct the set chosen for training. If a sample not in the initial training set is misclassified in a test experiment, you use it to replace correctly classified samples in the initial training set and repeat the training procedure. With this procedure we have gotten convergence to quite high accuracy. The CISS procedure is an important contribution from our perspective, since it facilitates the use of SVMs with the highly unbalanced training sets generated with pixel patch decomposition. I think it will be useful for anybody using an SVM with a needle-in-a-haystack problem.
In the paper, you said this method outperformed several commonly used methods. Can you talk about this?
What we said is a little bit misleading, because it implies there are other techniques out there that people are using to do the same thing. We should have said more clearly that, within our own studies, we've compared an SVM plus CISS to an SVM without CISS, and to an artificial neural network. We found CISS to be essential for success when using an SVM. To our knowledge, statistical learning machines have not previously been used to distinguish between unstained viable and nonviable cells.
Can this method be applied to automated fluorescence microscopy, as well, like what is used in high-content screening?
Absolutely. It's precisely what we intend it for. My laboratory is focused on human diseases, and I'm interested in things like schizophrenia and chronic lymphocytic leukemia. We are developing this technology to facilitate molecular studies on cells from people who have these illnesses. Back in 1997, we found that manual devices were absolutely hopeless. The development of an automatic machine has been a major goal since that time. What I envision is not just a robotic microscopy system with some automatic features. Rather, the key word is 'autonomous.' Once you press the start button, the system should run on its own for quite some time before a human operator is needed. Given the extraordinary discriminative power of SVMs, I think we're on the verge of a new era of autonomous systems that will involve fluorescence microscopy and other microscope techniques. A human operator looking into the microscope or even at a computer screen will not be needed in these systems.
To develop an autonomous system based on microscopy, several problems need to be solved. One of the hardest problems is cell identification without the use of stains. Once the needles in the haystack — or cells — are found, we want to use pattern recognition algorithms to recognize sub-cellular features of biological relevance, especially gene expression in real time. If you're doing complex time-lapse experiments, it's possible the machine would benefit from being able to make decisions, so we see decision-making algorithms as being an integral part of autonomous machines. It's not as if this device is extremely futuristic — I see it as a companion development to the autonomous vehicles that recently won the DARPA-sponsored competition. The basic building blocks needed for autonomous systems are now available.
Is there an interest in commercializing this?
Yes, all of what we do is ultimately protected through patent applications, and the Columbia University Science and Technology Ventures office is handling the commercialization. Although we now have NCI funding, the level of funding is only appropriate for proof-of-concept work. At a certain point, bringing this technology to practical devices that can be put in the hands of researchers is going to require industrial partners. Up until recently, it's been premature to pursue that, but we're at a point now where we're poised to move in that direction. We certainly see our device as being valuable in multiple applications: drug discovery, clinical diagnostics, manufacturing of protein therapeutics, and as a generic tool for cell biology research.