In Philip K. Dick’s complex science-fiction novel “Valis,” information is a living substance found in every person. For that reason, and because they envision a genomics software environment that behaves as a “vast active living intelligent system,” Bud Mishra’s team at New York University named its project VALIS.
It has nothing to do with all that alien weirdness in the novel, Mishra insists, yet the DOE-funded undertaking aims to bridge two worlds. Mishra, a computer scientist, has studied how biologists work. “You learn a lot of things from watching how they go about using software,” he says. He wants to build a system in which biologists can construct and modify programs in a language that is easy and natural for them to use. The VALIS environment will accommodate all stages of biologists’ work, from conceiving an experiment to reporting the results.
VALIS sprang from NYU’s optical mapping project. Current sequencing techniques tackle DNA in stretches of only about 750 base-pairs at a time. Information about the original location of segments is generally lost when DNA is separated into short pieces. To put Humpty Dumpty back together again, scientists need a reference map with known landmarks.
The optical mapping method, a single-molecule approach developed by David Schwartz who is now at the University of Wisconsin, involves “stretching out” individual DNA molecules, binding them to glass surfaces, and slicing them with restriction enzymes designed to react with particular location sequences. The DNA molecule, which remains fixed with its segments in their original order, is so long that it can be imaged with an optical microscope. Fluorescent stain makes the cut places stand out, and the image is recorded digitally.
The computational aspect of the technique is a challenge because there are several unavoidable error sources. Non-uniform extension of the DNA molecules yields sizing errors. Incomplete “digestion” of the DNA by the enzymes results in missing landmarks. Random breakage of the molecules leaves apparent landmarks where there shouldn’t be any. Mishra and colleague Thomas Anantharaman collaborated with Schwartz in developing analysis tools to deal with these problems and automate the optical mapping process. The tools included Gentig, a statistical algorithm using Bayesian inference, and the “Contig Visualization and Exploration tool” or ConVEx.
Optical maps have been instrumental to sequencing efforts at some labs such as TIGR’s. But sequencing software toolkits were not designed to interface with mapping data, and require retrofitting to accommodate them. VALIS is intended to provide the “glue” to integrate the tools and algorithms.
Valis will also provide a simulation environment where a scientist could plan wet lab experiments, analyze results, and check an experiment’s logic before performing it. A library of probabilistic and statistical parameters representing biochemical processes will allow results of some experimental steps to be predicted so that unpromising procedures might be bypassed.
As a proof of concept, Mishra and his team implemented a skeletal design with language elements suitable for optical mapping data. Next they will work on expanding the language to accommodate DNA chip applications.
An upcoming meeting with biologists will test the system. “We will find out what we were doing right and what we were doing wrong,” Mishra says. Much of the prototype code will probably be thrown out, and the rate of progress depends on funding resources. But he predicts that formal release is a year or two away.