NEW YORK (GenomeWeb) – A researcher from Pennsylvania State University has received roughly $209,000 in grant funding from the National Science Foundation to develop freely available novel computational methods for reconstructing large genomes from sequence data more scalably and efficiently than current methods can.
According to the grant abstract, the project aims to develop novel algorithms and software that can address current de novo assembly challenges such as "scalability, accuracy, and adaptability to new technology." Specifically, the funds will be used to develop a scalable assembler, a modular assembly framework, and predictive models to guide experimental design and "characterize the relationship between string and de Bruijn graphs, [as well as] the structure of sequencing overlap graphs." The improved tools will enable "previously impractical assembly projects and allow biologists to perform assembly without needing expensive hardware," the abstract states.
While current de novo assembly methods work well for assembling smaller genomes such as bacterial genomes, assembling mammalian and plant genomes are still something of a challenge, Paul Medvedev, principal investigator on the grant and an assistant professor in Penn State's department of computer science and engineering, told GenomeWeb. This is in part because existing methods don't scale well to handle larger, more complex genomes and can take weeks or months to complete an assembly.
Laboratories that have access to the sort of compute power that, for example, supercomputers provide can speed up de novo assembly on these larger genomes, he said. However, having that much power at the ready isn't the norm for most research labs, which typically have much smaller hardware systems in house. Medvedev's goal is to develop algorithms and software that can assemble large genomes on more common lab infrastructure, he said. These tools might make it possible to, for instance, assemble a human genome de novo on a desktop computer or a 20-billion-base-pair plant genome on a single multicore server with half a terabyte of random access memory.
Improved genome assembly tools would bolster efforts to study plant genomes, for example, in order to develop parasite-resistant strains or cultivate renewable sources of energy, the grant abstract states. These tools could also benefit large-scale genomics-based studies such as the Brain Research through Advancing Innovative Neurotechnologies (BRAIN) Initiative, which aims to detect variants tied to genetic disorders such as Alzheimer's, schizophrenia, autism, and epilepsy, the abstract states.
As part of its development efforts, Medvedev's lab will address memory usage challenges, he said. The researchers will work on methods for breaking up larger tasks into smaller chunks that can be loaded into compute memory individually as well as methods of running compute tasks in parallel. They'll also leverage existing methods that were developed in collaboration with researchers from Canada's Michael Smith Genome Sciences Center and the Ontario Institute for Cancer Research, and were described in a paper published last year in the proceedings of last year's conference on Research in Computational Molecular Biology (RECOMB).
A copy of the paper is available on the preprint server arXiv. It describes a data structure for representing de Bruijn graphs in low memory that uses k-mer counting software to "transform" input data into a list of k-mers stored on disk and a low-memory algorithm that uses frequency-based minimizers "to enumerate all maximal simple paths of the de Bruijn graph ... without loading the whole graph in memory." In experiments described in the paper that tested the Penn State method on human whole genome and chromosome datasets, the researchers reported a 46 to 60 percent improvement for their approach compared to existing methods.
Other plans include developing methods of estimating input parameters for the assembler from the input sequences quickly and automatically, Medvedev said. Previously, he and a member of his lab developed an approach for selecting one kind of parameter, k-mer size, called KmerGenie. Full details of the software as well as applications to various sequencing datasets were published in Bioinformatics in 2013. The tool uses a sampling method to construct "abundance histograms" and then applies a fast heuristic that estimates the best possible value for the k-mer size from the histograms. As part of this grant, Medevdev will be looking to use a similar approach to estimate other parameters from data, he said.