A collaborative effort among British researchers has produced the first de novo assembly software solution capable of assembling multiple eukaryotic genomes simultaneously. The team includes researchers from the European Molecular Biology Laboratory-European Bioinformatics Institute, the University of Oxford, and the Genome Analysis Centre in the United Kingdom.
Called Cortex, the new software has already facilitated the joint assembly of more than 150 genomes from the 1,000 Genomes Project, demonstrating that each individual has roughly 1.4 million DNA bases that differ from the reference genome. According to Zamin Iqbal, a postdoctoral researcher at the Wellcome Trust Centre for Human Genetics at Oxford, the impetus for developing Cortex arose from the large amount of memory — hundreds of gigabytes or even terabytes — standard assemblers typically use when processing next-generation sequence data.
"We were sure that a careful and efficient design could reduce this overhead, which made assembly of large genomes almost impossible — certainly no one was thinking about assembly of more than one eukaryote genome, as they could only barely do one," Iqbal says. "However, we were able to make dramatic improvements, which opened up possibilities for not just looking at two or three genomes, but hundreds. Once you open the door to simultaneous assembly of many individuals, you can bring to bear the full power of population genetics into assembly. We show in our paper how effectively this can be used to get a good call set, even when you don't have a reference genome for your species." The team's paper was published online in Nature Genetics in January.
The Cortex team is hopeful that its solution will open up several new avenues of research including the analysis of microbes and pathogens like methicillin-resistant Staphylococcus aureus and Escherichia coli. "Suppose you have a longitudinal study of blood samples from a patient with a disease caused by a pathogen. It's becoming more common to use sequencing to study evolution of the pathogen in the host," Iqbal says. "However, often there is either no good reference genome, or it is reasonably diverged from your samples. So what you really want to do is just compare your samples directly and watch the mutations appearing — Cortex makes this easy. You compare samples directly, without using a reference as an intermediate."