NEW YORK (GenomeWeb) – Researchers from the University of Michigan have found that a popular mathematical model for inferring species boundaries based on genetic data alone can lead to inflated species estimates that are five to 13 times higher than the true numbers. The findings, which the researchers believe could have wide-ranging implications for a number of fields, were released today in the Proceedings of the National Academy of Sciences.
The "species" classification is the fundamental unit for all evolutionary and ecological studies. The model, formally known as the multispecies coalescent model, is widely used amongst evolutionary biologists to quickly determine species boundaries without the painstaking process of comparing specimens in museum collections.
"It's been promoted as a way to speed up inventories of biodiversity by combining the automation of genomics with the statistical power of these models," Lacey Knowles, a professor in the UM Department of Ecology and Evolutionary Biology, curator of insects at the university's Museum of Zoology, and co-author on the paper, said in a statement.
"Suddenly it seemed like there was a magic bullet. You just have to push a button and you get your species," Jeet Sukumaran, an assistant research scientist in the UM Department of Ecology and Evolutionary Biology and study co-author, said in a statement.
"The only problem is, this method is not doing what we think it is doing, resulting in an overestimate of species numbers," Knowles added.
In order to determine the accuracy of mathematical models, such as the multispecies coalescent model, Sukumaran and Knowles generated species trees using a protracted speciation model. "Here we use the protracted speciation model as a generative model that allows us to simulate speciation as an extended process rather than an event, with a lag between initial population isolation or divergence of a lineage from an ancestral species and its development into true species," the researchers wrote in the paper.
Then the researchers generated gene trees under the multispecies coalescent model, and aligned sequences on gene trees for each locus and calculated inference of species trees based on sequence data. Finally, the researchers compared the inferred versus actual number of species in each set of generated species trees.
After Knowles and Sukumaran finished their analysis they determined that the multispecies coalescent model diagnoses genetic structure within a population, not species. Consequently, the model isn't able to accurately make species estimates.
"The overinflation of species due to the misidentification of general genetic structure for species boundaries has profound implications for our understanding of the generation and dynamics of biodiversity, as any ecological or evolutionary study that rely on species as their fundamental units will be impacted, as well as the very existence of this biodiversity, as conservation planning is undermined due to isolated populations incorrectly being treated as distinct species," they wrote.
"Everyone knows that speciation is not an instantaneous process. But what no one has questioned, until now, is how ignoring that fact changes the story this model is telling us," Sukumaran said. "This paper places that issue front and center."
"The irony is that the more genomic data we collect, the less certain we are as to where the species boundaries lie," Knowles said. "Going forward, we are going to need to both improve our models and fall back on alternate — and maybe even more traditional — forms of data to be able to identify species in the age of big data."