Skip to main content

Phylogenetics Researchers Push for Improved Data Archiving to Plug Holes in Evolutionary Trees


In a new study published this week in PLoS Biology, researchers from the University of Florida and other institutions call for journals, funding agencies, and researchers to adopt more rigorous archiving policies for phylogenetic data, arguing that failure to do so will foster the continued loss of data crucial to evolutionary research studies.

They contend that the current practice of archiving raw DNA sequences alone — a requirement imposed both by many scientific journals and funding agencies — is insufficient for phylogenetic studies and detrimental to future research in the field. Scientists also need to be able access to DNA sequence alignments as well as the resulting phylogenetic trees because these data are "pivotal for reproducibility, comparative purposes, meta-analyses, and ultimately synthesis," the researchers wrote.

The study grew out of the researchers' participation in the Open Tree of Life, an ongoing project funded by the National Science Foundation to use genetic sequences to construct a comprehensive phylogenetic tree linking all 1.9 million named species. Douglas Soltis, a professor in the Florida Museum of Natural History and UF's biology department and co-author on the paper, explained that the tree is built by stitching together multiple smaller trees generated for different species.

Soltis' team, which works on plants, was combing previously published literature to find building blocks for a plant phylogenetic tree when they realized that many of these smaller trees had never been deposited publically. This was also true for other species beyond plants, he told BioInform. They also found that the alignment data used to generate the trees hadn’t been deposited as well. That means that researchers can’t repeat the analysis themselves and try to regenerate the missing trees, he explained.

Missing alignments also make it difficult to "assess a publication's validity," according to Bryan Drew, the study's lead author and a postdoctoral researcher in UF's biology department. "There are ambiguities with the alignments, you have to make certain judgment calls, and so an alignment that I do is not going to be the same as an alignment that somebody else does."

Based on their analysis, the researchers estimate that approximately 70 percent of underlying phylogenetic data — sequence alignments and trees — produced in the last 12 years are no longer accessible. They came to this conclusion after examining over 7,500 peer-reviewed papers about animals, plants, fungi, bacteria, and more published between 2000 and 2012 in more than 100 journals. They found that only about 17 percent of these studies provided alignment and tree data. Efforts to obtain data directly from the 375 authors of these studies succeeded only 16 percent of the time, with most researchers simply not responding to requests for the added data, Soltis told BioInform.

The researchers also report that they evaluate 100 publications that implemented the Bayesian Evolutionary Analysis Sampling Trees, or BEAST, analysis package — a tool used to "obtain divergence times and phylogenies" — and found that only 11 studies provided access to xml input files needed to reproduce BEAST's results.

Possible reasons for the failure to submit this additional data, according to the PLoS Biology study, include complicated mechanisms for uploading data as well as varied, unclear, or poorly enforced data archiving practices. For instance, while most systematics and evolution journals don’t require researchers to submit sequence alignments or phylogenetic trees, about 35 journals have policies that encourage or require authors to turn in this data to resources like TreeBASE and Dryad although enforcement of these practices is "generally lax," according to the paper.

Meanwhile, although funding agencies like the National Science Foundation require that grant proposals contain data management plans, "explicit requirements regarding post-publication data archiving are lacking, and there is little if any post-funding oversight into data archiving practices," the researchers wrote. Furthermore, because preparing and uploading data into public resources can be time consuming and labor intensive, researchers may be unwilling to repeat the process for data not required upfront for publication. There are also legitimate concerns about source attribution. The paper notes that many researchers may be wary of making their data available for fear that it will be reused without proper acknowledgement.

"This [problem] is not just limited to biology and phylogenetic trees," Soltis said. "There is a message here for all science. We generate a lot of data but we don't necessarily archive [it] really well." But "in this modern age of informatics and computer-driven research … all of these data can be useful in new ways [and] I think we are going to find that a lot of different disciplines were missing a lot more data than we realized," he said.

Soltis and his colleagues suggest some potential solutions in their paper. For example, they propose that scientific journals implement and enforce policies that require researchers to deposit data in public repositories. These depositions should "include program input files for popular programs such as BEAST, as well as any other relevant information needed to replicate the study," they said. "Optimally, all peer-reviewed journals that publish phylogenetic datasets should require deposition — and activation for public access — of alignments and trees prior to publication, and these trees and alignments will include the same characters and taxa — and taxon names — as in the published study."

They also suggest devising a new "data deposition metric" that could be used to "confer prestige to well-published and well-archived authors" and making data archiving a required component of grant proposals.

Finally, they've begun discussions with the NSF focused on how the funding agencies could incentivize researchers to deposit more of their data, Soltis said. One potential solution would be to require researchers to provide explicit details about data archiving efforts as part of their annual and project reports, he said.

These efforts "could be quantified and rewarded by reporting previously archived data as part of new grant proposals," the researchers wrote. They also suggest that funding agencies make data archiving a required part of data management plans in grant proposals.

Filed under

The Scan

And For Adolescents

The US Food and Drug Administration has authorized the Pfizer-BioNTech SARS-CoV-2 vaccine for children between the ages of 12 and 15 years old.

Also of Concern to WHO

The Wall Street Journal reports that the World Health Organization has classified the SARS-CoV-2 variant B.1.617 as a "variant of concern."

Test for Them All

The New York Times reports on the development of combined tests for SARS-CoV-2 and other viruses like influenza.

PNAS Papers on Oral Microbiome Evolution, Snake Toxins, Transcription Factor Binding

In PNAS this week: evolution of oral microbiomes among hominids, comparative genomic analysis of snake toxins, and more.