Twenty-five. Using traditional, Linnaean classification techniques based on plant morphology, scientists identified 25 species in the genus Psiguria in 1916.
Six. Scientists have recently whittled that 1916 estimate down by incorporating genomic information, among other things, and have identified only six species in the genus.
Psiguria has a snarled taxonomic history in part because its leaf morphology changes drastically throughout its life cycle. Further, its species have monoecious, but temporally separated, male and female flowers and, as vines, thrive in different parts of the neotropical canopy.
"It's a very confusing group of plants to work on," says the University of Nebraska at Omaha's Roxanne Kellar, who has published much of her work as Roxanne Steele. Still, she chose Psiguria for her PhD work.
Kellar began her graduate training in 2004, learning classical taxonomic techniques. "When I started my work, it became obvious very quickly that many people were, in addition to the classical morphological characteristics, also using characters that can be found in DNA or genome sequences to distinguish species," she recalls. "So I decided to do both. I wanted to find out, based on both the morphological characteristics and the genomic characteristics: How many species are there?"
Consistent with others' revised estimates, "by combining those two pieces of evidence I came up with only six different species that I felt comfortable distinguishing," Kellar says.
Like many of her colleagues who had taken the molecular plunge, Kellar first used chloroplast intergenic spacers and the intron of a low-copy nuclear gene as markers to estimate phylogeny. She and her colleagues also identified Psiguria-specific DNA barcodes, which they then used to delineate species in work that appeared in the American Journal of Botany in 2009.
It was as a postdoc studying the plant order Asparagales that Kellar was introduced to Sanger sequencing. "Rather than using the one-, two-, or five-gene sequence regions that had commonly been used," she says, "my goal was to sequence the entire plastid genome to estimate the phylogeny."
Today, Kellar is using next-generation sequencing for species identification and phylogenetic analyses.
Learning to use any new technology can be tough. But, as Kellar says, getting a handle on next-generation sequencing for use on little-studied organisms — particularly those for which there is little to no existing genomic information — can be especially taxing. "The biggest challenge," she says, "is knowing whether you have the right answer."
But as the cost-per-basepair threshold to justify the transition to sequencing continues to fall, more and more researchers are diving in head first, opting to chart the complete transcriptome or even the whole genome of their chosen organisms.
"It's a real shift in how you do science from working with small datasets that you sequence on a PCR machine to getting gigabytes or terabytes of data from an Illumina run and not being able to use the same methods," says Karen Cranston, the bioinformatics project manager at the National Evolutionary Synthesis Center in Durham, NC.
She adds that "from a phylogenetics perspective, the rate of whole-genome sequencing is definitely increasing, but there's no way that we can incorporate all of the species we have into [not only] phylogenetics studies, but also into all sorts of questions that people have about biology just by looking at model organisms and the pipelines and procedures people that have used for model organisms." To fully realize the potential of phylogenomics, "we need to be able to look much more broadly across many, many species to ask the kind of biological questions we want to ask," she says.
Increasingly, researchers are looking to enhance the tree of life with genomic information for non-model organisms.
"There are millions of species on the earth today, and there are probably 1,000 extinct taxa for every one that's alive today. So, if you are interested in the history of life on earth, evolutionary changes that have occurred among living creatures, the origins of all the systems and all the features, all the biochemistry, all the anatomy, all that stuff, they're all there because of the history of the creatures," says Ward Wheeler, curator of invertebrate zoology at New York City's American Museum of Natural History. "If you want to understand that stuff, the model organisms are kind of irrelevant. It's the non-model ones that matter because that's where all the diversity is."
Markus Pfeffinger, a professor of molecular ecology at the Biodiversity and Climate Research Center in Frankfurt, Germany, says that because "evolution is highly idiosyncratic," in order to move the field forward, "there is no way around next-generation sequencing ... in non-model organisms."
Today, phylogenomics research is seldom stalled by scientists' abilities to sequence, or even assemble, transcriptomes and genomes for non-model organisms. Rather, it is more typically slowed by their ability to directly compare them.
"Surprisingly, what we have learned is that we don't always need a genome sequence or genetic sequence for an organism that is closely related in order to sequence non-model organisms," Nebraska's Kellar says. "However, they can make the bioinformatics part of the process a little cleaner."
Whether reference-based or de novo, genome assembly is still a challenge, though a tractable one. Experts expect that many of the current assembly issues will be abated as new machines — producing longer, more accurate reads — come online. As the University of Georgia's Travis Glenn puts it: "If [a new machine] can deliver long reads, even if they are not terribly accurate, then that gives us the opportunity to do these sort of 'Aha!'-type pipelines."
Benedict Paten from the University of California, Santa Cruz, adds that longer reads will be a game-changer. Read lengths on the order of thousands or even tens of thousands of bases, he says, "will radically change what we can do with the data. ... For non-model organisms, it's actually going to let us get much more complete genomes, as in much fewer contigs and scaffolds."
Assembly problems aside, "the main issue right now is: How to use, extract the comparative information from these complete genomes? And that's lagging behind the computational knowledge to assemble them," AMNH's Wheeler says.
In part, that is because it can be tough to tell where to look.
"What most people would do if they had a bunch of whole genomes from a bunch of metazoa right now is ... computationally extract the expressed segments and analyze those. Because the homology relationships are clear, the comparative relationships are clear," Wheeler says. However, he adds, "we don't have a good handle on how to deal with large amount of repetitive DNA or duplicated sections."
NESCent's Cranston says it can be especially difficult to know where to look in transcriptomes and genomes for non-model organisms. "Certainly one of the challenges ... is going [in] and asking: Which of those bag of genes that we sequence are appropriate for use in phylogenetic analyses?" she says. "It's much more challenging when we don't have a closely related genome."
And because most genes are not lone copies, duplications can be duplicitous — leading researchers to believe they have found elements related by evolutionary ancestry, when in fact they may not have. "When we have either incomplete sequencing or if we don't have a model organism to compare to, we can often — incorrectly — say that if we had copies A and B, we're putting together an A in one organism and a B in another, which can really confuse the algorithms we have for building trees," Cranston says.
Georgia's Glenn has his comparative analysis sights set on ultra-conserved elements, or UCEs — identical stretches of DNA in syntenic locations that are shared by at least two genomes. He and his colleagues have developed a sequence capture approach to enable the targeted enrichment and sequencing of thousands of orthologous loci across species, which they presented in a January Systematic Biology paper.
"Because many organismal lineages have UCEs, this type of genetic marker and the analytical framework we outline can be applied across the tree of life, potentially reshaping our understanding of phylogeny at many taxonomic levels," Glenn et al. wrote.
Then in a Biology Letters paper published in May, Glenn and his colleagues reported having applied this sequence capture approach to resolve the phylogenetic position of turtles based on sequences from UCEs.
Glenn says his team's sequence capture approach is versatile and can be adapted to suit a variety of studies. "You can develop one set of probes that works for up to thousands of different species," he says. "For example, the group we've worked on the most are amniotes — birds, mammals, reptiles. For those 25,000 species, we have one set of probes that works pretty well across all of those species."
Beyond phylogenomics, such probes might also be useful for population genetics studies, Glenn adds. "Using these UCEs, you can directly compare what's the heterozygosity, or what's the genetic diversity ... in humans versus alligators," he says.
Branch by branch
With so many facets — from the morphological to the molecular — to consider when comparing species, unraveling the labyrinthine limbs of the tree of life seems onerous.
Up against assembling and comparatively analyzing genomes, "the problems of actually reconstructing phylogenetic trees from sequence data are vastly more difficult," AMNH's Wheeler says.
But that has not stopped NESCent's Cranston and her colleagues from trying to pull together a synthesized view of the evolutionary tree of all species through the US National Science Foundation-funded Open Tree of Life initiative.
"My interest is definitely both in constructing large phylogenies, but increasingly on the informatics side, [thinking about] how we best share evolutionary data and how we can use the tree of life as sort of a framework for organizing all the data we have about biodiversity," she says, adding that "the phylogenetics end of it is sort of lagging behind the genomics."
And more genomes are on the way. With large-scale initiatives like the Earth Microbiome Project, BGI's 1,000 Plant and Animal Genome Project, and the Genome 10K project out of Santa Cruz, producing sequence data for diverse species will surely not be nearly as difficult as placing it.
UCSC's Paten points to the bird group within Genome 10K, which has sequenced more than 60 species to date. "They're currently really struggling with how to firm up the phylogeny of birds," Paten says.
"Without next-generation sequencing, I don't think that they could hope to fully resolve such a complicated phylogeny," he says. But having so many sequences is both a blessing and a curse, Paten adds, "because as you get more species, or if you are concerned with a phylogeny that involves a lot of species, then there are many, many possible trees."
Wheeler's lab at AMNH is tackling the tree problem with compute power. "We develop algorithms and software tools and mathematical techniques to aid in the reconstruction of trees from diverse sources of information — from morphological anatomy information to DNA sequences to synteny to chromosomal information, entire chromosome information, all sorts of things," he says.
But viewing sequences on a tree and finding the best-scoring tree under selected criteria "are both within the class of the most difficult known computational problems — they are NP-hard optimizations," Wheeler adds. "Those problems are intractable in the sense [that] we won't get the exact solution, but we have reasonably well-behaved heuristics to help us get answers that we feel are pretty good. Those are two vastly difficult problems, and they are explosive with the number of creatures we have."
The question of whether researchers will be able to scale their efforts to keep pace with this surge of genomic data looms.
"Previously, when we had three or four genomes, it was not such a big deal," Paten says. "When you have 60 or 100 or 10,000, just actually running everything gets difficult."
Going forward, he adds, "just being able to handle that huge fire hose of data and put it in databases and transmit it around the Internet — those are all really difficult technical questions that we don't have the infrastructure to handle today."
On the other hand, more data will mean more opportunities for research. "When we have thousands of genomes aligned in these databases, there will be sufficient changes for us to be able identify their individual DNA-binding sites, and understand their evolutionary history in a way that we can't today," Paten says. He adds that, given "orders of magnitude more data, the kind of questions we can ask will change."
NESCent's Cranston points to metagenomics as an example of how more data can power investigations that were hardly conceivable even five years ago.
"That's an area where you have rapid, rapid species discovery in areas of the tree where most of our biodiversity is," she says. "Now we have methods for people to sequence what's in a bucket of seawater and then match those sequences against reference trees and place those without having to be able to culture them in the lab or assign names to them."
It's a new-school approach that Cranston says could not have been realized without next-generation sequencing. "I think it's going to grow tremendously with NGS technologies in a very different way than the eukaryotic biodiversity has," she adds. "That's one of the most interesting areas for us."