Researchers at Pennsylvania State University have used 454’s sequencing technology and new analysis tools to study genetic diversity in endangered species without the use of a reference genome, In Sequence has learned.
As their first example, they have chosen the Tasmanian devil, which is under threat of extinction from a deadly infectious facial tumor.
Under a $1 million grant from the Gordon and Betty Moore Foundation, Penn State scientists led by Stephan Schuster, a professor of biochemistry and molecular biology, have begun to generate 454 low-coverage genome sequence data from two of the Australian marsupials, also known as Sarcophilus harrisii.
Using new computational tools that do not require a reference genome, the team discovered SNPs that they hope will help their Australian colleagues select cancer-resistant devils for a breeding program.
“We have the means, by just comparing reads to one another, to do quite successful SNP detection that we then validate on SNP arrays,” said Schuster.
The project is somewhat unusual in that it aims to discover SNPs in a species whose genome has not been sequenced, and for which no closely related reference genome is available for alignment.
“The idea is we know nothing about the genome of this species, and we want to sequence and just throw the reads into some software and watch what comes out, and stop the sequencing when it’s good enough,” explained Webb Miller, a professor of biology and computer science and engineering at Penn State and inventor of the software used in the study.
Miller’s analytical tools, which were developed for this purpose, are built on software he wrote to analyze sequence data from the woolly mammoth, which he and his team published two weeks ago in Science (see Short Reads in this issue). However, for that project, the researchers had a low-coverage assembly of the African elephant genome available for comparison, which made it “a little easier,” according to Miller.
Miller, who for almost two decades has been developing analytical approaches for comparing DNA sequences, said the software uses roughly 400-base-pair 454 reads from each of two individuals from a species and provides SNP calls and approximately 200 base pairs of flanking sequence on either side. No genome assembly is involved other than micro-assemblies of small clusters of overlapping reads, he said.
“We have the means, by just comparing reads to one another, to do quite successful SNP detection.”
Though he said he is not yet ready to reveal details of how the software works, Miller mentioned that one challenge has been the lack of information about certain families of repetitive elements in the devil’s genome.
“We had no knowledge ahead of time of the recently expanded repeat families in this animal, so you need to figure these out and mask them as you go,” he said.
The accuracy of the SNPs increases with the sequence coverage, and the researchers do not know ahead of time how much sequence data they will require.
“The idea is that you are going to start with some sense of how many SNPs you need and of what quality. And then, as you sequence, this code is going to be working away, and you can stop the sequencing when the SNPs are good enough,” he said.
The software, which Miller expects to reveal in more detail next summer, is currently optimized for the 454 sequencing platform and depends on the relatively long reads of about 400 base pairs that the new Titanium chemistry provides. “I would not be happy with 35-base sequences,” Miller said.
So far, Schuster and Miller have sequenced two Tasmanian devils at less than one-fold coverage and have determined a first round of SNP markers, which they recently sent to their Australian collaborator Vanessa Hayes, a group leader at the Children’s Cancer Institute Australia in Randwick, just south of Sydney.
One of the devils, named Cedric, came to fame earlier this year when, after being inoculated with a tumor vaccine, it became immune to the facial cancer that has decimated the Tasmanian devil population over the last decade or so.
Hayes and her colleagues are now determining the allele frequencies and distribution of a set of 96 SNPs in 80 devils from across Tasmania, an island off the southern coast of Australia. Later on, they want to screen larger numbers of SNPs, Hayes told In Sequence by e-mail, and select “informative markers” that may help researchers selectively breed the animals.
Details in the Devil
The idea of this first part of the sequence project was to provide genotyping data quickly, prior to the start of the new breeding season early next year, Schuster said.
However, according to the Moore Foundation grant abstract, the bigger aim is to generate a draft version of the Tasmanian devil genome. Hayes approached Schuster about the possibility of such a project about a year ago.
“As a cancer geneticist, I am particularly interested in understanding the genetics of this unusual cancer and I believe that the way to move our understanding forward at the most rapid pace possible is via whole-genome sequencing,” she said.
The grant provides funding for 8-fold coverage of the devil genome, Schuster said, which he and his colleagues will likely distribute between the two individuals. Long-range paired-end 454 reads spanning 20 kilobases will help them build scaffolds of the genome. In addition, Schuster and his team are looking into using a short-read sequencing technology to validate the SNPs they find, though they have not decided yet which one.
Schuster is hoping to analyze other endangered species in the future in a similar manner, and believes that funding agencies are more interested in such projects than in sequencing extinct species, such as the mammoth. “We can see that the interest in this approach is growing,” he said.