Skip to main content

Researchers Exploring De Novo Assembly Methods to Create Reference 'Super-Centenarian' Genome

Premium

By Andrea Anderson

Researchers from the Netherlands and the US are working to generate a de novo genome assembly based on SOLiD short read sequence data from samples donated by a Dutch woman who lived to be 115 years old.

Those involved in the study intend to use the genome sequence to look for genetic clues about how the woman, dubbed W115 for the study, lived so long and stayed in good health for most of her life.

They also believe her genome might serve as a useful reference sequence in other studies of longevity, healthy aging, or age-related disease. To that end, the team has opted to do a de novo assembly of the genome rather than simply aligning it against the standard human reference genome, Hg19.

"If you compare a genome, or whatever sequence you have, to the common reference, to the Hg19 reference, there are always a lot of elements in there that correlate with diseases," Henne Holstege, a clinical geneticist at the Free University Amsterdam, told In Sequence. "For example, lots of the SNPs that you find in [genome-wide association] studies are also found in the Hg19 reference."

"We wanted to make a completely new sequence, so that that would not be a problem," said Holstege, who presented the project at the International Congress of Human Genetics/American Society of Human Genetics meeting in Montreal earlier this month.

Duke University post-doctoral researcher Liz Cirulli, who was not involved in the study, said that the notion of using the genome of a very old, healthy individual as a reference genome could be a "really great idea."

"It would be nice to have a reference genome that was cleaner," Cirulli, a researcher in David Goldstein's Duke laboratory, told IS.

Cirulli is currently running a centenarian sequencing project at the Duke Center for Human Genome Variation that is being done in collaboration with investigators from the Measurement to Understand Reclassification of Diseases of Cabarrus/Kannapolis, or MURDOCK, study. Researchers have so far enrolled 18 individuals who are 100 years old or older and have already sequenced 13 centenarian genomes with the Illumina HiSeq platform.

The group is currently looking at overall patterns in the centenarian genomes, Cirulli explained, trying to see if centenarians have fewer rare, deleterious variants than population controls. They likely won't be able to fold data from the newly sequenced W115 genome into that analysis, she said, since differences in the genome sequencing platforms used by the centenarian sequencing team and Holstege's group are expected to influence variant calling.

Once they have sequenced a larger set of centenarians, though, Cirulli said W115's genome sequence might help validate some variants that are enriched in the centenarians compared to controls or to allow them to compare patterns in genes containing more variants in the centenarian group.

"What I'm doing overall is looking for global differences in the genomes, so you have to make sure that everything's sequenced the same way to be able to do that kind of comparison" Cirulli said. "If I had a certain variant or a list of a few variants I was really interested in and I wanted to see if [the W115 genome] had them, then it would be very useful to go look at it."

Holstege, Cirulli, and their teams are not the only ones using whole-genome sequencing to explore the genetics of healthy aging or longevity.

Earlier this month, Complete Genomics announced that it is working with collaborators at Scripps Health to sequence the genomes of around 1,000 healthy individuals between the ages of 80 and 108 years old (CSN 10/5/2011). Investigators involved in that effort, dubbed the Wellderly project, want to explore the genetics of healthy aging and plan to use their population as a control group in studies of age-related diseases such as cancer, Alzheimer's, Parkinson's, and heart disease.

A Genetic Basis for Healthy Aging?

Although they are focused on a single individual, researchers involved in the W115 genome sequencing and assembly effort believe W115's exceptional health and family history make her genome valuable for exploring both longevity and healthy aging.

Tests done when the woman was 112 years old suggested she had the mental sharpness of someone closer to 60 years old. And after she died of stomach cancer at 115 years old, post-mortem analyses did not reveal significant signs of aging in her brain or vasculature.

Though there may be some environmental contributors as well, researchers believe genetic factors were key to the woman's long and healthy life, since individuals on her mother's side and her father's side of the family tended to live longer than was typical for their time.

"Both sides of her family were significantly older than the mean of their generation. However, individuals on her mother's side were really much older than the mean of their generations," Holstege said. "There's really something genetic that keeps them healthy and makes them old."

Even so, she explained, the researchers did not see any obvious changes in W115's overall SNP pattern when they compared her genotype data with patterns in the normal Dutch population, using data on individuals genotyped with Illumina arrays as part of a study led by collaborator Eline Slagboom at the Leiden University Medical Center.

"We found that there was absolutely no difference between our old lady and their Dutch control group." Holstege said. "She doesn't have a specific set of strange SNPs or something that makes her different from a normal population for those SNPs that were measured on the array."

By sequencing her complete genome, researchers hope to get a more complete look at all of the single nucleotide variants, copy number changes, small insertions and deletions, and larger structural variants that may have contributed to W115's health and longevity.

The team did paired-end and mate pair sequencing on DNA from the woman's blood and brain samples using the SOLiD 4 platform. Sequencing for the study was performed at Life Technologies.

So far, researchers have generated about 60 times coverage each of the blood and brain genomes. Sequence reads from both tissues are being combined for the de novo germline assembly to get sequence covering 92 percent of the genome at an average depth of around 120-fold.

"We wanted to make sure that we sequenced the whole genome to a very high degree of confidence," said Life Tech's Tim Harkins.

Evaluating Assemblers

Members of the team have already assembled chromosome 19, preliminary work that Holstege presented at the ICHG/ASHG meeting, and are now working on completing the assembly for the rest of the genome.

"We're working through the process now. Essentially we're looking at a very large amount of coverage, in terms of reads, that we need to assemble," explained Scripps Translational Science Institute researcher Samuel Levy, who directs genomic sciences for Scripps Health and is leading the W115 genome assembly effort. "We're essentially evaluating the best approach currently."

To do this, the researchers are testing the assemblers that are available for doing de novo assembly using SOLiD data: Velvet and Abyss.

Generally speaking, de novo assemblies that rely on short-read sequence data present special challenges, Levy explained, primarily related to read length and the size of insert libraries available.

For instance, two to three different insert sizes are usually used to get past repeat sequences in the human genome, Levy said. Although it's possible to reliably make 200-base and 3,000-base insert libraries for short-read platforms such as SOLiD or Illumina, it can be difficult to efficiently make longer insert libraries, he said.

"For the most part, you can get up to [3,000-base] inserts reliably with either technology, but getting longer libraries is very hard — and that's what you really need," he explained.

Moreover, because short-read platforms generate such a slew of sequence data, they can quickly saturate the memory requirement of many assemblers, Levi noted, making it challenging to come up with assemblies that make use of all of the available sequence data.

To address that issue, the team is working with Wayne Pfeiffer and Mark Miller at the San Diego Supercomputer Center to try to incorporate as much data as they can into the overall genome assembly using large-memory machines.

The assembly efforts are still underway. In the meantime, researchers have started doing some preliminary analyses of the W115 genome based on comparisons with the standard human reference genome. Nevertheless, Holstege cautioned against over-interpreting information from a single genome.

"In every genome you sequence at the moment, you find novel SNPs and novel indels," she said. "So it's going to be a pretty difficult job to find out if these novel indels have anything to do with her becoming so very old or her not becoming sick."

"Usually you're looking for a deleterious event — what caused a cancer, what's led to a genetic disposition — and here we're looking for the positive: what has provided this woman to live very robustly until she was 115," Harkins added, noting that searching for these positive features in the genome can be tricky.

To get a clearer idea of which variants in the genome might be relevant to lifespan and disease resistance, the researchers said they will likely need to compare it with sequences generated for other longevity studies. To that end, Holstege said she hopes to get in contact with teams setting up cohorts for longevity and related studies.

Data from the W115 genome project is currently being housed in a password-protected site on the Amazon Elastic Compute Cloud environment. Once the genome assembly is complete, the data will be made available to other members of the research community.

Those involved with the study are keen to complete the W115 genome by the end of the year, depending on how long the remaining assembly steps take, but said the timing is somewhat contingent on whether they need another long mate pair sequencing run to augment the quality of the de novo assembly.

"We want to make a really, really nice de novo assembly," Holstege said. "We're thinking about doing another run to get more long mate pair data and adding that to it, so we really get a high quality de novo assembly."

The researchers were not able to estimate the cost of sequencing the genome, in part because SOLiD sequencing was done in-house at Life Tech.


Have topics you'd like to see covered in In Sequence? Contact the editor at anderson [at] genomeweb [.] com.

The Scan

Pfizer-BioNTech Seek Full Vaccine Approval

According to the New York Times, Pfizer and BioNTech are seeking full US Food and Drug Administration approval for their SARS-CoV-2 vaccine.

Viral Integration Study Critiqued

Science writes that a paper reporting that SARS-CoV-2 can occasionally integrate into the host genome is drawing criticism.

Giraffe Species Debate

The Scientist reports that a new analysis aiming to end the discussion of how many giraffe species there are has only continued it.

Science Papers Examine Factors Shaping SARS-CoV-2 Spread, Give Insight Into Bacterial Evolution

In Science this week: genomic analysis points to role of human behavior in SARS-CoV-2 spread, and more.