NEW YORK (GenomeWeb) – Scientists have put together a Danish reference genome, based on the de novo assembly of 150 genomes from 50 family trios, highlighting the use of deep sequencing to discover a "rich set" of population-specific structural variation.
While the genome, published in Nature today, caps years of work for the investigators behind the study, they also believe it has set a precedent for cross-institution collaboration that should encourage scientists in the Scandinavian country as they embark on a new, national personalized medicine program.
"Part of this was to prove that we could establish this kind of project using a nationwide approach," said co-author Karsten Kristiansen, a professor of biology at the University of Copenhagen. "It took some time to get the different parts to work together, but researchers got very good contact with each other," he said.
According to Kristiansen, discussion around generating a Danish reference population date back to 2010, but it only became feasible within the past few years to carry out the project with the provided funds. He didn't affix an exact price tag to the cost of producing the reference genome but said financing came from both public and private sources and totalled at around $10 million.
Ultimately, the research team, which included investigators from the University of Copenhagen, Aarhus University, and the Technical University of Denmark, among others, opted to sequence 50 sets of family trios using a combination of paired-end and mate-pair libraries on the Illumina HiSeq2000 instrument at an average read depth of 78x. Kristiansen noted that BGI-Europe carried out all the sequencing at its Copenhagen facility, a requirement stipulated by funding agencies, as well as the project's ethics committee.
"We decided on the best strategy for getting the maximum amount of information using available resources," said Kristiansen. "That's one of the reasons we chose the trios, because it provides a lot of information."
Kristiansen added that the team decided to use de novo assemblies for the effort "because it had been shown that if you simply align to a reference genome, you are missing a lot of information."
Researchers globally are increasingly seeking to produce reference genomes or panels for specific populations of interest. Last year, for instance, saw the publication of a Korean reference genome. Scientists at the Estonian Biocenter also recently generated a reference panel for their population. However, in each case, the selection of technologies used to produce the reference genomes differed based on the strategies of the investigators as well as available resources.
"We decided to go for very deep sequencing to discover new sequences," said Søren Brunak, a coordinating professor of bioinformatics at the Technical University of Denmark and a co-author of the paper. "The long paired-end insert libraries that were used really aided in the de novo assembly and resulted in a very low level of gaps."
Brunak noted that the trios sequenced were selected to be highly representative of the Danish population. "We even screened out individuals who were of half Norwegian or half Inuit ancestry," he said. "Being as focused as possible with the reference you have allows you to screen out common variation in the Danish population that will then not be linked to disease."
The scientists used three different programs to construct de novo assemblies for each of 150 individuals sequenced: SOAPdenovo2, SGA, and Allpaths-LG14. According to the authors, the assemblies had a median scaffold length of 21 megabases. The 100 largest scaffolds in each of the 140 best assemblies typically covered more than 75 percent of the genome, they reported.
To gauge the accuracy of these assemblies, they aligned the scaffolds to the human reference genome. They also compared the assemblies to a long-read assembly based on BioNano mapping and PacBio sequencing. They said the long-read assembly was "less complete" than theirs, though it had similar scaffold lengths. The comparisons led the authors to describe the quality of the de novo assemblies as "similar to those obtained using the more expensive long-read technology."
They also used the assemblies to identify structural variants found in the Danish population, including insertions and deletions that the researchers hope will enable them to decipher known association mapping signals. To do this, they created a hybrid variant calling strategy that relied on identifying candidate variants on the basis of mapping and assembly, followed by genotyping them. The mapping approach yielded 11,469,657 non-SNV candidates, roughly 85 percent of which were then validated by genotyping the variants across the 150 individuals.
Using the data, the Danish team was also able to resolve major histocompatibility complex haplotypes in half the trios, resulting in 100 complete MHC haplotypes. In addition, they fully assembled about 20 megabases of the Y chromosome in long scaffolds, identifying 10,898 SNVs, 855 deletions, and 793 insertions in the process. They also discovered 181 indels in fixed major haplogroups R,I, and Q that had not been previously reported.
"I think we have a top notch sequence for the HLA region, which is really important for putting people into immune system boxes," said Brunak of the results. "Another highlight is the Y chromosome assembly," he added. "It is notoriously difficult to assemble the Y chromosome and we are assembling a third of it quite well."
Along with these scientific achievements though is the hope that the data will help improve the interpretation of clinical genetics in Denmark. The country earlier this year commenced a DKK 100 million ($14.2 million), three-year program called the National Strategy for Personalized Medicine, Per Med for short, that includes among its aims the integration of genomic data into electronic medical records and the establishment of a national genome center.
"One of the large, complex tasks of PerMed will be to keep updated actionable information on what genome features are relevant in treatment and diagnostic contexts to Danish patients," said Brunak. "Here the reference genome obviously will be of great value," he said.
He also noted that alongside the development of the Danish reference genome, the researchers have developed the infrastructure for handling person-sensitive data and also established collaborative secure, private cloud environments that are present physically in Denmark. This included the creation of a national supercomputer with 10 petabytes of storage. Brunak said the supercomputer, dubbed Computerome, has provided Danish researchers with a "strong basis for further developing technology that may fit into the already well-established Danish healthcare sector infrastructure."
Though the effort is in its infancy, Brunak said he was optimistic about the opportunities created by Per Med, noting that the country's existing electronic medical records system was well positioned for integrating genomic data.
"Everything that happens to you healthwise is tagged by [a social security] number," Brunak noted. "Disease trajectories are hard to handle in other countries where different providers cannot merge their data," he said. "But we think we have better opportunities than in many other countries for matching the genomes with the clinical data."