NEW YORK (GenomeWeb) – Members of a subgroup of the data working arm of the Global Alliance for Genomics and Health are working on methods of representing and including genetic variation information in the canonical human reference genome. The collaborative project, which involves researchers from academia and industry, seeks to restructure the historically linear reference assembly as a graph that better reflects the different kinds of variation that occur across human populations.
The reference variation task team is co-chaired by Gilean McVean, a professor of statistical genetics at the University of Oxford and acting director of Oxford's Big Data Institute, and Benedict Paten, an assistant research scientist in comparative genomics at the University of California, Santa Cruz. Its mission, McVean told GenomeWeb, is to work on ways of describing genetic variation that move away from a single linear reference and make more use of the genetic variation information that has been assembled over time. GA4GH provides a platform for members of the community who have been thinking about this problem to pool their thoughts and expertise and "try to converge on what are sensible structures and how best to exploit it for the purpose of describing genomes," McVean said.
The goal of the group is to come up with "a much more comprehensive structure that integrates all the variant data, and not just the point mutations but also the structural variations, and places it within a reference structure so that right from the outset when you get a new sample, you are calling with the context of all the known variations," Paten told GenomeWeb.
The reference genome, as it stands, has been incredibly useful and important in providing a universal coordinate system for human variation, he said. But "we are moving into this world where … we are sequencing people like crazy and increasingly doing it for medical reasons." That means that in pretty short order, "we are going to have an enormous amount of data about human variation" and the "fragmentary" nature of the current system of reference mapping and variant calling and identification presents some challenges.
Currently, since variants are called with respect to the reference genome assembly, there is the problem of reference allele bias, which, as the name implies, means that it's much easier to identify with greater confidence alleles that are already present in the reference than it is to discover new alleles. Furthermore, "we don't really have a single comprehensive place where the catalogue of all human variation is stored," Patten noted. "The way that we store variants is kind of balkanized right now, so we store the point mutations in dbSNP and we store the structural variants in dbVAR and other databases, and so forth," he added, which makes searching for mutations that match to those found in samples a more complex activity. In addition, "because we've been discovering the variations with different technologies, and because of a certain amount of uncertainty about the discovery of those variants … we've got a lot of noise in the system as well," he said.
Working off GRCh38, the most recent incarnation of the human reference assembly, participants in the GA4GH subgroup so far have put together a data model and a formal specification for an application programming interface, and are working on various implementations of both with an eye towards eventually comparing them to figure out what works best, David Haussler, a professor of biomolecular engineering at the UC Santa Cruz and co-leader of the GA4Gh's data working group, told GenomeWeb.
There are a number of issues involved in thinking about these structures and it's important to build on the current paradigm incrementally rather than make sudden, major changes to the status quo, McVean said. The goal is to move "gradually towards a more graph-based structure that is completely back-compatible with everything else that's done to date ... we need to be commensurate with all of those [earlier] ideas."
It's also important to keep in mind that there is an informatics tool chain that needs to grow out of using these graph structures, which "has got to be as efficient if not more so than the current established tool chain which is highly optimized and works really well for characterizing genome sequences through high-throughput sequencing," he said.
It's also wise to think about the data itself and the sources of information on variation, McVean said. The group plans to use three structurally diverse regions in the human genome for which there are alternative loci available in the reference as a test bed for their methods including the major histocompatibilty complex (MHC) and the BRCA gene regions.
A fourth consideration addresses what features should be included in a graph. "What are the properties that are useful to people … and that reflect what we know about the way in which diversity appears and recombines … [and] what's the best way of going from a set of alternative loci to an actual graph that best matches the sort of things that we want to achieve and allows downstream researchers ask questions in an intelligent way,"McVean said.
Members of the group have written at least two papers — both of which are freely available in biorxiv — that are related to the concept of a graph-based reference. One of these is written by Paten and other researchers from UC Santa Cruz and describes what the authors consider to be "desirable properties" of reference structures.
Paten explained the paper's underlying concept this way: Adding variation to the reference requires some thought about how sequences meet, intersect, and line up. Currently, the process of comparing genomes is noisy with significant disagreement about exactly how two sequences should align — a figure in the Paten et al. paper illustrates this point by showing several alignment options for a relatively short string.
"That becomes really important when you start thinking about … lining up all the different variations that are present in humanity," he told GenomeWeb. "If you want to line up my genome and your genome and everybody else's genome then you need to have pinned down exactly how you are going to do that. We argue [in the paper] if you are going to start building those graphs, you need to understand … and define … that process in detail."
In the paper, Paten et al. propose labeling variants with unique stable identifiers and also suggest a consistent mechanism of identifying variants that can simply be repeated for new variants. What that means is that "if I have a new sample, I don't have to go and use [the Burrows Wheeler Aligner] with some random set of parameters to describe the new variant," he explained. "I can use the canonical way of looking at the strings … and if I see an instance of a particular series of ACGTs, then it identifies a given instance of a particular variant in the graph." This way "[we] no longer are identifying variants independently and differently from one another but we all have a consistent way of identifying all the variants in the graph."
Not only would this make locating variants in new samples a more efficient and concrete process, "you can imagine adding all kinds of metadata about phenotypes and so forth, associated with a given variant to that graph structure and having it all tied to one elegant conceptual model," Paten said. He and his co-developers are currently working on an improved and more concrete version of their approach and also taking steps to implement it in actual practice, he told GenomeWeb. As part of a pilot, they plan to test the method on several tricky regions of the genome including the major histocompatibilty complex (MHC), the killer cell immunoglobulin-like receptor locus, and the BRCA 1 and 2 regions.
It sounds simple but there is something of a conceptual barrier that needs to be surmounted "when you move away from a world [where you have] a nice linear set of chromosomes to having a structure that represents all the variation … and becomes a graph," Paten said. But there are big benefits to be had once that barrier is crossed. "Once you start thinking about it as a graph, all the different kinds of variation can be described pretty simply."
A second paper, written by McVean et al., provides a prototype of one possible graph structure and demonstrates its usefulness using data from the MHC region of the genome. The paper describes a structure where the "the genomes of novel samples are reconstructed as paths through the graph using an efficient hidden Markov model, allowing for recombination between different haplotypes and variants."
The researchers applied the method, the paper states, to the 4.5-Mb extended MHC region on chromosome 6 which includes eight assembled haplotypes, sequences of known HLA alleles, and over 87,000 SNPs from the 1000 Genomes Project. "We demonstrate, using simulations, SNP genotyping, [and] short-read and long-read data, how the method improves the accuracy of genome inference," the researchers wrote. "Moreover, the analysis reveals regions where the current set of reference sequences is substantially incomplete, particularly within the Class II region, indicating the need for continued development of reference quality genome sequences."
One genome to rule them all
The need for a more complex reference is an issue that has marinated in the minds of many in the biomedical research community for some time. The canonical human reference genome, which serves as the backbone for a large swath of biomedical research, is a composite of sequences from multiple genomes with each genomic region — with some exceptions in the last two incarnations — represented by one possible set of sequences selected from the pool of contributors. The most recent reference includes sequence from a total of about 70 individuals.
Essentially "all of our science and medicine has been biased towards this one arbitrarily chosen reference [that] doesn't necessarily reflect the ethnically common variation that you see throughout the world," Haussler, who co-authored the aforementioned UC Santa Cruz paper, told GenomeWeb. Aside from the virtually ubiquitous reference bias across biomedical studies, much of the bioinformatics software in use today was not designed to work with these alternate sequences and have largely ignored them, he said. It's also meant that mappers repeatedly map reads incorrectly because of the limited options available in the reference.
A haploid reference made sense for a number of reasons when the Human Genome Project put together the initial assembly. Aside from cost considerations, Haussler said, a linear reference was simpler to use from an informatics point of view.
Also, at the time, "our understanding of population diversity was very simple," Deanna Church, the senior director of genomics and content at Personalis, told GenomeWeb. "We didn't think that structural variation was quite as prevalent in the human population as we now understand that it is" and based on the information available back then, it was thought that a haploid representation would be sufficient. Before she joined Personalis, Church was a staff scientist at the National Center for Biotechnology Information and she was involved in the founding of the Genome Reference Consortium, a global group that took responsibility for maintaining the human reference genome after the HGP wrapped , and has sought to find mechanisms of making alternative sequences available as part of the assembly.
Because of the GRC's efforts, the current iteration of the reference, while not a graph in the mathematical sense of the word, is graph-like. Since 2009, the GRC has included alternative loci in parts of the genome with significant sequence and structural diversity such as the MHC and MAPT regions. These alternate sequences first showed up in GRCh37which had three regions with a total of nine alternate loci. In GRCh38, which came out late last year, the number has climbed to 261 alternate loci in over 200 regions. These alternate sequences come from fosmids generated as part of the NHGRI's Structural Variation Project — 22 libraries in total — and from BAC clones from the CHM1 hydatidiform mole, Valerie Schneider, a staff scientist at the NCBI and one of the members of the GRC, told GenomeWeb. They're also drawn from finished genomic clones in GenBank that were divergent from the reference chromosomes as well as BAC clones derived from DNA from HLA-specific haplotype resources, she said.
The decision to add these alternate loci grew out of gaps observed in the reference that were caused by structural variations that occurred in the donor sequence, Church explained. "If the donor sequence had a structural variation, the unique sequence in one of the alleles could keep us from closing that gap and what we would end up doing is representing two haplotypes in the reference assembly rather than representing one valid haplotype." The GRC's solution to that problem was to come up with a model that would allow them to "represent both genotypes faithfully" rather than create "alleles that are a mixture of different haplotypes."
Today, the GRCh38 reference includes the primary assembly, which consists of the chromosomes and the unlocalized and unplaced scaffolds, while the alternate loci exist as standalone accessioned scaffolds, Schneider said. "What's really important about the assembly model now, and differs from how these alternate scaffolds used to float around in the past, is that we give all of these things a chromosome context," she explained. "We do that by aligning them to the chromosome and we actually release the alignments with the scaffold sequences."
What that means, she added, is that "when you go to download the assembly, you get not just all the sequences but you will find the alignments of the chromosomes and the alternate loci, [and] through these alignments you know how each base on the alternate sequence is related to a base on the chromosome."
The GRC's efforts have provided an immediate and practical solution to the problem of missing variation in the reference. Aside from releasing new iterations of the reference, the group regularly releases quarterly patch updates which provide sequences that are very akin to the alternate loci. "They are standalone scaffolds that we align back to the existing reference so you know where they fit but they don't change any of the chromosome coordinates," Schneider told GenomeWeb.
They also release patches that provide previews of what researchers can expect to see in the next major release of the reference. "So if you are an investigator working in a particular genome region, and say there is a gap and your gene is in that gap and we've managed to get that gene, rather than you having to wait five years to see it in the assembly, we are releasing the scaffold sequence that contains it and now you have an accessioned genomic sequence that you can use for your research purposes," she said.
GRC members are also tapped into and supportive of the efforts of the GA4GH's data working group. "I think we are a little bit [away] from having a graph-based solution at the moment and by providing people with alternate loci, we are getting them some representation for these alternate loci that's needed," Schneider said. "There's still clearly work that needs to be done for aligners and variant callers to be able to handle even the alternate loci," she added.
Church noted that "it's pretty clear that the quality of your reference assembly has a very strong impact on your ability to identify variants and reconstruct the genotypes in the sample that you are trying to analyze," and that "ideally we would have a full graph representation that would actually hold all of the variation that's seen within the human population."
But "it's really going to take a while to develop that model and develop the tools that we need to use it," she added. "We think at least trying to move the bioinformatics community and the research community into using this whole GRCh38 reference" including the alternate loci "is at least a step in that direction."
Meanwhile, the GRC maintains close contact with the bioinformatics community to discuss what changes need to be made to current tools and algorithms to enable them to use all of the sequences currently available in the reference. In fact, at the Genome Informatics meeting last September in the UK, GRC members held a workshop to discuss tools, formats, the changing assembly model, alternate loci and more, Schneider told GenomeWeb.
Ongoing development efforts here include working on methods of distinguishing between allelic and paralogous duplication. Most analysis algorithms have a mechanism for downgrading reads that map to more than one location in the genome as a result of paralogous duplication. The addition of alternate loci however means that some regions have two separate representations — allelic duplication. Current mapping tools can't tell the difference between the two and "so we need to come up with some strategies and protocols for actually trying to do that," Church said.
The informatics problem
The effects of a more complex reference will probably be felt most by the bioinformatics domain with the burden landing squarely on alignment software, genome browsers, and even variant callers which have largely been developed with a linear reference representation in mind.
As this new structure is developed, "it's really important that we don't end up with a structure that means you have to throw away what's been done so far," McVean stressed. In terms of mapping methods, "we are looking at modifications to existing tools" that "basically use the same structures to map to these graph-based references." Importantly, "the output of such a mapper would be something that looks like a BAM file but has small modifications," he said. McVean told GenomeWeb that his group and others in the community have begun working on these improvements, including a team led by Heng Li, a research scientist at the Broad Institute and the principal developer of many well used tools in the bioinformatics community including the well-used Burrows-Wheeler Aligner.
There is also computational cost arising from the complex ways which reads might map to the graph structure. "That's the engineering challenge that really has to be solved," according to McVean. "We know how to do it in theory, we've even know how to do it in practice in any one case [but] what we don't yet have is a highly streamlined, efficient way of doing it that will work at the billions of reads scale that we need for mapping reads as they come off a sequencing machine."
One effort to tackle the analysis issue is coming from industry. Seven Bridges Genomics was established with the concept of a graph-based reference genome in mind and has been working on alignment and variant calling algorithms that can work with this sort of structure, Deniz Kural, the company's CEO, told GenomeWeb, In fact, the company's name is a nod to Swiss mathematician Leonhard Euler's use of graph theory to solve the Seven Bridges of Konigsberg problem.
With genomic datasets growing astronomically, graph-based structures represent a viable way to compress data to make it more manageable while also supporting computation and queries on that data that allow the community to learn from the information being collected, he said. "That's what so exciting about the applications of this and that's why we've been tackling the computational aspects and emphasizing that we really need to do this at scale."
Kural told GenomeWeb that his firm's algorithms are as efficient as standard algorithms such as the Burrows-Wheeler Aligner and that they don't become slower as the graph becomes more complex. The company has demonstrated how its methods — one of several being supported by Genomics England as part of their SBRI program — can be used to make tumor analysis more personalized. Currently, Kural said, when labs compare tumor and normal samples, they compare both the samples to the reference genome separately and then compare the comparisons. "We first sequence the normal genome and then we incorporate this normal genome into the graph reference genome so now you have a truly personalized reference, and then we align the tumor data against this personal reference," he explained. This approach provides "a much better picture of the tumor as an evolving population of cells."
Seven Bridges is mulling how best to make its methods available to the community. It is considering making some components open source and other components available under a commercial model, Kural told GenomeWeb.
In terms of browsers, Paten believes that linear visualization tools like the UCSC Genome Browser can work with a graph structure. That's because researchers would only have to compare their input sequences to the subset of the graph that is relevant to their particular genome. Essentially, a graph allows you to "translate" between any two different genomes, he explained. It actually "should make some analysis much simpler because instead of having to … view things through the [lens] of the existing reference genome… you are now able to review things on your personal reference genome and have a concrete translation," he said.
Reference assemblies past, present, and future
There are multiple versions of the human reference assembly in existence that have been used in numerous large- and small-scale studies. Redoing those analyses on a new assembly that has different chromosome coordinates than its predecessor is no small feat. The GA4GH data working group is using GRCH38 in its activities and researchers involved in 1000 Genomes project plan to begin reanalyzing their data on the assembly sometime next year, Church told GenomeWeb. But, she added, "if you go to a lot of research groups and clinical labs, most of them are still using GRCh37 [and] my suspicion is that many people are going to continue using [it] probably for the foreseeable future."
So, should the fruits of the current reference variation efforts be applied retroactively to assemblies past? Church thinks not. Whatever reference assembly was used in past project would have had a pretty significant impact on what on the analysis and interpretation results were and "you actually want to leave that frozen because you can at least understand how that reference assembly impacted that analysis," she said.
One point up for discussion is how much genetic variation to incorporate and also how to expand outwards from regions of high genetic variability to other parts of the genome. "I'd be incremental about it," McVean told GenomeWeb. "I think at the moment there are some regions where there is a real need and what we should do is get the methods working well in those regions, which would allow us to support … some really important use cases of genome sequencing," he said. If those are successful, he added, "I see no reason in not extending it further."
However, it's conceivable that if someday the community succeeded in sequencing a million or even a billion genomes, there would be variations at every point in the genome, and attempting to include every last SNP would make for one very messy graph, McVean said. "So, we might focus on larger and more common edits to the reference structure."
There are also "fascinating discussions revolving around the larger-scale structural differences that appear," Haussler added. "We have an enormous amount of differing DNA because big regions are missing, duplicated ... [It's] important to represent what are common variations and that's what a lot of these regions are trying to approach but little has been done with the hardest regions of the genome because we have a hard time actually getting the sequence right. I think simultaneously we need to improve our methodologies for sequencing those harder regions of the genome to get a definitive gold standard sequence which then you would add to the mix when you are deciding which variants to represent and which not to."
Separation of research and clinic?
While it's true that there are different requirements for genomic data use in research and clinical contexts, all the researchers that GenomeWeb spoke to believe that the community is best served by a single reference assembly.
"Even though the requirements in terms of variant calling and reporting in the clinical laboratory are much more stringent than they are in the research space, we still rely heavily on the work that the research community does," Church said. "In fact a lot of the tools that are developed in the research community are then later adapted for use in the clinical community." Moreover, experience has proven that working off two different assemblies is no easy feat, she added. "Anybody can tell you that as they moved from GRCh37 to GRCh38, trying to compare data between the two different assemblies is really, really challenging." The National Center for Biotechnology Information does offers a tool that helps with the comparison process, she noted, but making sense of the differences between the assemblies is not the easiest task.
McVean expressed similar sentiments. "I think we'd want to keep those two communities side by side," he said. In fact, the GA4GH's reference variation task team includes folks from the clinical world such as members of the Human Genome Variation Society, the Human Variome Project, and other researchers with clinical genetics experience "who understand the need for and are very used to talking about different models for genetic variation," he said.
Moreover "we wouldn't look to try and introduce what we are doing into the rest of the world until it was mature enough to be able to support those research and clinical uses," he added.