NEW YORK (GenomeWeb) – Researchers from Carnegie Mellon University, the University of Illinois at Urbana-Champaign, and elsewhere have developed an algorithm, called Weaver, that helps researchers identify and analyze complex structural variations and copy number alterations in whole-genome sequence data, allowing them to obtain more comprehensive pictures of cancer genomes.
Essentially, the method calculates the copy number of structural variants in cancer genomes, enabling researchers to analyze variant and copy number information at the same time. Jian Ma, an associate professor in Carnegie Mellon University’s Computational Biology Department and one of Weaver's developers, said in a statement that the new algorithm could become an important tool for identifying interactions between genetic alterations that drive the development of cancer calls. "This gives us a better view of the complexity of cancer genomes," Ma said. "This may help researchers better characterize cancers or understand which combinations of genetic changes might affect cancer behavior for the same type of cancer or for different cancer types."
In a paper published in Cell Systems last week, the developers applied Weaver to sequence data from the Michigan Cancer Foundation-7 (MCF-7) and HeLa cell lines as well as from ovarian cancer samples from the Cancer Genome Atlas. They were able to generate allele-specific copy numbers of SVs for the cell lines and identify recurrent SV patterns in the TCGA data.
Weaver is also one of the first methods for identifying complex alterations, like structural rearrangements, that account for the aneuploid nature of cancer genomes, Ma told GenomeWeb in an interview this week. Current methods for identifying alterations such as single nucleotide variants, small insertions and deletions, copy number variations, or breakpoints caused by structural rearrangements do not really do so, he said. Some methods for quantifying copy number alterations account for aneuploidy but these methods do not integrate information on structural rearrangements, he added.
"We are interested in understanding the functions of these very complex genomic alterations in cancer genomes. Of course if you want to study that, the first step is to identify them accurately," Ma said. "Our goal [was] to develop a method to quantify the copy number of the structural rearrangements in an allele-specific manner" to provide more refined information about how these alterations are connected to each other. For example, with Weaver, researchers can identify deletions in the genome as well as determine which alleles have the deletion, he explained.
As explained in the Cell Systems paper, Weaver uses a probabilistic graphical model called Markov Random Field that enables researchers to identify and visualize the copy numbers of structural variations as well as how these mutations are connected with each other. Weaver accepts aligned and unaligned sequence reads from tumor or matched normal samples. Its first step is to call all variants in the data, including SNPs and SVs – researchers can use variant calling tools of their choice for this task. The software then uses the variant calls to build a graph of the cancer genome that represents the connections between genomic regions in both the normal and cancer genome. It then converts the graph into a probabilistic model that captures structural variants, their allele-specific copy number phasing configurations, and whole-genome copy number changes.
Essentially, the software attempts to capture connections between genomic regions that are far away from each other in normal genomes but may be close to each other in a cancer genome, Ma explained. For example, "a piece of DNA on chromosome one and a piece of DNA on chromosome nine are not connected to each other in a normal genome but they may be right next to each other in a cancer genome since the genome is rearranged," he said. "We want to model those kinds of connections." In the cancer genome graph that Weaver constructs, nodes represent genomic segments and edges indicate dependencies among the segments. "The output of the algorithm for each node will be the allele-specific copy number and the edges will indicate whether they are involved in structural rearrangements," he said.
According to the paper, the researchers evaluated Weaver's performance by using it to analyze both simulated and real datasets, as well as by comparing its results to those generated by optical mapping of cancer genomes. Weaver was able to identify over 97 percent of the structural variants with correct copy numbers in each case in the simulated datasets. In the case of the MCF-7 cell line, Weaver identified 546 structural variants, with about 83 percent having a copy-number greater than one. Also, of a selected 268 structural variants identified by Weaver, 235 were consistent with the optical mapping results.
"If you look at the MCF-7 result, we found that if you consider the structural rearrangement, you will better interpret functional genomic data, like chromatin interaction data," Ma noted. "That's very important because … large-scale projects like ENCODE use cancer cell lines, but typically, these kind of structural rearrangements are not specifically considered when we interpret epigenetic data. But I think it's very important [that] … when we interpret epigenetic data, we keep in mind that there are rearrangements involved."
Ma and his colleagues also used Weaver to assess data from 44 ovarian cancer samples from the TCGA that were sequenced at high coverage. According to the paper, their analysis showed that the CCNE1 gene as significantly amplified across all 44 samples. "Amplification of CCNE1, which encodes cyclin E1, is associated with primary treatment failure in ovarian cancer patients and has been validated as a dominant marker of patient outcome," they wrote. Also, "previous studies have reported that CCNE1 amplification is one of the most common focal [copy number alteration] events in ovarian cancer." They also found recurrent amplifications in some of the ovarian cancer samples that are driven by specific structural rearrangements, Ma said. Specifically, Weaver identified a portion of chromosome 19 that was enriched with fold-back inversions that have breakpoints around a gene that's a member of the KDM4 protein family, which previous studies have shown is perturbed in various cancers, the researchers wrote.
These findings have laid the foundation for at least one future direction for Weaver's developers.
"The sample size is still not big, only 44, but if [the alterations] show up four or five times, then at least it says something," he told GenomeWeb. "We intend to apply this to many more samples to identify these recurrent patterns." He said that the team is currently looking into applying the algorithm to additional TCGA datasets, as well as to high-throughput datasets from labs that are willing to partner with the Weaver team.
The researchers are also interested in applying Weaver to data generated using other sequencing technologies. With Illumina short reads, "certainly we are going to miss a lot of breakpoints because the read length is a limitation but there are other technologies out there that I think are complementary and could help us better interpret these structural rearrangements and these complex structures in cancer genomes," Ma said.
The paper also highlights Weaver's ability to estimate the timing of SVs relative to chromosome amplifications. This is information that could help researchers better understand the roles that structural rearrangements play in cancer genomes, Ma explained. For example, if there are five copies of a single allele of a chromosome region but only a single copy of a deletion that occurs in the middle of the allele, then that means the deletion occurred after the whole region was amplified. On the other hand, if there are five copies of the allele and five copies of the deletion, that indicates that the deletion is more likely to have occurred prior to the amplification of the allele. "If you notice that this deletion happened before the amplification and you observe this in multiple samples, it may give some idea of the potential function of that structural variant," he said.
In the simulated dataset, for example, Weaver was able to detect about 97 percent of the pre-aneuploid structural variants and about 98 percent of the post-aneuploid structural variants. Also, in the MCF-7 line, Weaver was able to trace the evolution of two deletions within the MTAP-CDKN2A/B region by looking at how often the deletions occurred in amplified portions of the region.
The main benefit of Weaver is that it provides information that could help researchers better interpret the landscape of genomic alterations and their patterns. It provides a "recipe" of how these changes actually happen, Ma explained. "If you look at a large number of samples, it may give us some insights into how these structural rearrangements and copy number alterations are formed in different types of cancers or in different types of samples in the same kind of cancer," he said.
The study was supported by grants from the National Science Foundation and the National Institutes of Health.