NEW YORK (GenomeWeb) – Researchers from the University of Toronto's computer science department and the genomics arm of Canada's Hospital for Sick Children have developed freely available open-source software that offers more statistically sensitive tools for identifying small de novo copy number variants in fetal DNA.
The software is designed to analyze whole-genome sequence data from samples of cell-free DNA (cfDNA) extracted from maternal blood plasma, one of the newer and less invasive forms of prenatal genetic testing that are becoming more commonly used to screen fetuses for heritable genetic diseases and identify abnormalities. It's also designed, the researchers wrote in BioInformatics, to detect the small CNVs that are linked to significant genetic disorders such as DiGeorge syndrome and Prader-Willi syndrome, and to do so in a more sensitive manner than can existing methods.
The technology underlying non–invasive prenatal testing is "really fascinating and can significantly simplify the prenatal care that pregnant women receive," Michael Brudno, an author on the paper and an associate professor and research chair in computational biology at the University of Toronto, said during a conversation with BioInform.
It is starting to replace the more invasive and riskier tests for detecting fetal abnormalities that are often inaccurate and can only be done during later stages of pregnancy. Maternal blood plasma contains a mixture of fetal and maternal DNA. "The fraction of fetal DNA in such an admixture varies depending on multiple factors, including maternal weight and size of the fetus, but typically builds up from 5–7 [percent] early in the pregnancy to 10 [percent] at week 10 to as much as 50 [percent] before delivery," according to the paper.
"We wanted to see within the study how far we can push the envelope in terms of the technology," said Brudno. It is routinely used to identify the whole chromosome aberrations that underlie conditions such as Down syndrome, "but can we push it to identify smaller changes … that could lead to significant disorders but are not the size of a whole chromosome?" They also wanted to develop analysis software that was open and freely available for download and use, he added. Many current tools are proprietary and only available commercially.
Some open-source software are designed specifically to detect so-called sub-chromosomal CNVs, but they rely on a single source of data, making them less sensitive than the Toronto team's method. For instance, two of these, which are referenced in the Bioinformatics paper, make use of depth of coverage information. These methods — which use low-coverage WGS of cfDNA and a lot of samples — first map reads to the genome, divide the genome into bins, and identify the CNVs by comparing the number of reads mapped to each bin, according the paper. "The key idea in these methods is that deletions/duplications will result in more/fewer fetal reads within a window, and this difference can be identified using statistical methods."
In some ways, it's a practical approach because it uses less data, but it can't identify CNVs at the same resolution as the Toronto team's method, Ladislav Rampášek, a computer science doctoral student at UofT, one of the developers of the tool, and a co-author on the Bioinformatics paper, told BioInform. The Toronto method requires whole-genome sequences from the mother and the father in addition to the fetal genome sequence extracted from the cfDNA. Furthermore, the other open source [programs] are better at identifying larger CNVs that are around 10 megabases, for example, he said, but are not as effective at calling smaller deletions involved in conditions such as DiGeorge syndrome, which are on the order of one to five megabases long.
In contrast, the method that the Toronto team has developed uses multiple sources of information to make its calls more precise. Specifically, it uses a Hidden Markov Model to bring together data from three sources, including depth of coverage, according to the paper. It uses "allelic ratios, reflecting the changes in the expected observations of various alleles at SNP positions in the presence of the CNV; phasing information, allowing for the combining of allelic ratios across multiple SNP positions, thus improving the signal-to-noise ratio; and depth of coverage information, reflecting the change in expected sequencing depth in the presence of the CNV."
The exact mechanics of how the data is combined and used are described in detail in the paper, but basically the method uses a statistical framework to combine two types of genomic signals, Rampášek explained. One of the methods of analyzing sequence data from cfDNA is to count reads. For example, if a sample is being analyzed for Down syndrome, one approach would be to simply count the number of reads that come from chromosome 21 to see whether they are significantly more than would normally be expected. Another option is to look at SNP information to gain some insight into whether duplications or deletions have occurred. "We use both of these signals … and try to unify them into one framework and try to deal with them together," he said.
In tests involving in silico sequence data with 13 percent fetal DNA concentration — including 360 simulated CNVs — the researchers report that their software successfully identified genetic changes that were larger than 400kb with 90 percent sensitivity and that it was able identify 50–400 kb CNVs with 40 percent sensitivity.
The researchers will present their paper and method at the Intelligent Systems for Molecular Biology conference in Boston, which starts this week.
Meanwhile, they've begun working on additional features for the software, including capabilities that allow it to detect CNVs in samples in the absence of data from the biological father of the fetus, a rather significant impediment for many clinical tests, Brudno said. They are also improving the precision of the software and getting it to work with targeted sequence data and not just whole-genome sequence. Many testing labs aren't currently using WGS for NIPT, opting instead to sequence just the DNA for the triosomies, he said. "We are collaborating with some projects" including a group in Ottawa "which do targeted sequencing to adjust the method to work with them, as well."
Adding this particular capability is important for possible practical applications of the method, Rampášek noted. As it stands, the software cannot be used in routine practice because it requires a lot of input data. "It's a really good proof-of-concept, but we are working hard to make it more practical and … to figure out how much data is really needed to guarantee some level of CNV resolution. It's not a finished work," he said.