NEW YORK (GenomeWeb) – Researchers from the University of California, San Diego have published a novel computational method they developed to increase researchers' ability to glean relevant biomarkers from existing data from Illumina 450K BeadChip arrays.
In the study — published online in Bioinformatics last week — the team also demonstrated the potential of the approach in translational medical research by applying it to a dataset from a study of methylation loci in samples from rheumatoid arthritis patients, showing that it could improve the number of CpG sites covered by nearly 20 percent.
Gary Firestein, the study's corresponding author, said that he has been working with his colleague and co-author Wei Wang throughout his efforts to investigate the molecular underpinnings of rheumatoid arthritis.
Wang developed the group's newly published strategy in the context of his work with Firestein's research in RA epigenetics.
According to Firestein, although research in this and many other disease areas is moving inexorably toward whole-genome bisulfite sequencing, it hasn't gotten there yet. Meanwhile, investigators have amassed significant datasets using the 450K array technology and will continue to use newer, denser arrays as long as they offer cost and logistical advantages.
Recognizing that the advent of cost-competitive whole-genome bisulfite sequencing might still be some ways off, Wang and colleagues sought to create a method to extrapolate from 450K array data to try to approach a little more closely the coverage that could be achieved with sequencing.
"We wanted to see how much data we could really get from these chips," Firestein explained. "These [arrays] really focus on promoter CpGs [covering less than 2 percent of all CpG loci in the human genome] but we know that some of the most interesting action goes on in other areas, particularly regulatory regions where there is relatively scant coverage."
"So, this was really an attempt to use the concept that methylation of a locus does not occur in isolation [but] that neighbors or similar surrounding sequences for other loci are frequently also methylated," he said.
Other approaches have been created to infer methylation loci outside of what is covered on the 450K chip, the study authors wrote, but these have been limited to a particular tissue type. Strategies have also been developed using MeDIP-seq or MRE-seq data to infer whole-genome methylation levels, but these types of data are not available for RA. Wang and colleagues wanted to be able to do this using 450K chips.
To develop their prediction model — which extrapolates the methylation of a broader number of loci from chip data alone — Wang and his lab began training against datasets for 14 tissue types for which there existed both chip data and whole-genome bisulfite sequencing data.
The approach relies on localized patterns of methylation across different tissues and patterns of parallel methylation at different sites within single tissue types to predict methylation of sites outside of those CpGs covered on the 450K chip.
After training the model, Wang and colleagues cross validated their predictor and found that the strategy has a predictive accuracy, or AUC, of about 0.9. Between 70 and 80 percent of loci in the bisulfite sequence results could be predicted by the computational method from 450K data alone.
"Amazingly enough, it actually works," Firestein said. "It's not 100 percent accurate, but that's actually pretty good compared to trying to bear the costs of whole-genome sequencing,"
After these initial tests, Firestein proposed applying the predictor to actual 450K datasets that his group had been analyzing from clinical RA patients.
"My perspective was that this would take it out of a pure computational paper and give it some relevance to translational medicine," Firestein said. "We'd not just be looking at stem cells where everyone has the sequencing data already now, but doing an actual biological validation."
Applying the model to data from a study of DNA methylation in fibroblast-like sinoviocytes, a type of cell that lines the joints and has been implicated in RA, the group was able to expand their CpG coverage about 18-fold compared to the initial coverage yielded by the 450K array.
Using the predictor, the group found several thousand genes that were differentially methylated between RA and controls. By comparing the results to other evidence sources, like GWAS data, the team narrowed down to 12 genes with the strongest potential, five of which overlapped with seven genes identified by the group from 450K data alone in their earlier study.
According to the authors, six of the 12 genes are already known to have roles in the disease, while the other six appear to be potential therapeutic targets.
"There were quite a few new genes identified, but, importantly, they largely fell within pathways that had already been implicated in RA," Firestein said. "What that means is that the identification of these genes wasn't random. It suggests the method actually did identify new loci that were most likely meaningful, and that we can use this information to broaden the database in understanding of the pathways without having to perform whole-genome sequencing."
According to Firestein, he and colleagues are already looking at adopting bisulfite sequencing to more comprehensively interrogate methylation in RA. However, many other investigators may not be there yet, and for both his lab and others, there is a "vast amount of [450K] data out there already where investigators can expand their understanding by reinterpreting and extrapolating from the data they already have."
"This is not designed to be the end of the work for CpGs or to put whole-genome sequencing out of business," he said. "It's really designed to help reinterpret data that already exists and help researchers extrapolate until a time comes where sequencing is more reachable."
Meanwhile, Illumina has superseded the HumanMethylation450K BeadChip with a newer generation 850K system, the Infinium MethylationEPIC, which the authors wrote covers about 3 percent of CpGs in the human genome.
This is still a "woefully small percent of all the loci," that translational researchers are interested in, according Firestein. The team's predictor source code is available online via the Wang lab website and is applicable to data from the newer 850K chips as well, Wang and coauthors wrote.