NEW YORK (GenomeWeb News) – Widespread genetic variation affects the regulation of most human genes, according to a study of functional variation appearing in the online early edition of Nature this week from a consortium of European researchers.
To map out functional variation in the human genome, the Genetic European Variation in Health and Disease consortium, known as GEUVADIS, turned to samples from five populations from the 1000 Genomes Project, on which they performed mRNA and small RNA sequencing. The sequencing work was spread among seven labs in Europe. From this, the team of researchers noted that variation in genomic regulation was common, and it was able to predict a number of causal regulatory variants.
"The richness of genetic variation that affects the regulation of most of our genes surprised us," Tuuli Lappalainen, the study coordinator who is now at Stanford University but was at the University of Geneva at the time the study was conducted, said in a statement. "It is important that we figure out the general laws of how the human genome works, rather than just delving into individual genes."
Understanding how the genome works and the mechanisms behind that, the researchers added, will enable better interpretation of personal genomes as well as the implementation of genomic medicine.
In their efforts to uncover functional variation, the consortium researchers sequenced mRNA and small RNAs from lymphoblastoid cell line samples from 462 individuals from CEPH, Finnish, British, Toscani, and Yoruban populations. They then mapped cis-quantitative trait loci to transcriptome traits of protein-coding and miRNA genes.
Broadly, the researchers found that population differences explain about three percent of the total variation, and they identified between 263 genes and 4,370 genes with differential expression, as judged by ratio of transcripts, between population pairs, as they reported in Nature.
As the researchers noted in a companion Nature Biotechnology paper, they saw few differences in variation between laboratories, indicating that such dispersed sequencing can give reliable results.
To evaluate the reproducibility of RNA sequencing across the seven labs, the researchers randomly distributed those 465 RNA samples, with each lab receiving between 48 samples and 113 samples. All of the labs used the same kit to prepare the samples and they all sequenced the samples using the Illumina HiSeq2000. This generated a median 58 million reads, and between 60 percent and 80 percent of aligned reads mapped to annotated exons.
While they noted some differences in average GC content and insert size across labs, the researchers added that such variations could be taken into consideration during analysis.
For example, they found that exons with high GC content had more variable expression levels between the various labs as compared to medium or low GC-content reads, a finding they saw in both mRNA- and sRNA-seq data. They attributed this difference to thermocyclers with high ramping speeds.
To correct for variation in their data, the researchers used a Bayesian framework called PEER that uses factor analysis-based methods to infer what may explain those transcriptome-wide variations. After such correction, they noted that the samples clustered less by laboratory.
"[W]e have demonstrated that technical variation in RNA-seq experiments is small and that results from RNA-seq experiments performed in different laboratories are consistent," the team said. "This conclu¬sion is valid as long as all participating laboratories use the exact same protocols and versions of sample preparation and sequencing kits."
Still, based on their findings, the researchers suggested some half-dozen parameters for future studies to follow to ensure sample and data quality. For example, they suggested that investigators conduct quality checks, including examining the distribution of base quality scores; the average and width of GC content; and the percentage of reads mapping to the genome; as well as performing sample swap and contamination checks and outlier detections, among other control steps.
Meanwhile, in their Nature paper, the consortium researchers also reported that gene expression and transcript structure are both common, though independent, influencers of transcript variation. "Genetic regulatory variation is the rule rather than the exception in the genome with widespread allelic heterogeneity, and is the major determinant of allelic expression," they said.
Additionally, with such transcriptome data in hand, the researchers were able to estimate how frequently the top-ranked, most significant eQTL was likely to be the causal variant. They calculated that the best variant was likely to be causal in 55 percent of European eQTLs and 74 percent of Yoruban eQTLs.
For example, they noted that a SNP in the DGKD gene is associated with calcium levels, and the top eQTL, a two base-pair insertion, is 21 kilobases downstream and is the likely causal variant that affects calcium levels.
"Thus, the integration of genome sequencing and cellular phenotype data helps to not only understand causal genes and biological processes but also pinpoint putative causal genetic variants underlying GWAS associations," the researchers noted.
Having a better grasp of what causal variants are and how they affect cellular mechanisms will be necessary for predicting in the future variants uncovered through personal genomics, the researchers said.
"Understanding the cellular effects of disease-predisposing variants helps us understand causal mechanisms of disease," the University of Geneva's Emmanouil Dermitzakis added. "This is essential for developing treatments in the future."
The data from the study is freely available from EMBL-EBI's ArrayExpress functional genomics archive.