NEW YORK (GenomeWeb) – Affymetrix and researchers associated with the UK Biobank have worked together to develop new workflows and algorithms to weed out inconsistent and missing genotypes from the UKBB's 500,000-sample data set.
Scientists involved in the project described the approaches, which include multiple rounds of genotyping and data analysis, in a poster presented at the American Society of Human Genetics' annual meeting.
While the new methods were developed specifically to address the UKBB's needs, Affymetrix believes the approaches will "set a precedent and a new global standard for industrial-scale genotyping," Jeanette Schmidt, vice president of informatics at the Santa Clara, California-based company, said this week.
Moreover, Schmidt said that the analysis pipeline developed in connection with the UKBB is already being applied to other larger prospective population-wide cohort studies with which Affymetrix is involved, such as the US Veterans Administration's Million Veterans Program. Because of this, Affymetrix views the UKBB project as "transformative."
"I think that the procedures they have developed are generalizable in principle," Desislava Petkova, a researcher at the Wellcome Trust Center for Human Genetics at Oxford University, who has worked on the UKBB project, told GenomeWeb. "It is important to keep in mind, though, that the UK Biobank naturally reflects the genetic makeup of the British population," she cautioned. "Some details might be slightly different if applied to other populations."
The UKBB and Affymetrix announced in March 2013 that the array vendor would genotype all 500,000 samples in its repository using customized Axiom array plates. Since 2005, the UKBB has collected blood, urine, and saliva samples from half a million Britons between 40 and 69 years of age at enrollment. The ultimate aim of the project is to improve healthcare in the UK.
Genotyping of the samples commenced later in 2013 at Affymetrix Research Services Lab on custom arrays, each of which contains roughly 800,000 markers relevant to the White British population. Altogether, the UKBB raised £21 million ($32 million) to support the genotyping.
Teresa Webster, senior director of algorithm and data analysis at Affymetrix, said this week that all of the 500,000 UK Biobank samples have now been genotyped. Though some interim data – from about a third of the samples – has been released, Webster said that the final analysis is ongoing.
"The current focus is on a few percent of genotypes that require advanced algorithmic analysis," said Webster.
Colin Freeman, a scientific programmer at the Wellcome Trust Center for Human Genetics, who is involved in the UKBB project, said that the full release of data from all 500,000 samples will likely come in summer 2016. While he provided a link to the quality-control measures the UKBB scientists have relied on as part of the project, he declined to elaborate on the results until next summer.
According to Webster, the partners' "stringent quality-control steps" have allowed them to focus on the few markers that require additional analysis and to make algorithm improvements that will be applied to the whole cohort.
Much of the improvements have been focused on improving batch-to-batch consistency. Because of the size of the study, Webster said that genotyping was performed in batches of 4,700 individuals who were genotyped on 50 Axiom 96-sample array plates, for a total of approximately 100 batches.
The batch sizes were chosen to maximize rare variant detection and were also influenced by shipment logistic considerations, Webster noted.
Though Affymetrix claims that its masked-based array manufacturing approach ensures that every array is identical in its SNP content for every production run — compared to Illumina's use of bead arrays, where the beads included on the arrays can vary from batch to batch — the company still faced data consistency issues because of the scale of the UKBB project. These were related more to the size of the study than to array design, Webster stated, because even though a mask-produced array may contain markers that call accurate genotypes 99.9 percent of the time, that 0.1 percent of genotypes not called becomes visible when 800,000 markers are genotyped across 500,000 samples.
"Observing and studying small differences between analysis batches provided an opportunity to recognize variability and correct for it," said Webster. "In particular, data sets that look a little different also provide the greatest opportunity to characterize the genotype and correct for it by leveraging the power of 400 billion measurements."
According to Webster, the workflow Affymetrix and its partners developed to address these issues consists of two rounds of genotyping. After a primary round of standard genotyping, all of the batches are analyzed to select an exemplary batch for each probe set — the probes that interrogate a marker — as the source of SNP specific prior (SSP) information.
These SSPs are then used by Affymetrix's AxiomGT1 algorithm to improve consistency and accuracy in the other batches in a second round of genotyping. Additionally, Webster pointed out that SSPs could also be used to improve detection of rare alleles found in the British population.
"With a 500,000 sample size, the minor allele of a rare SNP variant will inevitably now be observed multiple times as 0.01 percent on 500,000 sample results in an expected 50 alternate allele observations," said Webster. "This allows the identification of the specific location, an SSP, of an alternate rare allele with respect to the common allele," she said. "That information can be used to anchor its expected position and train the genotyping algorithm to detect such rare alternate alleles."
Using Affymetrix's approach, other probe set-specific modifications to the algorithm, such as the alteration of algorithmic parameters, or advanced normalization to mitigate plate-to-plate variation, can be applied to selected probe sets, as well, she noted.
A final round of analysis subsequently excludes a small percentage of probe sets that produce sub-optimal clusters, as well as probe sets with cluster patterns consistent with complex genetics at the marker site, resulting in more than three genotype clusters.
Combined, Webster touted the development of a "highly parallel computational infrastructure" capable of supporting the analysis of 500,000 individuals on approximately 800,000 markers in less than three weeks, one that the company and its partners noted in the ASHG abstract can decrease the missing information per batch from approximately 3 percent to less than 2 percent, increase the ability to detect rare genotypes, and increase the consistency of allele frequencies across batches.
Affymetrix is now hoping to parlay these developments into other biobank projects, Schmidt said.
"This process is absolutely applicable to other large sample cohorts typically found in national-scale biobank projects, GWAS consortia, or clinical trial cohorts," said Schmidt.
She reiterated that the process has been incorporated into the MVP, which upon completion will be twice the size of the UKBB project. But Affymetrix's methods are not solely applicable for industrial-scale genotyping.
"Interestingly, they have applicability for smaller sample-size projects as well," Schmidt noted, "as the specific location of a common and alternate allele at a given locus has a marker-dependent signature that is not sample- or array-dependent."
Because of this, Schmidt said, empirically determined marker signatures can be translated from a large project and "ported over to smaller sample size projects or replication studies."