Skip to main content
Premium Trial:

Request an Annual Quote

TGen Lab Wins Microsoft Grant to Develop Universal Data Format for Genotyping Arrays

Name: John Pearson
Title: Head, Bioinformatics Research Unit, the Translational Genomics Research Institute
Background: Pearson completed his undergraduate degree in biochemistry at the University of Queensland, Australia, and has postgraduate qualifications in physiology and pharmacology; computer science; and engineering and technology management from the Queensland University of Technology and the University of Queensland.
Before joining TGen, he was the lead programmer in the Bioinformatics and Scientific Programming Core at the National Human Genome Research Institute in Bethesda, Md.

Microsoft Research last week awarded more than $850,000 to six research projects under its “Computational Challenges of Genome Wide Association Studies" program.
One of the six projects, selected from 40 proposals submitted from 39 academic institutions worldwide is titled "A Universal Data Format for Genotype Microarrays." The specific funding amount for the project was undisclosed.
Led by John Pearson, head of the Bioinformatics Research Unit at the Translational Genomics Research Institute in Phoenix, the project aims to develop a data format that would “accommodate multiple vendor platforms into a single file and software library.”
Optimally, Pearson believes the Universal Data Format software could be available in six months, with upgrades to follow. The longer-term objective is to encourage more data integration from different genotyping platforms to facilitate research.
To learn more about the development of the format, BioArray News spoke with Pearson this week.
What is your background, and how will TGen help in this project?
I originally studied biochemistry, but these days I am purely computational. The Neurogenomics Division at TGen, of which I’m a part, does tens of thousands of genotype arrays per year, so for the last five years I have done a lot of work with genotyping arrays. That naturally led to this interest.
I already have a small lab and my group will add this to the tasks that we are already undertaking. I will definitely be collaborating with the other TGen labs, led by David Craig and Dietrich Stephan, which are part of the GenePool software grant from the National Heart, Lung, and Blood Institute. [TGen received NIH funding to create GenePool, accessible here, in 2006 — Editor]
This work will be centered in my lab but there will be collaborations with other TGen bioinformaticians. TGen has a pretty strong theoretical computational biology group and we will definitely be looking for some feedback from them.
Why is there a need for a universal data format?
We have a background in using pooled genomic DNA and we often want to combine data from multiple platforms. It is not unusual for us to combine data from an Affy platform with data from an Illumina platform, for example.
Secondly, we consider some of the Affy products to be, in essence, multiplatform themselves. So the 500K is really two 250K chips, for example. So every time you do a 500K experiment, you are doing a multiplatform experiment. The fact is that whether people think about it or not, a lot of people are already doing multiple platform experiments so we are just taking the next logical step.
Affy and Illumina both have good data analysis tools and if you use the existing Illumina and Affy tools, the complexity of the underlying files is hidden from you. But if you are a tool developer or [if] you want to do some novel analysis, you are back in the world where you have to understand all the nuances and complexity of the data and the data files. And that’s where we want to help, so that everybody can do data analysis without having to learn the data formats from scratch.
Why hasn't this problem been addressed previously and what methods have you been using in cross-platform data comparisons at TGen in the past?
We just do the task manually and we have expertise in the different platforms. We understand Affy and we understand Illumina. You do the analysis independently, and at the end you shuffle the results like a deck of cards so that results for all SNPs on the two platforms inter-collate.
I am sure that many folks have done it this way, but I have never seen a proper solution published. Everyone does it as a manual one-off task and nobody sort of formalizes the method.
You mentioned NIH funding. What have you accomplished on this project to date?
This is an outgrowth of an NIH grant we have from the NHBLI to work on genome-wide association methodology. Our particular project is to design software for analyzing genotyping experiments using pooled genomic DNA – our GenePool software. That’s a scenario where we do often want to combine results from different platforms.
So this Microsoft grant is a natural outgrowth of the NHLBI grant and as part of that grant we did some thinking about what we wanted the [universal data format] to look like and what features we should create. So now we have a picture of what we want to make and the Microsoft money will allow us to finally make it.
The grants are awarded for a single year, so they are not expecting you to conduct a long research project. We proposed a deliverable, and now it’s a case of putting our nose to the grindstone and delivering the deliverable.
Why are you doing it in C++?
C++ is fast and it is general. It runs on many platforms and that is important for availability. We are hoping to create bindings in other languages. For example, I have some prototype UDF code working in Perl. If folks want to use the UDF from other programming languages, hopefully the decision to use C++ will make it easier to create bindings for those other languages.

What are some of the problems with integrating Affy and Illumina data into a UDF, and how do you plan to overcome them?

At TGen, we are most familiar with Affy, Agilent, and Illumina. Personally, I am most familiar with Illumina and Affy, so that’s why they are my first targets. Down the road, we would definitely also like to look at integrating chips from other vendors into the capabilities of the program.
Initially the UDF is for genotype microarrays, but we are trying to design the UDF to keep the door open for possibly storing other intensity-based chip types, at which time we’d look at the Agilent comparative genomic hybridization and gene expression chips.
The way Affy and Illumina output data is different for each chip. With Illumina you get a directory and within that read there are files for each strip – strips are physical features of the Illumina chip. Illumina chips have a variable number of beads for each SNP, so when you start your experiment, you don’t know how many intensity values will be present for each SNP.
Affy has a completely different layout for data files, but you do know ahead of time how many intensity values there are for each SNP. That changes, though, for each platform. On one Affy platform there might be 40 features you have to read for each SNP. On Affy you have a DAT file, which is the image of the chip and that is processed to give you a CEL file, which contains the intensities for each feature on the chip.
Then you use the CDF file, which is a ‘recipe file’ that acts like a decoder ring for the chip telling you which features are part of each SNP probe set. You do some math on the probe set intensities and that gives you your genotypes. At the end you have your genotypes but those are still annotated with the vendor’s ID numbers. Then you add an annotation file that shows you which dbSNP IDs correspond to the Affy IDs.
The commonality of UDF is that we will be storing intensity info. The reason we went with that rather than genotype is that statisticians like their data as raw as possible. The less you touch their data beforehand the better so we wanted to stay as far back in [the] data stream as possible. We want to store intensity data so that future tool builders will be able to reach back into the raw data to do their analysis.
In our planning document, we know how to treat both sets of data to put them into this format and we believe the preliminary design of the format will incorporate both without any difficulty.
When and how do you expect this format will become available to vendors and other users?
The plan is to make the first version available in six months. That’s the plan but we are flexible with time and we are not going to release something horrible. We will put something out when it is ready for people to take a look at it.
It will be available under a BSD [open source] license, which puts little restrictions on the further use of code. Anybody with a research project or commercial software could incorporate the code into their software. It is not a GPL license where someone cannot incorporate your code without releasing their own code.
The BSD license is particularly appropriate because one of the groups we’d like to adopt the UDF is the vendors. The biggest benefits would be realized if some of those folks picked it up so that UDF is one of the options from their software. This will be possible because the code will be under the BSD license with next to no restrictions with what folks do with it down the line.
We have chatted with both vendors in the past and to do our pooling software we have definitely discussed that with the partners. We want to approach them with this when it is designed to be incorporated to work with their software.
Do you have plans for follow-on versions?
I think there are some things we can hold off on for the six-month release. One example can be the inclusion in the format of annotations about the SNPs. In the past you would end your analysis with the vendor-supplied IDs and then use an annotation file to convert vendor IDs to dbSNP IDs. We will put annotation information inside of the UDF so that annotation info for each SNP is moving through the system with the intensity data for the SNP.
So the first release will just incorporate annotations, but over time the UDF library will need to have routines to “refresh” the annotations so that the UDF will add new info about the SNP to the data library. So the ability to refresh your library will definitely be part of the code. Those features could be put in after the initial six-month release.

The Scan

Positive Framing of Genetic Studies Can Spark Mistrust Among Underrepresented Groups

Researchers in Human Genetics and Genomics Advances report that how researchers describe genomic studies may alienate potential participants.

Small Study of Gene Editing to Treat Sickle Cell Disease

In a Novartis-sponsored study in the New England Journal of Medicine, researchers found that a CRISPR-Cas9-based treatment targeting promoters of genes encoding fetal hemoglobin could reduce disease symptoms.

Gut Microbiome Changes Appear in Infants Before They Develop Eczema, Study Finds

Researchers report in mSystems that infants experienced an enrichment in Clostridium sensu stricto 1 and Finegoldia and a depletion of Bacteroides before developing eczema.

Acute Myeloid Leukemia Treatment Specificity Enhanced With Stem Cell Editing

A study in Nature suggests epitope editing in donor stem cells prior to bone marrow transplants can stave off toxicity when targeting acute myeloid leukemia with immunotherapy.