Last week, Applied Biosystems and Celera Genomics made good on a promise originally made in May to release genomic sequence data from the Celera Discovery System into the public domain [BioInform 05-02-05]. The companies ended up releasing much more than sequence data, however, contributing more than 400,000 PCR primer-pair designs, more than 7 million SNPs, and 540,000 TaqMan gene expression assays to several databases hosted by the National Center for Biotechnology Information.
BioInform spoke to Peter Li, director of assays and content bioinformatics at Applied Biosystems, to get a better idea of how the organizations collaborated to bring ABI's once-proprietary data into the public domain.
Can you tell me how you worked with NCBI to get all this data into the public domain, and what some of the technical specifics were in getting this all to work properly?
This goes back to around six months ago when we were originally thinking about what to do with this data set that is relevant for our business, in terms of our assay business, because much of this data is actually the supporting annotation for our assays. With the termination of the Celera Discovery System, this particular data set will no longer be available to our users, so we were concerned about that. One option was of course to deposit this information into the public domain — partly PR and partly for the greater contribution to mankind, you might say. So we approached [NCBI] close to the end of our fiscal year in June because by that time we knew that the Celera Discovery System was to be turned off, and we started working with them in terms of the release of the information.
Ultimately, all the data was actually put out onto their repositories. So far, we've pushed the genome assembly data out, and that is in Genbank. We pushed out the trace files for the original reads that went into the human, mouse, and rat assemblies, and that's in the Trace [Archive]. I believe that they're still processing that. Even though the data is on their site, they haven't fully released everything because it was such a huge amount of data — 65 million traces takes a long time to process. [As of Oct. 26, NCBI listed 55,056,715 Celera traces available through the Trace Archive out of a total of 879,852,296 traces. — Ed.]
We also deposited our SNP data, which comprised about 5.1 million human and 2.5 million mouse [SNPs]. In addition, as part of the overall process, we deposited our information about gene expression TaqMan assays, and that's been put into GEO — the Gene Expression Omnibus — at NCBI. We submitted three separate platform files — one for human, one for mouse, and one for rat. There are approximately 540,000 assays. So if you bring up the platform files in GEO, there's a link that goes to our MyScience site at Applied Biosciences, where you can purchase the assays. Each assay is linked independently.
The other data set we deposited is our VariantSEQr resequencing primers, and that has been put in NCBI's new database called ProbeDB, so it's available on NCBI and [there are] approximately 430,000 of these primer pairs for resequencing.
That's a pretty big project, so what kind of coordination did you have between yourselves and NCBI in terms of making this all available?
It happens that in most of these situations, they had basic processes already in place for receiving these files. GEO has already been there for a long time, dbSNP has been there for a long time, and Genbank, of course, has been there a very long time. So we just submitted through that standard process. We generated these files according to the format specifications for these submissions and we put [them] in.
For GEO, we worked out with them in terms of which fields are linked, and we made initial contact just to sort out the details and make sure the formatting is correct, and that came out without any problems. Occasionally we had to ping them just to make sure they actually received the files because some of these files are very, very large.
We established multiple layers of communications with them — people at my level, people below me — who communicated with each other in terms of the details and the idiosyncrasies, because whenever you have such a bolus of data, there are always some problems. Either there are some glitches in our publishing of the files for them, or there are some problems in their specification, which hasn't anticipated the particular combinations of data attributes, so we tried to work those out over the last couple of months.
Now, ProbeDB is a special situation because this is a brand new database, although they established it with a couple of other submissions. But the set-up right now is very much open in terms of what's actually needed, so when we put resequencing data into ProbeDB, they hadn't had a prior submission for resequencing, so they had to go in and design the template pages for resequencing primers — what information should be included in that — and we worked with them in terms of sorting out the details for that.
How much primer data is already in the public domain?
There's actually a fair amount of so-called primer data, as in PCR data, through dbSTS. But that data set has been mostly used for mapping, because STSes are the markers they use for genome mapping — they're not really intended for sequencing, resequencing, or medical sequencing. So these primers we deposited are really targeted for resequencing, so the characteristics are slightly different from so-called STS primers. Even though the same protocol works, it's all based on PCR, the design is very different.
Who is responsible for maintaining the links to MyScience in GEO?
The links will be maintained by us. It's part of a standard format specification as part of the deposition file. In fact, we wondered about that in the very beginning, so we started looking at it and it turns out they've already thought about this issue and accommodated this aspect by having a way for the submitters to specify the link back to their own sites.
How does this relationship with NCBI's resources impact what you're doing through the MyScience portal?
It's a complementary process. In MyScience we have all the information about the details of the assays themselves, and we still have some proprietary data that's not yet deposited — although as time shifts and the public catches up, we will be moving that data into the public domain. Because of the fact that it's tied into our product offerings, [MyScience] has a lot more connections to the supporting information about our products that is not available in the public domain. We submit to NCBI mostly information dealing with the scientific nature of the assays, where we have more on the procedures and the way to use the information with respect to MyScience.
What are your plans for updating this data or any other future releases?
We are still looking at gene-annotation data. That has been announced, but it's still going through our analysis and filtering because we want to make sure it's the best quality we can put out.