As second-generation sequencing data storms into databases at the National Center for Biotechnology Information, the European Bioinformatics Institute, and the DNA Data Bank of Japan, the databases are working together more intensely under the auspices of the International Nucleotide Sequence Database Collaboration.
EBI, DDBJ, NCBI are now setting up mirrored short read archives in which deposits made at any of the archives will be available for search and download at all the archives, similar to the current arrangement with Sanger sequencing data. The three short-read archives will share a common data model, one accession space, and will mirror data and metadata updates on a daily basis.
The first official exchange of short-read data between EBI and NCBI took place this week, NCBI staff scientist Martin Shumway told BioInform in an e-mail. In addition, DDBJ has been regularly brokering Japanese submissions to NCBI.
One challenge the databases are working on is how to mirror the full set of instrumentation data gathered for a single sequencing run. One approach is to mirror only the base calls and qualities, while keeping the remainder at the data’s home archive. Users seeking a complete download can find that at the home archive, but most users, Shumway said, will opt for the more concise data served from the local archive.
There is also a generic short-read alignment format called SAM, for Sequence Alignment/Map, beingdeveloped by the 1000 Genomes Project “and other stakeholders” as a common, cross-platform data format, said Shumway.
SAM is a tab delimited format for sequence read and mapping data. Shumway said that it is “flexible enough to store all the alignment information generated by various alignment programs, … simple enough to be easily generated by alignment programs or converted from existing alignment formats, … and compact in file size.”
It also allows most alignment operations to work on a stream without loading the whole alignment into memory. In addition, the file can be indexed by genomic position, which lets researchers retrieve all reads aligning to a locus.
The SAM standard is currently being applied to pilot data from 180 individuals in the 1000 Genomes Project and by next December it will have been used to align the sequences of 1,200 individuals to the human genome reference assembly, Shumway said.
Another file format is the Sequence Read File format, a common, cross-platform data format for raw sequencing data. The second-generation sequencing technology vendors, production centers, and archives have committed to using this format as the standard interchange format for raw sequencing data. SRF serves as a container, carrying side information about sequencing runs as well as raw base calls, qualities, and signal-intensity measurements, Shumway said.
NCBI Director David Lipman spoke to BioInform recentlyabout the impact of second-generation sequence data at NCBI. The following text is an edited version of that conversation.
What dimension of data downloads is NCBI experiencing at the moment?
Right now, just from our site, in general we are getting over 5 terabytes of data downloaded a day.
What are some of the challenges of second-generation sequencing data for databases?
I think the major sequence databases are in a difficult situation with next-gen sequencing for the following reason: The technology is still changing and … it is going to continue to change. For example, a company with one kind of product, the [Illumina Genome Analyzer] with 35-base pair sequencing, is pushing that technology [forward]. Another type of next-gen, which could be out in a year, single molecule sequencing, could have fairly long sequences, but lots of them. And the error characteristics of the sequencers are changing.
On the one hand you want to have a repository that can take a fair amount of information about the experiment so that others can analyze and look at, for example, what are better methods for dealing with quality issues, aligning it to genomes.
These machines work well enough now that biologists are answering questions with them. You can spend a lot of time working on the error characteristics but a lot of data matches the reference genome essentially 100 percent. … So you have a lot of data of use right now.
In a sense there’s a huge amount of work to deal with a problem, which is going to be fairly transitory, which is holding the most raw form of the data. And there is the challenge of developing a model for what people will need to submit in the near future, so that others can benefit from the data just like they have from the sequence databases in the past.
For example for epigenomics, or ChIP-seq [chromatin immunoprecipitation with sequencing], or for expression, it may be well be that wiggle plot [a format used in the UCSC Genome Browser] type of information with the metadata, that’s basically all you need.
There are still some issues to get to that point, because things are changing so much. … It’s a bit like the problem of expression data, the older type of ChIP data where the target was moving, the quality was shifting, but you have to keep moving very quickly to where you think the problem is going to be a year from now.
We know that the most raw form of this data is not going to be saved in the near future, it won’t be worth it. But we are not quite there yet. Some groups are saving it and some groups like ours need to be able to archive it so that some of the toolmakers and others can assess it.
When it comes to submissions the large sequencing centers probably face different challenges than smaller labs in terms of data transfer. How does that play out for NCBI?
Working with the community we have defined some submission formats, which are more suited for this kind of data. They have in mind, ‘If you can give us all of this type of information, beyond what is called FastQ [with sequence and quality data in one line], then we can take it all.’ That allows a wider range of people to be able to ask different kinds of questions, many of which are methodological issues.
There is also a more streamlined abbreviated form of this we can take for a more ‘mom-and-pop lab’ that managed through a contract or through a core lab to get some ChIP-seq work done or some expression work, so that they can submit their data.
Data volume is an issue right now; in the near future what we are talking about is a different structure of data. For example, if you are studying regulation and looking at a few genes, looking at T cells treated with interleukin, looking at Jak-Stat [signaling], you might have seen a paper in which the scientists described what they found for one region, maybe the location of a binding site. Generally they wouldn’t even submit something to the databases. It would have been nice to get that into the databases, but generally speaking that wouldn’t get into the databases.
Now if you do a ChIP-seq experiment, where we know what genome is the coordinate system, it’s the reference genome for human or for mouse, let’s say. You have the metadata about the experiment. Then you get the information from the wiggle plot where you know these are roughly the regions where the proteins are binding and so forth. And you have information on how they are processing their raw data, and there are a few methods people are using right now.
Then yes, the data is fairly extensive because the coordinate system is three billion [bases] long, but, in fact, the amount of data you are talking about for a submission, if you are not actually sending all of the reads, is a fair amount but it is not that huge. But it’s a data structure that none of the databases have been taking before.
I think that the challenge is more: What are the tools on our end for being able to handle that data? We’ve made a lot of progress, we’re basically there for a lot of it. It’s not totally in place, but a bunch of it is already there, working, and getting submissions.
The companies that manufacture sequencers are providing a fair amount of software. As the software on the researcher’s desktop side starts to stabilize, then the idea of being able to create a pipeline that makes it much easier for someone to do their experiment, do their own analysis, writing their paper, and then have the data in a form that can go into the sequence databases, that’s the goal.
The manufacturers are trying to coordinate with the sequencing centers in terms of formats so that their software can put the data in a format that can go right in [to the databases].
Is the format challenge deeply embedded due to the different instruments?
Not that there isn’t a challenge, but the community was much more prepared for it in some sense. There were meetings between the data producers, the companies, and us. There are still some challenges, primarily because the target is moving and many scientists are already obtaining useful results. We can’t wait and work out all of the quality control, assessment problems now and forget about the people who are already getting useful answers, producing these other data structures and not worrying, not keeping all that raw data, because they don’t feel they need it.
The databases sort of have to solve two problems at once, and the first one, dealing with the raw data, may well be much less of a problem in the near future, because enough of the problems will have been worked out. That won’t be so critical.
As reads get longer, will the Short Read Archive become the former Short Read Archive?
The Short Read Archive is set up to handle much larger quantities of data and is much more the way the machines actually produce the data, because they are getting lots of reads at once.
For some of the applications of the new technology, scientists won’t be storing any of the raw data. There won’t be any raw data associated with it. People will be confident enough about how the reads align to the genome that you are going to use a more processed form. If you are trying to do polymorphisms, you may need to retain more of the raw data. For certain experiments with clinical isolates that are very valuable, people will be willing to store more of the raw data because people will want to harvest from it.
How much additional data, such as annotations, should scientists submit with their new data?
We are going to want to encourage scientists to submit a fair amount of metadata. Even though it’s sequence data, you’re looking at expression, you’re looking at epigenetics, or ChIP-seq, all of which are completely influenced by the conditions and the biology.
What do you see coming around the bend to handle second-generation sequence data?
My concern, given that resources are tight, is balancing the amount of work we’re doing to make sure we can take a fairly raw form of the data so we can understand its characteristics and people developing tools can refine their tools versus the more processed forms of the data, which biologists are going to be more concerned about. In the end, we want to serve the largest group in the biomedical community who are going to be focused on these more processed forms of the data.
If someone submits a paper in which they were studying a particular phenomenon with ChIP-seq or expression, they answered their question with it. These datasets, many of them, look good enough and clean enough that somebody with a very different set of questions could get useful results from analyzing that.
The effort for the community to find those experiments in the database, extract them, get them into tools so they can do their own analysis, has to be low enough so that people can do that.
Some scientists have mentioned that a dataset associated with a paper as part of the supplementary materials and the data associated with that paper in NCBI may differ, and then sometimes there may even be connected datasets on the scientist’s web page with yet different data, all relating to the same study. Does that concern you?
Right now, it’s a reality. Whenever I give a talk at a university, researchers will come up to me and say, ‘Geez, why can’t you get some of these genomes into GenBank? They’re on this or that web site, in this format or that format.’
It’s not that they love us so much, it’s just we’re the devil you know and we do it a certain standard way and they have figured out how to deal with that. Right now these datasets are, to some extent, every which way to Sunday. It will be easier if researchers can get it from us, EBI, or DDBJ, and it’s in a standard form. We’re not quite there yet.
Scientists want to compare large datasets with each other. How can databases help with that?
They are able to do that for expression data now. The community has figured out how to solve these sorts of problems. Really having this phenomenal technology has been more of a boon than a bad thing. It’s sort of like the disappearing cat in Alice in Wonderland, the issue is you can’t focus too much on the raw data problem and forget the results people are getting. On the other hand, we can’t put all of our resources on how the data is going to come to us, because you want some of these canonical studies to be available to people.
Just when you are thinking you are getting sort of on top of that, you have these surprise issues like the single-molecule sequencing, which is coming along amazingly quickly and that will have other properties. The challenge is in prioritizing.
For the raw data, there is a volume issue you have to deal with, exchanging data between EBI and us, for the canonical datasets with raw data: those are big datasets you have to exchange. You can’t ignore the fact that there is a quantity problem for the raw data.
The third thing is the way people are going to use these data is different from anything we’ve done before, so the data structures and software tools are different. We have to contend with qualitative differences in the tools. All that said, the community is working very well together and the companies have tried to be as helpful they can be and I give them credit for that.
How are the databases generally working together?
EBI, NCBI, and DDBJ are working well together. I feel confident we are going to sort this out. There is some software that is being shared by various sites, the data formats are being established by a group. The companies manufacturing sequencers are making sure they can output the data in a format according to the data specifications. People are not just talking, there is real coordinated work.
For the 1000 Genomes Project there is a joint analysis group that’s evaluating a lot of the methodologies. We’re part of that along with EBI and others.
Is NCBI intended to also be a site to which researchers not only bring data or from which they can get data from but also to analyze their data?
Our primary concern is making sure we can provide the data. If we don’t do that, then having a bunch of tools is useless. We are not directly developing tools for aligning reads to the reference genome. We are evaluating tools but not developing them. We will provide ways that you can visualize wiggle plots and some simple viewing and analysis, but we suspect just like with expression datasets, that if someone is really going to be studying somebody else’s data they will download it. The primary concern is: is the data there, can they find it, can they retrieve it, can they download it in a uniform way into their tools so they can analyze it well?
These datasets are so data-rich — let’s just take expression — that we will probably be doing our own in-house analyses of them, for example finding new splice forms and things like that, to improve our retrieval tools. We have been doing that for years, comparing sequences.
We provide ways if you are looking at a particular gene, an expression pattern such that you can find other genes that show that same pattern in the different conditions, in the series of experiments that were done. … Since we pre-computed that, you can find them.
Once you’ve seen a few of these, you might want to say, ‘I want to download this dataset and work on it locally.’ But you can do some preliminary identification of interesting things online.
For high-throughput experiments we are not doing this now, but we will be. For epigenomics, if there is a really good dataset, if people have found regulatory regions or modified regions, then based on looking at multiple datasets, you can use that to annotate aspects of the reference genome. And then it might be useful to you not by first asking, ‘Do you have the experiment?’ but by looking at the genome and seeing, ‘This has putatively got some kind of modification under some condition.’ From scanning the genome you might decide you want to look at this dataset. That is where we are moving.
We want to make sure the data is there and people don’t have to sort through 10 different websites and 10 different formats to get at it. The next step is, when there are some very robust things that can be computed out of that data, can we provide that and help people discover that this experiment and this dataset is interesting to them?
How will the databases serve scientists trying to decipher the biological implications of the data?
The molecular picture of the organism is getting richer and more complex. That’s ultimately the real challenge for the databases, having an organizing principle that represents this complexity that people can find the information they want.
Put bluntly, if the only way you can get understanding about the data is by reading papers and occasionally you pull a dataset and compute on it and discover something, that has less utility than if your information systems with the computable data actually have some structure to them that is well enough aligned to what is really going on biologically. That is, I think, what the community wants. When you have such a powerful new method, [we are seeing] that it is clearly going to cause us to change our views of gene regulation and epigenetics, [so] this is going to be a wild ride.