In two weeks, the National Center for Biotechnology Information will unveil a redesigned version of its Entrez Genome Project database, which will be renamed the BioProject database.
NCBI said that the change is being made to allow the resource to "more flexibly manage diverse research projects."
A well-used resource in the research community, Entrez Genome Project is a collection of nearly 3,000 complete and incomplete large-scale sequencing, assembly, annotation, and mapping projects for cellular organisms. The database is organized into organism-specific overviews that function as portals from which users can browse and retrieve projects that pertain to particular organisms.
In a note on its website, NCBI explained that the new BioProject database will represent "a higher order organization of research initiatives and the corresponding data which is deposited into several archival databases maintained by members of the [International Nucleotide Sequence Database Collaboration]."
BioInform spoke with NCBI's Ilene Karsch Mizrachi, GenBank coordinator, and Kim Pruitt, RefSeq coordinator, about the changes to the resource and what they mean for researchers. Below is an edited version of the conversation.
Let's start off with the most obvious question: why are you redesigning the Genome Project database?
Ilene Karsch Mizrachi: We are redesigning the database to make it more inclusive. At NCBI, in the different primary data archives, we archive data that’s more than just genomes. We have transcriptome data, we have variation data and sequence reads that are epigenetic. By changing from Genome Projects to BioProjects, we are including the other NCBI resources within the database
Kim Pruitt: We were starting to realize the limitations in the old design. It was too limited in its focus on genomes and the structure of the database itself had become a limiting factor.
Can you provide some more details about the planned changes?
KP: We are streamlining the presentation interface. It will be more focused on the registered projects so we are removing a lot of the 'decorations', if you will, that are on the current Genome Project page that distract from what the point of the project is. So the new presentation interface will be a cleaner definition of the data model — which is a registration of meta-information about a research project.
Some of the information that I call decorations, [for example] images, will become available in the Entrez Genomes resource.
IKM: Our plan is to release it in Entrez in about two weeks. We are going through the final phases of testing now.
Any changes to the infrastructure you have in place? For example, are you ramping up your storage capacity?
KP: As part of this redesign, we did completely redevelop the underlying database structure. One of the changes gives us more flexibility in how we group projects together in the database. For example, for a large complex international collaboration that has many different sub-projects but is considered to be a single major initiative, such as the Human Microbiome Project or the ENCODE project, the new database design allows us to more flexibly represent the substructure of these large initiatives.
We also added some fields, that we haven’t fully populated yet, to provide information about keywords or major areas of relevance; this will help highlight key points about different types of projects.
Talk a bit more about the limitations of the old structure
KP: In our old structure, we were grouping things taxonomically, so we were very oriented on the known organism space. That worked fine if you are talking about projects related to the study of humans or projects related to the study of other known animals or bacteria, but it didn’t work for some of the environmental sample projects. The old database structure didn’t fully support representing non-organismal research projects such as environmental metagenome studies. The new database provides support for that.
What's the impact of these changes on researchers? Will they have to change the way they enter their data or include additional data in their submissions?
IKM: We are building a new submission interface, which in some ways I think … will be easier to use than the current page. The current page was focused on genome submissions and didn't address the other types of projects that submitters were working on and didn’t prompt them for the correct information for the other types of projects.
The other thing which we are starting to do is integrate some of the submission systems within NCBI so that in the future, a submitter can go to one place and enter information about their project and their samples and provide the data. That’s a future step.
I think the submission process for BioProjects is a bit more intuitive and guides the users better.
Will the system require researchers to submit additional information with their projects?
IKM: We are giving them the opportunity to provide grant information, so that program officers can query the BioProjects database to find out what kind of data is being submitted to NCBI that way. In the past, we have gotten requests from program officers at the National Institutes of Health to let them know which projects people are submitting data on.
They can also enter more information about the different aspects of their project; they can let us know within the form if they are part of a major initiative so that we could link their projects to the initiative.
KP: In environmental studies, [because] the old interface was very focused on genomes, submitters couldn’t indicate that their project was actually an expression study or an environmental sample type of project rather than a genome sequencing project. So we do collect more information in the new submission interface.
Is the additional information mandatory?
IKM: There is a lot of optional information. What is required is what the initiative is, the types of data you intend to submit within the project, an organism name or description — for instance, a soil metagenome or a mouse gut metagenome.
When and why did the database expand to include more than just genomes?
IKM: I think it was more the evolution of the data. When we started Genome Projects several years ago, we were primarily receiving bacterial genome submissions; it was also the start of the human genome era. With the new developments in sequencing technology, and the technology becoming more affordable, researchers in other fields such as phylogenetics started using large-scale sequencing [and] the types of data that we are now receiving changed. Sequencing is not only focused on the genome of bacteria but also, for instance, what kinds of microorganisms live in the sea.
Your website also mentions that there will be some changes to the prokaryotic genomic resources to accommodate the large number of genomic sequences being deposited. Let's talk a little bit about these changes.
IKM: When we first started receiving bacterial genomes, our taxonomy database assigned a unique identifier for each strain. So if you are sequencing a strain of Escherichia coli, the taxonomic identifier would be the genus, species, and the strain. Initially when we started Genome Projects, the unique identifier for the projects was actually the strain-specific tax ID. We realized after we had gotten our fiftieth or so E. coli genome that this was becoming untenable.
Currently, there is a lot of bacterial surveillance research where individuals are looking at large outbreaks as well as doing basic research on bacteria. We are finding that it's going to become untenable in the future to assign a specific tax identifier to each genome initiative. We are not sure exactly when we are going to [make these changes] but probably sometime in the future.
We have also recently started a new database called BioSamples. Right now it's very sparsely populated and it's still in the development phase, but this database will be the place where you will deposit information about the individual strains within an initiative. Let’s say you have an initiative where you are looking at 400 Staphylococcus aureus strains. Each sample will be assigned a sample identifier and these will be clustered within a BioProject but they won't each receive a separate taxonomic ID.
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.