During the 30 years that Memorial Sloan-Kettering scientist Phuong-Van Luc investigated transcriptional regulation, no database existed for quickly finding out if a protein was part of a complex. As a result, Luc was left with few clues to help understand the function of a newly identified protein.
Luc tried to keep on top of the large number of proteins involved in transcriptional machinery by reading a lot of papers, but she found it hard to maintain a broader view of her bench work when she was spending long hours doing experiments.
To help researchers like herself, four years ago Luc decided to take a break from the bench to develop a nuclear protein complex database. The first version of the database is currently available over the web at http://pin.mskcc.org/home.html. A second version of the database, which includes a protein complex visualization tool, is available only to MSKCC scientists at present, but will soon become generally accessible, Luc said.
"I wanted to develop a database and tools to allow scientists who work hard at the bench to have a quick way of analyzing [their] protein and its function," said Luc, who is a research associate in Paul Tempst's laboratory at the Memorial Sloan-Kettering Cancer Center. "If there was a new protein identified, I wanted to be able to quickly go to that database and find out if it's part of a new complex, an existing complex, or if it's a new protein."
Working largely by herself over the past four years, Luc began methodically curating scientific papers that dealt with nuclear proteins, and developing software that would allow scientists to query what complex or complexes a protein belongs to.
"What we collect are proteins that are in the same complex in the nucleus," Luc explained. "Our database would be considered a 'gold standard' one it's curated, and it contains only things that are verified."
Luc's database, called the Proteins Interacting in the Nucleus database, or PINdb, is different from large databases such as BIND and MIPS, in that it doesn't include protein complexes from high-throughput experiments, and it doesn't include protein-protein interaction data from yeast two-hybrid analyses. With BIND, a long list of hundreds of complexes may be identified when a researcher types in a protein, but because the database includes high-throughput experiments, it is hard to tell which complexes are verified, and which are not, Luc said.
The PINdb may eventually link to databases such as MIPS that catalogue protein-protein interactions, but right now it catalogues proteins by complexes, rather than by direct interactions, she added. "We don't try to replicate what other people have done," Luc explained. "There's no other database that catalogues only complexes."
A first version of the PINdb was made publicly available in late 2003, after Luc presented her work at a transcriptional regulation meeting at Cold Spring Harbor Lab. About half a year later, a paper describing the database was published in the April 2004 issue of Bioinformatics.
Luc is now finishing up a second version of the PINdb that includes a visualization tool that allows users to view a "spider web" diagram of nuclear complexes. The web diagram includes different shapes for different complexes, and lines linking complexes that share overlapping proteins.
The second version of the PINdb is currently available within Memorial Sloan-Kettering. According to Luc, it should also be available to the general public through the web after a few data security issues are resolved.
"The nice thing about this database is that it allows people to compare complexes side by side to light up a subunit, and to determine if the root complexes are wrong or right," said Luc.
Another useful function of the PINdb is that it consolidates all the names that a protein may be known by. "A yeast protein can be known by so many different names, even people in the field would get confused," Luc noted. "One lab might call it by the name it was discovered by; others would call it by virtue of its homology to human or the fly."
To ease the confusion of all the names, users that type in a protein name get a list from the PINdb of all the different names that a protein is known by, including the protein's official name, which is determined by the Human Genome Nomenclature Committee. "The names are the first thing I need to be able to light up the subunit in all the complexes," Luc pointed out.
Tempst, Luc's research associate at Memorial Sloan-Kettering, said that the PINdb has allowed his research group to organize and interpret proteomic analysis of compartments of the human and yeast transcriptional machinery.
"To my knowledge, [the database] is totally unique in the field of transcription," he said. "The only other way that some people in the field stay on top of this immensely complicated system is by reading a lot and having an encyclopedic memory."
Though Luc doesn't spend much time doing bench experiments these days, she is collaborating with members of her lab who do. Currently, she is working on a paper that shows that she can predict how a protein network works by using her database to do functional analysis. She is aiming to publish that paper in the Jan. 2006 special database issue of Nucleic Acid Research.
Chris Hogue, the principal investigator of the Blueprint Initiative, which curates and maintains the BIND database, said that it is important that research groups like Tempst's maintain their own small databases for the interest and understanding of protein complexes.
"Without specific databases, it would clearly be far too complex for researchers like Phuong-Van Luc to make new discoveries in their field of nuclear protein complexes," Hogue said. "I encourage all researchers to post and share their own 'gold standard' data sets."
Hogue said that Blueprint archives special collections like Luc's in BIND, and that he would be happy to work with any researchers to see that their 'gold standard' sets are supported in BIND, with proper attributions.