Not long ago I found some recombinant DNA vector sequences as contaminants in GenBank entries that had been submitted relatively recently. Given that this was a major issue when I was a graduate student and had been the subject of much focus, it was surprising to see it crop up again. I don't wish to overstate the level of contamination I found; I was simply not expecting to see any.
How does this happen? One possibility is that the old lessons are being ignored, but more likely is that someone just forgot. In one case, I found that a eukaryotic genome project had a pUC-type vector sequence in every chromosome. One plausible explanation for this is that someone accidentally turned off the vector check.
The problem with contamination is mostly a nuisance, as long as you are careful. Contaminants can cause spurious sequence hits which could lead to spurious conclusions and, worse, nonsensical experiments. As a graduate student, I came across one paper that had sequenced a "human cDNA" from a tissue culture sample, synthesized the peptide "encoded" within and raised antibodies to that peptide — the authors never figured out that their cDNA was really Mycoplasma ribosomal rRNA. This is the rarity, but nobody wants to fall into such a trap.
The trickiest is contamination that isn't easy to distinguish from real sequence. It was obvious to me that pUC19-like sequences in a eukaryote didn't belong there. But my whole search was to identify where such sequences occurred naturally. If a pUC-like sequence sneaked into a gram negative bacterial report, would it be noted as wrong? That is the real worry.
However, a bit later I was reminded that there are worse data deposition sins than committing an erroneous sequence to the database. I can defend my analyses against common types of error, but it's rather hard to analyze a sequence that I can't even find.
This was once a systematic problem in the scientific community. However, the databases and the journals conspired to require sequence data deposition as a condition for publication, which made an enormous difference. That's not to say it solved the problem. In particular, articles that did not think of themselves as sequencing papers, but nevertheless contained interesting sequencing data, often failed to adhere to this standard. For example, transposon mutagenesis papers would often not deposit the insertion site sequences. While these were tantalizing hints when the target genome was unfinished, they are even more critical for post facto mapping of mutants onto the genome. So scientific data was unnecessarily lost.
During the twilight of my term at Codon, I was plotting out some protein engineering projects which followed in some well-trodden paths. However, what I was surprised to discover is that depositing the sequence of your engineered protein is not the norm in the field. Even the patent literature in this field has been slow to include sequences of what had been engineered. Most annoyingly, several publications (and patents) used site-directed mutagenesis, but without including the sequences of the primers. Instead, one was given sequence coordinates — on a sequence that must be inferred by following through three papers' worth of cloning experiments.
I'd love to get on a very high horse about this, but at the same time I've discovered the need for humility. A colleague of mine is about to publish a paper in the field, and he didn't plan on depositing the sequence — and I had reviewed the paper prior to submission! Indeed, I have been involved in the early preparation of two other contributions to the field, and the issue hadn't occurred to me.
That said, sometimes it takes a sinner to fight the sin.
Keith Robison works in cancer drug discovery at Infinity Pharmaceuticals. His blog can be found at omicsomics.blogspot.com.