Researchers at the Medical College of Wisconsin have developed a new, web-based software that helps identify de novo peptide sequences based on a novel principle: the peptides' composition of amino acids, rather than their sequence of amino acids.
The software, called DeNovoID, is meant to be used as a last step, in conjunction with other types of de novo software, said Brian Halligan, a scientific member of Medical College of Wisconsin's Center of Biotechnology and Bioengineering, who led the development of the software.
DeNovoID "doesn't replace any of those commercial [de novo identification] software, such as DeNovoX, PEAKS, or Gutentach. It's more of a complement to those," said Halligan. "We take the results they get, and have an alternative way to search it."
Commercial de novo identification software often end up giving out short, degenerate peptide sequences whose amino acids have not been ordered completely, Halligan said. DeNovoID takes those short, degenerate sequences and maps them to a database entry, he explained.
Researchers have tried to map the degenerate sequences using other search programs such as BLAST or FASTA, but those conventional programs don't work well with short sequences, or with degenerate sequences, Halligan pointed out.
"With BLAST or FASTA, it would be like looking for 'Smith' in the phone book by starting from the first page and comparing letter by letter. What we do is geometric indexing, so even if you have 'emith', in 20-D space, that's pretty close to 'Smith.'"
"Our program is very different from BLAST of FASTA. It's based on composition to start with," Halligan said.
DeNovoID is available free of charge from the website http://proteomics.mcw.edu/denovoid. Halligan said he intends to keep the software free. However, if someone is intending to use DeNovoID intensively, they might want to arrange to use a locally hosted copy, instead of the web version, he said.
"It's a good thing to see publicly available software for de novo sequencing — we need it," said David Tabb, a postdoctoral fellow at Oak Ridge National Laboratory who has spent some time writing de novo sequencing software himself.
Tabb added that having full-length de novo sequence inference tools is a necessity if scientists are to progress beyond what databases can tell them. "Sequence inference is the direction that bottom-up informatics must go," said Tabb. "Otherwise, we are limited to what databases can tell us. I know that the field has relied on database searching for the last decade, but to go deeper, we must have sequence tag and de novo algorithms."
Whereas BLAST and FASTA do paired comparisons of sequences, DeNovoID converts a peptide's composition into a 20-dimensional mathematical vector.
DeNovoID's approach is better for dealing with datasets that are error prone, Halligan pointed out.
"With BLAST or FASTA, it would be like looking for 'Smith' in the phone book by starting from the first page and comparing letter by letter," he explained. "What we do is geometric indexing, so even if you have 'emith', in 20-D space, that's pretty close to 'Smith'. The program creates a hierarchy" of the matches, based on which is the closest match.
The development of DeNovoID, which was described in publication for the first time in the July 1 issue of Nucleic Acids Research, began shortly after Halligan and his research group developed PepID, a proof-of-principle program that identifies peptides based on amino acid composition.
A key concept in developing PepID, which was released last year and described in the July-August 2004 issue of the Journal of Proteome Research, is that once a peptide reaches 8 amino acids in length, its amino acid composition is as good as the full peptide sequence for identifying which protein it comes from.
"Once you have a peptide of a reasonable length, knowing the sequence doesn't add to the ability to map it back [to a protein]," said Halligan. "So our hope was to skip the order [of amino acids] part entirely. There are very, very few peptides, even in the human proteome, that are about 10 amino acids in length, that have the same composition but different sequences. Once you get to that length, composition and sequence are about equally good for protein mapping."
DeNovoID expands on PepID by taking degenerate peptide sequence tags, which are outputted by conventional de novo sequencing software, and mapping it to a protein.
To speed up identification using DeNovoID, Halligan and his research team are collaborating with Applied Biosystems to see how well the program works when peptides are completely broken up into a mixture called ammonium ions that is composed of individual amino acids. The team has also worked with Bruker and Agilent, but the data did not turn out as well using those instruments, Halligan said.
"Most people think they've done something bad when they get ammonium ions, and most stock instruments have a deceleration mechanism to prevent getting ammonium ions," said Halligan. "We're playing with turning that [deceleration mechanism] off. Once we have the ammonium ions, we can do the identification very quickly."
Halligan said his research team is still lining up collaborators that can run samples without the deceleration mechanism, in order to create ammonium ions.
Most researchers who use de novo sequencing software don't rely on it as a "first line" software, Halligan noted. Instead, they first use traditional mass spectra identification software such as Sequest or Mascot.
"What happens is, they put in 15,000 spectra in the beginning, and about 300 good protein identifications come out of it. The yield is pretty low," he said. "Then the researcher says, 'When I look at this spectra, it looks like it makes sense' — so [he or she] tries to match up that spectra result using a de novo sequencing software. The de novo is used to fill in."
Typically, de novo software is not used in a high-throughput manner, but is used instead to identify unmatched sequences that are of special interest, possibly because of their relative quantitation, Halligan said.
An inherent limitation of all bottom-up proteomic approaches is that peptides can not always be unambiguously mapped back to a protein, because the protein may have splice variants, and multiple copies of a peptide may exist in a database, Halligan noted.
"Even if you know the peptide's sequence, there's a limited resolution to which you can determine the protein it comes from," he said.
— Tien-Shun Lee ([email protected])