It’s a familiar scenario: sequence data distributed across a score of public and proprietary sources, debate about the comprehensiveness of those sources and the degree of overlap between them, questions about the best way to integrate the information. Bioinformatics, right? Think again. The very same challenges that bioinformatics developers have grappled with for the past decade or so are now confronting a new user community: Patent analysts.
The explosion of sequence-based patents in recent years is forcing biotech and pharma information professionals to master the nuances of sequence searching. Much like bioinformaticists, patent analysts must glean sequence data from public and private nucleotide and protein repositories, international patent databases, the scientific literature, and specialty value-added patent sequence databases such as Derwent’s GeneSeq or the Chemical Abstract Service’s Registry file. But the average information professional lacks the computer science or biology background of a bioinformaticist, making the task that much more difficult.
“You have an interesting combination of people looking at this issue,” said Amy Dasch, a senior patent analyst at Genzyme. “There are people that are coming out of the bioinformatics world that are trying to apply that knowledge to patents, and there are people like myself, who have been focused on patents for a long time, who have had to come kicking and screaming into the world of bioinformatics.”
Much of the kicking and screaming may be due to the barriers that still exist in the field. Derwent and the CAS Registry, for example, only recently added Blast search engines to their databases. Before last year, a patent sequence search through the CAS Registry was only possible with some “very crude tools that were online based,” said Tony Trippe, senior staff investigator, intellectual property, at Vertex Pharmaceuticals. Furthermore, Trippe said, because the CAS service charges by the hour, “you had to use their system, and it was running on their machines, so you were paying online fees the whole time.” Heahyun Yoo, a patent analyst at Bristol-Myers Squibb, noted that her experiences with the CAS Registry’s bioinformatics capabilities have also been less than favorable. “They only let you do the search with only 200 base pairs or 200 amino acids,” she said, “so you had to do it sequentially and add them together.”
For those who opt for public domain resources, benefits include the unbeatable low, low, price of free, as well as the freedom to apply a variety of search and analysis tools to the data. The drawback, however, is the time required to aggregate and sift through the vast amount of resources. In addition, while GenBank, EMBL, and DDBJ post patent information along with their sequence data, this information is not made available in as timely a manner as other resources (see p. 6, for some pros and cons of available patent sequence options).
Getting What You Pay For
CAS, which provides its patent sequence data in collaboration with Fachinformationszentrum (FIZ) Karlsruhe in Germany and the Japan Science and Technology Corporation through a partnership called STN International, is making a concerted effort to improve its bioinformatics resources to gain new subscribers from the growing ranks of patent analysts searching for sequence-related information. STN introduced an enhanced version of the CAS Registry Blast software in October, and in a further bid to bring its user base of IP professionals up to speed on biology, even posted a mini-tutorial on its website called, “Bluff your Way in Genetics!”
Robert Austin, a US regional sales manager for FIZ Karlsruhe, spoke at a Patent Information Users’ Group meeting in October about the discrepancies between patent sequence data available in the CAS Registry, Derwent’s GeneSeq, and public nucleic acid databases. As if to prove the parallel between the worlds of patent data and bioinformatics, Austin noted that the paper was spurred from a common question: “Why should I use your expensive database when there’s so much information out there for free?”
In STN’s case, Austin argues that the resource is far more comprehensive and up to date than other sources. For one thing, STN not only includes patent sequence information from Registry, but also from GeneSeq (which it calls DGene), Genbank, and 38 international patent authorities. As of September 2002, Austin reported, DGene provided 900,000 peptide sequences and 2.1 million nucleotide sequences from patents, compared to 890,000 and 2.2 million for Registry, 167,000 and 800,000 for NCBI, and 338,000 and 800,000 for EMBL.
Most patent searchers tend to agree that STN is the closest thing to a one-stop-shop for patent sequence information, but cite the expense and the difficulty of its online search and analysis tools as disadvantages. One serious drawback is that STN’s licensing model does not allow users to install the database in house, leaving them with Blast as their only search or analysis option.
Derwent, on the other hand, allows subscribers to do whatever they want with the database. “The GeneSeq data can be accessed or analyzed using many different sequence analysis programs,” said Yoo. “I could spend as much time as I want in sequence analysis with Derwent, but I’m not sure what people do who have to access the databases through a commercial interface like STN…It’s extremely expensive.”
GeneSeq has proven to be a popular option for companies with strong bioinformatics teams who can integrate the database into their internal informatics infrastructure. Yoo noted that in her case, there’s generally no need to pay the extra money for STN “because we have an excellent bioinformatics group in Bristol-Myers, and they accrue all the databases in house, including GeneSeq and GenBank.” However, she added, depending on the situation, there are some instances where access to the additional resource is worth the expense.
The choice of which databases to use “all comes down to the money risk,” said Dasch. “In the corporate world, there’s a certain acceptance that the cost of doing business includes the cost of doing diligence. But if the issue at hand is small, the public databases offer great resources.”
“Depending on how you intend to use the information, or the point you are at in your research effort, or who the client is … lots of different factors go into making the decision on how much time you’re going to spend and how many databases you’re going to look at,” said Trippe.
Even Austin recommended that STN be used in combination with public sources because the indexing policies of different organizations may differ and the timing of posted patent information may cause a temporary discrepancy between different resources at any given time. The bottom line is that regardless of what their first or second choice for patent sequence searching may be, information professionals can’t rely on any single resource to provide them with the data they need to do their jobs.
Luckily for many patent searchers, the groundwork for assembling distributed sequence datasets and mining them effectively has already been laid out by the bioinformatics community. “You’re looking at combining your searches to include the public patent databases...NCBI, EBI, so really you have a bioinformatics problem and a set of skills associated with that that many companies have mastered and others either haven’t or rely on a third-party service provider,” observed Ron Ranauro, executive vice president of worldwide business development at Gene-IT.
Seeing an opportunity for such a third-party service provider in the patent analysis market, Gene-IT recently launched a patent search consulting practice based around its GenePast search algorithm and bioinformatics expertise. “We’re trying to educate the marketplace… that prior art searching depends on a complete set of data in the same way that discovery would,” noted Ranauro.
The company may have found a comfortable niche, as several biotech and pharmaceutical firms have already found their in-house bioinformatics expertise to be a valuable IP resource. Noted Genzyme’s Dasch, “I’m utterly dependent on a bioinformatics expert who can help me understand the complexities of matching.” While patent searchers may pick up the mechanics of effective sequence searching easily enough, “we also need to be able to evaluate the results and QC what we're doing. That's still a big gap in our skill set.”
BMS’s Yoo agreed that her bioinformatics colleagues have become an indispensable component of the patent analysis process: “Our legal group is really behind the bioinformatics group to make sure that all the resources are available, and they really listen to users. Whenever I have a problem, all I have to do is ask them and their response is incredible.”
Of course, added Trippe, bioinformatics researchers could stand to learn a thing or two from the patent analysis side as well. “It’s been my experiences that when the two groups sit down and talk to each other, the information professionals learn more about researchers’ needs and what they do and the tools they use, and the researchers learn more about the hidden gems that are out there that could provide valuable insight that they’re not currently aware of.”