Skip to main content

As NCBI Prepares to Retire LocusLink, Some Users Struggle with Transition to Entrez Gene

Premium

With a month remaining before the National Center for Biotechnology Information pulls the plug on its popular LocusLink resource on March 1, early feedback on the system’s replacement has been mixed.

NCBI is transitioning from LocusLink to Entrez Gene in order to integrate the resource more tightly with its other databases via the Entrez interface, and to provide access to information on many more organisms. LocusLink has historically focused on a dozen or so eukaryotes, while Entrez Gene has information on hundreds of organisms, and already includes more than 1.3 million records.

But these improvements are lost on some early users of the new offering, according to NCBI’s David Wheeler, who told BioInform that the center is “getting quite a bit of feedback” about the change — and not all of it is positive.

“Generally, [there are] those people who have learned to use LocusLink really well and are having trouble with Gene simply because it’s a little bit different. And it’s not so much that there is a deficiency in Gene, but that they’ve already learned one way and now they’re having to adjust their strategies a little bit,” he said.

Most of the comments, he said, “have been mainly of the type that, ‘This is different, I liked the old way, and I don’t see any reason to change it.’” In response, he said, “We’re going to try to make the new way as much like the old way as possible.”

One common complaint from end-users involves LocusLink’s use of colored boxes to highlight other NCBI resources that contain information related to a particular record. This feature, which provided a quick glimpse of supporting data, is missing from Entrez Gene, which instead uses a Java-script menu that lists links from the record, and requires an extra click of the mouse.

No information has been lost, Wheeler said, but users seem to prefer the at-a-glance format, so “we’re intending to make the actual format of the gene report more like LocusLink, because we’ve had people write in and say they liked [how] LocusLink breaks the data up visually into the logical sections that they want to focus on, whereas Entrez Gene is kind of text-dense by comparison.”

But these types of interface issues are minor compared to some other user concerns. Developers who rely on bulk downloads of LocusLink data via ftp files on NCBI’s website may experience farther-reaching consequences.

Many of the most popular of these files, such as the so-called LL_tmpl file, which contains functional annotation tables, are in a tag-value format that is easily parsed using Perl or Java. These ftp files will no longer be updated, and will be replaced with new files that are in ASN.1 format, Wheeler said.

Bernice Packer, bioinformatics manager for the National Cancer Institute’s Core Genotyping Facility, said that her lab — like many others — relies on the LocusLink ftp files as part of a “complicated bioinformatics pipeline.” The switch to the new set of ftp files raises several concerns for such groups, she said.

First of all, she noted, “Not all the information that we get out of LocusLink will be immediately available” through the Entrez Gene ftp files. Only certain records in LL_tmpl, for example, will be included in the Entrez Gene equivalent, called GeneRIF. Packer cited Gene Ontology data as a particularly egregious omission.

In addition, the new format will require a rewrite for all scripts that currently parse the tag-value format. Packer estimated that this task would take five person-days in her lab. Nevertheless, she noted, she considers herself to be fortunate because her group has remained intact since it first developed the bioinformatics pipeline several years ago. In many bioinformatics departments, she noted, the original developers of these pipelines are long gone, and it may be impossible to hunt down the bits of code that need to be rewritten. In the worst-case scenario, she said, “things are going to break.” More likely, she said, those programs that aren’t rewritten will just keep reading the same file on the LocusLink ftp site, which won’t be updated after March 1.

NCBI has been using ASN.1 for years, but the format “is less well known in the broad bioinformatics community” than other formats, like XML or tab-separated files, said Peter Robinson of the Institute of Medical Genetics at Germany’s Humboldt University. “There’s an NCBI toolkit for reading ASN.1, but I don’t believe that this is very much used in the community,” he said. “It’s not something that people are just going to pick up and learn in an afternoon like you can with BioPerl.”

Robinson said that he’s writing an ASN.1 parser for Entrez Gene using a Java tool called ANTLR (Another Tool for Language Recognition), and he plans on writing a similar tool for BioPerl. It would be easier for many users, he noted, if NCBI would make XML files available for ftp download.

Wheeler said that NCBI is developing a tool to convert the ASN.1 ftp files to XML. In the meantime, he said, NCBI is directing users to its E-Utilities (Entrez Programming Utilities) suite of programming tools (available at http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html).

“You can just make a series of E-Utility calls to pull down these gene records in XML format,” Wheeler said, noting that in an “experiment” he conducted at home, he was able to download more than 160,000 mammalian Entrez Gene records using the E-Utilities in about six hours. “You could download all of them like that in an afternoon on a modest sort of connection,” he said.

Wheeler had some additional advice as the March 1 deadline approaches. Links that currently go to LocusLink will be automatically redirected to Gene, he said, “but I would advise people to fix their links so that they go to Gene rather than LocusLink.”

In addition, he said, end-users should begin familiarizing themselves with the Entrez interface, and give Entrez Gene “a good initial going over” sooner rather than later. “I think most people are putting this off until the last possible moment,” he said.

For developers, he said, it’s likely that the E-Utilities will offer an alternative to rewriting lines and lines of code, but many people in the community are either not aware of these tools, or are reluctant to use them. Robinson, for example, said that he didn’t know about the E-Utilities, but noted that “I often find that it’s easier to write my own program than to learn somebody else’s.”

Wheeler said that NCBI intends to add examples of how to use the E-Utilities on the Entrez Gene website. “The Entrez system has a lot of flexibility built into it, both in terms of what you can view and what you can download,” Wheeler said, “but it sometimes takes a little bit of exploration for people to become aware of it.”

NCBI maintains a web page with information about the transition that it updates regularly at http://www.ncbi.nlm.nih.gov/entrez/query/static/help/LL2G.html, and a FAQ on Entrez Gene is available at http://www.ncbi.nlm.nih.gov/entrez/query/static/help/genefaq.html.

— BT

Filed under

The Scan

Pfizer-BioNTech Seek Full Vaccine Approval

According to the New York Times, Pfizer and BioNTech are seeking full US Food and Drug Administration approval for their SARS-CoV-2 vaccine.

Viral Integration Study Critiqued

Science writes that a paper reporting that SARS-CoV-2 can occasionally integrate into the host genome is drawing criticism.

Giraffe Species Debate

The Scientist reports that a new analysis aiming to end the discussion of how many giraffe species there are has only continued it.

Science Papers Examine Factors Shaping SARS-CoV-2 Spread, Give Insight Into Bacterial Evolution

In Science this week: genomic analysis points to role of human behavior in SARS-CoV-2 spread, and more.