Skip to main content
Premium Trial:

Request an Annual Quote

Bioinformatics Web Servers Continue to Grow, but Face Challenge of URL Decay

Premium
Online bioinformatics software tools are propagating at a steady clip, according to the latest web server issue of Nucleic Acids Research, which includes papers on 94 web-based molecular biology tools, of which “the overwhelming majority” are new, according to an editorial accompanying the annual feature. 
 
Reflecting heightened interest in network analysis, microarray analysis, gene sequence and protein analysis, and text mining, the 2008 NAR web server issue and the accompanying Bioinformatics Links Directory — an online resource maintained by Michelle Brazas, Francis Ouellette, and colleagues at the Ontario Institute for Cancer Research — show a spike in the number of web servers in these fields (see table below for details).  
 
The Bioinformatics Links Directory now includes more than 1,200 links to web servers out of a total of more than 2,500 bioinformatics-related URLs.
 
Meanwhile, even as the number of web-based bioinformatics tools is on the rise, URL decay — the challenge of broken web links — remains unacceptably high in the bioinformatics field, according to a recent study in Bioinformatics.
 
“It’s a growing problem,” said Jonathan Wren, a researcher at the Oklahoma Medical Research Foundation and author of the study, which extracted 7,462 URLs from Medline abstracts and queried for their availability to find that nearly 20 percent were no longer available — a rate that has remained unchanged since 2003, the first time Wren conducted this exercise.
 
“URLs are like house addresses and the ‘404’ messages are essentially finding that nobody is home when you were told they were,” Wren said.
 
Furthermore, while journals vary in terms of the fraction of published URLs that are broken, “the journals that publish bioinformatics-related papers are on the top of the list in terms of those affected,” he said.
 
According to Wren’s study, 17 percent of the 441 URLs published in 2007 in Bioinformatics were down at the time his study was published, as were 10 percent of the 289 URLs in BMC Bioinformatics, 17 percent of 282 URLs in Nucleic Acids Research, and 29 percent of the 96 URLs in Genome Research.  
 
And online software programs appear to be the biggest victims of URL rot. Wren found that 43 percent of dead URLs in his study linked to computer programs, including web servers, making them the most common form of lost content. This was followed by scholarly content, such as raw data, with 38 percent, and databases, which comprised 19 percent of the defunct URLs.
 
Yet the annual NAR web server issue serves as evidence that the Internet is still the preferred way for many bioinformatics developers to make their software available to the broader research community.
 
“Most researchers now understand that if you want your work used, you've got to put it on a web server,” Gary Benson, associate professor in Boston University’s Departments of Biology and Computer Science and editor of web server issue of NAR, told BioInform in an e-mail.
 
What’s Hot?
 
The annual web server issue also serves as an indicator of trends in the bioinformatics field. For example, some of the new resources parallel advances in second generation sequencing and imaging technology that are letting scientists ask “more probing” biological questions on the role of networks and pathways in a given disease, Ouellette and colleagues wrote in an editorial describing the Bioinformatics Links Directory in NAR.  
 
The number of resources on microarrays in the Links Directory rose to 101 from 89 last year, and for resources on microbes the number of links shot up to 45 from 38 in 2007.
 
In an e-mail to BioInform, Ouellette wrote that there is both more interest and data in these areas of late and “the tools are getting more sophisticated for these activities as well.”
 
In the case of microarrays, he said, “the tools are getting better, and for microbes, we know so many more, and there is next-generation sequencing, and metagenomes … lots of data!”
 
Proteins, however, make up the lion’s share of online resources, with a total of 850 of the total 1,200 URLs in the Links Directory.
 

“URLs are like house addresses and the ‘404’ messages are essentially finding that nobody is home when you were told they were.”

“Each year, the largest single group of websites we publish is for protein structure prediction,” Benson said.
 
This year’s version of the web server issue had a special focus on biological network analysis and text mining. “Establishing these special focus topics helps lead the field to work more on these problems, which I consider important,” said Benson.
 
The rise in text-mining is reflected in the Links Directory, which includes 22 such resources, as opposed to only 15 in 2007. “Text mining is primed to explode, in my opinion,” Benson said.
 
He added that the driver for development in this area is the vast amount of literature that scientists search. So far, he said, the research community has been relying on “a primitive sort of concept analysis with the use of keywords, but that is too limited.” 
 
Another driver for the growth in text-mining tools is the rise of open-access publishing, which makes the full text of an academic paper electronically available for free. “That's essential for useful text mining; you can't mine text that's fee-restricted,” Benson said. “A lot of … journals aren't there yet and it's hampering the development of sophisticated text-mining tools.”  
 
Next year’s web server issue will include metagenomics as a special focus, in addition to network analysis andbiological text mining, Benson noted in his editorial.
 
The Heartbreak of URL Decay
 
But as online bioinformatics resources continue to grow, Wren said that bioinformatics developers and end-users alike should be aware of the consequences of URL decay. 
 
“The number of URLs being published in journals is increasing exponentially but they are still decaying at the same rate,” said Wren, noting that URLs may break for any number of reasons, such as lack of upkeep.
 
Wren’s interest in this topic was ignited by his own experience with software applications he developed in 2002. The URLs were up and running when he submitted his paper, but upon publication six months later, the URL for his genomic analysis tool, SIGNAL, was down because the server had been reorganized.
 
“If the rate [of decay] keeps up there will be a very consistent loss of these resources,” Wren said. Although he said he doesn’t want to extrapolate into the future, he does believe the problem needs to be addressed. 
 
In contrast to the overall decay rate of 20 percent for most URLs, Wren found that only five percent of URLs cited more than once or twice have decayed.
 
“One could argue that maybe it is only the less important stuff that’s going down, [but] I don’t think there is any data to support that,” Wren said, adding that it is more likely that scientists will cite a URL that is up rather than down. “That kind of biases those statistics in that only those that can be cited will be cited,” he said.
 
“Maintained URLs work because people care about their site,” Ouellette said in an e-mail to BioInform. “If they don't care about their site, then maybe our attempts to point to it should fail?”
 
Ouellette acknowledged the URL decay “is an ongoing problem,” but said that he and his colleagues “hope to stay on top of it.” For example, he said, “if a web resource or database is on a student's web site or some other non-preserved location, problems should be expected and addressed.”
 
The effects of URL decay on scientific research can “range from mild inconvenience to preventing study replication and/or loss of important data,” according to Wren. The overall effect is only an estimation because scientists don’t document how their project was torpedoed by broken URLs, he said. “You usually just have to kind of plod ahead quietly and nobody really knows about your frustration.”
 
Although the sample size in his study was small, with a little over 7,000 URLs, Wren said he thinks the study proves that website content preservation is “a burden” and an “afterthought” for some researchers who post their resources online.
 
Overall, URL decay faces what Wren calls “a problem of advertising” because many scientists don’t consider that a cited website might become inaccessible or its contents might change after they viewed and cited it.
 
“No one really wants their URLs to decay,” he said. “In some ways this is a new substantiation of an old problem of resource preservation.” It is comparable to a scenario in which a lab develops a cell line, but then “drops off the face of the Earth,” taking the cell line with it, he said.
 
“The nature of the problem has changed from physical to electronic,” he said.
 
Ideas to Stop the Rot
 
Wren said that one way to combat the problem of URL decay in the case of resources that have moved to new servers is to use uniform resource identifiers, or URIs, as opposed to URLs.
 
“If the resource has moved and you have a URI resolver that can relocate a unique ID associated with that URI, then you could find that resource even if it moved, and part of the 404 problem would be solved,” said Wren. However, he noted that “if that computer is down or no longer providing the resource, then you'll get a 404 just as if it were a URL.”
 
A few tools are being advanced to prevent URL decay such as PURL and WebCite, but neither method has caught on widely, Wren said.
 
A Persistent Uniform Resource Locator, or PURL, is an approach developed by the Online Computer Library Center that assigns a unique identifier that comprises a URL and a redirect command and links to a central repository site, while WebCite is an online system for archiving URLs that includes the URL and a link to an archived copy of the material.
 
PURLs “can help, but if I understand PURLs, I don't think it is the solution, although I would love to be wrong,” said Ouellette.
 
“If it requires the instigator of the site to redirect his or her URLs, then it will fail,” he said. According to the OCLC’s web-based explanations, creating a PURL is not automated; it must be performed by registered users. 
 
In addition, PURL re-directs to a different location on the web, and “doesn’t ensure that the reader sees what the author saw when he or she cited the digital object on the web,” according to Gunther Eysenbach, WebCite’s developer.
 
The difference between WebCite and PURL “is that we are creating a physical snapshot of how the URL content looked at a given moment in time,” he said. Eysenbach is senior scientist at the Centre for Global eHealth at the University Health Network in Toronto and is also associate professor in the Department of health policy at the University of Toronto.
 
“We receive XML files from journals and our archiving engine goes through the XML file and creates a snapshot of each cited URL,” Eysenbach said. “It assigns a digital identifier, [and] creates a digital fingerprint called a hash code.” The actual document is stored at WebCitation.org, which is collaborating with libraries and the Internet Archive. 
 
However, Wren noted that web servers that must be queried, for example, with a gene sequence or microarray experiment identifier, can probably not be preserved by any existing method.
 
And as far as web server-based software is concerned, researchers would need to download an entire program to preserve it, said Wren. “To just take a snapshot of the webpage that links to it, that says, ‘Here is the documentation, here is the program’ — that would be not be sufficient,” he said.
 
Traveling the Semantic Web
 
In their NAR editorial, Ouellette and his colleagues note that while the web has proved an invaluable tool for bioinformatics resources, “with the current pace of data output and the increasing need to synthesize research data from multiple sources, even use of the web to identify, access and extract meaningful information for research purposes is becoming a daunting task.”
 
However, they cite new web technologies such as the semantic web as an “opportunity to automate computers to navigate and integrate all of the biological information stored on the web, and output coalesced information to the researcher for interpretation.”
 
For example, the authors describe how traditional “uncharacterized” links on webpages do not enable a computer to recognize that Blast is related to the T-Coffee tool used for protein multiple sequence alignment. In the Semantic Web, however, relationships are captured using common URIs that a computer can recognize.
 
“Whenever two subjects, in this case Blast and T-Coffee, refer to identical URIs, in this case capacity for protein sequence alignment, then their topics of discourse are identical and data merging becomes possible,” the authors write. The output of that process is “coalesced information” for the researcher to further interpret.
 
Using the semantic web, “researchers will thus be able to input a gene of interest from an experiment into a computer and explicitly ask the computer to return information on how this gene functions in another organism, or how the product of this gene affects a given biological process, or which compounds also affect that biological process and whether these compounds have been shown to have the same affect in other organisms,” they write.
 
They add that the current structure of the Bioinformatics Links Directory is “amenable to semantic web notation and upgrading of the directory to encompass this functionality is being explored.”

 
 
Largest Changes in Web Servers in the Bioinformatics Links Directory: 2006-2008*
Class of
Web Server
Web Server Description
2006
2007
2008
% Change Since 2006
DNA Phylogeny Reconstruction
37
43
46
24.3%
Sequence Feature Detection
118
142
145
22.9%
Utilities
19
20
23
21.1%
Expression cDNA,EST,SAGE
29
36
44
51.7%
Microarrays
75
89
101
34.7%
Protein Expression
8
9
17
112.5%
Human Genome Health and Disease
14
19
23
64.3%
Sequence Polymorphisms
25
33
36
44.0%
Literature Text Mining
11
15
22
100.0%
Model Organisms Microbes
31
38
45
45.2%
Proteins 3-D Structural Features
53
70
75
41.5%
3-D Structure Comparison
35
45
50
42.9%
Do-It All Tools
8
8
13
62.5%
Domains and Motifs
86
112
115
33.7%
Molecular Dynamics and Docking
19
21
27
42.1%
RNA Functional RNAs
14
19
26
85.7%
Structure Prediction, Visualization, and Design
38
47
54
42.1%
Sequence Comparison Multiple Sequence Alignments
38
50
56
47.4%
Pairwise Sequence Alignments
22
23
26
18.2%

Source: NAR, 2008, Web Server Issue, W2-W4, Bioinformatics Links Directory.
* Change is three web servers or more.

Filed under

The Scan

CDC Calls Delta "Variant of Concern"

CNN reports the US Centers for Disease Control and Prevention now considers the Delta variant of SARS-CoV-2 to be a "variant of concern."

From FDA to Venture Capital

Former FDA Commissioner Stephen Hahn is taking a position at a venture capital firm, leading some ethicists to raise eyebrows, according to the Washington Post.

Consent Questions

Nature News writes that there are questions whether informed consent was obtained for some submissions to a database of Y-chromosome profiles.

Cell Studies on Multimodal Single-Cell Analysis, Coronaviruses in Bats, Urban Microbiomes

In Cell this week: approach to analyze multimodal single-cell genomic data, analysis of bat coronaviruses, and more.