What percentage of links posted in published article are dead?

asked Aug 7, 2015 in Open Science by Franck Dernoncourt (540 points)

Is there any research/study that looked at the impact of the percentage of links posted in published article are dead? I am trying to know to what extent dead links are an issue. E.g. on Stack Overflow, it looks like 10% of the links posted here are dead.

1 Answer

answered Aug 8, 2015 by Daniel Mietchen (1,145 points)
There have been multiple studies about this over the years and in different fields, indicating a link rot on the order of a few percent per year. Here are some, in chronological order:

  • Randy J. Carnevale, Dominik Aronsky, The life and death of URLs in five biomedical informatics journals, International Journal of Medical Informatics, Volume 76, Issue 4, April 2007, Pages 269-273, doi:10.1016/j.ijmedinf.2005.12.001.

    Of the 19,108 references extracted from 606 printed and 86 in-press articles, 1112 (5.8%) references contained a URL. Of the 1049 unique URLs, 726 (69.2%) were alive, 230 (21.9%) were dead, and 93 (8.9%) were comatose. URLs from in-press articles included 212 URLs, of which 169 (79.7%) were alive, 21 (9.9%) were dead, and 22 (10.4%) were comatose. The average annual decay, or link rot, rate was 5.4%.

  • Edmund Russell, Jennifer Kane (2008). The Missing Link: Assessing the Reliability of Internet Citations in History Journals. Technology and Culture, Volume 49, Number 2, pp. 420-429. doi:10.1353/tech.0.0028

    We examined the reliability of worldwide web citations in two leading history journals (Journal of American History and American Historical Review) over seven years and found that 18 percent of web links cited over that period were inactive. The problem increased over time. In articles published seven years earlier, 38 percent of web citations were dead.

  • Jason Hennessey and Steven Xijin Ge (2013): A cross disciplinary study of link decay and the effectiveness of mitigation techniques. BMC Bioinformatics 14(Suppl 14):S5. doi:10.1186/1471-2105-14-S14-S5

    we downloaded 18,231 Web of Science (WOS) abstracts containing "http" in the title or abstract from the years under study (1996-2010), out of which 17,110 URLs (14,489 unique) were extracted and used. We developed Python scripts to access these URLs over a 30-day period. For the period studied, 69% of the published URLs (67% of the unique) were available on the live Internet, the Internet Archive's Wayback Machine had archived 62% (59% unique) of the total and WebCite had 21% (16% unique). Overall, 65% of all URLs (62% unique) were available from one of the two surveyed archival engines.

  • Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, Zhou K, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253.
    looked at link rot in article corpora from arXiv, Elsevier, and PMC:

    All three corpora show a moderate, yet alarming, link rot ratio for references made in recent articles, published in 2012: 13% for arXiv, 22% for Elsevier, and 14% for PMC. Not surprisingly when considering the dynamic nature of the web, for older articles the link rot ratio increases in all corpora. For publication year 2005, the link rot ratio stands at 18%, 41%, 36% for arXiv, Elsevier, and PMC, respectively. Going back to the earliest publication year in our corpora, 1997, the ratios become 34%, 66%, and 80%, respectively.

  • Zittrain, J, Albert, K and Lessig, L . (2014). Perma: scoping and addressing the problem of link and reference rot in legal citations. Harvard Law Review 127: 176–196.
    surveyed law journals and court decisions, with the following observations

    We documented a serious problem of reference rot: more than 70% of the URLs within the above mentioned journals, and 50% of the URLs within U.S. Supreme Court opinions suffer reference rot — meaning, again, that they do not produce the information originally cited.

I also think I had seen a study somewhere that indicated that even "early online" articles start out at about 5-10% link rot, i.e. they start rotting well before official publication. Could not find that right now, though.

