The Internet Archive, the Wikimedia Foundation, and volunteers from the Wikipedia community, have now fixed more than 1 million broken outbound web links on English Wikipedia. This was possible because, in addition to other web archiving projects, the Internet Archive has been monitoring all new, and edited, outbound links from English Wikipedia for three years and archiving them soon after changes are made to articles. As a result of this work, as pages on the Web become inaccessible, links to archived versions in the Internet Archive’s Wayback Machine can take their place. This has now been done for the English Wikipedia and more than 1 million links are now pointing to preserved copies of missing web content.
This story is a testament to the sharing, cooperative nature and resulting benefits of the open world.
What do you do when good web links go bad? If you are a volunteer editor on Wikipedia, you start by writing software to examine every outbound link in English Wikipedia to make sure it is still available via the “live web.” If, for whatever reason, it is no longer good (e.g. if it returns a “404” error code or “Page Not Found”) you check to see if an archived copy of the page is available via the Internet Archive’s Wayback Machine. If it is, you instruct your software to edit the Wikipedia page to point to the archived version, taking care to let users of the link know they will be visiting a version via the Wayback Machine.
That is exactly what Maximilian Doerr and Stephen Balbach have done. As a result of their work, in close collaboration with the non-profit Internet Archive and the Wikimedia Foundation’s Wikipedia Library program and Community Tech team, now more than 1 million broken links have been repaired. For example, footnote #85 from the article about Easter Island, now links to: https://web.archive.org/web/20071011083729/http://islandheritage.org/faq.html when before it linked to the missing page http://islandheritage.org/faq.html. Pretty cool, right?
“We are honored to work with the Wikipedia community to help maintain the cultural treasure that is Wikipedia,” said Brewster Kahle, founder and Digital Librarian of the Internet Archive, home of the Wayback Machine. “By editing broken outbound links on English Wikipedia to their archived versions available via the Wayback Machine, we are helping to provide persistent availability to reference information. Links that would have otherwise lead to a virtual dead end.”
“What Max and Stephen have done in partnership with Mark Graham at the Internet Archive is nothing short of critical for Wikipedia’s enduring value as a shared repository of knowledge. Without dependable and persistent links, our articles lose their backbone of reliable sources. It’s amazing what a few people can do when they are motivated by sharing—and preserving—knowledge,” said Jake Orlowitz, head of the Wikipedia Library.
“Having the opportunity to contribute something big to the community with a fun task like this is why I am a Wikipedia volunteer and bot operator. It’s also the reason why I continue to work on this never-ending project, and I’m proud to call myself its lead developer,” said Maximilian, the primary developer and operator of InternetArchiveBot.
So, what is next for this collaboration between Wikipedia and the Internet Archive? Well… there are nearly 300 Wikipedia language editions to rid of broken links. And, we are exploring ways to help make links added to Wikipedia self-healing. It’s a big job and we could use help.
Making the web more reliable… one web page at a time. It’s what we do!
A huge Thank You! to Stephen Balbach, Maximilian Doerr, Vinay Goel, Mark Graham, Brewster Kahle, John Lekashman, Kenji Nagahashi, the Wikimedia Foundation, and Wikipedia community members.
Protip: anyone with a Drupal installation can fix broken links fairly quickly using a module
The Internet Archive is the most reliable resource for rescuing dead-links in Wikipedia. However, to keep on top of things, editors need to use the ability to archive individual pages that they use in citations, preferably every time that they use a URL in a citation. I do check every URL I add in a citation to see if it is in (or can be stored in) the Internet Archive. I’d say that about 40% of all the links I’ve added I could and needed to add to IA. I am hoping that once a page is added, that the domain is added to any active crawl lists … is this something that happens?
This is pretty cool.
For citation links it would be useful if they all eventually linked to an version of the site archived at the time of the original reference.