Robots.txt meant for search engines don’t work well for web archives

Robots.txt files were invented 20+ years ago to help advise “robots,” mostly search engine web crawlers, which sections of a web site should be crawled and indexed for search.

Many sites use their robots.txt files to improve their SEO (search engine optimization) by excluding duplicate content like print versions of recipes, excluding search result pages, excluding large files from crawling to save on hosting costs, or “hiding” sensitive areas of the site like administrative pages. (Of course, over the years malicious actors have also used robots.txt files to identify those same sensitive areas!)  Some crawlers, like Google, pay attention to robots.txt directives, while others do not.

Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes.  Internet Archive’s goal is to create complete “snapshots” of web pages, including the duplicate content and the large versions of files.  We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine.  In other words, a site goes out of business and then the parked domain is “blocked” from search engines and no one can look at the history of that site in the Wayback Machine anymore.  We receive inquiries and complaints on these “disappeared” sites almost daily.

A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org). As we have moved towards broader access it has not caused problems, which we take as a good sign.  We are now looking to do this more broadly.  

We see the future of web archiving relying less on robots.txt file declarations geared toward search engines, and more on representing the web as it really was, and is, from a user’s perspective.

This entry was posted in Announcements, News. Bookmark the permalink.

4 Responses to Robots.txt meant for search engines don’t work well for web archives

  1. Daniel says:

    So the plan is to no longer respect robots.txt files with directives that explicitly say User-Agent: ia_archiver? or User-Agent: *? Many website explicitly block the Internet Archive’s ia_archiver crawler while allowing other crawlers.

    Have you considered adopting AppleNewsBot’s policy of pretending to be Googlebot? Robots directives written for Googlebot is more permissive than rules for other crawlers. Also, many sites block everything but Googlebot.

    I’m kind of torn on whether I think it’s a good thing if you improve the archive by ignoring the wishes of webmasters or not. I often run into issues with pages missing from the archive only to discover that the website has specifically excluded the ia_archiver. However, I still believe it’s important to preserve a standardized mechanism for controlling crawlers and bots of all kinds.

  2. Pingback: News Roundup | LJ INFOdocket

  3. Joshua says:

    A better choice would probably have been to respect robots.txt as of the time you crawled it; that is once archived changing robots.txt later doesn’t change its visibility. Oh well.

    • MeditateOrDie says:

      I agree with this mostly, though some sites seem to block archive.org for no sensible reason – then when their site eventually dies, as all sites do sooner or later, all of that good, useful info is lost forever.

      Perhaps giving more advanced users a means of over-riding robots.txt on a per-save and per-read basis while still keeping the default behaviors active for general use, might be a compromise that satisfies the needs of dutiful archivers, general users and webmasters.

      Something like a URL modification could be used to do the trick (this used to work in the past until this handy undocumented functionality was removed). eg: Additions of varying numbers of “.” in the right parts of a URL used to work nicely for saves and reads, until fairly recently.

      There’s not much point in us ‘saving the web’ if we human beings cannot access the archives because of our robot[s.txt] overlords!

      https://web.archive.org/web/https://u.cubeupload.com/ZkJ7hq.gif 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *