Robots.txt meant for search engines don’t work well for web archives

Name: Authors Alliance 10th Anniversary: Authorship in an Age of Monopoly and Moral Panics
Start: 2024-05-17T16:00:00-08:00
End: 2024-05-17T19:00:00-08:00
Location: Internet Archive

Posted on April 17, 2017 by Mark Graham

Robots.txt files were invented 20+ years ago to help advise “robots,” mostly search engine web crawlers, which sections of a web site should be crawled and indexed for search.

Many sites use their robots.txt files to improve their SEO (search engine optimization) by excluding duplicate content like print versions of recipes, excluding search result pages, excluding large files from crawling to save on hosting costs, or “hiding” sensitive areas of the site like administrative pages. (Of course, over the years malicious actors have also used robots.txt files to identify those same sensitive areas!) Some crawlers, like Google, pay attention to robots.txt directives, while others do not.

Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes. Internet Archive’s goal is to create complete “snapshots” of web pages, including the duplicate content and the large versions of files. We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine. In other words, a site goes out of business and then the parked domain is “blocked” from search engines and no one can look at the history of that site in the Wayback Machine anymore. We receive inquiries and complaints on these “disappeared” sites almost daily.

A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.

We see the future of web archiving relying less on robots.txt file declarations geared toward search engines, and more on representing the web as it really was, and is, from a user’s perspective.

34 thoughts on “Robots.txt meant for search engines don’t work well for web archives”

Daniel April 17, 2017 at 2:18 pm

So the plan is to no longer respect robots.txt files with directives that explicitly say User-Agent: ia_archiver? or User-Agent: *? Many website explicitly block the Internet Archive’s ia_archiver crawler while allowing other crawlers.

Have you considered adopting AppleNewsBot’s policy of pretending to be Googlebot? Robots directives written for Googlebot is more permissive than rules for other crawlers. Also, many sites block everything but Googlebot.

I’m kind of torn on whether I think it’s a good thing if you improve the archive by ignoring the wishes of webmasters or not. I often run into issues with pages missing from the archive only to discover that the website has specifically excluded the ia_archiver. However, I still believe it’s important to preserve a standardized mechanism for controlling crawlers and bots of all kinds.
Pingback: News Roundup | LJ INFOdocket
Joshua April 17, 2017 at 11:31 pm

A better choice would probably have been to respect robots.txt as of the time you crawled it; that is once archived changing robots.txt later doesn’t change its visibility. Oh well.
1. MeditateOrDie April 18, 2017 at 7:02 am
  
  I agree with this mostly, though some sites seem to block archive.org for no sensible reason – then when their site eventually dies, as all sites do sooner or later, all of that good, useful info is lost forever.
  
  Perhaps giving more advanced users a means of over-riding robots.txt on a per-save and per-read basis while still keeping the default behaviors active for general use, might be a compromise that satisfies the needs of dutiful archivers, general users and webmasters.
  
  Something like a URL modification could be used to do the trick (this used to work in the past until this handy undocumented functionality was removed). eg: Additions of varying numbers of “.” in the right parts of a URL used to work nicely for saves and reads, until fairly recently.
  
  There’s not much point in us ‘saving the web’ if we human beings cannot access the archives because of our robot[s.txt] overlords!
  
  https://web.archive.org/web/https://u.cubeupload.com/ZkJ7hq.gif 🙂
Shannon April 18, 2017 at 4:22 pm

Yay! Sites that have retroactively changed their robots.txt have caused me a *lot* of problems for (recent) historical research. The info is often still out there, but it can take additional hours to find it. And sometimes it’s just gone instead.

It never made any sense to block access to the archives today when the sites had allowed your crawling yesterday.
Adam April 19, 2017 at 12:55 pm

There is another problem with robots.txt that everyone on here should know. There are 2 ways that archive could accidently be removed:

-If a website goes dead, and a different web owner opens up his/her website with the same URL as the dead one with robots.txt, the entire archive, including the dead website, will be removed. Even if the new owner has nothing to own on the previous website.

-If a site gets hacked to have robots.txt, the same thing happened as above.

Web archive should change their system so that the machine that checks robots.txt would not remove already-archived pages, only disable the process of archiving of the page after robots.txt is established. To remove all history, would require to talk to archive.org to “flag” that site so that users cannot archive it. Also have a protection so that if web owners tries to remove history that isn’t theirs, would deny them (they are required to show proof of it) from deleting the dead site.

Please let me know by sending an email to Adomsyik@gmail.com.
MKKLSDKLS April 19, 2017 at 9:16 pm

As someone who has seen many good website be “park-nuked” and kicked out of the publicly accessible Archive, I beg you people to ignore the parked website robots.txt’s wishes.

If we really want to archive the web, people who have literally no relation to the previous website beyond usurping the previous name via domain-squatting should have no say in what is archived.
Andy L April 20, 2017 at 6:56 pm

I’m happy to hear about this change.

I’ve always thought it was a shame that changes to the modern robot.txt files are able to reach back in time and scrub the site from existence.

I guess I understand why that policy was put into place, but it doesn’t seem to make sense long term. For a convenient domain, its current website and owner might be completely unrelated to the historic page that was there before.
Georgene Uddin April 20, 2017 at 7:26 pm

Hey There. I discovered your weblog the usage of msn. That is an extremely well written article. I’ll make sure to bookmark it and come back to read extra of your useful information. Thank you for the post. I’ll definitely return.|
posty April 21, 2017 at 8:20 am

Could we expand this to more than just US government websites? Australian government websites do this too.

eg: http://operational.humanservices.gov.au/robots.txt

that website clearly details how our governments social security system works, which changes and leaves the public at a disadvantage.
1. Mark Graham Post authorApril 25, 2017 at 6:48 pm
  
  Yes, in general terms we think information produced by governments around the world, and published via public websites, should be preserved and made available via the Wayback Machine.
Andre Borie April 21, 2017 at 10:20 am

I really don’t see any problem with this – if a human can access it, so should the archive be able to – anyone who doesn’t want their stuff being searchable/archived online should just put a password on it.

The only good thing about robots.txt is the rate-limiting, so smaller sites can limit the bandwidth allocated to crawling if they wish.

By the way, what does this mean for previously-archived sites that now changed their robots.txt to block the Archive? Do you still keep the original data, and in which case, would you be able to restore access to it? I’ve seen a few sites where they used to be accessible on the Wayback machine but are not anymore due to a recent robots.txt change, and I’d love to see them available again if the original data wasn’t deleted.
Jim Moores April 21, 2017 at 11:40 am

I’ve found that I can’t access archive material that I myself created because I let a domain I was no longer using expire and now it has a non-permissive robots.txt. At a minimum archive.org needs to respect the robots.txt only at the point of collection, but my personal opinion is that it should be ignored completely by archive.org and allow people to actively opt out in some other way.
Mo April 21, 2017 at 2:35 pm

In my case I was trying to retrieve an old web site of mine

A cybersquatter later bought the domain and put up a robits.txt

Now I can’t see my own site

The archive respects a new robots.txt file iwned by a squtter who is effectively blocking a historical archive they had NOTHING to do with. That is INSANE.
1. Mark Graham Post authorApril 25, 2017 at 6:46 pm
  
  Thank you Mo.
  
  People write to us about the situation you describe every day. In many cases they implore us to make their content available again. This is exactly the harm we wish to address here.
  
  And, everyone, please remember you can always write to info@archive.org if you would like us to not crawl your site.
Ryan April 21, 2017 at 10:06 pm

Just this morning ia_archiver submitted a form on my site (the form was blank, but the point is that it clicked submit). Any crawler that submits forms is a jerk crawler. Would you consider redesigning your crawler to be less offensive?
1. Mark Graham Post authorApril 25, 2017 at 6:42 pm
  
  The “ia_archiver” User Agent is used by Alexa Internet, not the Internet Archive.
Henrik April 23, 2017 at 9:54 am

On tools.ietf.org, all the information is public. I use robots.txt primarily to steer web crawlers away from pages which require substantial CPU resources to generate.

Background: tools.ietf.org has been a pro-bono activity for 15 years, and runs on donated hardware; I don’t have the means to upgrade to a level of CPU resources to be able to serve generated pages at the rate the searchbots can hit them. The pages I steer robots away from are for instance source repository diffs, logs, commits etc., served through Trac.

If a crawler is sufficiently gentle, and is able to back down the rate of crawl if the time to serve pages is long or go up, I’m perfectly happy to have all of the pages now denied by my robots.txt crawled.
Chris April 23, 2017 at 6:38 pm

I’d implore you to consider recognizing an “archive.txt”-like standard then. For people like myself who maintain a personal website, I tend to use it as a file server and would be quite annoyed if my resume (which contains an email address and contact phone number) ended up archived.

The alternative would be I remove everything I don’t want archived. I don’t think that’s your intended goal, so please rethink this strategy.
1. Mark Graham Post authorApril 25, 2017 at 6:39 pm
  
  Thank you for this Chris. Please do write to us at info@archive.org about any sites you manage. I promise we will be responsive.
Ross April 23, 2017 at 8:13 pm

Internet Archive, thank you for wanting to archive the web as users see it, which is the whole point of “Saving the Web!” I had respect for robots.txt 20 year ago, but it’s today clear that we cannot allow site owners to affect the public record by their own selfish choices. Stay the course, thanks again!
vinz April 24, 2017 at 2:20 am

alas, this comes too late for many of my favourite sites….2014-2015 took out a lot for some reason, as did mid-2008

I guess I’ll have to hold out until computers can reconstruct things straight from memory then do a big ol’ rip.

also wish I knew why it doesn’t save images properly sometimes, I run into a lot of those at self-hosted sites, unless the crawler just happened to hit it while a file was broken.
Darren Duncan April 24, 2017 at 8:50 am

This is a good move on the part of the Internet Archive in principle.

At the very least, something I remember requesting of the Internet Archive years ago, is that any respect they give robots.txt should be time sensitive.

If a domain’s robots.txt allows archiving in the present, then the Internet Archive should always make today’s version of that content available in perpetuity, even if tomorrow’s robots.txt for that domain denies archiving.

I would want any website I operate to be archived, and if I gave up any of my domain names in the future, I would not want the future owners of those domain names to be able to cause the Internet Archive to stop displaying the versions of the domain that existed while I controlled it.
Vix April 24, 2017 at 9:40 am

“Archiving relying (..) more on representing the web as it really was, and is, from a user’s perspective.”
I agree 100%. Robots.txt aren’t limiting regular users and archiving purpose is to reflect the users’ perspective, not SEO crawling. Go for it!
Michael Martinez April 24, 2017 at 3:29 pm

If you ignore “robots.txt” directives people will find other ways to block you. While it’s unfortunate that you don’t keep data live after a “robots.txt” change, that is your own bad policy. The Robots Exclusion “standard” is NOT a standard, it’s an arbitrary and voluntary set of guidelines. No one forced the archive to take content offline after domain names changed hands. You can easily correct that bad practice by changing your policy rather than blaming the non-standard “standard” (of which MOST PEOPLE are unaware) for the issue.

While you’re fixing the problems with your system, you could also make it easier for Webmasters who do know about both the “robots.txt” file and your archive to correct errors rather than have to wait 24 hours or longer for your crawler to see changes.
1. Mark Graham Post authorApril 25, 2017 at 6:37 pm
  
  Thank you Michael. We encourage people to write to us at info@archive.org to report bugs, make requests (include for content to be removed from the Wayback Machine and for sites to not be crawled.) I assure you we read every message sent to us, and act on them as appropriate. Many of the features we add, and bugs we fix, are a direct result of user feedback.
Pingback: The Internet Archive and robots.txt — Pixel Envy
nascent April 25, 2017 at 1:04 pm

There still needs to be a way of specifically preventing IA from archiving a domain.
1. Mark Graham Post authorApril 25, 2017 at 6:33 pm
  
  Please know that site owners can always write to info@archive.org and request that content from a site be removed from the Wayback Machine and from future crawling. We process requests like that every day.
Pingback: Editors’ Choice: Robots.txt
Adam April 25, 2017 at 6:25 pm

Its bad enough that robots.txt not only prevents archiving, it also deletes the entire achieve (in other words if you archive it, and later employs robots.txt, will delete it). Including a website being hacked to include robots.txt.
John April 26, 2017 at 3:23 pm

Then please explain how I can keep a site ephemeral, as intended. Are there IP address ranges, HTTP headers, etc, that can be used to forbid access? What is the way to reliably keep sites out of the archive for the time you respected robots.txt, now, and forever? My sites explicitly tell robots “NOARCHIVE”. You shouldn’t even have the files on your systems. Retroactively making archives public is a dick move.
1. Mark Graham Post authorApril 26, 2017 at 3:54 pm
  
  Hi John,
  
  Please email your request to info@archive.org and we will promptly process it.
Chris Haines April 26, 2017 at 3:33 pm

“We see the future of web archiving relying less on robots.txt file declarations geared toward search engines, and more on representing the web as it really was, and is, from a user’s perspective.”

– I agree. I think times have changed, and this reflects what users want from a web archive these days.