The Internet Archive is collecting webpages from over 6,000 government domains, over 200,000 hosts, and feeds from around 10,000 official federal social media accounts. Some have asked if we ignore URL exclusions expressed in robots.txt files.
The answer is a bit complicated. Historically, sometimes yes and sometimes no; but going forward the answer is “even less so.”
Robots.txt files live on the top level of a website at a url like this: https://example.com/robots.txt. This standard was developed in 1994 to guide search engine crawlers in a variety of ways, including some areas to avoid crawling. This standard is used by Google, for instance.
These files were useful 20 years ago for the Internet Archive’s crawlers, but have become less and less so over the years because many sites have not actively maintained the files from the point of view of archiving. Also, large websites or hosted websites often do not make it easy for their users to edit these files, and large websites increasingly guide or block crawlers with technological measures. Another problem is knowing when a domain name changes hands, so a current robots.txt file is not relevant to a different era. As time has gone on, for those who want to exclude their sites we encourage webmasters to send exclusion requests to info@archive.org and encourage them to specify what time period they apply to.
Our end-of-term crawls of .gov and .mil websites in 2008, 2012, and 2016 have ignored exclusion directives in robots.txt in order to get more complete snapshots. Other crawls done by the Internet Archive and other entities have had different policies. We have had little or no negative feedback on this, and little or no positive feedback — in fact little feedback at all. The Wayback Machine has also been replaying the captured .gov and .mil webpages for some time in the beta wayback, regardless of robots.txt.
Overall, we hope to capture government and military websites well, and hope to keep this valuable information available to users in the future.
Great news! I wish it were as easy to identify domains acquired by abusive domain parkers.
One idea is to keep a log of the current whois record for that domain being cached
Then when a domain has changed hands, and it gets crawled it will count as a new owner.
Reducing the amount of data that needs to be kept, and to identify whois privacy blocking you could SHA512 the whole data and then compare against the whole “database/index” of SHAs. You will also needs to remove the expiry date from the data that is hashed and kept the original purchase/register date.
It’s far from perfect but it’s a start
Thanks for the good news, gov. sites must first process!