Cloudflare now populating and using the Internet Archive’s Wayback Machine in its content distribution network application
Cloudflare and the Internet Archive are now working together to help make the web more reliable. Websites that enable Cloudflare’s Always Online service will now have their content automatically archived, and if by chance the original host is not available to Cloudflare, then the Internet Archive will step in to make sure the pages get through to users.
Cloudflare has become core infrastructure for the Web, and we are glad we can be helpful in making a more reliable web for everyone.
“The Internet Archive’s Wayback Machine has an impressive infrastructure that can archive the web at scale,” said Matthew Prince, co-founder and CEO of Cloudflare. “By working together, we can take another step toward making the Internet more resilient by stopping server issues for our customers and in turn from interrupting businesses and users online.”
For more than 20 years the Internet Archive’s Wayback Machine has been archiving much of the public Web, and making those archives available to journalists, researchers, activists, academics and the general public, in total to hundreds of thousands of people a day. To date more than 468 billion Web pages are available via the Wayback Machine and we are adding more than 1 billion new archived URLs/day.
We archive URLs that are identified via a variety of different methods, such as “crawling” from lists of millions of sites, as submitted by users via the Wayback Machine’s “Save Page Now” feature, added to Wikipedia articles, referenced in Tweets, and based on a number of other “signals” and sources, such multiple feeds of “news” stories.
An additional source of URLs we will preserve now originates from customers of Cloudflare’s Always Online service. As new URLs are added to sites that use that service they are submitted for archiving to the Wayback Machine. In some cases this will be the first time a URL will be seen by our system and result in a “First Archive” event.
In all cases those archived URLs will be available to anyone who uses the Wayback Machine.
By joining forces on this project we can do a better job of backing up more of the public Web, and in so doing help make the Web more useful and reliable.
If you have suggestions about how we can continue to improve our services, please don’t hesitate to drop us a note at email@example.com.
Excellent article. I definitely love this site. Stick with it!
Woah, this is an AWESOME collaboration.
Question: what would the process be to opt out of this feature for Cloudflare users? At the moment, you can block Wayback Machine crawlers via robotx.txt, but this feature would negate that.
Cloudflare’s “Always Online” feature is only available to people who opt-in to use it.
Wayback machine has been really helpful in finding content from old and expired domains.
Hope partnership with cloudflare will allow even more sites to be indexed and added to archive.
Great work. Thanks for the service in preserving the internet.
so great and informative post
I’ve been using Cloudflare for more than 2 years now and I love it. This new addition is definitely a must use for me, I think it’ll be a good one for the little <1% downtime that my blog experiences monthly.
It won't even be noticeable, it is better than to have a downtime page.
I agree totally, its really a great combination . Am so looking out for it.
Really a better step to make web more useful.
I have few questions related to this:
1. What if the webmaster blocks Internet Archive bot in htaccess file? Will it still archive pages through/via cloudflare?
2. How about password protected/payment wall pages where a publisher/webmaster shows first paragraph of an article and hides the rest behind a payment wall? Will such pages be archived by Internet Archive through cloudflare?
3. Can a webmaster select which of their sites pages be crawled/indexed by Internet Archive via Cloudflare? Suppose I want Homepage, Products pages, but not the shopping-cart page to be archived.
Customers of Cloudflare need to opt-in to use their “Always Online” service.
Is this the reason I am getting an error when using web.archive.org/save/youtube.com
We can’t retrieve all the files we need to display that page. Please try again later.
If i’m reading this correctly, Cloudflare users will have their pages cached and indexed in Way Back Machine?
Will there be an opt-out option?
Customers of Cloudflare have to opt-in to use their “Always Online” service.
Any plan for IPv6 support in Internet Archive?
Nice post thank you so much
That’s a good news for both users and website owners. Thank you!
I have been using Cloudfare for 4 months. They are providing the best services.
This is an awesome collaboration. Great work. Thanks for the service in preserving the internet.
This is taking the meaning of “always online” a big step further!
It’s wonderful to have more such extensive and broad-based sources of URLs. I wish big search engines and DNS servers could find a privacy-friendly way to contribute too.
CloudFlare states both that they have their own crawler and that they’ll send the hostname (not specific URLs?) to Internet Archive for its own crawl.
> Enabling Always Online in the Cloudflare dashboard allows us to share your hostname with the Wayback Machine so that they can archive your website. When a website’s origin is down, Cloudflare will go to the Internet Archive to retrieve the most recently archived version of the site, so that visitors will still be able to view the site’s content.
> Our User Agent Mozilla/5.0 (compatible; CloudFlare-AlwaysOnline/1.0; +https://www.cloudflare.com/always-online) AppleWebKit/534.34
It seems however that the latter page is actually outdated, based on this comment from jgc at CloudFlare:
> Uh, no. We’re literally doing the opposite. We used to have our own caching infrastructure for “Always Online” and we’re getting rid of it and using archive.org instead. […] We tell archive.org about the URI, they crawl it. They handle robots.txt.
So it’s definitely not about CloudFlare making or sharing any WARCs by itself. Reusing Internet Archive’s services (for a decent fee, I suppose) is good use of resources!
Wayback machine has been really helpful in finding content from old and expired domains