Author Archives: Mark Graham

Robots.txt meant for search engines don’t work well for web archives

Posted on April 17, 2017 by Mark Graham

Robots.txt files were invented 20+ years ago to help advise “robots,” mostly search engine web crawlers, which sections of a web site should be crawled and indexed for search.

Many sites use their robots.txt files to improve their SEO (search engine optimization) by excluding duplicate content like print versions of recipes, excluding search result pages, excluding large files from crawling to save on hosting costs, or “hiding” sensitive areas of the site like administrative pages. (Of course, over the years malicious actors have also used robots.txt files to identify those same sensitive areas!) Some crawlers, like Google, pay attention to robots.txt directives, while others do not.

Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes. Internet Archive’s goal is to create complete “snapshots” of web pages, including the duplicate content and the large versions of files. We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine. In other words, a site goes out of business and then the parked domain is “blocked” from search engines and no one can look at the history of that site in the Wayback Machine anymore. We receive inquiries and complaints on these “disappeared” sites almost daily.

A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org). As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.

We see the future of web archiving relying less on robots.txt file declarations geared toward search engines, and more on representing the web as it really was, and is, from a user’s perspective.

Wayback Machine Chrome extension now available

Posted on January 13, 2017 by Mark Graham

The Wayback Machine Chrome browser extension helps make the web more reliable by detecting dead web pages and offering to replay archived versions of them. You can get it here.

For the past 20 years, the Internet Archive has recorded and preserved web pages, and hundreds of billions of them are available via the Wayback Machine. This is good because we are learning the web is fragile and ephemeral. For example a 2013 Harvard study found that 49% of the URLs referenced in U.S. Supreme Court decisions are now dead. Those decisions affect everyone in the U.S., and the evidence the opinions are based on is disappearing.

When previously valid URLs don’t respond, but instead return a result code of 404, we call that link rot. The Wayback Machine Chrome extension is designed to help mitigate against link rot and other common web breakdowns.

By using the “Wayback Machine” extension for Chrome, users are automatically offered the opportunity to view archived pages whenever any one of several error conditions, including code 404, or “page not found,” are encountered. If those codes are detected, the Wayback Machine extension silently queries the Wayback Machine, in real-time, to see if an archived version is available. If one is available, a notice is displayed via Chrome, offering the user the option to see the archived page.

The Internet Archive considers the privacy of our users to be of critical importance. We try not to record IP addresses, and we have fought National Security letters. You can rest assured that the use of the Wayback Machine Chrome extension will not expose your browsing history. In addition we are in conversation with Google about adding a proxy server as an additional layer of protection.

Thank you for giving the Wayback Machine for Chrome extension a try. You can test it with this URL: http://www.pfaw.org:80/attacks.htm We are committed to supporting better web browsing experiences and welcome your feedback and suggestions about how we can improve. Please send us your bug reports, feature requests and other feedback directly to info@archive.org.

Making the Web More Reliable — 20 Years and Counting

Posted on October 26, 2016 by Mark Graham

blog-wwwcheck
As a part of our 20th anniversary, here are some highlights about tools and projects, from the Internet Archive, helping to make the web a more reliable infrastructure for supporting our culture and commerce.

Launched a new and improved Wayback Machine, featuring Site Search
Launched the “No More 404s” Firefox add-on with Mozilla.
Rescued more than a million broken Wikipedia outlinks, replacing them with links to archived pages from the Wayback Machine.
Worked with 1,000 librarians to help archive the web.
Supported the Wayback Machine’s Save-Page-Now (bottom-right of page) feature, which is used tens of times a second!
Rebuilt and assumed responsibility for the PURL.org (Permanent URL) service from OCLC. Read more about that project here.
Celebrated the 20 years of web archives, now including more than 273 billion web pages and more than 510 billion web objects, and now growing at more than 500 million pages a week.
Championed efforts to help make the web more reliable, private, and fun by convening the Decentralized Web Summit and published this plan “Locking the Web Open.”
Automatically archived all links created by Wikipedia users.
Automatically archived posts from millions of wordpress.com hosted sites.

All in all, the Internet Archive is building collections and tools to help make the open web a permanent resource for current users and into the future.

Please donate to make it even better.

Thank you to the hundreds of people who have worked for the Internet Archive over the past 20 years, and to the thousands who have supported the Archive and contributed to the collections.

More than 1 million formerly broken links in English Wikipedia updated to archived versions from the Wayback Machine

Posted on October 26, 2016 by Mark Graham

blog-no404

The Internet Archive, the Wikimedia Foundation, and volunteers from the Wikipedia community, have now fixed more than 1 million broken outbound web links on English Wikipedia. This was possible because, in addition to other web archiving projects, the Internet Archive has been monitoring all new, and edited, outbound links from English Wikipedia for three years and archiving them soon after changes are made to articles. As a result of this work, as pages on the Web become inaccessible, links to archived versions in the Internet Archive’s Wayback Machine can take their place. This has now been done for the English Wikipedia and more than 1 million links are now pointing to preserved copies of missing web content.

This story is a testament to the sharing, cooperative nature and resulting benefits of the open world.

What do you do when good web links go bad? If you are a volunteer editor on Wikipedia, you start by writing software to examine every outbound link in English Wikipedia to make sure it is still available via the “live web.” If, for whatever reason, it is no longer good (e.g. if it returns a “404” error code or “Page Not Found”) you check to see if an archived copy of the page is available via the Internet Archive’s Wayback Machine. If it is, you instruct your software to edit the Wikipedia page to point to the archived version, taking care to let users of the link know they will be visiting a version via the Wayback Machine.

That is exactly what Maximilian Doerr and Stephen Balbach have done. As a result of their work, in close collaboration with the non-profit Internet Archive and the Wikimedia Foundation’s Wikipedia Library program and Community Tech team, now more than 1 million broken links have been repaired. For example, footnote #85 from the article about Easter Island, now links to: https://web.archive.org/web/20071011083729/http://islandheritage.org/faq.html when before it linked to the missing page http://islandheritage.org/faq.html. Pretty cool, right?

“We are honored to work with the Wikipedia community to help maintain the cultural treasure that is Wikipedia,” said Brewster Kahle, founder and Digital Librarian of the Internet Archive, home of the Wayback Machine. “By editing broken outbound links on English Wikipedia to their archived versions available via the Wayback Machine, we are helping to provide persistent availability to reference information. Links that would have otherwise lead to a virtual dead end.”

“What Max and Stephen have done in partnership with Mark Graham at the Internet Archive is nothing short of critical for Wikipedia’s enduring value as a shared repository of knowledge. Without dependable and persistent links, our articles lose their backbone of reliable sources. It’s amazing what a few people can do when they are motivated by sharing—and preserving—knowledge,” said Jake Orlowitz, head of the Wikipedia Library.

“Having the opportunity to contribute something big to the community with a fun task like this is why I am a Wikipedia volunteer and bot operator. It’s also the reason why I continue to work on this never-ending project, and I’m proud to call myself its lead developer,” said Maximilian, the primary developer and operator of InternetArchiveBot.

So, what is next for this collaboration between Wikipedia and the Internet Archive? Well… there are nearly 300 Wikipedia language editions to rid of broken links. And, we are exploring ways to help make links added to Wikipedia self-healing. It’s a big job and we could use help.

Making the web more reliable… one web page at a time. It’s what we do!

A huge Thank You! to Stephen Balbach, Maximilian Doerr, Vinay Goel, Mark Graham, Brewster Kahle, John Lekashman, Kenji Nagahashi, the Wikimedia Foundation, and Wikipedia community members.

FAQs for some new features available in the Beta Wayback Machine

Posted on October 24, 2016 by Mark Graham

The Beta Wayback Machine has some new features including searching to find a website and a summary of types of media on a website.

How can I use the Wayback Machine’s Site Search to find websites? The Site Search feature of the Wayback Machine is based on an index built by evaluating terms from hundreds of billions of links to the homepages of more than 350 million sites. Search results are ranked by the number of captures in the Wayback and the number of relevant links to the site’s homepage.

Can I find sites by searching for words that are in their pages? No, at least not yet. Site Search for the Wayback Machine will help you find the homepages of sites, based on words people have used to describe those sites, as opposed to words that appear on pages from sites.

Can I search sites with text from multiple languages? Yes! In fact, you can search for any unicode character, e.g. you can search for ❤ (try clicking on it). If you can generate characters with your computer, you should be able to use them to search for sites via the Wayback Machine. Go ahead, try searching for правда

Can I still find sites in the Wayback Machine if I just know the URL? Yes, just enter a domain or URL the way you have in the past and press the “Browse History” button.

What is the “Summary of <site>” link above the graph on the calendar page telling me? It shows you the breakdown of the web captures for a given domain by content type (text, images, videos, PDFs, etc.) In addition, it shows the number of captures, URLs and new URLs, by year for all the years available via the Wayback Machine, so you can see how a certain site has changed over time.

What are the sources of your captures? When you roll over individual web captures (that pop-up when you roll over the dots on the calendar page for a URL) you may notice some text links shows up above the calendar, along with the word “why”. Those links will take you to the Collection of web captures associated with the specific web crawl the capture came from. Every day hundreds of web crawls contribute to the web captures available via the Wayback Machine. Behind each, there is a story about factors like who, why, when and how.

Why are some of the dots on the calendar page different colors? We color the dots, and links, associated with individual web captures, or multiple web captures, for a given day. Blue means the web server result code the crawler got for the related capture was a 2nn (good); Green means the crawlers got a status code 3nn (redirect); Orange means the crawler got a status code 4nn (client error), and Red means the crawler saw a 5nn (server error). Most of the time you will probably want to select the blue dots or links.

Can I find sites by searching for a word specific to that site? Yes, by adding in “site:<domain>” your results will be restricted to the specified domain. E.g. “site:gov clinton” will search for sites related to the term “clinton” in the domain “gov”.

Persistent URL Service, purl.org, Now Run by the Internet Archive

Posted on September 27, 2016 by Mark Graham

purl

OCLC and the Internet Archive today announced the results of a year-long cooperation to ensure the future of purl.org. The organizations have worked together to build a new service hosted by the Internet Archive that will manage the persistent URLs and sub-domain redirections for purl.org, purl.com and purl.net.

Since its introduction by OCLC Research in 1995, purl.org has provided a source of Persistent URLs (PURLs) that redirect users to the correct hosting location for documents, data, and websites as they change over time.

With more than 2,500 users including publishing and metadata organizations such as Dublin Core, purl.org has become important to the smooth functioning of the Web, data on the Web, and the Semantic Web in particular.

Brewster Kahle of the Internet Archive said “We share a common belief with OCLC that what is shared on the Web should be preserved, so it makes perfect sense for us to add this important service to our set of tools and services including the WayBack Machine as part of our mission to promote universal access to all knowledge.”

Lorcan Dempsey of OCLC welcomed the announcement as “a major step in the future sustainability and independence of this key part of the Web and linked data architectures. OCLC is proud to have introduced persistent URLs and purl.org in the early days of the Web and we have continued to host and support it for the last twenty years. We welcome the move of purl.org to the Internet Archive which will help them continue to archive and preserve the World’s knowledge as it evolves.”

All previous PURL definitions have been transferred to Internet Archive and can continue to be maintained by their owners through a new web-based interface located at here.

About OCLC:
OCLC is a nonprofit global library cooperative providing shared technology services, original research and community programs so that libraries can better fuel learning, research and innovation. Through OCLC, member libraries cooperatively produce and maintain WorldCat, the most comprehensive global network of data about library collections and services. Libraries gain efficiencies through OCLC’s WorldShare, a complete set of library management applications and services built on an open, cloud-based platform. It is through collaboration and sharing of the world’s collected knowledge that libraries can help people find answers they need to solve problems. Together as OCLC, member libraries, staff and partners make breakthroughs possible

About Internet Archive:
The Internet Archive (archive.org) is a 501(c)(3) non-profit that was founded to build an Internet library, with the purpose of offering permanent access for researchers, historians, and scholars to historical collections that exist in digital format.

No More 404s! Resurrect dead web pages with our new Firefox add-on.

Posted on August 9, 2016 by Mark Graham

No More 404s Have you ever clicked on a web link only to get the dreaded “404 Document not found” (dead page) message? Have you wanted to see what that page looked like when it was alive? Well, now you’re in luck.

Recently the Internet Archive and Mozilla announced “No More 404s”, an experiment to help you to see archived versions of dead web pages in your Firefox browser. Using the “No More 404s” Firefox add-on you are given the option to retrieve archived versions of web pages from the Internet Archive’s 20-year store of more than 490 billion web captures available via the Wayback Machine.

To try this free service, and begin to enjoy a more reliable web, view this page with Firefox (version 48 or newer) then:

Install the Firefox “Test Pilot”: https://testpilot.firefox.com
Enable the “No More 404s” add-on: https://testpilot.firefox.com/experiments/no-more-404s
Try viewing this dead page: http://stevereads.com/cache/ephemeral_web_pages.html

See the banner that came down from the top of the window offering you the opportunity to view an archived version of this page? Success!

For 20 years, the Internet Archive has been crawling the web, and is currently preserving web captures at the rate of one billion per week. With support from the Laura and John Arnold Foundation, we are making improvements, including weaving the Wayback Machine into the fabric of the web itself.

“We’d like the Wayback Machine to be a standard feature in every web browser,” said Brewster Kahle, founder of the Internet Archive. “Let’s fix the web — it’s too important to allow it to decay with rotten links.”

“The Internet Archive came to us with an idea for helping users see parts of the web that have disappeared over the last couple of decades,” explained Nick Nguyen, Vice President, Product, Firefox.

The Internet Archive started with a big goal — to archive the web and preserve it for history. Now, please help us. Test our latest experiment and email any feedback to info@archive.org.

Internet Archive Blogs

A blog from the team at archive.org