Author Archives: Mark Graham

Brave Browser and the Wayback Machine: Working together to help make the Web more useful and reliable

The Web just got a little bit more reliable.

Available today, starting with version 1.4 of its desktop browser, Brave has added a 404 detection system, with an automated Wayback Machine lookup process to its desktop browser.

By default, it now offers users one-click access to archived versions of Web pages that might otherwise not be available. Specifically we are checking for 14 HTTP error codes in addition to the 404 (page not found) condition, including: 408, 410, 451, 500, 502, 503, 504, 509, 520, 521, 523, 524, 525, and 526. 

The Web is fragile. Just as nations rise and fall, so do the Websites of your favorite news orgs, brands, companies, governments, etc. Web pages are edited and pages are taken down. Studies suggest the average life expectancy of a single Web page is anywhere from 44 – 100 days. We’ve all hit the dreaded error code 404 “Page Not Found”. Is there any hope of seeing that Web page ever again?

If you are a Brave desktop browser user, the answer is now just a click away. But first – you have to update your browser. Then see the benefits of this new feature in action by clicking on this URL.

For the past 23 years the Wayback Machine has archived more than 900 billion URLs, and more than 400 billion Web pages, and adds many hundred million more archived URLs each day. As such there is a good chance archived versions of “missing” pages you are looking for are available.

This is not the first time the Internet Archive has partnered with Brave. In 2017 we announced our support of their micropayments system and then last year we shared an update about that effort. We appreciate how Brave continues to innovate and deliver new value and services through their browser.

We are grateful for their commitment to user privacy, helping advance alternatives to the current ad-supported Web, and focusing on improving the overall Web browsing experience. We applaud Brave’s leadership in these efforts and look forward to working with them on other ways to help make the Web more useful and reliable.

While native Wayback Machine 404 support is only available via the Brave desktop browser, various Wayback Machine functionality, including 404 detection and archived URL playback, is available via browser extensions for SafariChrome and Firefox.

If you have ideas about how we can improve the Wayback Machine please share them with us via email to info@archive.org  Many of the recent features we have added are the result of suggestions from users of the service and we appreciate all feedback. Together we can help make the Web more useful and reliable.

The Wayback Machine’s Save Page Now is New and Improved

Every day hundreds of millions of web pages are archived to the Internet Archive’s Wayback Machine. Tens of millions of them submitted by users like you using our Save Page Now service. You can now do that in a way that is easier, faster and better than ever before.

Save Page Now (SPN) just got a major upgrade as a result of a total code rewrite, adding a slew of new and awesome features, with more on the way.  

Let’s explore what’s new with Save Page Now    

You can now save all the “outlinks” of a web page with a single click. By selecting the “save outlinks” checkbox you can save the requested page (and all the embedded resources that make up that page) and also all linked pages (and all the embedded resources that make up those pages). Often, a request to archive a single web page, with outlinks, will cause us to archive hundreds of URLs.  Every one of which is shown via the SPN interface as it is archived.

My Web Archive keeps a record of the pages you personally saved in the Wayback Machine using Save Page Now.

The new and improved SPN is based on the modern, server-side Brozzler software, which is capable of running web page JavaScript when saving a URL. With this new approach, we can replay the original more faithfully than was possible before.  And, because this software is actively supported by several developers, bugs are quickly fixed, and new features added at a rapid pace. 

When users are logged in with their free Archive.org account, SPN-generated archives can be saved to that user’s “My web archive” public gallery of archived pages.  

In addition to capturing more high-quality archives of web page elements (HTML, JavaScript, Image files, etc.), SPN can now also produce a screenshot. If screenshots of archived pages are available, we will display an icon on corresponding playback pages and if selected the screenshot will be shown. 

Have you ever wanted to archive all the web pages linked from an email message?  Well, you are in luck because now you can forward that email to “savepagenow@archive.org” and after a few minutes you will get an email back filled with Wayback Machine playback URLs. 

Some of you might like the new “First capture” badge you will see if any of the URLs you submit to be archived (including outlinked URLs and URLs included in emails) have not been archived yet. And, yes, for those of you who are feeling competitive, we are planning to launch a “leader board” soon. Let the games begin!

Maybe you want the URLs embedded in a web-based PDF file, RSS feed, or JSON file archived. The new SPN will parse those files and archive all the URLs they contain.  To use this feature, simply submit PDF/RSS or JSON URLs to SPN, and don’t forget to select the “capture outlinks” checkbox.

This new version of SPN is also being used as the back-end support for a number of Wayback Machine services, including the iOS and Android apps as well as the Chrome, Firefox and Safari browser extensions. And, in case you wondered, those apps and extensions will also be getting major updates very soon.

And, yes, of course SPN has a brand new API that you can use to automate a range of Web archiving projects. Please write to us at info@archive.org if you would like to learn more about the API.

We have often gotten requests to archive URLs from a Google Sheet. We now support that feature for authorised users. Please write to us for access to this advanced capability at info@archive.org.

We LOVE hearing about ways we can make the Wayback Machine better. In fact most of these new SPN features started with your user suggestions.  

Please let us know what you think. Good, bad, or otherwise. Who knows, the next cool SPN feature might be invented by you!

And remember, “If you see something, save something!”

The Mueller Report – Now with Linked Footnotes and Accessible

The Mueller Report, orginally released as a scanned image PDF, is now available as a text-based EPUB document with 747 live footnotes and is conformant with both Web and EPUB accessibility requirements.

The Mueller Report is arguably one of the most important documents in American politics. However, when the report was made available to the public by the Department of Justice (DOJ) on the morning of April 18th, 2019, the formatting left much to be desired.  For one thing, it was initially published as a PDF image file with no text, which meant it could not be searched. That version of the report can be found here.  An updated version of the report, with searchable text, was published by the DOJ on April 22nd at the same URL and with the same filename (report.pdf).  More importantly, while the report had 2,390 footnotes, only 14 of those referenced links to live web pages. In addition the report contained many formatting issues that made it less than accessible to reading disabled people and was not compliant with US federal law 508 accessibility standards.

The Internet Archive sought to help make the report more useful by adding links to as many references in the footnotes as possible, as well as help make it more accessible to the reading disabled community. To do this, we teamed with MuckRock to crowdsource the identification of web-based resources referred to in footnotes.  We then worked with a team of interns to carefully research every footnote and, in some cases, the multiple references each one contained.  We identified 733 external resources (added to the 14 available in the original report, for a total of 747 links) which we archived via the Wayback Machine, the Internet Archive’s TV News Archive, and uploaded to its collections. We included links to archived webpages to guard against the ephemerality of web-based resources. In particular referencing archives guards against link rot (when URLs go dead, e.g. return a status code 404) and content drift (when the content associated with a URLs changes over time.)

In addition, the report has been made fully conformant with both Web and EPUB accessibility requirements, as well as meeting the U.S. government’s Section 508 requirements. This includes proper heading markup and other accessibility markup, to facilitate the use of assistive technology, proper image descriptions for users unable to see the images (including the redactions), and accessibility metadata. It is now fully accessible for the print-disabled, which includes blind, low-vision, dyslexic, and other users with visual impairments. This work was done by Publishing Technology Partners and codeMantra.

The production of this enhanced EPUB edition of the Mueller Report was done in partnership with the Digital Public Library of America (DPLA). Their editors added the links we found, as well as the accessibility changes that had been identified, to a high quality EPUB edition of the report that they had previously created and published. We are happy to share that updated version here.

This version of the report still does not have links for every footnote.  That is because many of the underlying documents and interviews cited in the report are not yet available to the public, and in some cases the footnotes are points of clarification and no external resources are relevant.  We are monitoring open FOIA requests for documents that are currently unavailable and we hope to add more links to updated versions of the report as they become available.

We also know there may be some errors or other omissions in our links and edits and, as such, welcome any suggestions of additional resources that should be linked to references in the report. We also invite suggestions of other public documents that could be made more accessible.  Please write to info@archive.org with your thoughts. 

More than 9 million broken links on Wikipedia are now rescued

As part of the Internet Archive’s aim to build a better Web, we have been working to make the Web more reliable — and are pleased to announce that 9 million formerly broken links on Wikipedia now work because they go to archived versions in the Wayback Machine.

22 Wikipedia Language Editions with more than 9 million links now pointing to the Wayback Machine.

For more than 5 years, the Internet Archive has been archiving nearly every URL referenced in close to 300 wikipedia sites as soon as those links are added or changed at the rate of about 20 million URLs/week.

And for the past 3 years, we have been running a software robot called IABot on 22 Wikipedia language editions looking for broken links (URLs that return a ‘404’, or ‘Page Not Found’). When broken links are discovered, IABot searches for archives in the Wayback Machine and other web archives to replace them with. Restoring links ensures Wikipedia remains accurate and verifiable and thus meets one of Wikipedia’s three core content policies: ‘Verifiability’.

To date we have successfully used IABot to edit and “fix” the URLs of nearly 6 million external references that would have otherwise returned a 404. In addition, members of the Wikipedia community have fixed more than 3 million links individually. Now more than 9 million URLs, on 22 Wikipedia sites, point to archived resources from the Wayback Machine and other web archive providers.

 

 

                   (Broken Link)                                                      (Rescued Page)

One way to measure the real-world benefit of this work is by counting the number of click-throughs from Wikipedia to the Wayback Machine. During a recent 10-day period, the Wikimedia Foundation started measuring external link click-throughs, as part of a new research project (in collaboration with a team of researchers at Stanford and EPFL) to study how Wikipedia readers use citations and external links. Preliminary results suggest that, by far, the most popular external destination was the Wayback Machine, three times the next most popular site, books.google.com. In real numbers, on average, more than 25,000 clicks/day were made from the English Wikipedia to the Wayback Machine.

From “Research:Characterizing Wikipedia Citation Usage/First Round of Analysis

Running IABot on a given Wikipedia site requires both technical integration and operations support as well as the approval of each related Wikipedia community. Two key people have worked on this project.

Maximilian Doerr, known in the Wikipedia world as “Cyberpower”, is a long time volunteer with the Wikipedia community and now a consultant to the Internet Archive. He is the author of the InternetArchiveBot (IABot) software.

Stephen Balbach is a long time volunteer with the Wikipedia community who collaborates with Max and the Internet Archive. He has authored programs that find and fix data errors, verifies existing archives on Wikipedia, and discovers new archives amongst Wayback’s billions of pages and across dozens of other web archive providers.

The number of rescued links, and the quality of the edits, is the result of Max and Stephen’s dedicated, creative and patient work.

What have we learned?

We learned that links to resources on the live web are fragile and not a persistently reliable way to refer to those resources. See “49% of the Links Cited in Supreme Court Decisions Are Broken”, The Atlantic, 2013.

We learned that archiving live-web linked resources, as close to the time they are linked, is required to ensure we capture those links before they go bad.

We learned that the issue of “link rot” (when once-good links return a 404, 500 or other complete failure) is only part of the problem, and that “content drift” (when the content related to a URL changes over time) is also a concern. In fact, “content drift” may be a bigger problem for reliably using external resources because there is no way for the user to know the content they are looking at is not the same as the editor had originally intended.

We learned that by working in collaboration with staff members of the Wikimedia Foundation, volunteers from the Wikipedia communities, paid contractors and the archived resources of the Wayback Machine and other web archives, we can have a material impact on the quality and reliability of Wikipedia sites and in so doing support our mission of “helping to make the web more useful and reliable”.

What is next?

We will expand our efforts to check and edit more Wikipedia sites and increase the speed which we scan those sites and fix broken links.

We will improve our processes to archive externally referenced resources by taking advantage of the Wikimedia Foundation’s new “EventStreams” web service.

We will explore how we might expand our link checking and fixing efforts to other media and formats, including more web pages, digital books and academic papers.

We will investigate and experiment with methods to support authors and editors use of archived resources (e.g. using Wayback Machine links in place of live-web links).

We will continue to work with the Wikimedia Foundation, and the Wikipedia communities world-wide, to advance tools and services to promote and support the use of persistently available and reliable links to externally referenced resources.

Why I Love Helping Back up the Public Web

Over the past couple of years the Wayback Machine has been written about, or referenced, by journalists, researchers, academics and students in more than a thousand published news articles.

This week a CNN article used the Wayback Machine to bring to light writings of a public figure, that otherwise would have been lost, in a relevant and current context. Reading the article made me the happiest about leading the Wayback Machine project since I started 3 years ago.

I think it is fair to say that this article, written by Andrew Kaczynski, @KFILE of CNN, makes the case stronger, and more clearly, than any other, of the importance of cultural memory in general, and the Wayback Machine in particular, in the role of supporting a healthy political discourse and helping to hold those in power accountable.

The article cites two columns of now Vice President Mike Pence that were posted about 17 years ago and that can be read via the Internet Archive’s Wayback Machine here and here. However they return a 404 (page not found) error when accessed via the “Live Web” here and here, and have been gone from the live web for more than a decade.

The fruits of the Wayback Machine are the result of thousands of people over the past 20 years, working, volunteering and otherwise contributing to the Internet Archive’s efforts to preserve our cultural heritage and helping to make the web more useful and reliable.

If it were not for the Wayback Machine, the cogent and earnest writings of a columnist who became Vice President of the United States might not be available for us to reflect on, and benefit from, today.

To all those who value journalism, memory, context and perspectives, supporting the Internet Archive’s mission of Universal Access to All Knowledge is necessary now more than ever in our digital age.

Google Summer of Code 2018: Thank you Google and welcome students!

The Internet Archive is grateful to Google for running their “Google Summer of Code” (GSoC) program, providing support for students and open source projects.

This year the GSoC will support 5 students to work with the Internet Archive on the following projects:

Anish Kumar Sarangi – Continue development of the Chrome extension “Wayback Machine” Today this extension is used by 10s of thousands of people to help them archive URLs, access archived content from broken links (404s, etc.) and perform other functions to help make the web more useful and reliable. We will build on that work, adding features, fix bugs and supporting efforts to bring this tool to millions of users.

Zhengyue Cheng – Inventory the Web to help the Wayback Machine do a better job of archiving it. Today the Wayback Machine archives about 1.5 billion URLs/week. A goal of this project will be to help inform the selection of “seeds” for that effort, to help ensure our coverage is as complete and distributed as possible. We don’t know what we don’t know and this project will help us fill in the blanks.

Fotios Tsalampounis – Add functionality to the Wayback Machine to help people learn about changes in web pages over time. Leveraging work done by the Environmental Data Governance Initiative (EDGI) we will continue to develop software to detect changes in the content of web pages and provide user-facing and API-based interfaces to those changes.

Salman Bhai – Improve the OpenLibrary.org. Salman will lead an effort to write robots that will add hundreds of thousands of new modern book catalog records to OpenLibrary. He will also make OpenLibrary more robust and easier to deploy using Docker and Ansible.

Dave Barry – Continue development of the Google Home (voice) service “Internet Archive” If you have a Google Home device you can use the service today by saying “Hey Google, ask Internet Archive”. Or, try some complete sentences like “Hey Google, ask the Internet Archive to randomly play the Grateful Dead” or “Hey Google, ask the Internet Archive to randomly play Jazz 78s”

Each student has been paired with a “Mentor”, from the Internet Archive’s staff, who will help guide them to a successful engagement.

At the end of the Summer we will publish blog posts here about the outcome of each project.

Thank you Google!

Internet Archive to host conference about saving online news

The Internet Archive will host the “Dodging the Memory Hole” (DTMH) forum Nov 15 and 16th.  This will be the fifth in the series of outreach efforts over the past four years. Presented by the Donald W. Reynolds Journalism Institute, with support from the Institute of Museum and Library Services, the conference will address issues related to archiving and access to online news.

We are happy to be able to present a range of people, and projects, involved in a wide cross-section of activities related to news archiving, representing local, national and world-wide efforts.  As a bonus, our special guest speaker, Daniel Ellsberg, will highlight the value of the First Amendment and the need to make sure the public has free access to accurate information in the digital age.

News has been called the “first rough draft of history.”  Some think the risk to this history is at an all time high.  The possibility exists that large portions of our cultural record, as captured by journalists and others, will be lost forever if no action is taken to provide long-term solutions for access.  The loss of digital records is happening at an unprecedented pace – faster than the loss of comparable print and analog resources.  Access and preservation are two sides of the same coin in this regard.

The Internet Archive has become increasingly important as a means of collecting and preserving online news content.  As if the challenges of capturing more traditional news sources such as newspapers and television stations aren’t enough, the rise of social media as major distribution channels has made it even more difficult to address the complex set of issues involved.  Since many of the challenges end up being technical in nature, bringing Internet Archive staff together with the DTMH community offers the chance to identify problems and approach solutions to some of the stumbling blocks we’ve encountered at this point in the journey.

Journalists, memory institutions, technologists, historians, political scientists and anyone with an interest in having long-term access to a trustworthy and accurate record of life in the digital age will find this gathering of interest.  I urge anyone interested in this urgent and important issue to come join us at the Internet Archive on Nov. 15-16.  We have a limited number of seats available.  Registration is required, but it is free. If you want register in time to allow us to order food for you, please register by Monday, Oct. 30.  Final cutoff for registrations is Nov. 5.  I hope to see you there!

To learn more about the conference click here.

To register for the conference click here.

Wayback Machine Playback… now with Timestamps!

The Wayback Machine has an exciting new feature: it can list the dates and times, the Timestamps, of all page elements compared to the date and time of the base URL of a page.  This means that users can see, for instance, that an image displayed on a page was captured X days before the URL of the page or Y hours after it.  Timestamps are available via the “About this capture” link on the right side of the Wayback Toolbar.  Here is an example:

The Timestamps list includes the URLs and date and time difference compared to the current page for the following page elements: images, scripts, CSS and frames. Elements are presented in a descending order. If you put your cursor over a list element on the page, it will be highlighted and if you click on it you will be shown a playback of just that element.

Under the hood

Web pages are usually a composition of multiple elements such as images, scripts and CSS. The Wayback Machine tries to archive and playback web pages in the best possible manner, including all their original elements.  Each web page element has its own URL and Timestamp, indicating the exact date and time it was archived. Page elements may have similar Timestamps but they could also vary significantly for various reasons which depend on the web crawling process. By using the new Timestamps feature, users can easily learn the archive date and time for each element of a page.

Why this is important

The Wayback Machine is increasingly used in critical procedures such as legal evidence or political debate material.  It is important that what is presented is clear and transparent, even in the light of a web that was not designed to be archived. One of the ways a web archive could be confusing is via anachronisms, displaying content from different dates and times than the user expects. For example, when a archived page is played back, it could include some images from the current web, making it look like the image came from the past when it did not. We implemented Timestamps to provide users with more context about, and in turn hopefully greater confidence in, what they are seeing.

Robots.txt meant for search engines don’t work well for web archives

Robots.txt files were invented 20+ years ago to help advise “robots,” mostly search engine web crawlers, which sections of a web site should be crawled and indexed for search.

Many sites use their robots.txt files to improve their SEO (search engine optimization) by excluding duplicate content like print versions of recipes, excluding search result pages, excluding large files from crawling to save on hosting costs, or “hiding” sensitive areas of the site like administrative pages. (Of course, over the years malicious actors have also used robots.txt files to identify those same sensitive areas!)  Some crawlers, like Google, pay attention to robots.txt directives, while others do not.

Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes.  Internet Archive’s goal is to create complete “snapshots” of web pages, including the duplicate content and the large versions of files.  We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine.  In other words, a site goes out of business and then the parked domain is “blocked” from search engines and no one can look at the history of that site in the Wayback Machine anymore.  We receive inquiries and complaints on these “disappeared” sites almost daily.

A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org). As we have moved towards broader access it has not caused problems, which we take as a good sign.  We are now looking to do this more broadly.  

We see the future of web archiving relying less on robots.txt file declarations geared toward search engines, and more on representing the web as it really was, and is, from a user’s perspective.

Wayback Machine Chrome extension now available

The Wayback Machine Chrome browser extension helps make the web more reliable by detecting dead web pages and offering to replay archived versions of them.  You can get it here.

For the past 20 years, the Internet Archive has recorded and preserved web pages, and hundreds of billions of them are available via the Wayback Machine.  This is good because we are learning the web is fragile and ephemeral.  For example a 2013 Harvard study found that 49% of the URLs referenced in U.S. Supreme Court decisions are now dead.  Those decisions affect everyone in the U.S., and the evidence the opinions are based on is disappearing.

When previously valid URLs don’t respond, but instead return a result code of 404, we call that link rot.  The Wayback Machine Chrome extension is designed to help mitigate against link rot and other common web breakdowns.  

By using the “Wayback Machine” extension for Chrome, users are automatically offered the opportunity to view archived pages whenever any one of several error conditions, including code 404, or “page not found,” are encountered.  If those codes are detected, the Wayback Machine extension silently queries the Wayback Machine, in real-time, to see if an archived version is available.  If one is available, a notice is displayed via Chrome, offering the user the option to see the archived page.

The Internet Archive considers the privacy of our users to be of critical importance. We try not to record IP addresses, and we have fought National Security letters.  You can rest assured that the use of the Wayback Machine Chrome extension will not expose your browsing history.  In addition we are in conversation with Google about adding a proxy server as an additional layer of protection.

Thank you for giving the Wayback Machine for Chrome extension a try.  You can test it with this URL: http://www.pfaw.org:80/attacks.htm  We are committed to supporting better web browsing experiences and welcome your feedback and suggestions about how we can improve.  Please send us your bug reports, feature requests and other feedback directly to info@archive.org.