Category Archives: Wayback Machine – Web Archive

No More 404s! Resurrect dead web pages with our new Firefox add-on.

No More 404sHave you ever clicked on a web link only to get the dreaded “404 Document not found” (dead page) message? Have you wanted to see what that page looked like when it was alive? Well, now you’re in luck.

Recently the Internet Archive and Mozilla announced “No More 404s”, an experiment to help you to see archived versions of dead web pages in your Firefox browser. Using the “No More 404s” Firefox add-on you are given the option to retrieve archived versions of web pages from the Internet Archive’s 20-year store of more than 490 billion web captures available via the Wayback Machine.

PastedGraphic-15

To try this free service, and begin to enjoy a more reliable web, view this page with Firefox (version 48 or newer) then:

  1. Install the Firefox “Test Pilot”: https://testpilot.firefox.com
  2. Enable the “No More 404s” add-on: https://testpilot.firefox.com/experiments/no-more-404s
  3. Try viewing this dead page: http://stevereads.com/cache/ephemeral_web_pages.html

See the banner that came down from the top of the window offering you the opportunity to view an archived version of this page?  Success!

Wayback MachineFor 20 years, the Internet Archive has been crawling the web, and is currently preserving web captures at the rate of one billion per week. With support from the Laura and John Arnold Foundation, we are making improvements, including weaving the Wayback Machine into the fabric of the web itself.

“We’d like the Wayback Machine to be a standard feature in every web browser,” said Brewster Kahle, founder of the Internet Archive. “Let’s fix the web — it’s too important to allow it to decay with rotten links.”

“The Internet Archive came to us with an idea for helping users see parts of the web that have disappeared over the last couple of decades,” explained Nick Nguyen, Vice President, Product, Firefox.

The Internet Archive started with a big goal — to archive the web and preserve it for history. Now, please help us. Test our latest experiment and email any feedback to info@archive.org.

Fixing Broken Links on the Internet

No More 404s

Today the Internet Archive announces a new initiative to fix broken links across the Internet.  We have 360 billion archived URLs, and now we want you to help us bring those pages back out onto the web to heal broken links everywhere.

When I discover the perfect recipe for Nutella cookies, I want to make sure I can find those instructions again later.  But if the average lifespan of a web page is 100 days, bookmarking a page in your browser is not a great plan for saving information.  The Internet echoes with the empty spaces where data used to be.  Geocities – gone.  Friendster – gone.  Posterous – gone.  MobileMe – gone.

Imagine how critical this problem is for those who want to cite web pages in dissertations, legal opinions, or scientific research.  A recent Harvard study found that 49% of the URLs referenced in U.S. Supreme Court decisions are dead now.  Those decisions affect everyone in the U.S., but the evidence the opinions are based on is disappearing.

In 1996 the Internet Archive started saving web pages with the help of Alexa Internet.  We wanted to preserve cultural artifacts created on the web and make sure they would remain available for the researchers, historians, and scholars of the future.  We launched the Wayback Machine in 2001 with 10 billion pages.  For many years we relied on donations of web content from others to build the archive.  In 2004 we started crawling the web on behalf of a few, big partner organizations and of course that content also went into the Wayback Machine.  In 2006 we launched Archive-It, a web archiving service that allows librarians and others interested in saving web pages to create curated collections of valuable web content.  In 2010 we started archiving wide portions of the Internet on our own behalf.  Today, between our donating partners, thousands of librarians and archivists, and our own wide crawling efforts, we archive around one billion pages every week.  The Wayback Machine now contains more than 360 billion URL captures.

ftc.gov

FTC.gov directed people to the Wayback Machine during the recent shut down of the U.S. federal government.

We have been serving archived web pages to the public via the Wayback Machine for twelve years now, and it is gratifying to see how this service has become a medium of record for so many.  Wayback pages are cited in papers, referenced in news articles and submitted as evidence in trials.  Now even the U.S. government relies on this web archive.

We’ve also had some problems to overcome.  This time last year the contents of the Wayback Machine were at least a year out of date.  There was no way for individuals to ask us to archive a particular page, so you could only cite an archived page if we already had the content.  And you had to know about the Wayback Machine and come to our site to find anything.  We have set out to fix those problems, and hopefully we can fix broken links all over the Internet as a result.

Up to date.  Newly crawled content appears in the Wayback Machine about an hour or so after we get it.  We are constantly crawling the Internet and adding new pages, and many popular sites get crawled every day.

Save a page. We have added the ability to archive a page instantly and get back a permanent URL for that page in the Wayback Machine.  This service allows anyone — wikipedia editors, scholars, legal professionals, students, or home cooks like me — to create a stable URL to cite, share or bookmark any information they want to still have access to in the future.  Check out the new front page of the Wayback Machine and you’ll see the “Save Page” feature in the lower right corner.

Do we have it?  We have developed an Availability API that will let developers everywhere build tools to make the web more reliable.  We have built a few tools of our own as a proof of concept, but what we really want is to allow people to take the Wayback Machine out onto the web.

Fixing broken links.  We started archiving the web before Google, before Youtube, before Wikipedia, before people started to treat the Internet as the world’s encyclopedia. With all of the recent improvements to the Wayback Machine, we now have the ability to start healing the gaping holes left by dead pages on the Internet.  We have started by working with a couple of large sites, and we hope to expand from there.

WordPress.com is one of the top 20 sites in the world, with hundreds of millions of users each month.  We worked with Automattic to get a feed of new posts made to WordPress.com blogs and self-hosted WordPress sites.  We crawl the posts themselves, as well as all of their outlinks and embedded content – about 3,000,000 URLs per day.  This is great for archival purposes, but we also want to use the archive to make sure WordPress blogs are reliable sources of information.  To start with, we worked with Janis Elsts, a developer from Latvia who focuses on WordPress plugin development, to put suggestions from the Wayback into his Broken Link Checker plugin.  This plugin has been downloaded 2 million times, and now when his users find a broken link on their blog they can instantly replace it with an archived version.  We continue to work with Automattic to find more ways to fix or prevent dead links on WordPress blogs.

Wikipedia.org is one of the most popular information resources in the world with  almost 500 million users each month.  Among their millions of amazing articles that all of us rely on, there are 125,000 of them right now with dead links.  We have started crawling the outlinks for every new article and update as they are made – about 5 million new URLs are archived every day.  Now we have to figure out how to get archived pages back in to Wikipedia to fix some of those dead links.  Kunal Mehta, a Wikipedian from San Jose, recently wrote a protoype bot that can add archived versions to any link in Wikipedia so that when those links are determined to be dead the links can be switched over automatically and continue to work.  It will take a while to work this through the process the Wikipedia community of editors uses to approve bots, but that conversation is under way.

Every webmaster.  Webmasters can add a short snippet of code to their 404 page that will let users know if the Wayback Machine has a copy of the page in our archive – your web pages don’t have to die!

We started with a big goal — to archive the Internet and preserve it for history.  This year we started looking at the smaller goals — archiving a single page on request, making pages available more quickly, and letting you get information back out of the Wayback in an automated way.  We have spent 17 years building this amazing collection, let’s use it to make the web a better place.

Thank you so much to everyone who has helped to build such an outstanding resource, in particular:

Adam Miller
Alex Buie
Alexis Rossi
Brad Tofel
Brewster Kahle
Ilya Kreymer
Jackie Dana
Janis Elsts
Jeff Kaplan
John Lekashman
Kenji Nagahashi
Kris Carpenter
Kristine Hanna
Kunal Mehta
Martin Remy
Raj Kumar
Ronna Tanenbaum
Sam Stoller
SJ Klein
Vinay Goel

Blacked Out Government Websites Available Through Wayback Machine

 

(from the Internet Archive’s Archive-it group: Announcing the first ever Archive-It US Government Shutdown Notice Awards!  )

Congress has caused the U.S. federal government to shut down and important websites have gone dark.  Fortunately, we have the Wayback Machine to help.

Many government sites are displaying messages saying that they are not being updated or maintained during the government shut down, but the following sites are some who have completely shut their doors today.  Clicking the logos will take you to a Wayback Machine archived capture of the site.    Please donate to help us keep the government websites available.  You can also suggest pages for us to archive so that we can document the shut down.

noaa.gov
National Oceanic and Atmospheric Administration
noaa.gov
parkservice
National Park Service
nps.gov
 LOClogo3
Library of Congress
loc.gov
 NSF_Logo
National Science Foundation
nsf.gov
 fcc-logo
Federal Communication Commission
fcc.gov
 CensusBureauSeal
Bureau of the Census
census.gov
 usdalogo
U.S. Department of Agriculture
usda.gov
usgs
United States Geological Survey
usgs.gov
usitc
U.S. International Trade Commission
usitc.gov
 FTC-logo
Federal Trade Commission
ftc.gov
NASA_LOGO
National Aeronautics and Space Administration
nasa.gov
trade.gov
International Trade Administration
trade.gov
Corporation_for_National_and_Community_Service
Corporation for National and Community Service
nationalservice.gov

 

80 terabytes of archived web crawl data available for research

petaboxInternet Archive crawls and saves web pages and makes them available for viewing through the Wayback Machine because we believe in the importance of archiving digital artifacts for future generations to learn from.  In the process, of course, we accumulate a lot of data.

We are interested in exploring how others might be able to interact with or learn from this content if we make it available in bulk.  To that end, we would like to experiment with offering access to one of our crawls from 2011 with about 80 terabytes of WARC files containing captures of about 2.7 billion URIs.  The files contain text content and any media that we were able to capture, including images, flash, videos, etc.

What’s in the data set:

  • Crawl start date: 09 March, 2011
  • Crawl end date: 23 December, 2011
  • Number of captures: 2,713,676,341
  • Number of unique URLs: 2,273,840,159
  • Number of hosts: 29,032,069

The seed list for this crawl was a list of Alexa’s top 1 million web sites, retrieved close to the crawl start date.  We used Heritrix (3.1.1-SNAPSHOT) crawler software and respected robots.txt directives.  The scope of the crawl was not limited except for a few manually excluded sites.  However this was a somewhat experimental crawl for us, as we were using newly minted software to feed URLs to the crawlers, and we know there were some operational issues with it.  For example, in many cases we may not have crawled all of the embedded and linked objects in a page since the URLs for these resources were added into queues that quickly grew bigger than the intended size of the crawl (and therefore we never got to them).  We also included repeated crawls of some Argentinian government sites, so looking at results by country will be somewhat skewed.  We have made many changes to how we do these wide crawls since this particular example, but we wanted to make the data available “warts and all” for people to experiment with.  We have also done some further analysis of the content.

Hosts Crawled pie chart

If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what you’re hoping to do with it.  We may not be able to say “yes” to all requests, since we’re just figuring out whether this is a good idea, but everyone will be considered.

 

HTTP Archive joins with Internet Archive

It was announced today that HTTP Archive has become part of Internet Archive.

The Internet Archive provides an archive of web site content through the Wayback Machine, but we do not capture data about the performance of web sites.  Steve Souders’s HTTP Archive started capturing and archiving this sort of data in October 2010 and has expanded the number of sites covered to 18,000 with the help of Pat Meenan and WebPagetest.

Steve Souders will continue to run the HTTP Archive project, and we hope to expand its reach to 1 million sites.  To this end, the Internet Archive is accepting donations for the HTTP Archive project to support the growth of the infrastructure necessary to increase coverage.  The following companies have already agreed to support the project: Google, Mozilla, New Relic, O’Reilly Media, Etsy, Strangeloop, and dynaTrace Software. Coders are also invited to participate in the open source project.

Internet Archive is excited about archiving another aspect of the web for both present day and future researchers.

New Firefox Add-on for searching the Wayback Machine

Fellow time travelers,

We have a new Firefox add-on that allows you to search the Wayback Machine from your browser. You can get it at: https://addons.mozilla.org/en-US/firefox/addon/162148/.

For those who have yet to travel back in time, the Internet Archive Wayback Machine allows you to browse through over 150 billion web pages archived from 1996 to a few months ago.

So install the Wayback Machine Firefox add-on and take a trip.

-Jeff Kaplan

GeoCities, Preserved!

There’s a chance that in the 1990s, you were more familiar with neighborhoods on GeoCities than with the neighborhoods in your own town. As one of the most popular and oldest (nearly 15 years running) sites for self expression on the web, GeoCities paved the way for other sites which would offer a sense of community and networking capabilities. Because it was one of the first ways for people to freely and openly become engaged with the internet, GeoCities will always be an important part of web history.

Yahoo! announced that it will close the site on October 26, 2009, steering users towards their paid service instead. We have been archiving GeoCities sites for years in our crawls, but, as goes with the territory of being web archivists, we want to make sure to gather as many of the pages as possible before the looming end of an era, 10-26-2009. If you have a page with GeoCities or are a fan of a particular page, please use our special collections page to ensure its preservation. Additionally, please refer to another independent project, the Archive Team, who is working to save cultural information that may be lost with the site closing. Yahoo! is also offering valuable advice at their help center.

–Cara Binder

Bookmark and Share

Wayback Machine comes to life in new home

The Wayback Machine is a 150 billion page web archive with a front end to serve it through the archive.org website.

Today the new machine came to life, so if you using the service, you are using a 20′ by 8′ by 8′ “machine” that sits in Santa Clara, courtesy of Sun Microcomputer. It serves about 500 queries per second from the approximately 4.5 Petabytes (4.5 million gigabytes) of archived web data. We think of the cluster of computers and the Modular Datacenter as a single machine because it acts like one and looks like one. If that is true, then it might be one of the largest current computers.

Also, we can do fun stats. We now know the the web weighs 26,500 pounds, the average web page weighs 80 micrograms, and 160 joules per query.

On another note, we got a nice letter from the last living director of the Rocky and Bullwinkle Show, Gerard Baldwin, because he read about the “fantastic project”. Our Wayback Machine is a tribute to their more cleverly named “Waybac Machine” which in turn was a reference to the Univac. Sherman and Peabody live on.

-brewster

The success story video.

Wayback Machine has 85 Billion Archived Webpages

The Internet Archive’s Wayback Machine now has 85,898,456,616 archived web objects in it, and is available, as always, to the public for free.

A snapshot of the World Wide Web is taken every 2 months and donated to the Internet Archive by Alexa Internet. Further, librarians all over the world have helped curate deep and frequent crawls of sites that could be especially important to future researchers historians and scholars.

As web pages are changed or deleted every 100 days, on average, having a resource like this is important for the preservation of our emerging cultural heritage.

The Wayback Machine is a database that serves thousands of users every day, and currently gets 300 requests per second. The database contains over 1.5 petabytes of data that came from the web (that is 1.5 million gigabytes) which makes it one of the largest databases of any kind.

To virtually visit the Archive, please visit www.archive.org. To physically visit us in San Francisco, please call and make an appointment– we are open to the public.

85 billion of anything is a big number– thank you all for making it possible.