For the last 15 years, users of the Wayback Machine have browsed past versions of websites by entering in URLs into the main search box and clicking on Browse History. With the generous support of The Laura and John Arnold Foundation, we’re adding an exciting new feature to this search box: keyword search!

With this new beta search service, users will now be able to find the home pages of over 361 Million websites preserved in the Wayback Machine just by typing in keywords that describe these sites (e.g. “new york times”). As they type keywords into the search box, they will be presented with a list of relevant archived websites with snippets containing:

a link to the archived versions of the site’s home page in the Wayback Machine
a thumbnail image of the site’s homepage (when available)
a short description of the site’s homepage
a capture summary of the site
- number of unique URLs by content type (webpage, image, audio, video)
- number of valid web captures over the associated time period

Key Features

Search as you type
- Instant results as you type — predictive, interactive and speedy
Multilingual
- Search in any language or using symbols — expanding scope and utility
Site-based Filtering
- Limit results to certain websites or domains using the site: operator (e.g. site:edu)

Behind the Scenes

Search index was built by processing over 250 billion webpages archived over 20 years
- Index contains more than a billion terms collected from over 400 billion hyperlinks to the homepages of websites
Search results are ranked based on the number of relevant hyperlinks to the site’s homepage and the total number of web captures from the site

Example queries

Websites related to academic journals — academic journals
Searching in Greek to find websites related to Aristotle — Αριστοτέλειο
Government websites related to climate change — site:gov climate change
Stanford websites related to Asian studies — site:stanford.edu asian studies

We hope that this service, to search and discover archived web resources through time, will create new opportunities for scholarly work and innovation.

A big Thank You to: Vinay Goel, Kenji Nagahashi, Mark Graham, Bill Lubanovic, John Lekashman, Greg Lindahl, Vangelis Banos, Richard Caceres, Zijian He, Eugene Krevenets, Benjamin Mandel, Rakesh Pandey, Wendy Hanamura and Brewster Kahle

blog-thoughtbubble
The Internet Archive has been archiving the web for 20 years and has preserved billions of webpages from millions of websites. These webpages are often made up of, and link to, many images, videos, style sheets, scripts and other web objects. Over the years, the Archive has saved over 510 billion such time-stamped web objects, which we term web captures.

We define a webpage as a valid web capture that is an HTML document, a plain text document, or a PDF.

A domain on the web is an owned section of the internet namespace, such as google.com or archive.org or bbc.co.uk. A host on the web is identified by a fully qualified domain name or FQDN that specifies its exact location in the tree hierarchy of the Domain Name System. The FQDN consists of the following parts: hostname and domain name. As an example, in case of the host blog.archive.org, its hostname is blog and the host is located within the domain archive.org.

We define a website to be a host that has served webpages and has at least one incoming link from a webpage belonging to a different domain.

As of today, the Internet Archive officially holds 273 billion webpages from over 361 million websites, taking up 15 petabytes of storage.

Internet Archive Blogs

A blog from the team at archive.org

Author Archives: Vinay Goel

About Vinay Goel

Beta Wayback Machine – Now with Site Search!

Defining Web pages, Web sites and Web captures

Upcoming Events

Book Talk: Big Fiction