Category Archives: Technical

Gone but Not Forgotten: Recovering the Dead Web

TL;DR: A Pew Research Center study found that 38% of webpages from a decade ago, and about 25% of pages sampled across the decade, are now inaccessible; our analysis shows that the Wayback Machine has rescued roughly 15% of those otherwise dead pages.

In 2024, the Pew Research Center published a link-rot study, “When Online Content Disappears”. They stated, “38% of webpages that existed in 2013 are no longer accessible a decade later”. They further noted, “a quarter of all webpages that existed at one point between 2013 and 2023 are no longer accessible”. This is not an isolated report that quantified the rate of loss of the online information. Numerous other link-rot studies in the last two decades have reported similar numbers or worse, depending on the context and samples. For example, Ahrefs, an SEO company, reported in the same year, “At Least 66.5% of Links to Sites in the Last 9 Years Are Dead”. In 2021, Jonathan Zittrain published an article in the Atlantic, “The Internet Is Rotting”, in which his team analyzed about 2 million external links from New York Times (NYTimes) articles and reported that 25% of deep links have rotted. They further noted that 72% of the older links from 1998 were dead. A recent longitudinal study on link-rot from the Old Dominion University (ODU), “Some URLs Are Immortal, Most Are Ephemeral”, analyzed 27.3 million URL samples from the Wayback Machine since 1996 and reported that about 65% of the sampled URLs were found dead on the live Web, when checked in 2023. Brewster Kahle, the founder of the Internet Archive, has been citing numbers from the early days of the Web and stating the average life of webpages to be anywhere from 40 to 100 days. A 2026 book, “Vanishing Culture: A Report on Our Fragile Cultural Record”, by Messarra et al. highlights underlying causes of numerous recent cultural digital losses while emphasizing the critical roles libraries and archives must play to maintain our cultural history for the future. Different studies have looked at the problem from different perspectives and contexts, hence it is often difficult to compare them side-by-side, but they all agree that an increasing number of links are rotting with the passage of time. However, some of these studies (not all) have failed to acknowledge the existence of Web archives, such as the Wayback Machine, where a portion of the dead Web might be preserved and can be used as a fallback when a reference leads to a broken link.

In this post we go through some of the link-rot studies and look at them from the perspective of the Wayback Machine to see how much of the dead Web can be rescued. Table 1 shows the status of the dead and rescued Web at a glance as sampled by a few different studies.

StudyYearPeriodSamplesDeadRescued
Pew (All)20242013-20235.4M26%16%
Pew (General)20242013-20231M27%13%
Zittrain NYT*20212013-201388K40%38%
ODU NYPW20241996-202127.3M65%65%
Table 1: Dead links from various link-rot studies rescued by the Wayback Machine.
* The NYT numbers are based on our recreated dataset.

Let us begin by looking at the study from Pew Research Center. They have generously shared their dataset with us so it was rather trivial for us (after performing some transformations and extractions, as the original dataset was stored in Parquet files) to check the URLs against the Wayback Machine to see if and when each of those were archived the first time. Their dataset contains 5.4 million unique URLs in general, news, government, and Wikipedia references categories sampled from the Common Crawl archive and Wikipedia pages. They also reported on Tweets in their post, but that dataset was not shared with us due to the restrictions posed by the usage policies.

Before we dive into our findings, below are brief descriptions of some terminologies that we will use frequently:

  • Alive: URLs that return 200 OK HTTP status code when resolved
  • Dead: URLs that return HTTP error status codes, TCP connection errors, or DNS failures when resolved
  • Preserved: URLs that are Alive on the live Web as well as present in a Web archive
  • Rescued: URLs that are Dead on the live Web, but are present in a Web archive
  • Endangered: URLs that are Alive on the live Web, but are not present in any Web archive
  • Vanished: URLs that are Dead on the live Web and also not present in any Web archive
  • Archived: Preserved + Rescued
  • Accessible: Preserved + Rescued + Endangered

When we do not take any Web archives into account, about a quarter of all the 5.4 million sampled URLs would be considered inaccessible or dead as illustrated in Figure 1. However, when we leverage the Wayback Machine to access otherwise dead URLs, the fraction of inaccessible or vanished URLs drops from one in every four down to only one in every ten. The Wayback Machine has about 72% of the entire dataset archived, of which 56% are preserved from the URLs that are still alive on the live Web and 16% are rescued from the dead. There are 18% of the URLs from the sample that are still alive, but have not been archived in the Wayback Machine yet, which we call endangered, as they may become vanished if they cease to exist on the live Web ever. It is worth noting that we did not account for any captures of these URLs that might be present in any of the many smaller Web archives other than the Wayback Machine, which if accounted for, might increase the percentage of the accessible URLs a little more. Moreover, we relied on HTTP status codes and did not look into the contents of the pages to check for any soft-404s (i.e., error pages that wrongly return a 200 OK HTTP status code) or other irrelevant content, which might change the numbers further.

Figure 1: Archiving status of all the URLs from the Pew dataset in the Wayback Machine.

A subset of about 1 million URLs from the Pew dataset is a sample of general webpages from the last decade, spanning across 11 years from 2013 to 2023. They noted that about a quarter of the URLs from this subset were dead in 2023, with older URLs having a greater percentage of loss, all the way to 38% for links from 2013. We recreated their yearly graph in Figure 2 in orange color with an overlay of rescued URLs by the Wayback Machine in green color. We found that about 38% of the 38% dead URLs from 2013 (i.e., about 15% of the total) are rescued by the Wayback Machine. Moreover, about a quarter of the accumulative URLs of the general sample which were considered dead, about half of them were rescued by the Wayback Machine. It is worth noting that the last three years in Figure 2 seem to be rescued almost completely, but it is a side-effect of ingestion of Common Crawl data from the recent years into the Wayback Machine, which happens to be the source of the sample of the Pew dataset.

Figure 2: Yearly archiving status of URLs from the general sample of the Pew dataset in the Wayback Machine.

We tried getting access to the dataset of about 2 million URLs from the Zittrain’s NYTimes outlinks study, but we did not get it yet. However, in the interim we created our own dataset by downloading all the NYTimes pages published in 2013 that are present in the Wayback Machine, extracting all the outlinks from them, and excluding all the links to pages from NYTimes itself. We were able to collect about 88 thousand such URLs this way. Then we checked the live Web status of each of the URLs (after following up to 5 redirects, if any) and also checked for their presence in the Wayback Machine. We found that 40% of the external links from NYTimes pages from 2013 were found dead on the live Web, but 96% of those URLs are archived in the Wayback Machine. This means, only about 2% URLs from this sample have vanished. However, this impressive number needs to be taken with a grain of salt because we do not have the original URL sample and our own sample is derived from pages present in the Wayback Machine, which has an inherent bias of outlinks from those pages being more likely to be archived than the outlinks of the pages that are not present in the Wayback Machine. That said, we will be keen to revisit these numbers if and when we get access to the original sample of URLs used in Zittrain’s study.

A recent, and perhaps the most comprehensive, longitudinal link-rot study from ODU, to which we are a collaborator, analyzed 27.3 million URLs sampled from the index of the Wayback Machine spanning over more than two and a half decades. They reported about 65% of the sampled URLs from 1996 to 2021 were found dead in 2023. A significant number of these samples were not even resolving the DNS, indicating that many of those domain names were not registered anymore. They found that most of the URLs die rapidly in the first few years of their existence, but some of the longest living sites are not dead yet. Luckily, all of the dead URLs in this sample are rescued by the Wayback Machine by the virtue of it being the source of the sample in the first place. This also means, the ODU study would not be able to tell the percentages of endangered or vanished URLs, because its dataset contains no URLs that were never archived.

In summary, all of the link-rot studies, with varying numbers, indicate that the Web is brittle and an increasing number of Web resources die with the passage of time. However, we found that Web archives like the Wayback Machine play an increasingly important role in rescuing the dead Web and minimizing the fracture of the knowledge graph on the Web, but there is a lot more to do. For example, the Turn All References Blue (TARB) project has fixed more than 30 million broken links (and counting) on hundreds of wikis with the help of the InternetArchiveBot, the WaybackMedic bot, and the Wayback Machine.

While there is not a lot that can be done to resurrect the vanished Web other than attempting to find alternate locations where the content might have moved to (via projects like FABLE), we are determined to minimize the percentage of the endangered URLs. However, there are some internal and external factors that limit our ability to make it ZERO, such as, resource limitations, JavaScript-heavy pages, bot blocking, loginwalls, paywalls, deepweb, lack of timely discovery, etc. We strive to narrow down the potential loss of our cultural heritage via different means such as ingesting feeds from MediaCloud, GDELT, Wikipedia EventStream, and more recently, becoming part of the IndexNow initiative for link discovery soon after corresponding page creation or update on the Web. Moreover, we have the Save Page Now (SPN) service and urge that when you “See Something, Save Something!”. Your continued support will help us preserve the Web more and better.

NOTE: This work was presented at the IIPC WAC 2025, with the talk recording available on YouTube and slides hosted in the UNT Digital Library. It was also presented at the WADL 2025.

ACKNOWLEDGEMENTS: We thank our friends at the Pew Research Center and the Old Dominion University and our colleagues Jake LaFountain, Stephen Balbach, Chris Freeland, and Mark Graham for their help and support in this work.


Dr. Sawood Alam
Research Lead, Wayback Machine
Internet Archive

Learning from Cyberattacks

The Wayback Machine, Archive.org, Archive-it.org, and OpenLibrary.org came up in stages over the week after cyberattacks with some of the contributor features coming up over the last couple of weeks.  A few to go.  Much of the development during this time has been focused on securing the services so they can still run while attacks continue.  

The Internet Archive is adapting to a more hostile world, where DDOS attacks are recurring periodically (such as yesterday and today), and more severe attacks might happen. Our response has been to harden our services and learn from friends. This note is to share some high level findings, without being so detailed as to help those that are still attacking archive.org.

By tightening firewall technologies, we have changed how data flows through our systems to improve monitoring and control. The downside is these upgrades have forced changes to software, some of it quite old. 

The bright side is this is forcing upgrades that we have long planned or hoped for.  We are greatly helped by the free and open source community’s improving tools that can be used by large corporations as well as non-profit libraries because they are freely available.

Also, some commercial companies have offered assistance that would generally be prohibitively expensive.  We are grateful for the support.

Where the Internet Archive has always focused on building collections and preserving them, we have been starkly reminded how important reliable access is to researchers, journalists, and readers. This is leading us to install technical defenses and increase staff to improve service availability.

Libraries in general, and the Internet Archive in specific, have been under attack for many years now.  For us it started with the book publishers suing (about lending books), and now the recording industry (about 78rpm records), which is a drain on our staff and financial resources. Now recurring DDOS attacks distract us from the goals of preservation and access to our digital heritage.

We don’t know why these attacks have started recently and if they are coordinated, but we are building defenses.

We are grateful for the support from our patrons, through social media, through donations, and through offers of help, which frankly, makes it worthwhile to keep building a library for all of us.

– Brewster Kahle

In Memoriam of Python 2

Generated by Bing Image Creator

Today, on the Day of the Dead 2023, we at the Internet Archive honor the death of Python 2. Having mostly emerged from one of the greatest software upgrade SNAFU’s in history—the migration from Python 2 to Python 3—we now shed a tear for that old version that served us so well.

When Python 3 was launched in 2008, it contained a number of significant improvements which nevertheless broke compatibility with the previous version of Python at the syntax, string-handling, and library level. As terrible as this sounds, breaking changes are fairly normal for a major software upgrade.

Rather, the chaos that followed was rooted in the fact that unlike most software transitions of this sort, it could not be done incrementally. Instead of being offered a way to gradually upgrade, remaining compatible with both versions and spreading the incremental costs over time, developers were given a risky all-or-nothing choice. The result has been a reluctant, glacial, expensive migration that continues to plague the world.

At the Internet Archive, we did not begin our migration in earnest until 2021, starting with Open Library and then this year focusing on Archive.org and its underlying services. However, we are now happy to declare migration of our core storage service, S3, which underlies all of the millions of items stored in the Archive, complete. We are grateful for the intensive efforts over many months by Chris, Scott, and Tracey, and everyone who supported them!

There are just a few more projects to go, but we are nearly there. And come our next OS upgrade, Python 2 will be but the whisper of a memory, preserved in the Archive and honored on a day like today. Rest in peace, Python 2. And please stay dead.

IMLS National Leadership Grant Supports Expansion of the ARCH Computational Research Platform

In June, we announced the official launch of Archives Research Compute Hub (ARCH) our platform for supporting computational research with digital collections. The Archiving & Data Services group at IA has long provided computational research services via collaborations, dataset services, product features, and other partnerships and software development. In 2020, in partnership with our close collaborators at the Archives Unleashed project, and with funding from the Mellon Foundation, we pursued cooperative technical and community work to make text and data mining services available to any institution building, or researcher using, archival web collections. This led to the release of ARCH, with more than 35 libraries and 60 researchers and curators participating in beta testing and early product pilots. Additional work supported expanding the community of scholars doing computational research using contemporary web collections by providing technical and research support to multi-institutional research teams.

We are pleased to announce that ARCH recently received funding from the Institute of Museum and Library Services (IMLS), via their National Leadership Grants program, supporting ARCH expansion. The project, “Expanding ARCH: Equitable Access to Text and Data Mining Services,” entails two broad areas of work. First, the project will create user-informed workflows and conduct software development that enables a diverse set of partner libraries, archives, and museums to add digital collections of any format (e.g., image collections, text collections) to ARCH for users to study via computational analysis. Working with these partners will help ensure that ARCH can support the needs of organizations of any size that aim to make their digital collections available in new ways. Second, the project will work with librarians and scholars to expand the number and types of data analysis jobs and resulting datasets and data visualizations that can be created using ARCH, including allowing users to build custom research collections that are aggregated from the digital collections of multiple institutions. Expanding the ability for scholars to create aggregated collections and run new data analysis jobs, potentially including artificial intelligence tools, will enable ARCH to significantly increase the type, diversity, scope, and scale of research it supports.

Collaborators on the Expanding ARCH project include a set of institutional partners that will be closely involved in guiding functional requirements, testing designs, and using the newly-built features intended to augment researcher support. Primary institutional partners include University of Denver, University of North Carolina at Chapel Hill, Williams College Museum of Art, and Indianapolis Museum of Art, with additional institutional partners joining in the project’s second year.

Thousands of libraries, archives, museums, and memory organizations work with Internet Archive to build and make openly accessible digitized and born-digital collections. Making these collections available to as many users in as many ways as possible is critical to providing access to knowledge. We are thankful to IMLS for providing the financial support that allows us to expand the ARCH platform to empower new and emerging types of access and research.

National Library Week 2023: Brenton, user experience

To celebrate National Library Week 2023, we are introducing readers to four staff members who work behind the scenes at the Internet Archive, helping connect patrons with our collections, services and programs.

Brenton Cheng learned to program in BASIC on an Apple II Plus at age 9. His mother was one of the earliest computer programmers and his dad was a marketing consultant for technology products in Portola Valley, California. By age 12, Cheng had written a series of animated games that he put together in a hand-assembled software package. It sold about four copies.

Now, Cheng is a senior engineer at the Internet Archive, where he leads the user experience (UX) team. “Our goal is to give our patrons a great experience on the Archive.org website while making sure that under the hood, our technologies are as simple, robust and maintainable as possible,” said Cheng, who has been at the organization for seven years.

Despite his early computer exposure, Cheng wanted to study something more tangible in college. He pursued mechanical engineering and earned a bachelor’s degree from Princeton University and a master’s from Stanford University. Along the way, he developed a love of contemporary dance and improvisation. Inspired by the creativity of movement, he veered toward biomechanical engineering in graduate school. 

Entering the job market, Cheng said he wanted a flexible schedule so he would be able to take workshops and occasionally go on tour with dance companies. He was a freelance computer programmer for about a decade, then worked at Astrology.com and NBCUniversal for another 10 years. 

In 2016, Cheng said he was drawn to the Internet Archive by its mission, reputation and people. “Being in the dance world, I was constantly surrounded with all kinds of eclectic, eccentric, fascinating, brilliant people,” he said. “There were certain common elements in the way the Archive embraces and benefits from diversity. I found many artists and engineers working in novel ways. That felt very much at home.”

From his experience working with improvisation in dance, Cheng said he loves trying to create the conditions within which people contribute their best work and feel good about what they’re doing. His team is focused on fighting for users and constantly making the website better for the public. “I also serve the digital librarians who are collecting and providing content for our patrons,” Cheng said. “I am giving them the tools, platform and environment to do their magic.” 

Tell us something about your role at the Internet Archive that most people wouldn’t know about.
Simultaneously with supporting the Archive’s mission and helping our patrons, I am always holding in the back of my mind the subtext of a “small team, long term.” These ideas guide choices around process, technologies and architecture. We regularly discard choices that would entail too much complexity or require too much on-going, hands-on maintenance. And we try to resist rushing features out the door that will only add to our technical debt later.

What is the most interesting project you’ve worked on at the Internet Archive?
I set up a wiki to allow scholars to submit transcriptions of scanned Balinese palm leaves.

What has been your greatest achievement (so far) at the Internet Archive?
Creating a team that likes working together, is resilient through conflicts and pushes each other to keep getting better.

What are you reading?
The Sense of Style by Steven Pinker. It’s a contemporary writing style manual that incorporates cognitive science and linguistics and acknowledges the evolving nature of language.

Thank you for helping us increase our bandwidth

Last week the Internet Archive upped our bandwidth capacity 30%, based on increased usage and increased financial support.  Thank you.

This is our outbound bandwidth graph that has several stories to tell…

A year ago, usage was 30Gbits/sec. At the beginning of this year, we were at 40Gbits/sec, and we were handling it.  That is 13 Petabytes of downloads per month.  This has served millions of users to materials in the wayback machine, those listening 78 RPMs, those browsing digitized books, streaming from the TV archive, etc.  We were about the 250th most popular website according to Alexa Internet.

Then Covid-19 hit and demand rocketed to 50Gbits/sec and overran our network infrastructure’s ability to handle it.  So much so, our network statistics probes had difficulty collecting data (hence the white spots in the graphs).   

We bought a second router with new line cards, and got it installed and running (and none of this is easy during a pandemic), and increased our capacity from 47Gbits/sec peak to 62Gbits/sec peak.   And we are handling it better, but it is still consumed.

Alexa Internet now says we are about the 160th most popular website.

So now we are looking at the next steps up, which will take more equipment and is more wizardry, but we are working on it.

Thank you again for the support, and if you would like to donate more, please know it is going to build collections to serve millions.  https://archive.org/donate

Farewell to IE11

At the end of the movie “Titanic,” from her makeshift raft Rose Calvert promises Jack Dawson, “I will never let go,” but then, well, a floating board is only so big…

On June 1, we will gently release Internet Explorer, version 11, from the list of browsers supported on our website Archive.org into the oceanic depths of the obsolete. To give you an idea of what this means to us, a member of the UX team composed this little remembrance:

We hate you. Good-bye.

Why the ichor? Why the bile? No doubt one too many sleepless nights struggling to make our website layout work with this venerable browser, released in 2013, which lacks support for so many features that are now standard in today’s browsers: module imports, web components, CSS Grid, ES6, the list goes on. Like its ancestor IE6, version 11 has clung to life far longer than it should have.

Though Microsoft support for it will not officially end until 2025, Microsoft’s Chief of Security, Chris Jackson, recently recommended in a blog post that people stop using IE11 as their default browser. It is considered a “compatibility solution,” something you should only use for services that require it. Our analytics indicate that a mere 0.8% of our users use IE11 to browse the site. (Even worldwide usage is at 1-3%, the bulk of it from a country in which we are blocked.)

Plus, maintaining compatibility with IE11 — with its need for polyfills, transpilation, and other workarounds — gets expensive, especially for a small team such as ours. Generously supported by donations from people like you, we are committed to doing the greatest good with the resources we have, making the world’s knowledge available to as many people as possible. IE11 is a distraction, with a diminished and ever diminishing return on our efforts.

So farewell, IE11. We will sleep better and rise with a little more spring in our step, knowing that your phrase with us has reached its conclusion.

Two Thin Strands of Glass

There’s a tiny strand of glass inside that thick plastic coat.

Two thin strands of glass. When combined, these two strands of glass are so thin they still wouldn’t fill a drinking straw. That’s known in tech circles as a “fiber pair,” and these two thin strands of glass carry all the information of the world’s leading archive in and out of our data centers. When you think about it, it sounds kind of crazy that it works at all, but it does. Every day. Reliably.

Except this past Monday night, here in California…

On Monday, June 24, the real world had other ideas. As a result, the Internet Archive was down for 15 hours. For Californians, this was less of a big deal: those 15 hours stretched from mid-Monday evening (9:11pm on the US West coast), to 11:51am on Tuesday. Many Californians were asleep during several hours of that time. But in the Central European time zone (e.g. France, Germany, Italy, Poland, Tunisia), that fell on early Tuesday morning (06:11) to mid-Tuesday evening (21:51). And in the entire country of India, it was late Tuesday morning (09:41) to just after midnight on Wednesday (00:21).

Continue reading

New Views Stats for the New Year

We began developing a new system for counting views statistics on archive.org a few years ago. We had received feedback from our partners and users asking for more fine-grained information than the old system could provide. People wanted to know where their views were coming from geographically, and how many came from people vs. robots crawling the site.

The new system will debut in January 2019. Leading up to that in the next couple of weeks you may see some inconsistencies in view counts as the new numbers roll out across tens of millions of items.  

With the new system you will see changes on both items and collections.

Item page changes

An “item” refers to a media item on archive.org – this is a page that features a book, a concert, a movie, etc. Here are some examples of items: Jerky Turkey, Emma, Gunsmoke.

On item pages the lifetime views will change to a new number.  This new number will be a sum of lifetime views from the legacy system through 2016, plus total views from the new system for the past two years (January 2017 through December 2018). Because we are replacing the 2017 and 2018 views numbers with data from the new system, the lifetime views number for that item may go down. I will explain why this occurs further down in this post where we discuss how the new system differs from the legacy system.

Collection page changes

Soon on collection page About tabs (example) you will see 2 separate views graphs. One will be for the old legacy system views through the end of 2018. The other will contain 2 years of views data from the new system (2017 and 2018). Moving forward, only the graph representing the new system will be updated with views numbers. The legacy graph will “freeze” as of December 2018.

Both graphs will be on the page for a limited time, allowing you to compare your collections stats between the old and new systems.  We will not delete the legacy system data, but it may eventually move to another page. The data from both systems is also available through the views API.

People vs. Robots

The graph for new collection views will additionally contain information about whether the views came from known “robots” or “people.”  Known robots include crawlers from major search engines, like Google or Bing. It is important for these robots to crawl your items – search engines are a major source of traffic to all of the items on archive.org. The robots number here is your assurance that search engines know your items exist and can point users to them.  The robots numbers also include access from our own internal robots (which is generally a very small portion of robots traffic).

One note about robots: they like text-based files more than audio/visual files.  This means that text items on the archive that have a publicly accessible text file (the djvu.txt file) get more views from robots than other types of media in the archive. Search engines don’t just want the metadata about the book – they want the book itself.

“People” are a little harder to define. Our confidence about whether a view comes from a person varies – in some cases we are very sure, and in others it’s more fuzzy, but in all cases we know the view is not from a known robot. So we have chosen to class these all together as “people,” as they are likely to represent access by end users.

What counts as a view in the new system

  • Each media item in the archive has a views counter.
  • The view counter is increased by 1 when a user engages with the media file(s) in an item.
    • Media engagement includes experiencing the media through the player in the item page (pressing play on a video or audio player, flipping pages in the online bookreader, emulating software, etc.), downloading files, streaming files, or borrowing a book.
    • All types of engagements are treated in the same way – they are all views.
  • A single user can only increase the view count of a particular item once per day.
    • A user may view multiple media files in a single item, or view the same media file in a single item multiple times, but within one day that engagement will only count as 1 view.
  • Collection views are the sum of all the view counts of the items in the collection.
    • When an item is in more than one collection, the item’s view counts are added to each collection it is in. This includes “parent” collections if the item is in a subcollection.
    • When a user engages with a collection page (sorting, searching, browsing etc.), it does NOT count as a view of the collection.
    • Items sometimes move in or out of collections. The views number on a collection represents the sum of the views of the items that are in the collection at that time (e.g. the September 1, 2018 views number for the collection represents the sum of the views on items that were in the collection on September 1, 2018. If an item moves out of that collection, the collection does not lose the views from September 1, 2018.).

How the new system differs from the legacy system

When we designed the new system, we implemented some changes in what counted as a “view,” added some functionality, and repaired some errors that were discovered.  

  • The legacy system updated item views once per day and collection views once per month. The new system will update both item and collection views once per day.
  • The legacy system updated item views ~24 hours after a view was recorded.  The new system will update the views count ~4 days after the view was recorded. This time delay in the new system will decrease to ~24 hours at some point in the future.
  • The legacy system had no information about geographic location of users. The new system has approximate geolocation for every view. This geographic information is based on obfuscated IP addresses. It is accurate at a general level, but does not represent an individual user’s specific location.
  • The legacy system had no information about how many views were caused by robots crawling the site. The new system shows us how well the site is crawled by breaking out media access by robots (vs. interactions from people).
  • The legacy system did not count all book reader interactions as views.  The new system counts bookreader engagements as a view after 2 interactions (like page flips).
  • On audio and video items, the legacy system sometimes counted views when users saw *any* media in the item (like thumbnail images). The new system only counts engagements with the audio or video media files in an item in those media types, respectively.

In some cases, the differences above can lead to drastic changes in views numbers for both items and collections. While this may be disconcerting, we think the new system more accurately reflects end user behavior on archive.org.

If you have questions regarding the new stats system, you may email us at info@archive.org.