We began developing a new system for counting views statistics on archive.org a few years ago. We had received feedback from our partners and users asking for more fine-grained information than the old system could provide. People wanted to know where their views were coming from geographically, and how many came from people vs. robots crawling the site.
The new system will debut in January 2019. Leading up to that in the next couple of weeks you may see some inconsistencies in view counts as the new numbers roll out across tens of millions of items.
With the new system you will see changes on both items and collections.
Item page changes
An “item” refers to a media item on archive.org – this is a page that features a book, a concert, a movie, etc. Here are some examples of items: Jerky Turkey, Emma, Gunsmoke.
On item pages the lifetime views will change to a new number. This new number will be a sum of lifetime views from the legacy system through 2016, plus total views from the new system for the past two years (January 2017 through December 2018). Because we are replacing the 2017 and 2018 views numbers with data from the new system, the lifetime views number for that item may go down. I will explain why this occurs further down in this post where we discuss how the new system differs from the legacy system.
Collection page changes
Soon on collection page About tabs (example) you will see 2 separate views graphs. One will be for the old legacy system views through the end of 2018. The other will contain 2 years of views data from the new system (2017 and 2018). Moving forward, only the graph representing the new system will be updated with views numbers. The legacy graph will “freeze” as of December 2018.
Both graphs will be on the page for a limited time, allowing you to compare your collections stats between the old and new systems. We will not delete the legacy system data, but it may eventually move to another page. The data from both systems is also available through the views API.
People vs. Robots
The graph for new collection views will additionally contain information about whether the views came from known “robots” or “people.” Known robots include crawlers from major search engines, like Google or Bing. It is important for these robots to crawl your items – search engines are a major source of traffic to all of the items on archive.org. The robots number here is your assurance that search engines know your items exist and can point users to them. The robots numbers also include access from our own internal robots (which is generally a very small portion of robots traffic).
One note about robots: they like text-based files more than audio/visual files. This means that text items on the archive that have a publicly accessible text file (the djvu.txt file) get more views from robots than other types of media in the archive. Search engines don’t just want the metadata about the book – they want the book itself.
“People” are a little harder to define. Our confidence about whether a view comes from a person varies – in some cases we are very sure, and in others it’s more fuzzy, but in all cases we know the view is not from a known robot. So we have chosen to class these all together as “people,” as they are likely to represent access by end users.
What counts as a view in the new system
- Each media item in the archive has a views counter.
- The view counter is increased by 1 when a user engages with the media file(s) in an item.
- Media engagement includes experiencing the media through the player in the item page (pressing play on a video or audio player, flipping pages in the online bookreader, emulating software, etc.), downloading files, streaming files, or borrowing a book.
- All types of engagements are treated in the same way – they are all views.
- A single user can only increase the view count of a particular item once per day.
- A user may view multiple media files in a single item, or view the same media file in a single item multiple times, but within one day that engagement will only count as 1 view.
- Collection views are the sum of all the view counts of the items in the collection.
- When an item is in more than one collection, the item’s view counts are added to each collection it is in. This includes “parent” collections if the item is in a subcollection.
- When a user engages with a collection page (sorting, searching, browsing etc.), it does NOT count as a view of the collection.
- Items sometimes move in or out of collections. The views number on a collection represents the sum of the views of the items that are in the collection at that time (e.g. the September 1, 2018 views number for the collection represents the sum of the views on items that were in the collection on September 1, 2018. If an item moves out of that collection, the collection does not lose the views from September 1, 2018.).
How the new system differs from the legacy system
When we designed the new system, we implemented some changes in what counted as a “view,” added some functionality, and repaired some errors that were discovered.
- The legacy system updated item views once per day and collection views once per month. The new system will update both item and collection views once per day.
- The legacy system updated item views ~24 hours after a view was recorded. The new system will update the views count ~4 days after the view was recorded. This time delay in the new system will decrease to ~24 hours at some point in the future.
- The legacy system had no information about geographic location of users. The new system has approximate geolocation for every view. This geographic information is based on obfuscated IP addresses. It is accurate at a general level, but does not represent an individual user’s specific location.
- The legacy system had no information about how many views were caused by robots crawling the site. The new system shows us how well the site is crawled by breaking out media access by robots (vs. interactions from people).
- The legacy system did not count all book reader interactions as views. The new system counts bookreader engagements as a view after 2 interactions (like page flips).
- On audio and video items, the legacy system sometimes counted views when users saw *any* media in the item (like thumbnail images). The new system only counts engagements with the audio or video media files in an item in those media types, respectively.
In some cases, the differences above can lead to drastic changes in views numbers for both items and collections. While this may be disconcerting, we think the new system more accurately reflects end user behavior on archive.org.
If you have questions regarding the new stats system, you may email us at email@example.com.
Thanks, this is something people were asking for in Italy as well. Other than known robots, will all downloads still count towards “human” pageviews? I’m thinking of torrent downloads, for instance, which may show up with unusual user agents.
Thanks for the new stats and the blog post explaining them, Alexis.
I think video producers will appreciate the new stats.
I’ll forward this new explanation to producers who ask about the stats in the Community Media Archive.
Some links on how difficult it is to get good video consumption metrics and many others: https://threader.app/thread/1078003966863200256
Great article and right on the money. Basically any attempt to get a fine-grained assessment of b—s–t, just gives one a fine-grained view of the same b—s–t. Note, I am not saying here that we should ‘not’ try to make the ‘views’ thing better, only that we should not delude ourselves into thinking that this ‘fine-grained’ version is any ‘better’ when in reality it is not and I can give numerous reasons why that is the case. Until we have a better explanation of what those bots are doing and where they are coming from, such assurances that this ‘new’ fine-grained approach will be any better are meaningless. Of course the question then becomes “why change it for something which cannot be shown to be any better?” Just rearranging the chairs on the Titanic when the end result is the same.
What happens when bots are used to harvest items for individuals to view off-line. This could never give a reliable measurement of human-views since the item can be viewed by multiple individuals after it is harvested. We would need much more information about who or what is accessing the items and thinking that what we have here is fine-grained does not in reality make it so.
Thanks for the new stats and the blog post explaining them, Alexis.
Happy New Year