Tag Archives: documentation

archive.org download counts of collections of items updates and fixes

Every month, we look over the total download counts for all public items at archive.org.  We sum item counts into their collections.  At year end 2014, we found various source reliability issues, as well as overcounting for “top collections” and many other issues.

archive.org public items tracked over time

archive.org public items tracked over time

To address the problems we did:

  • Rebuilt a new system to use our database (DB) for item download counts, instead of our less reliable (and more prone to “drift”) SOLR search engine (SE).
  • Changed monthly saved data from JSON and PHP serialized flatfiles to new DB table — much easier to use now!
  • Fixed overcounting issues for collections: texts, audio, etree, movies
  • Fixed various overcounting issues related to not unique-ing <collection> and <contributor> tags (more below)
  • Fixes to character encoding issues on <contributor> tags

Bonus points!

  • We now track *all collections*.  Previously, we only tracked items tagged:
    • <mediatype> texts
    • <mediatype> etree
    • <mediatype> audio
    • <mediatype> movies
  • For items we are tracking <contributor> tags (texts items), we now have a “Contributor page” that shows a table of historical data.
  • Graphs are now “responsive” (scale in width based on browser/mobile width)

 

The Overcount Issue for top collection/mediatypes

  • In the below graph, mediatypes and collections are shown horizontally, with a sample “collection hierarchy” today.
  • For each collection/mediatype, we show 1 example item, A B C and D, with a downloads/streams/views count next to it parenthetically.   So these are four items, spanning four collections, that happen to be in a collection hierarchy (a single item can belong to multiple collections at archive.org)
  • The Old Way had a critical flaw — it summed all sub-collection counts — when really it should have just summed all *direct child* sub-collection counts (or gone with our New Way instead)

overcount

So we now treat <mediatype> tags like <collection> tags, in terms of counting, and unique all <collection> tags to avoid items w/ minor nonideal data tags and another kind of overcounting.

 

… and one more update from Feb/1:

We graph the “difference” between absolute downloads counts for the current month minus the prior month, for each month we have data for.  This gives us graphs that show downloads/month over time.  However, values can easily go *negative* with various scenarios (which is *wickedly* confusing to our poor users!)

Here’s that situation:

A collection has a really *hot* item one month, racking up downloads in a given collection.  The next month, a DMCA takedown or otherwise removes the item from being available (and thus counted in the future).  The downloads for that collection can plummet the next month’s run when the counts are summed over public items for that collection again.  So that collection would have a negative (net) downloads count change for this next month!

Here’s our fix:

Use the current month’s collection “item membership” list for current month *and* prior month.  Sum counts for all those items for both months, and make the graphed difference be that difference.  In just about every situation that remains, graphed monthly download counts will be monotonic (nonnegative and increasing or zero).

 

 

How to use the Virtual Machine for Researchers

Some researchers that are working with the Internet Archive, such as those at University of Massachusetts, have wanted closer access to some of our collections. We are learning how to support this type of “on-campus” use of the collections. This post is to document how to use these machines.

Who can have access?

This is for joint projects with the archive, usually some academic program often funded by NSF.  So this is not a general offering, but more of a special case thing. Most use the collections by downloading materials to their home machines. We have tools to help with this, and use “GNU Parallel” to make it go fast.

How to get an account?

Is there an agreement? Yes, there usually is. This is usually administered by Alexis Rossi.  All in all, these are shared machines, so please be respectful of others data and use of the machines.

How do I get access to the VM? To get an account you will need to forward a public SSH key to Jake Johnson. Please follow the steps below for more details.

Generate your SSH keys.

These instructions assume you’re on a Unix-like operating system. If you’re using Windows please see Mike Lichtenberg’s blog post, Generating SSH Keys on Windows.

  1. If you don’t already have an ~/.ssh directory, you will need to create one to store your SSH configuration files and keys:
    $ mkdir -p ~/.ssh
  2. Move into the ~/.ssh directory:
    $ cd ~/.ssh
  3. Create your keys (replacing {username} with the username you would like to use to login to the VM):
    $ bash -c 'ssh-keygen -t rsa -b 2048 -C "{username}@researcher0.fnf.archive.org"'
  4. You will be prompted to enter a filename which your private SSH key will be saved to. Use something like id_rsa.{username}@researcher0.fnf.archive.org, again replacing {username} with your username that you will be using to login to the VM):
    Enter file in which to save the key (~/.ssh/id_rsa): id_rsa.{username}@researcher0.fnf.archive.org
  5. You will be prompted again to enter a passphrase. Enter a passphrase, and continue.
    Enter passphrase (empty for no passphrase): [enter your passphrase]
    Enter same passphrase again: [enter your passphrase again]

You should now have two new files in your ~/.ssh directory, a private key and a public key. For example:

~/.ssh/id_rsa.{username}@researcher0.fnf.archive.org
~/.ssh/id_rsa.{username}@researcher0.fnf.archive.org.pub

Your public key is the key suffixed with “.pub“.

Adding your public key to the VM

Forward your public key  to Jake Johnson. He will create a user for you, and add your public key to the VM. Once you receive notification that your user has been created and your key successfully added to the VM, proceed to the next step.

Logging into the VM via SSH

You can now use your private key to login into the VM with the following command:

$ ssh -i ~/.ssh/id_rsa.{username}@researcher0.fnf.archive.org {username}@researcher0.fnf.archive.org

How do I bulk download data from archive.org onto the VM?

We recommend using wget to download data from archive.org. Please see our blog post, Downloading in bulk using wget, for more details.

If you have privileges to an access-restricted collection, you can use your archive.org cookies to download data from this collection by adding the following --header flag to your wget command:

--header "Cookie: logged-in-user={email%40example.com}; logged-in-sig={private};"

(Note: replace {email%40example.com} with the email address associated with your archive.org account (encoding @ as %40), and {private} with the value of your logged-in-sig cookie.)

You can retrieve your logged-in-sig cookie using the following steps:

  1. In Firefox , go to archive.org and log in with your account
  2. Go to Firefox > Preferences
  3. Click on the Privacy tab
  4. Select “Use custom settings for History” in drop down menu in the history section
  5. Click the “Show cookies” button
  6. Find archive.org in the list of cookies and expand to show options
  7. Select the logged-in-sig cookie. The long string in the “Content:” field is the value of your logged-in-sig cookie. This is the value that you will need for your wget command (specifically, replacing {private} in the --header flag mentioned above).

How do I bulk download metadata from archive.org onto the VM?

You can download all of an items metadata via our Metadata API.

How do I generate a list of identifiers for downloading data and metadata from collections in bulk?

You can use our advanced search engine. Please refer to the Create a file with the list of identifiers section in our Downloading in bulk using wget blog post.

How can I monitor usage of the VM?

You can monitor usage of the VM via MRTG (Multi Router Traffic Grapher) here: http://researcher0.fnf.archive.org:8088/mrtg/