Tag Archives: documentation

archive.org download counts of collections of items updates and fixes

Posted on January 26, 2015 by traceypooh

Every month, we look over the total download counts for all public items at archive.org. We sum item counts into their collections. At year end 2014, we found various source reliability issues, as well as overcounting for “top collections” and many other issues.

archive.org public items tracked over time

To address the problems we did:

Rebuilt a new system to use our database (DB) for item download counts, instead of our less reliable (and more prone to “drift”) SOLR search engine (SE).
Changed monthly saved data from JSON and PHP serialized flatfiles to new DB table — much easier to use now!
Fixed overcounting issues for collections: texts, audio, etree, movies
Fixed various overcounting issues related to not unique-ing <collection> and <contributor> tags (more below)
Fixes to character encoding issues on <contributor> tags

Bonus points!

We now track *all collections*. Previously, we only tracked items tagged:
- <mediatype> texts
- <mediatype> etree
- <mediatype> audio
- <mediatype> movies
For items we are tracking <contributor> tags (texts items), we now have a “Contributor page” that shows a table of historical data.
Graphs are now “responsive” (scale in width based on browser/mobile width)

The Overcount Issue for top collection/mediatypes

In the below graph, mediatypes and collections are shown horizontally, with a sample “collection hierarchy” today.
For each collection/mediatype, we show 1 example item, A B C and D, with a downloads/streams/views count next to it parenthetically. So these are four items, spanning four collections, that happen to be in a collection hierarchy (a single item can belong to multiple collections at archive.org)
The Old Way had a critical flaw — it summed all sub-collection counts — when really it should have just summed all *direct child* sub-collection counts (or gone with our New Way instead)

So we now treat <mediatype> tags like <collection> tags, in terms of counting, and unique all <collection> tags to avoid items w/ minor nonideal data tags and another kind of overcounting.

… and one more update from Feb/1:

We graph the “difference” between absolute downloads counts for the current month minus the prior month, for each month we have data for. This gives us graphs that show downloads/month over time. However, values can easily go *negative* with various scenarios (which is *wickedly* confusing to our poor users!)

Here’s that situation:

A collection has a really *hot* item one month, racking up downloads in a given collection. The next month, a DMCA takedown or otherwise removes the item from being available (and thus counted in the future). The downloads for that collection can plummet the next month’s run when the counts are summed over public items for that collection again. So that collection would have a negative (net) downloads count change for this next month!

Here’s our fix:

Use the current month’s collection “item membership” list for current month *and* prior month. Sum counts for all those items for both months, and make the graphed difference be that difference. In just about every situation that remains, graphed monthly download counts will be monotonic (nonnegative and increasing or zero).

How to use the Virtual Machine for Researchers

Posted on July 4, 2013 by Brewster Kahle

Some researchers that are working with the Internet Archive, such as those at University of Massachusetts, have wanted closer access to some of our collections. We are learning how to support this type of “on-campus” use of the collections. This post is to document how to use these machines.

Who can have access?

This is for joint projects with the archive, usually some academic program often funded by NSF. So this is not a general offering, but more of a special case thing. Most use the collections by downloading materials to their home machines. We have tools to help with this, and use “GNU Parallel” to make it go fast.

How to get an account?

Is there an agreement? Yes, there usually is. This is usually administered by Alexis Rossi. All in all, these are shared machines, so please be respectful of others data and use of the machines.

How do I get access to the VM? To get an account you will need to forward a public SSH key to Jake Johnson. Please follow the steps below for more details.

Generate your SSH keys.

These instructions assume you’re on a Unix-like operating system. If you’re using Windows please see Mike Lichtenberg’s blog post, Generating SSH Keys on Windows.

If you don’t already have an ~/.ssh directory, you will need to create one to store your SSH configuration files and keys:
```
$ mkdir -p ~/.ssh
```
Move into the ~/.ssh directory:
```
$ cd ~/.ssh
```
Create your keys (replacing {username} with the username you would like to use to login to the VM):
```
$ bash -c 'ssh-keygen -t rsa -b 2048 -C "{username}@researcher0.fnf.archive.org"'
```
You will be prompted to enter a filename which your private SSH key will be saved to. Use something like id_rsa.{username}@researcher0.fnf.archive.org, again replacing {username} with your username that you will be using to login to the VM):
```
Enter file in which to save the key (~/.ssh/id_rsa): id_rsa.{username}@researcher0.fnf.archive.org
```

You will be prompted again to enter a passphrase. Enter a passphrase, and continue.

Enter passphrase (empty for no passphrase): [enter your passphrase]
Enter same passphrase again: [enter your passphrase again]

You should now have two new files in your ~/.ssh directory, a private key and a public key. For example:

~/.ssh/id_rsa.{username}@researcher0.fnf.archive.org
~/.ssh/id_rsa.{username}@researcher0.fnf.archive.org.pub

Your public key is the key suffixed with “.pub“.

Adding your public key to the VM

Forward your public key to Jake Johnson. He will create a user for you, and add your public key to the VM. Once you receive notification that your user has been created and your key successfully added to the VM, proceed to the next step.

Logging into the VM via SSH

You can now use your private key to login into the VM with the following command:

$ ssh -i ~/.ssh/id_rsa.{username}@researcher0.fnf.archive.org {username}@researcher0.fnf.archive.org

How do I bulk download data from archive.org onto the VM?

We recommend using wget to download data from archive.org. Please see our blog post, Downloading in bulk using wget, for more details.

If you have privileges to an access-restricted collection, you can use your archive.org cookies to download data from this collection by adding the following --header flag to your wget command:

--header "Cookie: logged-in-user={email%40example.com}; logged-in-sig={private};"

(Note: replace {email%40example.com} with the email address associated with your archive.org account (encoding @ as %40), and {private} with the value of your logged-in-sig cookie.)

You can retrieve your logged-in-sig cookie using the following steps:

In Firefox , go to archive.org and log in with your account
Go to Firefox > Preferences
Click on the Privacy tab
Select “Use custom settings for History” in drop down menu in the history section
Click the “Show cookies” button
Find archive.org in the list of cookies and expand to show options
Select the logged-in-sig cookie. The long string in the “Content:” field is the value of your logged-in-sig cookie. This is the value that you will need for your wget command (specifically, replacing {private} in the --header flag mentioned above).

How do I bulk download metadata from archive.org onto the VM?

You can download all of an items metadata via our Metadata API.

How do I generate a list of identifiers for downloading data and metadata from collections in bulk?

You can use our advanced search engine. Please refer to the Create a file with the list of identifiers section in our Downloading in bulk using wget blog post.

How can I monitor usage of the VM?

You can monitor usage of the VM via MRTG (Multi Router Traffic Grapher) here: http://researcher0.fnf.archive.org:8088/mrtg/

The Internet Archive Metadata API

Posted on July 4, 2013 by jeff kaplan

The Metadata API is intended for fast, flexible, and reliable reading and writing of Internet Archive items.

Metadata Read API

The Metadata Read API is the fastest and most flexible way to retrieve metadata for items on archive.org. We’ve seen upwards of 500 reads per second for some collections!

Overview

Returns all of an item’s metadata in JSON.

Resource URL

http://archive.org/metadata/:identifier

Parameters

identifier: The globally unique ID of a given item on archive.org.

Usage

For example, frenchenglishmed00gorduoft is the identifier for http://archive.org/details/frenchenglishmed00gorduoft. You can retrieve all of this item’s metadata from the Metadata API using the following curl command:

$ curl http://archive.org/metadata/frenchenglishmed00gorduoft

The Metadata API also supports HTTPS:

$ curl https://archive.org/metadata/frenchenglishmed00gorduoft

Sub-item Access

The Metadata API returns all of an item’s metadata by default. You can access specific metadata elements like so:

http://archive.org/metadata/:identifier/metadata
http://archive.org/metadata/:identifier/server
http://archive.org/metadata/:identifier/files_count
http://archive.org/metadata/:identifier/files?start=1&count=2
http://archive.org/metadata/:identifier/metadata/collection
http://archive.org/metadata/:identifier/metadata/collection/0
http://archive.org/metadata/:identifier/metadata/title
http://archive.org/metadata/:identifier/files/0/name

Metadata Write API

The metadata write API is intended to make changes to metadata timely, safe and flexible.
It utilizes version 02 of the JSON Patch standard.

Overview

timely

Callers receive results (success or failure) immediately.
Changes are quickly reflected through the metadata read API.

safe

All writes pass through the catalog, so all changes are recorded.
All writes are checked before they’re submitted to the catalog.
If there’s a problem, no catalog task is created. Goal: no redrows!
All checks are repeated when the catalog task is executed.

flexible

Supports arbitrary changes to multiple metadata targets through a unified API.
Changes are easy — no string concatenation or libraries needed.

Resource URL

http://archive.org/metadata/:identifier

Parameters

identifier: The globally unique ID of a given item on archive.org.

Targets

The Metadata Write API supports three kinds of target:

metadata: Changes item_meta.xml (e.g. http://archive.org/metadata/:identifier/metadata).
files/:filename: Changes the file entry in the item’s files.xml (e.g. http://archive.org/metadata/:identifier/files).
other: Changes other.json (e.g. http://archive.org/metadata/:identifier/other).

For XML targets (e.g. ‘metadata‘ and ‘files‘) patches should be composed against their JSON representation, as found in metadata read API results.

Usage

As an HTTP post/get

http://archive.org/metadata/:identifier

With the following url-encoded arguments:

-target: The metadata target you would like to modify.
-patch: The patch you are submitting to the Metadata API.
access: Your IA-S3 access key.
secret: Your IA-S3 secret key.

Authentication

NOTE: These calls must be made with appropriate authentication – at the moment, this means passing your Archive.org IA-S3 credentials. Please visit http://archive.org/account/s3.php to obtain your IA-S3 access key and secret key.

Patches

Patches are JSON strings. They should comply to the draft Json-Patch standard:

http://tools.ietf.org/html/draft-ietf-appsawg-json-patch-02

Examples

Writing to an item’s meta.xml

Add ‘scan_sponsor’ with value ‘Starfleet’ to target ‘metadata’ to the item metadata_test_item:

#!/bin/bash
ACCESS=<redacted>
SECRET=<redacted>
IDENTIFIER=metadata_test_item
TARGET=metadata
PATCH='{"add":"/scan_sponsor", "value":"Starfleet"}'

curl --data-urlencode -target=$TARGET \
     --data-urlencode -patch="$PATCH" \
     --data-urlencode access=$ACCESS \
     --data-urlencode secret=$SECRET \
     http://archive.org/metadata/$IDENTIFIER

returns a JSON object, like the following:

{"success":true,"task_id":114350522,"log":"http://www.us.archive.org/log_show.php?task_id=114350522″}

or perhaps

{"error":"Some problem applying the patch"}

writing to files.xml entry

#!/bin/bash
ACCESS=<redacted> 
SECRET=<redacted>
IDENTIFIER=metadata_test_item
TARGET='files/glogo.png'
PATCH='{"add":"/camera", "value":"Canon A150″}'

curl --data-urlencode -target=$TARGET \
     --data-urlencode -patch="$PATCH" \
     --data-urlencode access=$ACCESS \
     --data-urlencode secret=$SECRET \
     http://archive.org/metadata/$IDENTIFIER

Writing to metadata_test_item/foo_client.json

NOTE: Keys and values are binary-safe and unrestricted

#!/bin/bash
ACCESS=<redacted> 
SECRET=<redacted>
IDENTIFIER=metadata_test_item
TARGET='foo_client'
PATCH='{"add":"/of concern to foo", "value":{"foo-ness":["buckle", "shoe"]}}'

curl --data-urlencode -target=$TARGET \
     --data-urlencode -patch="$PATCH" \
     --data-urlencode access=$ACCESS \
     --data-urlencode secret=$SECRET \     
     http://archive.org/metadata/$IDENTIFIER

After the above call, a metadata read of metadata_test_item will have a toplevel member ‘foo_client’ with value:

{"foo-ness":["buckle", "shoe"]}