Some researchers that are working with the Internet Archive, such as those at University of Massachusetts, have wanted closer access to some of our collections. We are learning how to support this type of “on-campus” use of the collections. This post is to document how to use these machines.
Who can have access?
This is for joint projects with the archive, usually some academic program often funded by NSF. So this is not a general offering, but more of a special case thing. Most use the collections by downloading materials to their home machines. We have tools to help with this, and use “GNU Parallel” to make it go fast.
How to get an account?
Is there an agreement? Yes, there usually is. This is usually administered by Alexis Rossi. All in all, these are shared machines, so please be respectful of others data and use of the machines.
How do I get access to the VM? To get an account you will need to forward a public SSH key to Jake Johnson. Please follow the steps below for more details.
Generate your SSH keys.
These instructions assume you’re on a Unix-like operating system. If you’re using Windows please see Mike Lichtenberg’s blog post, Generating SSH Keys on Windows.
- If you don’t already have an
~/.ssh
directory, you will need to create one to store your SSH configuration files and keys:$ mkdir -p ~/.ssh
- Move into the
~/.ssh
directory:$ cd ~/.ssh
- Create your keys (replacing
{username}
with the username you would like to use to login to the VM):$ bash -c 'ssh-keygen -t rsa -b 2048 -C "{username}@researcher0.fnf.archive.org"'
- You will be prompted to enter a filename which your private SSH key will be saved to. Use something like
id_rsa.{username}@researcher0.fnf.archive.org
, again replacing{username}
with your username that you will be using to login to the VM):Enter file in which to save the key (~/.ssh/id_rsa): id_rsa.{username}@researcher0.fnf.archive.org
- You will be prompted again to enter a passphrase. Enter a passphrase, and continue.
Enter passphrase (empty for no passphrase): [enter your passphrase] Enter same passphrase again: [enter your passphrase again]
You should now have two new files in your ~/.ssh
directory, a private key and a public key. For example:
~/.ssh/id_rsa.{username}@researcher0.fnf.archive.org ~/.ssh/id_rsa.{username}@researcher0.fnf.archive.org.pub
Your public key is the key suffixed with “.pub
“.
Adding your public key to the VM
Forward your public key to Jake Johnson. He will create a user for you, and add your public key to the VM. Once you receive notification that your user has been created and your key successfully added to the VM, proceed to the next step.
Logging into the VM via SSH
You can now use your private key to login into the VM with the following command:
$ ssh -i ~/.ssh/id_rsa.{username}@researcher0.fnf.archive.org {username}@researcher0.fnf.archive.org
How do I bulk download data from archive.org onto the VM?
We recommend using wget to download data from archive.org. Please see our blog post, Downloading in bulk using wget, for more details.
If you have privileges to an access-restricted collection, you can use your archive.org cookies to download data from this collection by adding the following --header
flag to your wget command:
--header "Cookie: logged-in-user={email%40example.com}; logged-in-sig={private};"
(Note: replace {email%40example.com}
with the email address associated with your archive.org account (encoding @
as %40)
, and {private}
with the value of your logged-in-sig
cookie.)
You can retrieve your logged-in-sig cookie using the following steps:
- In Firefox , go to archive.org and log in with your account
- Go to Firefox > Preferences
- Click on the Privacy tab
- Select “Use custom settings for History” in drop down menu in the history section
- Click the “Show cookies” button
- Find archive.org in the list of cookies and expand to show options
- Select the
logged-in-sig
cookie. The long string in the “Content:” field is the value of yourlogged-in-sig
cookie. This is the value that you will need for your wget command (specifically, replacing{private}
in the--header
flag mentioned above).
How do I bulk download metadata from archive.org onto the VM?
You can download all of an items metadata via our Metadata API.
How do I generate a list of identifiers for downloading data and metadata from collections in bulk?
You can use our advanced search engine. Please refer to the Create a file with the list of identifiers section in our Downloading in bulk using wget blog post.
How can I monitor usage of the VM?
You can monitor usage of the VM via MRTG (Multi Router Traffic Grapher) here: http://researcher0.fnf.archive.org:8088/mrtg/
It is not quite clear if the researchers will be using GNU Parallel for their research, but if they are could you help them learn ‘parallel –bibtex’? Unfortunately it seems researchers often forget the citation of GNU Parallel.
The Internet Archive staff uses GNU parallel often– it helps us overcome retrieval latency as well as use multiple processors effectively.
great tool.
-brewster