Author Archives: Jake Johnson

Distributed Preservation Made Simple

Library partners of the Internet Archive now have at their fingertips an easy way – from a Unix-like command line in a terminal window – to download digital collections for local preservation and access.

This post will show how to use a Internet Archive command-line tool (ia) to download all items in a collection stored on Archive.org, and keep their local collections in sync with the Archive.org collection.

To use ia, the only requirement is to have Python 2 installed on a Unix-like operating system (i.e. Linux, Mac OS X). Python 2 is pre-installed on Mac OS X and most Linux systems so there is nothing more that needs to be done, except to open up a terminal and follow these steps:

1.  Download the latest binary of the ia command-line tool by running the following command in your terminal:

curl -LO https://archive.org/download/ia-pex/ia

2. Make the binary executable:

chmod +x ia

3. Make sure you have the latest version of the binary, version 1.0.0:

./ia --version

4. Configure ia with your Archive.org credentials (This step is only needed if you need privileges to access the items). :

./ia configure

5. Download a collection:

./ia download --search 'collection:solarsystemcollection'

or

./ia download --search 'collection:JangoMonkey'

The above command to “Download a collection”, for example, will download all files from all items from the band JangoMonkey or the NASA Solar System collection. If re-run, by default, will skip over any files already downloaded, as rysnc does, which can help keep your local collection in sync with the collection on Archive.org.

If you would like to download only certain file types, you can use the –glob option. For example, if you only wanted to download JPEG files, you could use a command like:

./ia download --search 'collection:solarsystemcollection' --glob '*.jpeg|*.jpg'

Note that by default ia will download files into your current working directory. If you launch a terminal window without moving to a new directory, the files will be downloaded to your user directory. To download to a different directory, you can either cd into that directory or use the “–destdir” parameter like so:

mkdir solarsystemcollection-collection

./ia download --search 'collection:solarsystemcollection' --destdir solarsystemcollection-collection

Downloading in Parallel

GNU Parallel is a powerful command-line tool for executing jobs in parallel. When used with ia, downloading items in parallel is as easy as:

./ia search 'collection:solarsystemcollection' --itemlist | parallel --no-notice -j4 './ia download {} --glob="*.jpg|*.jpeg"'

The -j option controls how many jobs run in parallel (i.e. how many files are downloaded at a time). Depending on the machine you are running the command on, you might get better performance by increasing or decreasing the number of simultaneous jobs. By default, GNU Parallel will run one job per CPU.

GNU Parallel can be installed with Homebrew on Mac OS X (i.e.: brew install parallel), or your favorite package manager on Linux (e.g. on Ubuntu: apt-get install parallel, on Arch Linux: pacman -S parallel, etc.). For more details, please refer to: https://www.gnu.org/software/parallel/

For more options and details, use the following command:

./ia download --help

Finally, to see what else the ia command-line tool can do:

./ia --help

Documentation of the ia command-line tool is available at: https://internetarchive.readthedocs.org/en/latest/cli.html

There you have it. Library partners, download and store your collections now using this command-line tool from the Internet Archive. If you have any questions or issues, please write to info (at) archive.org. We are trying to make distributed preservation simple and easy!