As someone who’s uploaded hundreds of thousands of items to the Internet Archive’s stacks and who has probably done a few million transactions with the materials over the years, I just “know” about the Internet Archive python client, and if you’re someone who wants to interact with the site as a power user (or were looking for an excuse to), it’ll help you to know about it too.
You might even be the kind of power user who is elbowing me out of the way saying “show me the code and show me the documentation”. Well, the documentation is here and the code is here. Have a great time.
Boy, they run fast.
So, for everyone still around, a little history about how this client came along and how, if you have a certain set of tasks and interactions you want to conduct with the massive treasures of archive.org, it might enable you to do some amazing things indeed. If you’ve never done command-line scripting before, here’s a great excuse to learn.
Started in 2012 and overseen primarily by Archive employee Jake Johnson, the internetarchive client (which is generally just called “ia”) is both a set of libraries and a command-line program for doing a wide range of activities and actions with the archive without having to come in through the website. There’s a range of advantages and differences from using the web interface, mostly that it can be called as a command-line request, and return the results (success, failure, other information) right into your scripts. It is coded to be in lock-step with our APIs and system, and does its best to respect capacity as well as return informative messages about success or errors.
The command comes in the form of ia [command], where command is a variety of functions:
- It is possible to do a ia search command and return the item identifiers of every item that matches your query, which can then be fed to other scripts or utilitzed as a checklist for your own research.
- The ia metadata command will return as much metadata as possible, including file sizes, metadata pairs, content type, and other useful information baked into every object in the collections.
- The ia list command will tell you all the different files within an item identifier, to see which you might specifically want.
- The ia download and ia upload commands let you pull down and upload items to the archive, setting all the attributes for uploads and adding conditions and specific matches for downloads.
- The ia tasks command lets your scripts know how the addition of your items went into the archive’s sets, as well as where they stand in terms of post-processing.
All the commands, in fact, that a user might find themselves in desperate need of due to the size or complexity of the task, and clicking endlessly in a browser is just not going to cut it.
The client was originally created for the Archive to do many different processes itself, via scripts, that would both provide clear error messages, give accurate status updates, and allow the scripts to understand what was working or what needed modification. Many internal teams either use this client or depend on its output for information to do their tasks. With over six years of development on it, the tool is very mature and utilized thousands of times a day internally.
In my case, here are some automated or semi-automated tasks I use the ia client command set to do, often daily:
- Analyze the text of a set of documents to provide me with best guesses as to their publication date, which I then sign off on
- Take a donation of several hundred PDF files and turn them into individual items in a collection, including taking metadata from a .CSV sheet
- Compare and contrast screenshots within an item to find the best one and make that a thumbnail for the item
- Maintain “Pipelines” that pull from content located elsewhere (like the Bitsavers documentation project or the DNA Lounge) and place the resulting items into the Archive with no human intervention
For people who are using the Archive to simply play with and enjoy its many different materials, be they website histories, movies, music, and books – this tool is probably not what you need.
But for the scripting-comfortable folks.. for people who want to become scripting comfortable folks… for people who are maintaining collections or working hard with multiple uploads and doing a lot of manual work to enter metadata.. this multi-tool of Internet Archive access is exactly what you need.
As mentioned above, the documentation is here and the code is here. Have a great time.