Want Some Terabytes from the Internet Archive to Play With?

There are many computer science projects, decentralized storage, and digital humanties projects looking for data to play with. You came to the right place– the Internet Archive offers cultural information available to web users and dataminers alike.

While many of our collections have rights issues to them so require agreements and conversation, there are many that are openly available for public, bulk downloading.

Here are 3 collections, one of movies, another of audio books, and a third are scanned public domain books from the Library of Congress. If you have a macintosh or linux machine, you can use those to run these command lines. If you run each for a little while you can get just a few of the items (so you do not need to download terabytes).

These items are also available via bittorrent, but we find the Internet Archive command line tool is really helpful for this kind of thing:

$ curl -LOs https://archive.org/download/ia-pex/ia
$ chmod +x ia
$ ./ia download –search=”collection:prelinger” #17TB of public domain movies
$ ./ia download –search=”collection:librivoxaudio” #20TB of public domain audiobooks
$ ./ia download –search=”collection:library_of_congress” #166,000 public domain books from the Library of Congress (60TB)

Here is a way to figure out how much data is in each:

apt-get install jq > /dev/null
./ia search “collection:library_of_congress” -f item_size | jq -r .item_size | paste -sd+ – | bc | numfmt –grouping
./ia search “collection:librivoxaudio” -f item_size | jq -r .item_size | paste -sd+ – | bc | numfmt –grouping
./ia search “collection:prelinger” -f item_size | jq -r .item_size | paste -sd+ – | bc | numfmt –grouping

Sorry to say we do not yet have a support group for people using these tools or finding out what data is available, so for the time being you are pretty much on your own.

11 thoughts on “Want Some Terabytes from the Internet Archive to Play With?

  1. Mike Lichtenberg

    If you are a Windows 10 user with no access to a Linux or Mac machine, the Windows Subsystem for Linux that is available with Windows 10 will run these command line tools as well.

    BTW, love the examples that show how to determine the amount of data in a collection. Was recently asked about the size of a collection I work with daily, and didn’t know an easy way to get the answer.

  2. Jeff Benton

    I would like to download the Library of Congress documents. I am using a PC with Windows 10. I do not know how to find or use the “Windows Subsystem for Linux” referred to above. Is there someone who can tell me how to do this?

  3. Brewster Kahle Post author

    I got the IA tool to work on windows! (I have not used windows for years, so please forgive this naive approach, but maybe someone could post a better way. I am using windows10 on vmware fusion on my mac).

    I installed WSL https://docs.microsoft.com/en-us/windows/wsl/install-win10
    (I tried to upgrade to 2, but that failed to get ubuntu going, so set the default version back to 1– I would suggest skipping the upgrade), then installed Ubuntu 20.04. It did not take long.

    Then I used:
    sudo apt install internetarchive
    then
    ia configure

    then ia download worked! (notice I took out the “./” before the ia since it was installed using apt

    brewster@DESKTOP-53SI5TR:~$ ia download –search=”collection:library_of_congress”
    100000clubpaperp00ring (1/166599): dddddddddddddddddd – success
    1000assortedfact00hami (2/166599): ddddddddddddddddddd – success
    1000choicerecipe00mell (3/166599): ddddddddddddddd

    Hope this works for you!

    -brewster

Comments are closed.