There are many computer science projects, decentralized storage, and digital humanties projects looking for data to play with. You came to the right place– the Internet Archive offers cultural information available to web users and dataminers alike.

While many of our collections have rights issues to them so require agreements and conversation, there are many that are openly available for public, bulk downloading.

Here are 3 collections, one of movies, another of audio books, and a third are scanned public domain books from the Library of Congress. If you have a macintosh or linux machine, you can use those to run these command lines. If you run each for a little while you can get just a few of the items (so you do not need to download terabytes).

These items are also available via bittorrent, but we find the Internet Archive command line tool is really helpful for this kind of thing:

$ curl -LOs https://archive.org/download/ia-pex/ia
$ chmod +x ia
$ ./ia download –search=”collection:prelinger” #17TB of public domain movies
$ ./ia download –search=”collection:librivoxaudio” #20TB of public domain audiobooks
$ ./ia download –search=”collection:library_of_congress” #166,000 public domain books from the Library of Congress (60TB)

Here is a way to figure out how much data is in each:

Sorry to say we do not yet have a support group for people using these tools or finding out what data is available, so for the time being you are pretty much on your own.

11 thoughts on “Want Some Terabytes from the Internet Archive to Play With?”

Mike Lichtenberg October 21, 2020 at 9:55 pm

If you are a Windows 10 user with no access to a Linux or Mac machine, the Windows Subsystem for Linux that is available with Windows 10 will run these command line tools as well.

BTW, love the examples that show how to determine the amount of data in a collection. Was recently asked about the size of a collection I work with daily, and didn’t know an easy way to get the answer.

Brewster Kahle Post authorOctober 21, 2020 at 11:02 pm

Mike– Love that it can work on windows? can you add in the commands to do it?
Brewster Kahle Post authorOctober 25, 2020 at 4:50 pm

Thank you, Mike, for the tip of using Windows Subsystem for Linux. It worked for me as I posted in a seperate comment on this post.

Lee Miller October 22, 2020 at 2:42 am

Have you spoken with anyone at Plex to find out if they would be willing to, essentially mirror everything? I bet they would

Hendra Surya October 22, 2020 at 9:56 am

I’ve tried, but it didn’t work ..
I use Windows 10

javan October 23, 2020 at 8:51 pm

hi thanks for help i installed it now successfully.

Jeff Benton October 25, 2020 at 2:13 am

I would like to download the Library of Congress documents. I am using a PC with Windows 10. I do not know how to find or use the “Windows Subsystem for Linux” referred to above. Is there someone who can tell me how to do this?

Brewster Kahle Post authorOctober 25, 2020 at 3:56 pm

Jeff– I have never heard of it either. I dont use Windows, nor do most of us at the Archive, so maybe you could explore this option?

a quick search for “Windows Subsystem for Linux” found https://docs.microsoft.com/en-us/windows/wsl/install-win10
1. Brewster Kahle Post authorOctober 25, 2020 at 4:48 pm
  
  Jeff– see above to get it to work on windows.

bioD October 25, 2020 at 4:34 pm

Excellent examples that show how to determine the amount of data. Thanks for that!

Brewster Kahle Post authorOctober 25, 2020 at 4:40 pm

I got the IA tool to work on windows! (I have not used windows for years, so please forgive this naive approach, but maybe someone could post a better way. I am using windows10 on vmware fusion on my mac).

I installed WSL https://docs.microsoft.com/en-us/windows/wsl/install-win10
(I tried to upgrade to 2, but that failed to get ubuntu going, so set the default version back to 1– I would suggest skipping the upgrade), then installed Ubuntu 20.04. It did not take long.

Then I used:
sudo apt install internetarchive
then
ia configure

then ia download worked! (notice I took out the “./” before the ia since it was installed using apt

brewster@DESKTOP-53SI5TR:~$ ia download –search=”collection:library_of_congress”
100000clubpaperp00ring (1/166599): dddddddddddddddddd – success
1000assortedfact00hami (2/166599): ddddddddddddddddddd – success
1000choicerecipe00mell (3/166599): ddddddddddddddd
…

Hope this works for you!

-brewster

Comments are closed.

Internet Archive Blogs

A blog from the team at archive.org

Want Some Terabytes from the Internet Archive to Play With?

11 thoughts on “Want Some Terabytes from the Internet Archive to Play With?”

Upcoming Events

Book Talk: Big Fiction