There are many computer science projects, decentralized storage, and digital humanties projects looking for data to play with. You came to the right place– the Internet Archive offers cultural information available to web users and dataminers alike.
While many of our collections have rights issues to them so require agreements and conversation, there are many that are openly available for public, bulk downloading.
Here are 3 collections, one of movies, another of audio books, and a third are scanned public domain books from the Library of Congress. If you have a macintosh or linux machine, you can use those to run these command lines. If you run each for a little while you can get just a few of the items (so you do not need to download terabytes).
These items are also available via bittorrent, but we find the Internet Archive command line tool is really helpful for this kind of thing:
$ curl -LOs https://archive.org/download/ia-pex/ia
$ chmod +x ia
$ ./ia download –search=”collection:prelinger” #17TB of public domain movies
$ ./ia download –search=”collection:librivoxaudio” #20TB of public domain audiobooks
$ ./ia download –search=”collection:library_of_congress” #166,000 public domain books from the Library of Congress (60TB)
Here is a way to figure out how much data is in each:
apt-get install jq > /dev/null
./ia search “collection:library_of_congress” -f item_size | jq -r .item_size | paste -sd+ – | bc | numfmt –grouping
./ia search “collection:librivoxaudio” -f item_size | jq -r .item_size | paste -sd+ – | bc | numfmt –grouping
./ia search “collection:prelinger” -f item_size | jq -r .item_size | paste -sd+ – | bc | numfmt –grouping
Sorry to say we do not yet have a support group for people using these tools or finding out what data is available, so for the time being you are pretty much on your own.
If you are a Windows 10 user with no access to a Linux or Mac machine, the Windows Subsystem for Linux that is available with Windows 10 will run these command line tools as well.
BTW, love the examples that show how to determine the amount of data in a collection. Was recently asked about the size of a collection I work with daily, and didn’t know an easy way to get the answer.
Mike– Love that it can work on windows? can you add in the commands to do it?
Thank you, Mike, for the tip of using Windows Subsystem for Linux. It worked for me as I posted in a seperate comment on this post.
Have you spoken with anyone at Plex to find out if they would be willing to, essentially mirror everything? I bet they would
I’ve tried, but it didn’t work ..
I use Windows 10
hi thanks for help i installed it now successfully.
I would like to download the Library of Congress documents. I am using a PC with Windows 10. I do not know how to find or use the “Windows Subsystem for Linux” referred to above. Is there someone who can tell me how to do this?
Jeff– I have never heard of it either. I dont use Windows, nor do most of us at the Archive, so maybe you could explore this option?
a quick search for “Windows Subsystem for Linux” found https://docs.microsoft.com/en-us/windows/wsl/install-win10
Jeff– see above to get it to work on windows.
Excellent examples that show how to determine the amount of data. Thanks for that!
I got the IA tool to work on windows! (I have not used windows for years, so please forgive this naive approach, but maybe someone could post a better way. I am using windows10 on vmware fusion on my mac).
I installed WSL https://docs.microsoft.com/en-us/windows/wsl/install-win10
(I tried to upgrade to 2, but that failed to get ubuntu going, so set the default version back to 1– I would suggest skipping the upgrade), then installed Ubuntu 20.04. It did not take long.
Then I used:
sudo apt install internetarchive
then
ia configure
then ia download worked! (notice I took out the “./” before the ia since it was installed using apt
brewster@DESKTOP-53SI5TR:~$ ia download –search=”collection:library_of_congress”
100000clubpaperp00ring (1/166599): dddddddddddddddddd – success
1000assortedfact00hami (2/166599): ddddddddddddddddddd – success
1000choicerecipe00mell (3/166599): ddddddddddddddd
…
Hope this works for you!
-brewster