There are many computer science projects, decentralized storage, and digital humanties projects looking for data to play with. You came to the right place– the Internet Archive offers cultural information available to web users and dataminers alike.
While many of our collections have rights issues to them so require agreements and conversation, there are many that are openly available for public, bulk downloading.
Here are 3 collections, one of movies, another of audio books, and a third are scanned public domain books from the Library of Congress. If you have a macintosh or linux machine, you can use those to run these command lines. If you run each for a little while you can get just a few of the items (so you do not need to download terabytes).
These items are also available via bittorrent, but we find the Internet Archive command line tool is really helpful for this kind of thing:
$ curl -LOs https://archive.org/download/ia-pex/ia
$ chmod +x ia
$ ./ia download –search=”collection:prelinger” #17TB of public domain movies
$ ./ia download –search=”collection:librivoxaudio” #20TB of public domain audiobooks
$ ./ia download –search=”collection:library_of_congress” #166,000 public domain books from the Library of Congress (60TB)
Here is a way to figure out how much data is in each:
apt-get install jq > /dev/null
./ia search “collection:library_of_congress” -f item_size | jq -r .item_size | paste -sd+ – | bc | numfmt –grouping
./ia search “collection:librivoxaudio” -f item_size | jq -r .item_size | paste -sd+ – | bc | numfmt –grouping
./ia search “collection:prelinger” -f item_size | jq -r .item_size | paste -sd+ – | bc | numfmt –grouping
Sorry to say we do not yet have a support group for people using these tools or finding out what data is available, so for the time being you are pretty much on your own.