Category Archives: Technical

Archive video now supports WebVTT for captions

Posted on March 7, 2018 by traceypooh

We now support .vtt files (Web Video Text Tracks) in addition to .srt (SubRip) (.srt we have supported for years) files for captioning your videos.

It’s as simple as uploading a “parallel filename” to your video file(s).

Examples:

myvid.mp4
myvid.srt
myvid.vtt

Multi-lang support:

myvid.webm
myvid.en.vtt
myvid.en.srt
myvid.es.vtt

Here’s a nice example item:
https://archive.org/details/cruz-test

VTT with caption picker (and upcoming A/V player too!)

(We will have an updated A/V player with a better “picker” for so many language tracks in days, have no fear 😎

Enjoy!

Using Kakadu JPEG2000 Compression to Meet FADGI Standards

Posted on July 31, 2017 by jeff kaplan

The Internet Archive is grateful to the folks at Kakadu Software for contributing to Universal Access to Knowledge by providing the world’s leading implementation of the JPEG2000 standard, used in the Archive’s image processing systems.

Here at the Archive, we digitize over a thousand books a day. JPEG2000, an image coding system that uses compression techniques based on wavelet technology, is a preferred file format for storing these images efficiently, while also providing advantages for presentation quality and metadata richness. The Library of Congress has documented its adoption of the JPEG2000 file format for a number of digitization projects, including its text collections on archive.org.

Recently we started using their SDK to apply some color corrections to the images coming from our cameras. This has helped us achieve FADGI standards in our work with the Library of Congress.

Thank you, Kakadu, for helping make it possible for millions of books to be digitized, stored, and made available with high quality on archive.org!

If you are interested in finding out more about Kakadu Software’s powerful software toolkit for JPEG2000 developers, visit kakadusoftware.com or email info@kakadusoftware.com.

The Hidden Shifting Lens of Browsers

Posted on August 16, 2016 by Jason Scott

Some time ago, I wrote about the interesting situation we had with emulation and Version 51 of the Chrome browser – that is, our emulations stopped working in a very strange way and many people came to the Archive’s inboxes asking what had broken. The resulting fix took a lot of effort and collaboration with groups and volunteers to track down, but it was successful and ever since, every version of Chrome has worked as expected.

But besides the interesting situation with this bug (it actually made us perfectly emulate a broken machine!), it also brought into a very sharp focus the hidden, fundamental aspect of Browsers that can easily be forgotten: Each browser is an opinion, a lens of design and construction that allows its user a very specific facet of how to address the Internet and the Web. And these lenses are something that can shift and turn on a dime, and change the nature of this online world in doing so.

An eternal debate rages on what the Web is “for” and how the Internet should function in providing information and connectivity. For the now-quite-embedded millions of users around the world who have only known a world with this Internet and WWW-provided landscape, the nature of existence centers around the interconnected world we have, and the browsers that we use to communicate with it.

Avoiding too much of a history lesson at this point, let’s instead just say that when Browsers entered the landscape of computer usage in a big way after being one of several resource-intensive experimental programs. In circa 1995, the effect on computing experience and acceptance was unparalleled since the plastic-and-dreams home computer revolution of the 1980s. Suddenly, in one program came basically all the functions of what a computer might possibly do for an end user, all of it linked and described and seemingly infinite. The more technically-oriented among us can point out the gaps in the dream and the real-world efforts behind the scenes to make things do what they promised, of course. But the fundamental message was: Get a Browser, Get the Universe. Throughout the late 1990s, access came in the form of mailed CD-ROMs, or built-in packaging, or Internet Service Providers sending along the details on how to get your machine connected, and get that browser up and running.

As I’ve hinted at, though, this shellac of a browser interface was the rectangular window to a very deep, almost Brazil–like series of ad-hoc infrastructure, clumsily-cobbled standards and almost-standards, and ever-shifting priorities in what this whole “WWW” experience could even possibly be. It’s absolutely great, but it’s also been absolutely arbitrary.

With web anniversaries aplenty now coming into the news, it’ll be very easy to forget how utterly arbitrary a lot of what we think the “Web” is, happens to be.

There’s no question that commercial interests have driven a lot of browser features – the ability to transact financially, to ensure the prices or offers you are being shown, are of primary interest to vendors. Encryption, password protection, multi-factor authentication and so on are sometimes given lip service for private communications, but they’ve historically been presented for the store to ensure the cash register works. From the early days of a small padlock icon being shown locked or unlocked to indicate “safe”, to official “badges” or “certifications” being part of a webpage, the browsers have frequently shifted their character to promise commercial continuity. (The addition of “black box” code to browsers to satisfy the ability to stream entertainment is a subject for another time.)

Flowing from this same thinking has been the overriding need for design control, where the visual or interactive aspects of webpages are the same for everyone, no matter what browser they happen to be using. Since this was fundamentally impossible in the early days (different browsers have different “looks” no matter what), the solutions became more and more involved:

Use very large image-based mapping to control every visual aspect
Add a variety of specific binary “plugins” or “runtimes” by third parties
Insist on adoption of a number of extra-web standards to control the look/action
Demand all users use the same browser to access the site

Evidence of all these methods pop up across the years, with variant success.

Some of the more well-adopted methods include the Flash runtime for visuals and interactivity, and the use of Java plugins for running programs within the confines of the browser’s rectangle. Others, such as the wide use of Rich Text Format (.RTF) for reading documents, or the Realaudio/video plugins, gained followers or critics along the way, and were ultimately faded into obscurity.

And as for demanding all users use the same browser… well, that still happens, but not with the same panache as the old Netscape Now! buttons.

This puts the Internet Archive into a very interesting position.

With 20 years of the World Wide Web saved in the Wayback machine, and URLs by the billions, we’ve seen the moving targets move, and how fast they move. Where a site previously might be a simple set of documents and instructions that could be arranged however one might like, there are a whole family of sites with much more complicated inner workings than will be captured by any external party, in the same way you would capture a museum by photographing its paintings through a window from the courtyard.

When you visit the Wayback and pull up that old site and find things look differently, or are rendered oddly, that’s a lot of what’s going on: weird internal requirements, experimental programming, or tricks and traps that only worked in one brand of browser and one version of that browser from 1998. The lens shifted; the mirror has cracked since then.

This is a lot of philosophy and stray thoughts, but what am I bringing this up for?

The browsers that we use today, the Firefoxes and the Chromes and the Edges and the Braves and the mobile white-label affairs, are ever-shifting in their own right, more than ever before, and should be recognized as such.

It was inevitable that constant-update paradigms would become dominant on the Web: you start a program and it does something and suddenly you’re using version 54.01 instead of version 53.85. If you’re lucky, there might be a “changes” list, but that luck might be variant because many simply write “bug fixes”. In these updates are the closing of serious performance or security issues – and as someone who knows the days when you might have to mail in for a floppy disk to be sent in a few weeks to make your program work, I can totally get behind the new “we fixed it before you knew it was broken” world we live in. Everything does this: phones, game consoles, laptops, even routers and medical equipment.

But along with this shifting of versions comes the occasional fundamental change in what browsers do, along with making some aspect of the Web obsolete in a very hard-lined way.

Take, for example, Gopher, a (for lack of an easier description) proto-web that allowed machines to be “browsed” for information that would be easy for users to find. The ability to search, to grab files or writings, and to share your own pools of knowledge were all part of the “Gopherspace”. It was also rather non-graphical by nature and technically oriented at the time, and the graphical “WWW” utterly flattened it when the time came.

But since Gopher had been a not-insignificant part of the Internet when web browsers were new, many of them would wrap in support for Gopher as an option. You’d use the gopher:// URI, and much like the ftp:// or file:// URIs, it co-existed with http:// as a method for reaching the world.

Until it didn’t.

Microsoft, citing security concerns, dropped Gopher support out of its Internet Explorer browser in 2002. Mozilla, after a years-long debate, did so in 2010. Here’s the Mozilla Firefox debate that raged over Gopher Protocol removal. The functionality was later brought back externally in the form of a Gopher plugin. Chrome never had Gopher support. (Many other browsers have Gopher support, even today, but they have very, very small audiences.)

The Archive has an assembled collection of Gopherspace material here. From this material, as well as other sources, there are web-enabled versions of Gopherspace (basically, http:// versions of the gopher:// experience) that bring back some aspects of Gopher, if only to allow for a nostalgic stroll. But nobody would dream of making something brand new in that protocol, except to prove a point or for the technical exercise. The lens has refocused.

In the present, Flash is beginning a slow, harsh exile into the web pages of history – browser support dropping, and even Adobe whittling away support and upkeep of all of Flash’s forward-facing projects. Flash was a very big deal in its heyday – animation, menu interface, games, and a whole other host of what we think of as “The Web” depended utterly on Flash, and even specific versions and variations of Flash. As the sun sets on this technology, attempts to be able to still view it like the Shumway project will hopefully allow the lens a few more years to be capable of seeing this body of work.

As we move forward in this business of “saving the web”, we’re going to experience “save the browsers”, “save the network”, and “save the experience” as well. Browsers themselves drop or add entire components or functions, and being able to touch older material becomes successively more difficult, especially when you might have to use an older browser with security issues. Our in-browser emulation might be a solution, or special “filters” on the Wayback for seeing items as they were back then, but it’s not an easy task at all – and it’s a lot of effort to see information that is just a decade or two old. It’s going to be very, very difficult.

But maybe recognizing these browsers for what they are, and coming up with ways to keep these lenses polished and flexible, is a good way to start.

Those Hilarious Times When Emulations Stop Working

Posted on June 27, 2016 by Jason Scott

Jason Scott, Software Curator and Your Emulation Buddy, writing in.

With tens of thousands of items in the archive.org stacks that are in some way running in-browser emulations, we’ve got a pretty strong library of computing history afoot, with many more joining in the future. On top of that, we have thousands of people playing these different programs, consoles, and arcade games from all over the world.

Therefore, if anything goes slightly amiss, we hear it from every angle: twitter, item reviews, e-mails, and even the occasional phone call. People expect to come to a software item on the Internet Archive and have it play in their browser! It’s great this expectation is now considered a critical aspect of computer and game history. But it also means we have to go hunting down what the problem might be when stuff goes awry.

Sometimes, it’s something nice and simple, like “I can’t figure out the keys or the commands” or “How do I find the magic sock in the village.”, which puts us in the position of a sort of 1980s Software Company Help Line. Other times, it’s helping fix situations where some emulated software is configured wrong and certain functions don’t work. (The emulation might run too fast, or show the wrong colors, or not work past a certain point in the game.)

But then sometimes it’s something like this:

In this case, a set of programs were all working just fine a while ago, and then suddenly started sending out weird “Runtime” errors. Or this nostalgia-inducing error:

Here’s the interesting thing: The emulated historic machine would continue to run. In other words, we had a still-functioning, emulated broken machine, as if you’d brought home a damaged 486 PC in 1993 from the store and realized it was made of cheaper parts than you expected.

To make things even more strange, this was only happening to emulated DOS programs in the Google Chrome browser. And only Google Chrome version 51.x. And only in the 32-bit version of Google Chrome 51.x. (A huge thanks to the growing number of people who helped this get tracked down.)

This is what people should have been seeing, which I think we can agree looks much better:

The short-term fix is to run Firefox instead of Chrome for the moment if you see a crash, but that’s not really a “fix” per se – Chrome has had the bug reported to them and they’re hard at work on it (and working on a bug can be a lot of work). And there’s no guarantee an update to Firefox (or the Edge Browser, or any of the other browsers working today) won’t cause other weird problems going down the line.

All this, then, can remind people how strange, how interlocking, and even fragile our web ecosystem is at the moment. The “Web” is a web of standards dancing with improvisations, hacks, best guesses and a radically moving target of what needs to be obeyed and discarded. With the automatic downloading of new versions of browsers from a small set of makers, we gain security, but more-obscure bugs might change the functioning of a website overnight. We make sure the newest standards are followed as quickly as possible, but we also wake up to finding out an old trusted standard was deemed no longer worthy of use.

Old standards or features (background music in web pages, the gopher protocol, Flash) give way to new plugins or processes, and the web must be expected, as best it can, to deal with the new and the old and fail gracefully when it can’t quite do it. As part of the work of the Decentralized Web Summit was to bring forward the strengths of this world (collaboration, transparency, reproducibility) while pulling back from the weaknesses of this shifting landscape (centralization, gatekeeping, utter and total loss of history), it’s obvious a lot of people recognize this is an ongoing situation, needing vigilance and hard work.

In the meantime, we’ll do our best to keep on how the latest and greatest browsers deal with the still-fresh world of in-browser emulation, and try to emulate hardware that did come working from the factory.

In the meantime, enjoy some Apple II programs. On us.

Distributed Preservation Made Simple

Posted on February 26, 2016 by Jake Johnson

Library partners of the Internet Archive now have at their fingertips an easy way – from a Unix-like command line in a terminal window – to download digital collections for local preservation and access.

This post will show how to use a Internet Archive command-line tool (ia) to download all items in a collection stored on Archive.org, and keep their local collections in sync with the Archive.org collection.

To use ia, the only requirement is to have Python 2 installed on a Unix-like operating system (i.e. Linux, Mac OS X). Python 2 is pre-installed on Mac OS X and most Linux systems so there is nothing more that needs to be done, except to open up a terminal and follow these steps:

1. Download the latest binary of the ia command-line tool by running the following command in your terminal:

curl -LO https://archive.org/download/ia-pex/ia

2. Make the binary executable:

chmod +x ia

3. Make sure you have the latest version of the binary, version 1.0.0:

./ia --version

4. Configure ia with your Archive.org credentials (This step is only needed if you need privileges to access the items). :

./ia configure

5. Download a collection:

./ia download --search 'collection:solarsystemcollection'

or

./ia download --search 'collection:JangoMonkey'

The above command to “Download a collection”, for example, will download all files from all items from the band JangoMonkey or the NASA Solar System collection. If re-run, by default, will skip over any files already downloaded, as rysnc does, which can help keep your local collection in sync with the collection on Archive.org.

If you would like to download only certain file types, you can use the –glob option. For example, if you only wanted to download JPEG files, you could use a command like:

./ia download --search 'collection:solarsystemcollection' --glob '*.jpeg|*.jpg'

Note that by default ia will download files into your current working directory. If you launch a terminal window without moving to a new directory, the files will be downloaded to your user directory. To download to a different directory, you can either cd into that directory or use the “–destdir” parameter like so:

mkdir solarsystemcollection-collection

./ia download --search 'collection:solarsystemcollection' --destdir solarsystemcollection-collection

Downloading in Parallel

GNU Parallel is a powerful command-line tool for executing jobs in parallel. When used with ia, downloading items in parallel is as easy as:

./ia search 'collection:solarsystemcollection' --itemlist | parallel --no-notice -j4 './ia download {} --glob="*.jpg|*.jpeg"'

The -j option controls how many jobs run in parallel (i.e. how many files are downloaded at a time). Depending on the machine you are running the command on, you might get better performance by increasing or decreasing the number of simultaneous jobs. By default, GNU Parallel will run one job per CPU.

GNU Parallel can be installed with Homebrew on Mac OS X (i.e.: brew install parallel), or your favorite package manager on Linux (e.g. on Ubuntu: apt-get install parallel, on Arch Linux: pacman -S parallel, etc.). For more details, please refer to: https://www.gnu.org/software/parallel/

For more options and details, use the following command:

./ia download --help

Finally, to see what else the ia command-line tool can do:

./ia --help

Documentation of the ia command-line tool is available at: https://internetarchive.readthedocs.org/en/latest/cli.html

There you have it. Library partners, download and store your collections now using this command-line tool from the Internet Archive. If you have any questions or issues, please write to info (at) archive.org. We are trying to make distributed preservation simple and easy!

archive.org download counts of collections of items updates and fixes

Posted on January 26, 2015 by traceypooh

Every month, we look over the total download counts for all public items at archive.org. We sum item counts into their collections. At year end 2014, we found various source reliability issues, as well as overcounting for “top collections” and many other issues.

archive.org public items tracked over time

To address the problems we did:

Rebuilt a new system to use our database (DB) for item download counts, instead of our less reliable (and more prone to “drift”) SOLR search engine (SE).
Changed monthly saved data from JSON and PHP serialized flatfiles to new DB table — much easier to use now!
Fixed overcounting issues for collections: texts, audio, etree, movies
Fixed various overcounting issues related to not unique-ing <collection> and <contributor> tags (more below)
Fixes to character encoding issues on <contributor> tags

Bonus points!

We now track *all collections*. Previously, we only tracked items tagged:
- <mediatype> texts
- <mediatype> etree
- <mediatype> audio
- <mediatype> movies
For items we are tracking <contributor> tags (texts items), we now have a “Contributor page” that shows a table of historical data.
Graphs are now “responsive” (scale in width based on browser/mobile width)

The Overcount Issue for top collection/mediatypes

In the below graph, mediatypes and collections are shown horizontally, with a sample “collection hierarchy” today.
For each collection/mediatype, we show 1 example item, A B C and D, with a downloads/streams/views count next to it parenthetically. So these are four items, spanning four collections, that happen to be in a collection hierarchy (a single item can belong to multiple collections at archive.org)
The Old Way had a critical flaw — it summed all sub-collection counts — when really it should have just summed all *direct child* sub-collection counts (or gone with our New Way instead)

So we now treat <mediatype> tags like <collection> tags, in terms of counting, and unique all <collection> tags to avoid items w/ minor nonideal data tags and another kind of overcounting.

… and one more update from Feb/1:

We graph the “difference” between absolute downloads counts for the current month minus the prior month, for each month we have data for. This gives us graphs that show downloads/month over time. However, values can easily go *negative* with various scenarios (which is *wickedly* confusing to our poor users!)

Here’s that situation:

A collection has a really *hot* item one month, racking up downloads in a given collection. The next month, a DMCA takedown or otherwise removes the item from being available (and thus counted in the future). The downloads for that collection can plummet the next month’s run when the counts are summed over public items for that collection again. So that collection would have a negative (net) downloads count change for this next month!

Here’s our fix:

Use the current month’s collection “item membership” list for current month *and* prior month. Sum counts for all those items for both months, and make the graphed difference be that difference. In just about every situation that remains, graphed monthly download counts will be monotonic (nonnegative and increasing or zero).

Using Docker to Encapsulate Complicated Program is Successful

Posted on November 14, 2014 by Brewster Kahle

The Internet Archive has been using docker in a useful way that is a bit out of the mainstream: to package a command-line binary and its dependencies so we can deploy it on a cluster and use it in the same way we would a static binary.

Columbia University’s Daniel Ellis created an audio fingerprinting program that was used in a competition. It was not packaged as a debian package or other distribution approach. It took a while for our staff to find how to install it and its many dependencies consistently on Ubuntu, but it seemed pretty heavy handed to install that on our worker cluster. So we explored using docker and it has been successful. While old hand for some, I thought it might be interesting to explain what we did.

1) Created a docker file to make a docker container that held all of the code needed to run the system.

2) Worked with our systems group to figure out how to install docker on our cluster with a security profile we felt comfortable with. This included running the binary in the container as user nobody.

3) Ramped up slowly to test the downloading and running of this container. In general it would take 10-25 minutes to download the container the first time. Once cached on a worker node, it was very fast to start up. This cache is persistent between many jobs, so this is efficient.

4) Use the container as we would a shell command, but passed files into the container by mounting a sub filesystem for it to read and write to. Also helped with signaling errors.

5) Starting production use now.

We hope that docker can help us with other programs that require complicated or legacy environments to run.

Congratulations to Raj Kumar, Aaron Ximm, and Andy Bezella for the creative solution to problem that could have made it difficult for us to use some complicated academic code in our production environment.

Go docker!

Job Posting: Web Application/Software Developer for Archive-It

Posted on August 6, 2013 by jeff kaplan

The Internet Archive is looking for a smart, collaborative and resourceful engineer to lead and do the development of the next generation of the Archive-It service, a web based application used by libraries and archives around the world. The Internet Archive is a digital public library founded in 1996. Archive-It is a self-sustaining revenue generating subscription service first launched in 2006.

Primary responsibilities would be to extend the success of Archive-It, which librarians and archivists use to create collections of digital content, and then make them accessible to researchers, scholars and the general public. Widely considered to be the market leader since its’ inception, Archive-It’s partner base has archived over five billion web pages and over 260 terabytes of data. http://archive-it.org

Working for Archive-It program’s director, this position has technical responsibility to evolve this service while still being straightforward enough to be operated by 300+ partner organizations and their users with minimal technical skills. Our current system is primarily Java based and we are looking to help build the next-generation of Archive-It using the latest web technologies. The ideal candidate will possess a desire to work collaboratively with a small internal team and a large, vocal and active user community; demonstrating independence, creativity, initiative and technological savvy, in addition to being a great programmer/architect.

The ideal candidate will have: 

5+ years work experience in Java and Python web application development
Experience with Hadoop, specifically HBase and Pig
Experience developing web application database back-end (SQL or NoSQL).
Good understanding of latest web framework technologies, both JVM and non-JVM based, and trade-offs between them.
Strong familiarity with all aspects of web technology and protocols, including: HTTP, HTML, and Javascript
Experience with a variety of web applications, machine clusters, distributed systems, and high-volume data services.
Flexibility and a sense of humor
BS Computer Science, or equivalent work experience

Bonus points for:

Experience with web crawlers and/or applications designed to display [archived] web content (especially server-side apps)
Open source practices experience
Experience and/or interest in user interface design and information architecture
Familiarity with Apache SOLR or similar facet-based search technologies
Experience with the building/architecture of social media sites
Experience building out a mobile platform

To apply:

Please send your resume and cover letter to kristine at archive dot org with the subject line “Web App Developer Archive-It”.

The Archive thanks all applicants for their interest, but advises that only those selected for an interview will be contacted. No phone calls please!

We are an equal opportunity employer.

How to use the Virtual Machine for Researchers

Posted on July 4, 2013 by Brewster Kahle

Some researchers that are working with the Internet Archive, such as those at University of Massachusetts, have wanted closer access to some of our collections. We are learning how to support this type of “on-campus” use of the collections. This post is to document how to use these machines.

Who can have access?

This is for joint projects with the archive, usually some academic program often funded by NSF. So this is not a general offering, but more of a special case thing. Most use the collections by downloading materials to their home machines. We have tools to help with this, and use “GNU Parallel” to make it go fast.

How to get an account?

Is there an agreement? Yes, there usually is. This is usually administered by Alexis Rossi. All in all, these are shared machines, so please be respectful of others data and use of the machines.

How do I get access to the VM? To get an account you will need to forward a public SSH key to Jake Johnson. Please follow the steps below for more details.

Generate your SSH keys.

These instructions assume you’re on a Unix-like operating system. If you’re using Windows please see Mike Lichtenberg’s blog post, Generating SSH Keys on Windows.

If you don’t already have an ~/.ssh directory, you will need to create one to store your SSH configuration files and keys:
```
$ mkdir -p ~/.ssh
```
Move into the ~/.ssh directory:
```
$ cd ~/.ssh
```
Create your keys (replacing {username} with the username you would like to use to login to the VM):
```
$ bash -c 'ssh-keygen -t rsa -b 2048 -C "{username}@researcher0.fnf.archive.org"'
```
You will be prompted to enter a filename which your private SSH key will be saved to. Use something like id_rsa.{username}@researcher0.fnf.archive.org, again replacing {username} with your username that you will be using to login to the VM):
```
Enter file in which to save the key (~/.ssh/id_rsa): id_rsa.{username}@researcher0.fnf.archive.org
```

You will be prompted again to enter a passphrase. Enter a passphrase, and continue.

Enter passphrase (empty for no passphrase): [enter your passphrase]
Enter same passphrase again: [enter your passphrase again]

You should now have two new files in your ~/.ssh directory, a private key and a public key. For example:

~/.ssh/id_rsa.{username}@researcher0.fnf.archive.org
~/.ssh/id_rsa.{username}@researcher0.fnf.archive.org.pub

Your public key is the key suffixed with “.pub“.

Adding your public key to the VM

Forward your public key to Jake Johnson. He will create a user for you, and add your public key to the VM. Once you receive notification that your user has been created and your key successfully added to the VM, proceed to the next step.

Logging into the VM via SSH

You can now use your private key to login into the VM with the following command:

$ ssh -i ~/.ssh/id_rsa.{username}@researcher0.fnf.archive.org {username}@researcher0.fnf.archive.org

How do I bulk download data from archive.org onto the VM?

We recommend using wget to download data from archive.org. Please see our blog post, Downloading in bulk using wget, for more details.

If you have privileges to an access-restricted collection, you can use your archive.org cookies to download data from this collection by adding the following --header flag to your wget command:

--header "Cookie: logged-in-user={email%40example.com}; logged-in-sig={private};"

(Note: replace {email%40example.com} with the email address associated with your archive.org account (encoding @ as %40), and {private} with the value of your logged-in-sig cookie.)

You can retrieve your logged-in-sig cookie using the following steps:

In Firefox , go to archive.org and log in with your account
Go to Firefox > Preferences
Click on the Privacy tab
Select “Use custom settings for History” in drop down menu in the history section
Click the “Show cookies” button
Find archive.org in the list of cookies and expand to show options
Select the logged-in-sig cookie. The long string in the “Content:” field is the value of your logged-in-sig cookie. This is the value that you will need for your wget command (specifically, replacing {private} in the --header flag mentioned above).

How do I bulk download metadata from archive.org onto the VM?

You can download all of an items metadata via our Metadata API.

How do I generate a list of identifiers for downloading data and metadata from collections in bulk?

You can use our advanced search engine. Please refer to the Create a file with the list of identifiers section in our Downloading in bulk using wget blog post.

How can I monitor usage of the VM?

You can monitor usage of the VM via MRTG (Multi Router Traffic Grapher) here: http://researcher0.fnf.archive.org:8088/mrtg/

The Internet Archive Metadata API

Posted on July 4, 2013 by jeff kaplan

The Metadata API is intended for fast, flexible, and reliable reading and writing of Internet Archive items.

Metadata Read API

The Metadata Read API is the fastest and most flexible way to retrieve metadata for items on archive.org. We’ve seen upwards of 500 reads per second for some collections!

Overview

Returns all of an item’s metadata in JSON.

Resource URL

http://archive.org/metadata/:identifier

Parameters

identifier: The globally unique ID of a given item on archive.org.

Usage

For example, frenchenglishmed00gorduoft is the identifier for http://archive.org/details/frenchenglishmed00gorduoft. You can retrieve all of this item’s metadata from the Metadata API using the following curl command:

$ curl http://archive.org/metadata/frenchenglishmed00gorduoft

The Metadata API also supports HTTPS:

$ curl https://archive.org/metadata/frenchenglishmed00gorduoft

Sub-item Access

The Metadata API returns all of an item’s metadata by default. You can access specific metadata elements like so:

http://archive.org/metadata/:identifier/metadata
http://archive.org/metadata/:identifier/server
http://archive.org/metadata/:identifier/files_count
http://archive.org/metadata/:identifier/files?start=1&count=2
http://archive.org/metadata/:identifier/metadata/collection
http://archive.org/metadata/:identifier/metadata/collection/0
http://archive.org/metadata/:identifier/metadata/title
http://archive.org/metadata/:identifier/files/0/name

Metadata Write API

The metadata write API is intended to make changes to metadata timely, safe and flexible.
It utilizes version 02 of the JSON Patch standard.

Overview

timely

Callers receive results (success or failure) immediately.
Changes are quickly reflected through the metadata read API.

safe

All writes pass through the catalog, so all changes are recorded.
All writes are checked before they’re submitted to the catalog.
If there’s a problem, no catalog task is created. Goal: no redrows!
All checks are repeated when the catalog task is executed.

flexible

Supports arbitrary changes to multiple metadata targets through a unified API.
Changes are easy — no string concatenation or libraries needed.

Resource URL

http://archive.org/metadata/:identifier

Parameters

identifier: The globally unique ID of a given item on archive.org.

Targets

The Metadata Write API supports three kinds of target:

metadata: Changes item_meta.xml (e.g. http://archive.org/metadata/:identifier/metadata).
files/:filename: Changes the file entry in the item’s files.xml (e.g. http://archive.org/metadata/:identifier/files).
other: Changes other.json (e.g. http://archive.org/metadata/:identifier/other).

For XML targets (e.g. ‘metadata‘ and ‘files‘) patches should be composed against their JSON representation, as found in metadata read API results.

Usage

As an HTTP post/get

http://archive.org/metadata/:identifier

With the following url-encoded arguments:

-target: The metadata target you would like to modify.
-patch: The patch you are submitting to the Metadata API.
access: Your IA-S3 access key.
secret: Your IA-S3 secret key.

Authentication

NOTE: These calls must be made with appropriate authentication – at the moment, this means passing your Archive.org IA-S3 credentials. Please visit http://archive.org/account/s3.php to obtain your IA-S3 access key and secret key.

Patches

Patches are JSON strings. They should comply to the draft Json-Patch standard:

http://tools.ietf.org/html/draft-ietf-appsawg-json-patch-02

Examples

Writing to an item’s meta.xml

Add ‘scan_sponsor’ with value ‘Starfleet’ to target ‘metadata’ to the item metadata_test_item:

#!/bin/bash
ACCESS=<redacted>
SECRET=<redacted>
IDENTIFIER=metadata_test_item
TARGET=metadata
PATCH='{"add":"/scan_sponsor", "value":"Starfleet"}'

curl --data-urlencode -target=$TARGET \
     --data-urlencode -patch="$PATCH" \
     --data-urlencode access=$ACCESS \
     --data-urlencode secret=$SECRET \
     http://archive.org/metadata/$IDENTIFIER

returns a JSON object, like the following:

{"success":true,"task_id":114350522,"log":"http://www.us.archive.org/log_show.php?task_id=114350522″}

or perhaps

{"error":"Some problem applying the patch"}

writing to files.xml entry

#!/bin/bash
ACCESS=<redacted> 
SECRET=<redacted>
IDENTIFIER=metadata_test_item
TARGET='files/glogo.png'
PATCH='{"add":"/camera", "value":"Canon A150″}'

curl --data-urlencode -target=$TARGET \
     --data-urlencode -patch="$PATCH" \
     --data-urlencode access=$ACCESS \
     --data-urlencode secret=$SECRET \
     http://archive.org/metadata/$IDENTIFIER

Writing to metadata_test_item/foo_client.json

NOTE: Keys and values are binary-safe and unrestricted

#!/bin/bash
ACCESS=<redacted> 
SECRET=<redacted>
IDENTIFIER=metadata_test_item
TARGET='foo_client'
PATCH='{"add":"/of concern to foo", "value":{"foo-ness":["buckle", "shoe"]}}'

curl --data-urlencode -target=$TARGET \
     --data-urlencode -patch="$PATCH" \
     --data-urlencode access=$ACCESS \
     --data-urlencode secret=$SECRET \     
     http://archive.org/metadata/$IDENTIFIER

After the above call, a metadata read of metadata_test_item will have a toplevel member ‘foo_client’ with value:

{"foo-ness":["buckle", "shoe"]}