News from the Archive 0004: Petabytes, Recap, and Ramadan

No. 4, 5 September 2012

Another Day, Another Petabyte

Did you wonder where the Internet Archive stores millions of books, movies, recordings, and 150 billion web pages? Not in some conceptual cloud, but on our custom-designed Petabox servers, that’s where. This week, we’re installing another petabyte of storage; that’s a thousand terabytes or a million gigabytes.

Each Petabox is comprised of ten racks; each rack holds thirty-eight three-terabyte hard drives, two of which are used for the operating systems with the remainder used for data.

Here’s what one of the racks looks like fresh out of the shipping container …

And when the racks are assembled into Petaboxes …

Brewster’s Report

I’m especially interested in the Recap collection because it is huge, useful, and an interesting example of an archive that builds itself. This set of court filings–in electronic form–are from the U.S. government’s Pacer database. When lawyers file documents in federal court, they submit them in electronic form such as a PDF, a Microsoft Word document, or a scanned paper printout. The documents that can be made public go into a database called Pacer, which is freely available to the public.

Well, not quite free. The government sells access to these public documents for ten cents a page, with a document cap of three dollars. This seems to be a fair price for someone who just needs a few documents, but the cost is prohibitive for someone who needs lots and lots of data for their research.

And that brings us to Recap (Pacer spelled backward). A group of academics and activists thought of an ingenious scheme to make wholesale access available court documents for free as well as benefit the individual users that make the project possible.

They created a Firefox browser plugin that notices when a visitor searches the Pacer site. If the court filing the user is looking for is available from the Internet Archive’s Recap collection, the document may be downloaded for free. If the researcher pays for and downloads a court filing from the Pacer site, it’s automatically added to the Recap collection.

As a result, the Internet Archive hosts a large database of over 700,000 public court cases. This collection of millions and millions of documents, in a publicly accessible archive, can be freely used in bulk for research purposes.

This automated insertion into the Internet Archive was a new use of our S3-like interface; it required patience and debugging as the Princeton programmers and the Internet Archive staff worked out the kinks. As a result of meticulous work, the system has been running almost unattended for three years. The most popular case at the moment involves the Apple Computer and Samsung trademark dispute; it’s been downloaded 1,100 times in the last week. The most popular filing has been downloaded almost 35,000 times.

We are excited about building independent archive support into computer applications, and offering bulk access to materials for all sorts of uses beyond what was imagined by the original database builders.  We hope more services become “Archive aware.”

Congratulations to Ed Felton, Aaron Swartz, Sam Stoller, Harlan Yu, Tim Lee for making a new type of automated archive service work.

—Brewster Kahle, founder and digital librarian

From the Archive’s Mailbox

Since you use a gazillion hard drives, which brands are the best? Which brands should I avoid?

Thanks in advance,

— Nancy Miller

In our experience, they’re fungible. Hard drives all fail sooner or later, so we buy whatever’s the best value when it’s time to add another terabyte. We duplicate (backup) the data, and replace the drives that die, which they generally do under warranty. Take care of your data and don’t worry about the fallibility of hardware.

Selected Collection: The Crittenden Automotive Library

The Crittenden Automotive Library was started in 2006 as a collection of automotive information including various forms of media (audio, video, and text) at

It is a large collection of information relating to not only cars, trucks, and motorcycles, but also the roads they drive on, the races they compete in, cultural works based on them, government regulation of them, and the people who design, build, and drive them. We are dedicated to the preservation and free distribution of information relating to all types of cars and road-going vehicles for those seeking the greater understanding of these very important elements of modern society, how automobiles have affected how people live around the world, or for the general study of automotive history and anthropology. In addition to the historical knowledge, we preserve current events for future generations.

Other Picks from the Archive

Too Late for Tears (1949)

It’s part of the film noir collection for a reason.

Without giving away the plot, here’s a relevant bit of dialogue:

Jane, Jane, what’s happening to us—what’s happening? The money sits down there in an old leather bag and yet it’s tearing us apart.

Enjoy! (Or not.)

— recommended by Seth Johannsen

Bathhouse Row Adaptive Use Program, The Fordyce Bathouse: Technical Report 5 (1985)

This particular Bathhouse Row report is interesting for several reasons. One, its pictures show us about what the interior and exterior of some historic bathhouses in the present day should look like, as well as what they looked like on the inside when they were operational. Two, the exterior drawing plan of the Fordyce Bathhouse is oh so intricate and lovely. Three, all the materials we are currently scanning relate to national parks, so it is neat to find an area in the National Park System where natural resources, like hot springs, were used rather than preserved in their natural state. And fourth, we are fortunate enough today to reap the benefits of what replaced the bathhouse movement of centuries ago, which is spas and personal baths.

— recommended by Sarah M. Lohmann

Ramadan 30, 1433 ~ Madeenah Tahajjud Audio

Ramadan ended a few weeks ago, an observance that went largely unnoticed outside of the Muslim community. These recordings document an aural environment literally unheard of by most people in the western world, and have the same resonance as a recording of a Kansas preacher might have for a Bedouin nomad.

— recommended by Boulaye Trevore

What are your Archive favorites? Please suggest a link or two and a few words about why you appreciate your recommendation to:

—David Glenn Rinehart

/ / / / /

To subscribe to this list, please visit:

If you don’t already have a free Internet Archive library card, you may get yours here:

There, enter your password into the “Change Your Account Settings” Option, then click on the “Verify” button. That will bring you to your accounts setting page, where you may change your subscription status in the “Change Announcement Settings” section.

If the above URL is inoperable, make sure that you have copied the entire address. Some mail readers will wrap a long URL, breaking the link.

If you’re still having trouble, please contact the list owner at:

/ / / / / / /

David Glenn Rinehart is an artist in residence at the Internet Archive as well as a cartoonist, composer, filmmaker, musician, and writer. His work is at and elsewhere.

4 thoughts on “News from the Archive 0004: Petabytes, Recap, and Ramadan

  1. Cement Science

    Sure, Petabox may be better than those cloud concept products, since petabox is more controllable by Archive itself. The good news is hardware is getting cheaper and cheaper.

    1. Just an noname Blogger

      Yeah Technology is getting cheaper and cheaper and more efficient in an exponential rate. Perhaps one day it would be possible to store alle the Internet Archive Data on a single USB Stick or something like that, this would be great 😉

  2. kurtulus39

    These recordings document an aural environment literally unheard of by most people in the western world, and have the same resonance as a recording of a Kansas preacher might have for a Bedouin nomad

Comments are closed.