The mission of the Internet Archive is “Universal Access to All Knowledge.” The knowledge we archive is represented as digital data. We often get questions related to “How much data does Internet Archive actually keep?” and “How do you store and preserve that knowledge?”
All content uploaded to the Archive is stored in “Items.” As with items in traditional libraries, Internet Archive items are structured to contain a single book, or movie, or music album — generally a single piece of knowledge that can be meaningfully cataloged and retrieved, along with descriptive information (metadata) that usually includes the title, creator (author), and other curatorial information about the content. From a technical standpoint, items are stored in a well-defined structure within Linux directories.
Once a new item is created, automated systems quickly replicate that item across two distinct disk drives in separate servers that are (usually) in separate physical data centers. This “mirroring” of content is done both to minimize the likelihood of data loss or data corruption (due to unexpected harddrive or system failures) and to increase the efficiency of access to the content. Both of these storage locations (called “primary” and “secondary”) are immediately available to serve their copy of the content to patrons… and if one storage location becomes unavailable, the content remains available from the alternate storage location.
We refer to this overall scheme as “paired storage.” Because of the dual-storage arrangement, when we talk about “how much” data we store, we usually refer to what really matters to the patrons — the amount of unique compressed content in storage — that is, the amount prior to replication into paired-storage. So for numbers below, the amount of physical disk space (“raw” storage) is typically twice the amount stated.
As we have pursued our mission, the need for storing data has grown. In October of 2012, we held just over 10 petabytes of unique content. Today, we have archived a little over 30 petabytes, and we add between 13 and 15 terabytes of content per day (web and television are the most voluminous).
Currently, Internet Archive hosts about 20,000 individual disk drives. Each of these are housed in specialized computers (we call them “datanodes”) that have 36 data drives (plus two operating systems drives) per machine. Datanodes are organized into racks of 10 machines (360 data drives), and interconnected via high-speed ethernet to form our storage cluster. Even though our content storage has tripled over the past four years, our count of disk drives has stayed about the same. This is because disk drive technology improvements. Datanodes that were once populated with 36 individual 2-terabyte (2T) drives are today filled with 8-terabyte (8T) drives, moving single node capacity from 72 terabytes (64.8T formatted) to 288 terabytes (259.2T formatted) in the same physical space! This evolution of disk density did not happen in a single step, so we have populations of 2T, 3T, 4T, and 8T drives in our storage clusters.
Our data mirroring scheme ensures that information stored on any specific disk, on a specific node, and in a specific rack is replicated to another disk of the same capacity, in the same relative slot, and in the same relative datanode in a another rack usually in another datacenter. In other words, data stored on drive 07 of datanode 5 of rack 12 of Internet Archive datacenter 6 (fully identified as ia601205-07) has the same information stored in datacenter 8 (ia8) at ia801205-07. This organization and naming scheme keeps tracking and monitoring 20,000 drives with a small team manageable.
We maintain our datacenters at ambient temperatures and humidity, meaning that we don’t incur the cost of operating and maintaining an air-conditioned environment (although we do use exhaust fans in hot weather). This keeps our power consumption down to just the operational requirements of the racks (about 5 kilowatts each), but does put some constraints on environmental specifications for the computers we use as data nodes. So far, this approach has (for the most part) worked in terms of both computer and disk drive longevity.
Of course, disk drives all eventually fail. So we have an active team that monitors drive health and replaces drives showing early signs for failure. We replaced 2,453 drives in 2015, and 1,963 year-to-date 2016… an average of 6.7 drives per day. Across all drives in the cluster the average “age” (arithmetic mean of the time in-service) is 779 days. The median age is 730 days, and the most tenured drive in our cluster has been in continuous use for 6.85 years!
So what happens when a drive does fail? Items on that drive are made “read only” and our operations team is alerted. A new drive is put in to replace the failed one and immediately after replacement, the content from the mirror drive is copied onto the fresh drive and read/write status is restored.
Although there are certainly alternatives to drive mirroring to ensure data integrity in a large storage system (ECC systems like RAID arrays, CEPH, Hadoop, etc.) Internet Archive chooses the simplicity of mirroring in-part to preserve the the transparency of data on a per-drive basis. The risk of ECC approaches is that in the case of truly catastrophic events, falling below certain thresholds of disk population survival means a total loss of all data in that array. The mirroring approach means that any disk that survives the catastrophe has usable information on it.
Over the past 20 years, Internet Archive has learned many lessons related to storage. These include: be patient in adopting newly introduced technology (wait for it to mature a bit!); with ambient air comes ambient humidity — plan for it; uniformity of infrastructure components is essential (including disk firmware). One of several challenges we see on the horizon is a direct consequence of the increases in disk density — it takes a long time to move data to and from a high-capacity disk. Across pair-bonded 1Gbps node interconnects, transferring data to or from an 8T drive requires 8 hours and 11 minutes at “full speed” and in-practice can extend to several days with network traffic and activity interruptions. This introduces a longer “window of vulnerability” for the unlikely “double-disk failure” scenario (both sides of the mirror becoming unusable). To address this we are looking as increased speeds for node-to-node networking as well as alternative storage schemes that compensate for this risk.
As a final note, I want to thank the small team of extremely hard-working individuals at Internet Archive who maintain and evolve the compute and storage infrastructure that enables us to pursue our mission and service our patrons. Without their hard work and dedicated service, we would not be able to store and preserve the knowledge and information that the community works hard to collect and curate.
Thank you to the 2015-2016 Core Infrastructure Team (and contributors):
Andy Bezella, Hank Bromley, Dwalu Khasu, Sean Fagan, Ralf Muehlen, Tim Johnson, Jim Nelson, Mark Seiden, Samuel Stoller, and Trevor von Stein
-jcg (John C. Gonzalez)