The mission of the Internet Archive is “Universal Access to All Knowledge.” The knowledge we archive is represented as digital data. We often get questions related to “How much data does Internet Archive actually keep?” and “How do you store and preserve that knowledge?”
All content uploaded to the Archive is stored in “Items.” As with items in traditional libraries, Internet Archive items are structured to contain a single book, or movie, or music album — generally a single piece of knowledge that can be meaningfully cataloged and retrieved, along with descriptive information (metadata) that usually includes the title, creator (author), and other curatorial information about the content. From a technical standpoint, items are stored in a well-defined structure within Linux directories.
Once a new item is created, automated systems quickly replicate that item across two distinct disk drives in separate servers that are (usually) in separate physical data centers. This “mirroring” of content is done both to minimize the likelihood of data loss or data corruption (due to unexpected harddrive or system failures) and to increase the efficiency of access to the content. Both of these storage locations (called “primary” and “secondary”) are immediately available to serve their copy of the content to patrons… and if one storage location becomes unavailable, the content remains available from the alternate storage location.
We refer to this overall scheme as “paired storage.” Because of the dual-storage arrangement, when we talk about “how much” data we store, we usually refer to what really matters to the patrons — the amount of unique compressed content in storage — that is, the amount prior to replication into paired-storage. So for numbers below, the amount of physical disk space (“raw” storage) is typically twice the amount stated.
As we have pursued our mission, the need for storing data has grown. In October of 2012, we held just over 10 petabytes of unique content. Today, we have archived a little over 30 petabytes, and we add between 13 and 15 terabytes of content per day (web and television are the most voluminous).
Currently, Internet Archive hosts about 20,000 individual disk drives. Each of these are housed in specialized computers (we call them “datanodes”) that have 36 data drives (plus two operating systems drives) per machine. Datanodes are organized into racks of 10 machines (360 data drives), and interconnected via high-speed ethernet to form our storage cluster. Even though our content storage has tripled over the past four years, our count of disk drives has stayed about the same. This is because disk drive technology improvements. Datanodes that were once populated with 36 individual 2-terabyte (2T) drives are today filled with 8-terabyte (8T) drives, moving single node capacity from 72 terabytes (64.8T formatted) to 288 terabytes (259.2T formatted) in the same physical space! This evolution of disk density did not happen in a single step, so we have populations of 2T, 3T, 4T, and 8T drives in our storage clusters.
Our data mirroring scheme ensures that information stored on any specific disk, on a specific node, and in a specific rack is replicated to another disk of the same capacity, in the same relative slot, and in the same relative datanode in a another rack usually in another datacenter. In other words, data stored on drive 07 of datanode 5 of rack 12 of Internet Archive datacenter 6 (fully identified as ia601205-07) has the same information stored in datacenter 8 (ia8) at ia801205-07. This organization and naming scheme keeps tracking and monitoring 20,000 drives with a small team manageable.
We maintain our datacenters at ambient temperatures and humidity, meaning that we don’t incur the cost of operating and maintaining an air-conditioned environment (although we do use exhaust fans in hot weather). This keeps our power consumption down to just the operational requirements of the racks (about 5 kilowatts each), but does put some constraints on environmental specifications for the computers we use as data nodes. So far, this approach has (for the most part) worked in terms of both computer and disk drive longevity.
Of course, disk drives all eventually fail. So we have an active team that monitors drive health and replaces drives showing early signs for failure. We replaced 2,453 drives in 2015, and 1,963 year-to-date 2016… an average of 6.7 drives per day. Across all drives in the cluster the average “age” (arithmetic mean of the time in-service) is 779 days. The median age is 730 days, and the most tenured drive in our cluster has been in continuous use for 6.85 years!
So what happens when a drive does fail? Items on that drive are made “read only” and our operations team is alerted. A new drive is put in to replace the failed one and immediately after replacement, the content from the mirror drive is copied onto the fresh drive and read/write status is restored.
Although there are certainly alternatives to drive mirroring to ensure data integrity in a large storage system (ECC systems like RAID arrays, CEPH, Hadoop, etc.) Internet Archive chooses the simplicity of mirroring in-part to preserve the the transparency of data on a per-drive basis. The risk of ECC approaches is that in the case of truly catastrophic events, falling below certain thresholds of disk population survival means a total loss of all data in that array. The mirroring approach means that any disk that survives the catastrophe has usable information on it.
Over the past 20 years, Internet Archive has learned many lessons related to storage. These include: be patient in adopting newly introduced technology (wait for it to mature a bit!); with ambient air comes ambient humidity — plan for it; uniformity of infrastructure components is essential (including disk firmware). One of several challenges we see on the horizon is a direct consequence of the increases in disk density — it takes a long time to move data to and from a high-capacity disk. Across pair-bonded 1Gbps node interconnects, transferring data to or from an 8T drive requires 8 hours and 11 minutes at “full speed” and in-practice can extend to several days with network traffic and activity interruptions. This introduces a longer “window of vulnerability” for the unlikely “double-disk failure” scenario (both sides of the mirror becoming unusable). To address this we are looking as increased speeds for node-to-node networking as well as alternative storage schemes that compensate for this risk.
As a final note, I want to thank the small team of extremely hard-working individuals at Internet Archive who maintain and evolve the compute and storage infrastructure that enables us to pursue our mission and service our patrons. Without their hard work and dedicated service, we would not be able to store and preserve the knowledge and information that the community works hard to collect and curate.
Thank you to the 2015-2016 Core Infrastructure Team (and contributors):
Andy Bezella, Hank Bromley, Dwalu Khasu, Sean Fagan, Ralf Muehlen, Tim Johnson, Jim Nelson, Mark Seiden, Samuel Stoller, and Trevor von Stein
-jcg (John C. Gonzalez)
Hey that’s really cool, thanks for so many details. I once maintained a large system and it was not easy, the odds of weird things happening goes way up when working at scale. I like your simplicity and transparency. I can’t tell you how horrible it was when our RAID system had a file corruption and had to restore from tape which meant days of downtime. And the hassle of swapping tapes, which get messed up also, etc..
The average disk life seems short to me. Is it a function of the drive quality, the temperatures, or high use? Would you tell us what brand drives your happy with currently? I know that you were using Seagate at one time. Thanks!
Thank you for the comment. If our average drive-life is indeed short, that is most likely due to temperature and vibration in our datacenter (we do not employ dampening mechanisms). However, based on data from Backblaze, our average in-use time of 779 days seems to exceed their drive life expectancy of 503 days (although that figure is for 4T drives).
As to manufacturers: we use a preponderance of Seagate drives (all 8T, and many of the 4T). We also have a healthy population of Hitachi 4T drives, which seem (on average) to have outlasted the Seagate 4T models. One thing I did not elaborate on is that our 8T drives are the “archival” type of drive which employs shingled write technology. These are early enough in their operational life that we don’t yet have meaningful statistics on failure rates or lifespan (or how that might be affected by shingled technology).
Hope that helps!
-jcg
Very cool to read about. I run the storage systems for a biomed, we have about 30+ P of useable storage space. The only difference is that very little of ours is archive. All our stuff is high performance. Thats why we deal with EMC Isilon quite a bit. The node architecture can handle the compute. One thing im curious about is if youve ever looked into cloud storage. Im curious the price difference in what you have currently vs glacier or the new google coldline storage. Access prices are brutal, but most of your stuff I assume is very cold.
Thanks for the article!
Generally, we do not consider using Cloud storage. First of all, we have done the calculations and it is VERY much more expensive per petabyte than the owned datacenter model we currently follow. Because we are charged with the permanent preservation of data, we are also cautious about entering into any arrangement where a change of funding (or of service provider governance) might result in accounts being summarily closed and data being destroyed.
Thanks for the post! I enjoy reading about the hardware and software that powers the Internet Archive.
I’m intrigued by your use of archival drives – in theory they perform just like conventional drives when reading.
How often do you experience both sides of a mirror failing? Anecdotally it seems very rare and is more often than not due to operator error or software/driver/firmware bugs.
Are you considering triple-mirroring? Disk capacity is increasing, network transfer speeds are the same, and drive transfer speeds are the same (or decreasing in the case of archival drives). The window for double disk failures is only getting larger with time.
A double-disk failure is EXTREMELY rare. In fact we have only ever experienced this during the brief period of time when we (and many other datacenters) were using the highly problematic 3T drives in the market shortly after the tsunami several years ago. Even in this case, because the data on the problematic drives is stored in a “transparent” manner, we were able to recover over 80% of the content of that single drive pair. While far from ideal, it did not prove to be a disaster.
For our current situation, you can calculate the probability of a double-disk failure using the “n choose k” calculation with N = 20,0000 drives, k = 2, and probability of single drive failure (per day) = 6.7/N.
Triple store would increase our storage expenditures by at least 50%, and is probably not feasible for our organization at this time.
Fantastic article – a very interesting read!
I’m quite curious about the point you made regarding newly introduced technology needing to mature a bit. Understanding that you probably can’t divulge too much in the way of details, can you tell us more?
Sure! Generally, we want to only use technologies are widely adopted and stabilized. That means different things for different kinds of technology. For a new generation of hard drives (say 10T), we would want to see at least 6 months at production (not “limited availability”) volumes, as well as at least one firmware update. For more expansive software technologies we want to “wait and see” that they are widely adopted before jumping into the fray. So we tend to be “slow” at shifting to the latest and greatest new thing (as we have found that these are sometimes fads, rather than firm stepping stones for innovation). An example is that we used SOLR as our search framework for many years — and only last year did we begin a shift of our production environment off of SOLR and onto ElasticSearch (even though Elastic has been a popular and trending for probably 3+ years now).
Do you do backups too, for example to guard against corrupt data getting mirrored across both copies, or accidental deletion?
We have done experiments to confirm that we can back up large portions of our corpus… but this is not a regular practice for us at this time.
“I want to thank the small and extremely hard-working individuals at Internet Archive”
Are they all really short then? 😉
(I think the word ‘team’ is missing somewhere in that sentence)
Touche! Fixed.
Thank you John for showing us how it’s done, and thanks to the team for its hard work at preserving knowledge! 🙂
Hadoop HDFS can run in a mirroring mode as well as an ECC mode.
For those who are new to the discussion: in HDFS, each file is divided into fixed-length 64MB chunks; then 3 copies of each chunk are made; each is saved on a separate datanode. (You could, of course, set the mirroring level to 2 instead of 3).
The advantage: when a single disk dies, the mirror copies of its chunks are sprinkled throughout the cluster instead of being concentrated on one other disk. Therefore re-replication of the required chunks can be much faster: a many-to-many process instead of a one-to-one process.
The surprising disadvantage: if you happen to lose three disks at the same time, you will often have one unlucky file which happened to have one of its chunks replicated on exactly those three disks. In a large enough cluster with enough files, pick any three disks and there’s always some set of chunk replicas that they have in common.
Schemes such as Reed-Solomon / Erasure Coding are an attempt to mitigate this problem: instead of making 3 copies of each chunk, you instead apply a computation that produces 8 blocks (5 data, 3 code). Given any 5 out of the 8 blocks, you can recreate the original chunk. You’ve also drastically reduced your storage overhead, because now you’re storing about 1.6x for each 64MB chunk instead of 2x or 3x.
But you’ve only mitigated the drive-failure problem a little bit. Now, if it’s 4 drives you lose at the same time, there will still be some unlucky stripe in some unlucky file that had 4 of its 8 Reed-Solomon blocks on those 4 drives, and is now unrecoverable.
The “matched mirrored pairs of disks” strategy described may actually be more resilient against disk failure, up to a point determined by the Birthday Paradox.
To reduce the time when only a single copy is available after a drive failure, you could keep a spare drive or two in each server or rack (let failure stats and experience tell you how many). Once a drive starts failing, a local spare drive would automatically be allocated to replace it, and a third copy of the data would immediately start being copied to it.
If the failing drive is still largely readable, this copying could happen locally at disk drive speeds, using a tolerant sector copy program like GNU Ddrescue, or a more complicated error-tolerant file-by-file copier (rsyncrescue?) as yet unwritten. In either case, after the copying, the holes in the copy would be known (as output from the tolerant copy operation) and could be patched by rsyncing from the copy stored in another data center.
Or, the second data center’s copy could be used to populate the newly allocated spare drive, more slowly, over the network. This would still shorten the vulnerable “we only have one good copy” time, by taking the human response time out of the delay. Once the third copy is complete, it could be used to satisfy network demand for reads. Soon a human would still have to physically remove the failing drive, move the third copy to the failed slot, and insert a new spare drive in the spare slot. This would restore the duplicate drives to physically matching pairs, for simple administration.
If no local spare drive is available, or possibly as a first resort, a pool of spare drives stored elsewhere (in another data center) could allocate a drive to become the new copy, with the copying occurring over the network from the duplicate drive. There would rapidly be two copies of the data after a failure. If it is easy to transport that new copy to replace the failing drive (e.g. if the data center is within an hour’s drive), then it can be directly moved. If it isn’t easy, the failing drive can eventually be physically replaced by a blank drive, and a second network copy operation would populate it. When it is back in operation, the “third” copy on the spare drive is free to go back into the spares pool. Using this scheme, a single spares pool co-located with the maintenance staff, or sitting near the fastest point in the network, would minimize the vulnerable time after each failure, while requiring no change in existing data center disk allocation patterns.