Author Archives: John Gonzalez

Internet Archive helps make books accessible for students with disabilities

The Internet Archive will be part of a team that is working to address a key challenge for students with disabilities: getting books in accessible formats. This participation aligns with an existing Internet Archive program to make materials available and accessible to readers with disabilities.

The number of students with disabilities at colleges and universities has grown over the past few decades. Many of those students have print disabilities, including the largest subgroup, those with learning differences.  Students with print disabilities require text to be reformatted for screen readers, text-to-speech software, or other forms of audio delivery, often with human intervention. Universities are required to perform this reformatting on request but are rarely staffed to do that work at scale and this type of reformatting and remediation can cost hundreds or even thousands of dollars. Once the work has been done for a student at one university, the reformatted book is almost never made available for use by students with disabilities at other universities.  Without collaboration and coordination across campuses efforts are wasted and students with disabilities often wait weeks to get texts in a form they can access and use.

A newly-funded pilot project, “Federated Repositories of Accessible Materials for Higher Education,” aims to address this problem. This is a two-year pilot program that has recently been funded by a $1,000,000 grant from The Andrew W. Mellon Foundation to the University of Virginia (as principal investigator) with a primary goal of reducing the duplication of remediation activity across the seve (7) universities participating in the pilot. It will also support the cumulative improvement of accessible texts and decrease the turnaround time for delivering those texts to students and faculty.

Within this program, the Internet Archive will participate as one of several repositories of digitized books, both to provide initial digital copies (for remediation) and to receive and hold remediated book files. Those improved books can then be shared with other schools and organizations that provide services to people with disabilities. They may also be used as a starting point for further conversion into additional formats (such as Braille) that may be needed to support specific reader needs.

The Internet Archive’s role in this pilot project dovetails with our existing program to make materials available and accessible to readers with disabilities. Our current program allows any organization that is already working with people with disabilities, known as Qualifying Authorities, to access the digital files of over 1.8 million books (about 900,000 of which are otherwise unavailable). Those Qualifying Authorities, especially Disability Student Service teams at colleges and universities, are then able to streamline their preparation and remediation of these digital books for people with print disabilities. Because they work directly with individual readers, Qualifying Authorities are also able to enable existing (and qualified) Internet Archive users for an account with disability access. With that access, these users can enjoy expanded and immediate access to the Internet Archive’s full collection of books (through or OpenLibrary).

We are excited to participate in and support the wider community of teams working to make books accessible for all.

Documentation for Public APIs at the Internet Archive

Internet Archive is well-known for our interactive user services.  These include the Wayback Machine, the website, and OpenLibrary.  Less well known are the programmatic, or API (Application Program Interface) tools that can allow users and computer programs to access archived information “at scale.”

Our APIs evolved over time, adapting to address specific projects and expanding as we introduced new services and capabilities into our operations.  Although not entirely uniform, these APIS were created to encourage developers to add media to as well as to consume and repurpose metadata and media.

“Items” are the organizational units of Internet Archive.  Our primary APIs interact with items to perform fundamental actions:

  • Write and read metadata to and from Items
  • Write and read media or other files to and from Items

We have recently introduced two new capabilities:

  • Report the interaction and activity that an item has experienced
  • Discover what changes have happened to Internet Archive content

Documentation and examples to use our most important APIs have now been organized at a single location.  We invite our community to review and use this documentation to make use of the information and content in the Internet Archive.

Internet Archive expands access to millions of books for people with disabilities

Now, disabled users that are certified by a growing number of organizations can borrow hundreds of thousands of modern books and download mostly older books all for free.

Individuals that are already a qualified user of  NLS-BARD, Bookshare, or Ontario Council of University Libraries Scholar’s Portal (ACE)  can link their accounts and gain access.

Individuals that are are affiliated with any of these organizations can contact them to authorize their account for print-disabled access.

Individuals can also request verification for free by filling in this form to contact the Vermont Mutual Aid Society.

We welcome other organizations, such as libraries, schools, hospitals, and dedicated service organizations to join in this free program to certify users for access and also get full access to digital books for further remediation.

If you have questions or suggestions about this program, please contact the Internet Archive. We are excited to be able to offer these services to the print-disabled community.




20,000 Hard Drives on a Mission

The mission of the Internet Archive is “Universal Access to All Knowledge.” The knowledge we archive is represented as digital data. We often get questions related to “How much data does Internet Archive actually keep?” and “How do you store and preserve that knowledge?”

All content uploaded to the Archive is stored in “Items.” As with items in traditional libraries, Internet Archive items are structured to contain a single book, or movie, or music album — generally a single piece of knowledge that can be meaningfully cataloged and retrieved, along with descriptive information (metadata) that usually includes the title, creator (author), and other curatorial information about the content. From a technical standpoint, items are stored in a well-defined structure within Linux directories.

Once a new item is created, automated systems quickly replicate that item across two distinct disk drives in separate servers that are (usually) in separate physical data centers. This “mirroring” of content is done both to minimize the likelihood of data loss or data corruption (due to unexpected harddrive or system failures) and to increase the efficiency of access to the content. Both of these storage locations (called “primary” and “secondary”) are immediately available to serve their copy of the content to patrons… and if one storage location becomes unavailable, the content remains available from the alternate storage location.

We refer to this overall scheme as “paired storage.” Because of the dual-storage arrangement, when we talk about “how much” data we store, we usually refer to what really matters to the patrons — the amount of unique compressed content in storage — that is, the amount prior to replication into paired-storage. So for numbers below, the amount of physical disk space (“raw” storage) is typically twice the amount stated.

As we have pursued our mission, the need for storing data has grown. In October of 2012, we held just over 10 petabytes of unique content. Today, we have archived a little over 30 petabytes, and we add between 13 and 15 terabytes of content per day (web and television are the most voluminous).

Currently, Internet Archive hosts about 20,000 individual disk drives. Each of these are housed in specialized computers (we call them “datanodes”) that have 36 data drives (plus two operating systems drives) per machine. Datanodes are organized into racks of 10 machines (360 data drives), and interconnected via high-speed ethernet to form our storage cluster. Even though our content storage has tripled over the past four years, our count of disk drives has stayed about the same. This is because disk drive technology improvements. Datanodes that were once populated with 36 individual 2-terabyte (2T) drives are today filled with 8-terabyte (8T) drives, moving single node capacity from 72 terabytes (64.8T formatted) to 288 terabytes (259.2T formatted) in the same physical space! This evolution of disk density did not happen in a single step, so we have populations of 2T, 3T, 4T, and 8T drives in our storage clusters.

petaboxOur data mirroring scheme ensures that information stored on any specific disk, on a specific node, and in a specific rack is replicated to another disk of the same capacity, in the same relative slot, and in the same relative datanode in a another rack usually in another datacenter. In other words, data stored on drive 07 of datanode 5 of rack 12 of Internet Archive datacenter 6 (fully identified as ia601205-07) has the same information stored in datacenter 8 (ia8) at ia801205-07. This organization and naming scheme keeps tracking and monitoring 20,000 drives with a small team manageable.

We maintain our datacenters at ambient temperatures and humidity, meaning that we don’t incur the cost of operating and maintaining an air-conditioned environment (although we do use exhaust fans in hot weather). This keeps our power consumption down to just the operational requirements of the racks (about 5 kilowatts each), but does put some constraints on environmental specifications for the computers we use as data nodes. So far, this approach has (for the most part) worked in terms of both computer and disk drive longevity.

Of course, disk drives all eventually fail. So we have an active team that monitors drive health and replaces drives showing early signs for failure. We replaced 2,453 drives in 2015, and 1,963 year-to-date 2016… an average of 6.7 drives per day. Across all drives in the cluster the average “age” (arithmetic mean of the time in-service) is 779 days. The median age is 730 days, and the most tenured drive in our cluster has been in continuous use for 6.85 years!

So what happens when a drive does fail? Items on that drive are made “read only” and our operations team is alerted. A new drive is put in to replace the failed one and immediately after replacement, the content from the mirror drive is copied onto the fresh drive and read/write status is restored.

Although there are certainly alternatives to drive mirroring to ensure data integrity in a large storage system (ECC systems like RAID arrays, CEPH, Hadoop, etc.) Internet Archive chooses the simplicity of mirroring in-part to preserve the the transparency of data on a per-drive basis. The risk of ECC approaches is that in the case of truly catastrophic events, falling below certain thresholds of disk population survival means a total loss of all data in that array. The mirroring approach means that any disk that survives the catastrophe has usable information on it.

Over the past 20 years, Internet Archive has learned many lessons related to storage. These include: be patient in adopting newly introduced technology (wait for it to mature a bit!); with ambient air comes ambient humidity — plan for it; uniformity of infrastructure components is essential (including disk firmware). One of several challenges we see on the horizon is a direct consequence of the increases in disk density — it takes a long time to move data to and from a high-capacity disk. Across pair-bonded 1Gbps node interconnects, transferring data to or from an 8T drive requires 8 hours and 11 minutes at “full speed” and in-practice can extend to several days with network traffic and activity interruptions. This introduces a longer “window of vulnerability” for the unlikely “double-disk failure” scenario (both sides of the mirror becoming unusable). To address this we are looking as increased speeds for node-to-node networking as well as alternative storage schemes that compensate for this risk.

As a final note, I want to thank the small team of extremely hard-working individuals at Internet Archive who maintain and evolve the compute and storage infrastructure that enables us to pursue our mission and service our patrons. Without their hard work and dedicated service, we would not be able to store and preserve the knowledge and information that the community works hard to collect and curate.

Thank you to the 2015-2016 Core Infrastructure Team (and contributors):
Andy Bezella, Hank Bromley, Dwalu Khasu, Sean Fagan, Ralf Muehlen, Tim Johnson, Jim Nelson, Mark Seiden, Samuel Stoller, and Trevor von Stein

-jcg (John C. Gonzalez)