Making the Web More Reliable — 20 Years and Counting

As a part of our 20th anniversary, here are some highlights about tools and projects, from the Internet Archive, helping to make the web a more reliable infrastructure for supporting our culture and commerce.

All in all, the Internet Archive is building collections and tools to help make the open web a permanent resource for current users and into the future.

Please donate to make it even better.

Thank you to the hundreds of people who have worked for the Internet Archive over the past 20 years, and to the thousands who have supported the Archive and contributed to the collections.


Posted in Announcements, News | Leave a comment

Searching Through Everything

With over 20 million items in the Internet Archive’s many collections, having a good way to search through them to find exactly what you want is crucial. It is equally important to be able to filter the data in flexible ways so that you see subsets of the data most relevant to you. We are pleased to offer two new features that might change everything about how you search.

Faceted Filtering

Once you’ve executed a site search, either from the search form at the top right of every page or by going to the search page directly, you’ll see a bunch of new checkboxes down the left-hand side, in addition to the search results. These checkboxes are grouped into categories, such as “Media Type” and “Topics & Subjects”.

Clicking any of the checkboxes adds the corresponding term to the search criteria, allowing you to more precisely define the filtered set of search results. Checkmarking more than one term within the same category causes items that match any of the selected terms to be displayed, whereas checkmarking items from two different categories means that only items matching both terms will be shown. Play around with it, and you’ll see how intuitive it is. Checking or unchecking new terms causes search results to be re-filtered on the fly.

We were looking for a way to provide a more powerful, visual approach to filtering search results. When we user-tested the faceted search interface, our testers loved it. It was a familiar interface already in use throughout the Internet which offered both simplicity and richness.

Full-Text Search (in Beta)

Every day, we see an average of 50,000 hits on our search pages, as you, our users, search for title, creator, and various other metadata about the items we’ve archived. But you have long asked when you would be able to search not only across all items but within them as well. For years you’ve been able to search within the text of a single book using our BookReader, but never before have you been able to search across and within all 9 million available text items at the Internet Archive in a single shot. Until now.

Full-Text Search

And here’s all you have to do: On the search page, after entering your search query in the text field, checkmark “Search full text of books” just underneath the text field, and then click or tap “GO”. That’s it! In seconds, you’ll have the results of searching through millions of texts. Note that the facets at the left work a little differently from non-full-text searches; just click or tap one to add it as a filter criterion.

At the moment, we’re still in beta. Suffice to say, we’ve faced quite a number of challenges in configuring and populating our full-text search engine, from creating the Elasticsearch clusters to dealing with optical character recognition (OCR) issues related to strange fonts, running page headers, or language recognition. We are continuing to make improvements, and still have a ways to go.

But please use it! Try searching for some phrase that’s stuck in your head from a book long ago forgotten, and see what comes up. You now have the contents of 9 million texts at your fingertips.

Posted in Announcements, Books Archive, News | Leave a comment

More than 1 million formerly broken links in English Wikipedia updated to archived versions from the Wayback Machine


The Internet Archive, the Wikimedia Foundation, and volunteers from the Wikipedia community, have now fixed more than 1 million broken outbound web links on English Wikipedia. This was possible because, in addition to other web archiving projects, the Internet Archive has been monitoring all new, and edited, outbound links from English Wikipedia for three years and archiving them soon after changes are made to articles.  As a result of this work, as pages on the Web become inaccessible, links to archived versions in the Internet Archive’s Wayback Machine can take their place.  This has now been done for the English Wikipedia and more than 1 million links are now pointing to preserved copies of missing web content.

This story is a testament to the sharing, cooperative nature and resulting benefits of the open world.

What do you do when good web links go bad? If you are a volunteer editor on Wikipedia, you start by writing software to examine every outbound link in English Wikipedia to make sure it is still available via the “live web.” If, for whatever reason, it is no longer good (e.g. if it returns a “404” error code or “Page Not Found”) you check to see if an archived copy of the page is available via the Internet Archive’s Wayback Machine. If it is, you instruct your software to edit the Wikipedia page to point to the archived version, taking care to let users of the link know they will be visiting a version via the Wayback Machine.

That is exactly what Maximilian Doerr and Stephen Balbach have done. As a result of their work, in close collaboration with the non-profit Internet Archive and the Wikimedia Foundation’s Wikipedia Library program and Community Tech team, now more than 1 million broken links have been repaired. For example, footnote #85 from the article about Easter Island, now links to: when before it linked to the missing page  Pretty cool, right?

“We are honored to work with the Wikipedia community to help maintain the cultural treasure that is Wikipedia,” said Brewster Kahle, founder and Digital Librarian of the Internet Archive, home of the Wayback Machine. “By editing broken outbound links on English Wikipedia to their archived versions available via the Wayback Machine, we are helping to provide persistent availability to reference information. Links that would have otherwise lead to a virtual dead end.”

“What Max and Stephen have done in partnership with Mark Graham at the Internet Archive is nothing short of critical for Wikipedia’s enduring value as a shared repository of knowledge. Without dependable and persistent links, our articles lose their backbone of reliable sources. It’s amazing what a few people can do when they are motivated by sharing—and preserving—knowledge,” said Jake Orlowitz, head of the Wikipedia Library.

“Having the opportunity to contribute something big to the community with a fun task like this is why I am a Wikipedia volunteer and bot operator.  It’s also the reason why I continue to work on this never-ending project, and I’m proud to call myself its lead developer,” said Maximilian, the primary developer and operator of InternetArchiveBot.

So, what is next for this collaboration between Wikipedia and the Internet Archive? Well… there are nearly 300 Wikipedia language editions to rid of broken links. And, we are exploring ways to help make links added to Wikipedia self-healing. It’s a big job and we could use help.

Making the web more reliable… one web page at a time. It’s what we do!

A huge Thank You! to Stephen Balbach, Maximilian Doerr, Vinay Goel, Mark Graham, Brewster Kahle, John Lekashman, Kenji Nagahashi, the Wikimedia Foundation, and Wikipedia community members.

Posted in News | 1 Comment


It’s always going to be an open question as to what parts of culture will survive beyond each generation, but there’s very little doubt that one of them is going to be memes.

Memes are, after all, their own successful transmission of entertainment. A photo, an image that you might have seen before, comes to you with a new context. A turn of phrase, used by a politician or celebrity and in some way ridiculous or unique, comes back you in all sorts of new ways (Imma let you finish) and ultimately gets put back into your emails, instant messages, or even back into mass media itself.

However, there are some pretty obvious questions as to what memes even are or what qualifies as a meme. Everyone has an opinion (and a meme) to back up their position.leo2

One can say that image macros, those combinations of an expressive image with big bold text, are memes; but it’s best to think of them as one (very prominent) kind of a whole spectrum of Meme.

Image Macros rule the roost because they’re platform independent. They slip into our lives from e-mails, texts, websites and even posted on walls and doors. The chosen image (in this example, from the Baz Luhrman directed Great Gatsby) portrays an independent idea (Here’s to you) and the text compliments or contrasts it. The smallest, atomic level of an idea. And it gets into your mind, like a piece of candy (or a piece of grit).

photofunia-1475750857It can get way more complicated, however. This 1980s “Internet Archive” logo was automatically generated by an online script which does the hard work of layout, fonts and blending for you. When news of this tool broke in September of 2016 (it had been around a long time before that), this exact template showed up everywhere, from nightclub flyers to endless tweets. Within a short time, the ideas of both “using a computer to do art” and “the 1980s” became part of the payload of this image, as well as the inevitable feeling it was even more cliche and tired as hundreds piled on to using it. The long-term prospects of this “1980s art” meme are unknown.

223798 And let’s not forget that “memes” (a term coined by Richard Dawkins in his 1976 book The Selfish Gene) themselves go back decades before the internet made its first carefully engineered cross-continental connections. Office photocopies ran rampant with passed along motivational (or de-motivational) posters, telling you that you didn’t need to be crazy to work here… but it helps! Suffering the pains of analog transfer, the endless remixing and hand touchups of these posters gave them a weathered look, as if aged by their very (relative) longevity. To many others, this whole grandparent of the internet meme had a more familiar name: Folklore.

Memes are therefore rich in history and a fundamental part of the online experience, passed along by the thousands every single day as a part of communicating with each other. They deserve study, and they’ve gotten it.

Websites have been created to describe both the contributing factors and the available examples of memes throughout the years. The most prominent has been Know Your Meme, which through several rounds of ownership and contributors has consistently provided access to the surprisingly deep dive of research a supposedly shallow “meme” has behind it.

meme-gapBut the very fluidity and flexibility of memes can be a huge weakness — a single webpage or a single version of an image will be the main reference point for knowing why a meme came to be, and the lifespan of these references are short indeed. Even when hosted at prominent hosting sites or as part of a larger established site, one good housecleaning or consolidation will shut off access to the information, possibly forever.

This is where the Internet Archive comes in. With our hundreds of billions of saved URLs from 20 years stored in the Wayback Machine, a neutral storehouse of not just the inspirations for memes but examples of the memes themselves are kept safe for retrieval beyond the fleeting fads and whims of the present.

The metaphor of “the web” turns out to be more and more apt as time goes on — like spider webs, they’re both surprisingly strong, but also can be unexpectedly lost in an instant. Connections that seemed immutable and everlasting will drop off the face of the earth at the drop of a hat (or a server, or an unpaid hosting bill).

Memes are, as I said, compressed culture. And when you lose culture, you lose context and meaning to the words and thoughts that came before. The Wayback machine will be a part of ensuring they stick around for a long time to come.

Posted in News, Wayback Machine, Web Archive | Leave a comment

How the Internet Archive is hacking the election

There are thirteen days until Election Day — not that we’re counting.

In this most bizarre, unruly, terrifying, fascinating election year, the Internet Archive has been in the thick of it. We’re using technology to give journalists, researchers and the public the power to take the political junk food that’s typically spoon fed to all of us—the political ads, the presidential debates, the TV news broadcasts—and help us to scrutinize the labels, dig into the content, and turn that meal into something more nutritious.

political ad archivePolitical ads. We’ve archived more than 2,600 different ads over at the Political TV Ad Archive and used the open source Duplitron created by senior technologist Dan Schultz to count nearly 300,000 airings of the TV ads across 26 media markets. We’ve linked the ads to information on the sponsors—whether it’s a super PAC, a candidate committee, or a nonprofit “dark money” group.

Journalists have used the underlying metadata to visualize this information creatively, whether it’s the moment when anti-Trump ads started popping up in Florida (, revealing how Ted Cruz favors “The Sound of Music”  (, or turning the experience of being an Iowa voter deluged with campaign ads into an 8-bit arcade-style video game (The Atlantic).

Meanwhile, our fact checking partners at, PolitiFact, and The Washington Post’s Fact Checker, have fact checked 116 archived ads and counting, not just for the presidential candidates but for U.S. Senate, House, and local campaigns as well. Of the 70 ads fact check by PolitiFact reporters, nearly half have earned ratings ranging from “Mostly false” to “Pants on Fire!”

Example: this “Pants on Fire!” ad played nearly 300 times in Cleveland, Ohio, in August, where Democrat Ted Strickland is facing incumbent Senate Rob Portman, a Republican, in a competitive race.  The claim: that as governor, Democrat Ted Strickland proposed deep budget cuts and then “wasted over $250,000 remodeling his bathrooms at the governor’s mansion.” While it’s true Strickland proposed budget cuts in the wake of the 2008 financial crisis, the money used to renovate the governor’s mansion didn’t come from that pool of money. What’s more, the bathrooms in question were not for the governor’s personal use, but rather for tourists who come to visit the mansion.

Presidential debates. In the recent presidential debates, the Internet Archive opened up the TV News Archive to offer near real-time broadcasts while the candidates were still on the stage. Journalists and fact checkers used this online resource to share clips of key points in the debate.

Example: during the third presidential debate, Farai Chideya, a reporter for, linked to this clip in a live blog about the debate, noting that abortion is a key issue for Trump’s core supporters.

Twenty-five hours after the debate, we learned that the public made 85 quotes from our TV News Archive debate footage, and that viewers played these more than one million times—a healthy response to this brand new experiment.

TV News. When the debates were over, we used the Duplitron on TV news to tally which debate clips were shared on such networks as CNN, FOX News, and MSNBC and shows such as “Good Morning America” and the “Today show.” Journalists used our downloadable data to create visualizations to show how TV News shows present the debates to viewers.

nytExample: this interactive visualization in The New York Times shows readers how the different cable news networks presented the first debates, and highlights the differences between them.

The Wall Street Journal, the Economist, Fusion and The Atlantic all have used the data to visualize how the debates were portrayed for viewers. In addition, we’re keeping our eyes open and Duplitron turned on for tracking how TV news shows cover other key video. For example, we have data on how TV news shows used clips from the 2005 “Access Hollywood” tape, in which Trump bragged about groping women, and his subsequent apology.

In the thirteen days remaining before the election, we’ll continue to track airings of political ads in key battleground state markets, work with fact checking and journalist partners, and stay on the TV news beat with attention to breaking news.

And when it’s all over, we’re looking forward to working with our partners to figure out what just happened, what we’ve learned, and how we can help in the future.


Posted in Announcements, News | Tagged , , , , , , , , , , , , | Leave a comment

10 Years of Archiving the Web Together

As the Internet Archive turns 20, the Archive-It community is proud to celebrate an anniversary of its own: 10 years of working with thousands of librarians, archivists, and others to preserve the web and build rich, expansive collections of websites for discovery and use by future generations. Eighteen partners inaugurated the Archive-It service in 2006. Since then, that list has grown to include more than 450 organizations and individuals, each with its unique goals and collecting scope. In this time they added more than 17 billion (yes, with a “b”) URLs to their collections.

Archive-It partners over the years. Clockwise from top-left: Margaret Maes (LIPA) and Nicholas Taylor (Stanford University); James Jacobs (Stanford University) and Kent Norsworthy (University of Texas at Austin); K12 web archivists at PS 174 in Queens; Renate Giacomuzzi, Elisabeth Sporer (University of Innsbruck), and Kristine Hanna (Internet Archive)

Archive-It partners over the years. Clockwise from top-left: Margaret Maes (Legal Information Preservation Alliance) and Nicholas Taylor (Stanford University); James Jacobs (Stanford University) and Kent Norsworthy (University of Texas at Austin); K12 web archivists at PS 174 in Queens; Renate Giacomuzzi, Elisabeth Sporer (University of Innsbruck), and Kristine Hanna (Internet Archive)

And to give you just a hint of how the overall collection has grown: that’s about 5 billion new URLs in just the last year! They’ve captured some momentous historical events, local community history, and social and cultural activity across more than 7,000 collections to date, everything from 700+ human rights sites to the tea party movement; tobacco industry records to Mormon missionaries’ blogs. And of course who can forget all of the LOLcats? They’ve collaborated on capturing breaking news, opened doors to the next generation of curators in our K12 web archiving program, and explored their own collections in new forms with datasets leveraging our researcher services.


The Archive-It pilot website in 2005

Archive-It is Internet Archive’s web archiving service that helps institutions build, preserve, and provide access to collections of archived web content. It was developed in response to the needs of libraries, archives, historical societies, museums, and other organizations who sought to use the same powerful technology behind the Wayback Machine to curate their own web archives. The service was then the first of its kind, but has grown and expanded to meet the needs of an ever-widening scope of partners dedicated to archiving the web.


Adding a website to a collection in Archive-It 2.0, as released in July 2006.

Our pilot partners, who began testing a beta version of the service in late 2005, helped to develop and improve the essential tools that such a service would provide and used those tools to create collections, documenting local and global histories in a new way. Based on feedback from the pilot partners, the Archive-It web application launched publicly in 2006 with the most basic of curation tools: create a collection, capture content, and make it publicly available. The service and the community grew exponentially from there.

Archive-It 5.0 realtime crawl tracking.

Archive-It 5.0 realtime crawl tracking.

The myriad partner-driven technical (to say nothing of aesthetic!) improvements of the last ten years are reflected in this year’s release of Archive-It 5.0, the first full redesign of the Archive-It web application since its launch. In the meantime, Archive-It continues to work with the community to preserve and provide access to amazing collections and to develop new tools for archiving the web, including new capture technologies, data transfer APIs, and more.

With year 11 (and Archive-It 5.1) just around the corner, we look forward to helping our partner institutions use new tools, build new collections, and expand the broader community working to archive the web.

Posted in News | Leave a comment

Lending Launches on, Plus Bookreader Updates

We have been loaning digital books through Open Library since 2010. We started with about 10,000 books in the lending collections, and soon there will be more than 500,000 books available.  

Today we launch lending on, so patrons no longer need to go to Open Library to borrow books. The same parameters for borrowing apply — books are free to borrow for logged in users, and they can be borrowed for a period of 2 weeks.

For Open Library users, the lending path has changed a bit — see this post for more information.

For users, you’re going to see many more modern books available in the coming weeks. These books will appear in collections and search results with a blue “Borrow” notice on them.



Logged in users will be able to borrow the book from the book’s details page where you see the full metadata. Remember, creating an account on is free, and so is borrowing books.


When you click “Borrow This Book” you will be taken to the new bookreader.  You can search, use the read aloud feature, zoom in and out, and change the number of pages you see at once. The book will be available in your browser for 2 weeks as long as you are connected to the Internet.

If you prefer to read your book offline, you can download a PDF or EPUB version of the book to be read in Adobe Digital Editions (free download).  You must install Adobe Digital Editions before you can read the offline version of your book.


When you want to return the book, you can return it from Adobe Digital Editions (if you chose to download) and from the bookreader.


In addition to the new borrow features, we have updated the bookreader to display better on mobile devices. The layout now changes when you are on a very small screen in order to make it easier to use.  You will see one page at a time, and some of the functions are located in the menu on the left.


If you would like to download an offline copy of the book accessible through Adobe Digitial Editions (don’t forget to download the app first!) open the menu and choose “Loan Information.”


From here you can download a PDF or EPUB to read offline, or return the book.


We hope you will explore the books available for lending, and enjoy the features of the new bookreader.

Many thanks to: Richard Caceres, Brenton Cheng, Carolyn Li-Madeo, Tracey Jaquith, Jessamyn West, Jeff Kaplan, John Lekashman, Dwalu Khasu, John Gonzalez and Alexis Rossi.

Posted in Books Archive, News | Leave a comment

The New Memory Palace

By Paul D. Miller aka DJ Spooky

     “Sometimes it is the people no one can imagine anything of who do the things no one can imagine.”

– Alan Turing’s biopic, The Imitation Game, 2014

A lot of things have changed in the last 20 years. A lot of things haven’t. We’ve moved from the tyranny of physical media to the seemingly unlimited possibilities of total digital immersion. We’ve moved from a top down, mega corporate dominated media, to a hyper-fragmented multiverse where any kind of information is accessible within reason (and sometimes without!). The fundamental issue that “memory” and how it responds to the digital etherealization of all aspects of the information economy we inhabit conditions everything we do in this 21st-century culture of post-, post-, post-everything contemporary America. Whether it’s the legions of people who walk the streets with Bluetooth enabled earbuds that allow them to ignore the physical reality of the world around them, or the Pokémon Go hordes playing the world’s largest video game as it’s overlaid on stuff that happens “IRL” (In Real Life) that layer digital role playing over the world: diagnosis is pending. But the fundamental fact is clear: digital archives are more important than ever and how we engage and access the archival material of the past, shapes and molds the way we experience the present and future. Playing with the Archive is a kind of digital analytics of the subconscious impulse to collage. It’s also really fun.

mnemosyne1Mnemosyne was the Greek muse who personified memory. She was a Titaness who was the daughter of Uranus (who represented “Sky”), the son and husband of Gaia, Mother Earth. When you break it down, Mnemosyne had a deeply complicated life, and ended up birthing the other muses with her nephew, Zeus. Ancient Greek myth was quite an incestuous place, and every deity had complicated and deeply interwoven histories that added layers and layers of what we would now call “intertextuality.” Look at it this way: a Titaness, Mnemosyne, gave birth to Urania (Muse of Astronomy), Polyhymnia (Muse of hymns,) Melpomene (Muse of tragedy,) Erato (Muse of lyric poetry,) Clio (Muse of history,) Calliope (Muse of epic poetry,) Terpsichore (Muse of dance,) and Euterpe (Muse of music). It’s complicated. Mnemosyne also presided over her own pool in Hades as a counterpoint to the river Lethe, where the dead went to drink to forget their previous life. If you wanted to remember things, you went to Mnemosyne’s pool instead. You had to be clever enough to find it. Otherwise, you’d end up crossing the river under the control of spirits guided by the “helmsman” whose title translates from the Greek term “kybernētēs” across the mythical river into the land of the dead aka Hades. What’s amazing about the wildly “recombinant” logic of this cast of characters is that somehow it became the foundation of our modern methods for naming almost every aspect of digital media — including the term “media.” Media, like the term data is a plural form of a word “appropriated” directly from Latin. But the eerie resonance it has with our era comes into play when we think of the ways “the archive” acts as a downright uncanny reflection site of language and its collision between code and culture.

neuromancer-william-gibsonUntil the internet, the term cyber was usually used to measure words about governance and then later evolved to how we look at computers, computer networks, and now things like augmented reality and virtual reality. The term traces back to the word cybernetics, which was popularized by the renowned mathematician Norbert Wiener, founder of Information theory, at MIT. There’s a strange emergent logic that connects the dots here: permutation, wordplay, and above all, the use of borrowed motifs and ahistorical connections between utterly unassociated material. I guess William S. Burroughs was right: the world has become a mega-Cybertron, a place where everything is mixed, cut and paste style, to make new meanings from old. With people like Norbert Wiener, cybernetics usually refers to the study of mechanical and electronic systems designed at heart, to replace human systems. The term “cyberspace” was coined by William Gibson, to reflect the etherealized world of his 1982 classic, Burning Chrome. He used it again as a reference point for Neuromancer, his groundbreaking novel. A great, oft-cited passage gives you a sense how resonant it is with our current time:

Cyberspace. A consensual hallucination experienced daily by billions of legitimate operators, in every nation, by children being taught mathematical concepts… a graphic representation of data abstracted from the banks of every computer in the human system. Unthinkable complexity. Lines of light ranged in the nonspace of the mind, clusters and constellations of data. Like city lights, receding…

When the Internet Archive asked me to do a megamix of their archive of recordings from their data files, I was a bit overwhelmed. There’s no way any human being could comb through even the way they’ve documented just the web, let alone the material they have asked people to upload.Where to start? Sir Tim Berners-Lee’s speech inaugurating the internet back when he came up with the term the “Semantic Web?”  The first recordings from Edison? That could be cool. Maybe mix that with GW Bush’s State of the Union speech inaugurating the invasion of Iraq? Why not. Take Hedy Lamar’s original blueprints for spread spectrum “secret communications systems” and mix that with recordings of William S. Burrough and Malcolm X, with a beat made from open source 1920’s jazz and 1950’s New Orleans blues? Why not. Grab some clips of Cory Doctorow talking about the upcoming war on open computing and mix it with Parliament Funkadelic? Sure. Take the first “sound heard around the world,” the telemetry signals guiding the Sputnik satellite as it swirled around planet Earth to become our first orbital artificial moon? Cool. Why not? Take a speech from Margaret Sanger, the woman who started Planned Parenthood, and mix it with Public Enemy? Cool. Take D.W. Griffith’s “Birth of a Nation” and re-score it with the Quincy Jones theme from “Fat Albert?” That would actually be kind of cool, but would require a lot of editing.

The basic idea here is that once you have the recordings and documentation of all aspects of human activity from the last several centuries, that is a serious “mega-mix.”

What you will hear in the short track I made is a mini reflection of the density of the sheer volume of materials that the Internet Archive has onsite. It is a humble reminder that through the computer, the network, and the wireless transmission of information, we have an immaculate reflection of djspookyvinylwhat Alan Turing may have called “morphogenesis” — the human, all too human, attempt to corral the world into anthropocentric metaphors that seek to convey the sublime, the edge of human understanding: the emergent patterns that occur when you recombine material with unexpectedly powerful new connections. I’m honored to be the first DJ to start. But I’m also honored that many, many more will follow. The Archive is a mirror of infinite recombinant potential. I hope that its gift of free culture and free exchange creates a place where we will be comfortable with what is almost impossible to guess comes next. It is not a “collaborative filter” but a place where you are invited to explore on your own and come up with new ways of seeing the infinite memory palace of the fragments of history, time, and space that make this modern 21st century world work.


Paul D. Miller aka DDJ SpookyJ Spooky’s work ranges from creating the first DJ app to producing an impactful DVD anthology about the “Pioneers of African American Cinema.” According to a New York Times review, “there has never been a more significant video release than ‘Pioneers of African-American Cinema.'” The prolific innovator and artist also created 13 music albums and is about to release a fourteenth. Called “Phantom Dancehall,” it is an intense mix of hip hop, Jamaican ska and dancehall culture.

Posted in Announcements, Event, Music, News | Leave a comment

20,000 Hard Drives on a Mission

The mission of the Internet Archive is “Universal Access to All Knowledge.” The knowledge we archive is represented as digital data. We often get questions related to “How much data does Internet Archive actually keep?” and “How do you store and preserve that knowledge?”

All content uploaded to the Archive is stored in “Items.” As with items in traditional libraries, Internet Archive items are structured to contain a single book, or movie, or music album — generally a single piece of knowledge that can be meaningfully cataloged and retrieved, along with descriptive information (metadata) that usually includes the title, creator (author), and other curatorial information about the content. From a technical standpoint, items are stored in a well-defined structure within Linux directories.

Once a new item is created, automated systems quickly replicate that item across two distinct disk drives in separate servers that are (usually) in separate physical data centers. This “mirroring” of content is done both to minimize the likelihood of data loss or data corruption (due to unexpected harddrive or system failures) and to increase the efficiency of access to the content. Both of these storage locations (called “primary” and “secondary”) are immediately available to serve their copy of the content to patrons… and if one storage location becomes unavailable, the content remains available from the alternate storage location.

We refer to this overall scheme as “paired storage.” Because of the dual-storage arrangement, when we talk about “how much” data we store, we usually refer to what really matters to the patrons — the amount of unique compressed content in storage — that is, the amount prior to replication into paired-storage. So for numbers below, the amount of physical disk space (“raw” storage) is typically twice the amount stated.

As we have pursued our mission, the need for storing data has grown. In October of 2012, we held just over 10 petabytes of unique content. Today, we have archived a little over 30 petabytes, and we add between 13 and 15 terabytes of content per day (web and television are the most voluminous).

Currently, Internet Archive hosts about 20,000 individual disk drives. Each of these are housed in specialized computers (we call them “datanodes”) that have 36 data drives (plus two operating systems drives) per machine. Datanodes are organized into racks of 10 machines (360 data drives), and interconnected via high-speed ethernet to form our storage cluster. Even though our content storage has tripled over the past four years, our count of disk drives has stayed about the same. This is because disk drive technology improvements. Datanodes that were once populated with 36 individual 2-terabyte (2T) drives are today filled with 8-terabyte (8T) drives, moving single node capacity from 72 terabytes (64.8T formatted) to 288 terabytes (259.2T formatted) in the same physical space! This evolution of disk density did not happen in a single step, so we have populations of 2T, 3T, 4T, and 8T drives in our storage clusters.

petaboxOur data mirroring scheme ensures that information stored on any specific disk, on a specific node, and in a specific rack is replicated to another disk of the same capacity, in the same relative slot, and in the same relative datanode in a another rack usually in another datacenter. In other words, data stored on drive 07 of datanode 5 of rack 12 of Internet Archive datacenter 6 (fully identified as ia601205-07) has the same information stored in datacenter 8 (ia8) at ia801205-07. This organization and naming scheme keeps tracking and monitoring 20,000 drives with a small team manageable.

We maintain our datacenters at ambient temperatures and humidity, meaning that we don’t incur the cost of operating and maintaining an air-conditioned environment (although we do use exhaust fans in hot weather). This keeps our power consumption down to just the operational requirements of the racks (about 5 kilowatts each), but does put some constraints on environmental specifications for the computers we use as data nodes. So far, this approach has (for the most part) worked in terms of both computer and disk drive longevity.

Of course, disk drives all eventually fail. So we have an active team that monitors drive health and replaces drives showing early signs for failure. We replaced 2,453 drives in 2015, and 1,963 year-to-date 2016… an average of 6.7 drives per day. Across all drives in the cluster the average “age” (arithmetic mean of the time in-service) is 779 days. The median age is 730 days, and the most tenured drive in our cluster has been in continuous use for 6.85 years!

So what happens when a drive does fail? Items on that drive are made “read only” and our operations team is alerted. A new drive is put in to replace the failed one and immediately after replacement, the content from the mirror drive is copied onto the fresh drive and read/write status is restored.

Although there are certainly alternatives to drive mirroring to ensure data integrity in a large storage system (ECC systems like RAID arrays, CEPH, Hadoop, etc.) Internet Archive chooses the simplicity of mirroring in-part to preserve the the transparency of data on a per-drive basis. The risk of ECC approaches is that in the case of truly catastrophic events, falling below certain thresholds of disk population survival means a total loss of all data in that array. The mirroring approach means that any disk that survives the catastrophe has usable information on it.

Over the past 20 years, Internet Archive has learned many lessons related to storage. These include: be patient in adopting newly introduced technology (wait for it to mature a bit!); with ambient air comes ambient humidity — plan for it; uniformity of infrastructure components is essential (including disk firmware). One of several challenges we see on the horizon is a direct consequence of the increases in disk density — it takes a long time to move data to and from a high-capacity disk. Across pair-bonded 1Gbps node interconnects, transferring data to or from an 8T drive requires 8 hours and 11 minutes at “full speed” and in-practice can extend to several days with network traffic and activity interruptions. This introduces a longer “window of vulnerability” for the unlikely “double-disk failure” scenario (both sides of the mirror becoming unusable). To address this we are looking as increased speeds for node-to-node networking as well as alternative storage schemes that compensate for this risk.

As a final note, I want to thank the small team of extremely hard-working individuals at Internet Archive who maintain and evolve the compute and storage infrastructure that enables us to pursue our mission and service our patrons. Without their hard work and dedicated service, we would not be able to store and preserve the knowledge and information that the community works hard to collect and curate.

Thank you to the 2015-2016 Core Infrastructure Team (and contributors):
Andy Bezella, Hank Bromley, Dwalu Khasu, Sean Fagan, Ralf Muehlen, Tim Johnson, Jim Nelson, Mark Seiden, Samuel Stoller, and Trevor von Stein

-jcg (John C. Gonzalez)

Posted in News | 12 Comments

FAQs for some new features available in the Beta Wayback Machine


The Beta Wayback Machine has some new features including searching to find a website and a summary of types of media on a website.

How can I use the Wayback Machine’s Site Search to find websites? The Site Search feature of the Wayback Machine is based on an index built by evaluating terms from hundreds of billions of links to the homepages of more than 350 million sites. Search results are ranked by the number of captures in the Wayback and the number of relevant links to the site’s homepage.

Can I find sites by searching for words that are in their pages? No, at least not yet. Site Search for the Wayback Machine will help you find the homepages of sites, based on words people have used to describe those sites, as opposed to words that appear on pages from sites.

Can I search sites with text from multiple languages? Yes! In fact, you can search for any unicode character, e.g. you can search for (try clicking on it). If you can generate characters with your computer, you should be able to use them to search for sites via the Wayback Machine. Go ahead, try searching for правда

Can I still find sites in the Wayback Machine if I just know the URL? Yes, just enter a domain or URL the way you have in the past and press the “Browse History” button.

What is the “Summary of <site>” link above the graph on the calendar page telling me? It shows you the breakdown of the web captures for a given domain by content type (text, images, videos, PDFs, etc.) In addition, it shows the number of captures, URLs and new URLs, by year for all the years available via the Wayback Machine, so you can see how a certain site has changed over time.

What are the sources of your captures? When you roll over individual web captures (that pop-up when you roll over the dots on the calendar page for a URL) you may notice some text links shows up above the calendar, along with the word “why”. Those links will take you to the Collection of web captures associated with the specific web crawl the capture came from. Every day hundreds of web crawls contribute to the web captures available via the Wayback Machine. Behind each, there is a story about factors like who, why, when and how.

Why are some of the dots on the calendar page different colors? We color the dots, and links, associated with individual web captures, or multiple web captures, for a given day. Blue means the web server result code the crawler got for the related capture was a 2nn (good); Green means the crawlers got a status code 3nn (redirect); Orange means the crawler got a status code 4nn (client error), and Red means the crawler saw a 5nn (server error). Most of the time you will probably want to select the blue dots or links.

Can I find sites by searching for a word specific to that site? Yes, by adding in “site:<domain>” your results will be restricted to the specified domain. E.g. “site:gov clinton” will search for sites related to the term “clinton” in the domain “gov”.

Posted in News | 3 Comments

Open Library New Features and Fixes

OpenLibrary team has added pages for 200,000 new modern works and rolled out a brigade of fixes and features.

screen shot of book reader

Prioritized by feedback from openlibrary patrons,

  • Full-text search through all books hosted on the Internet Archive is back online and is faster than ever. You can try the new feature, for example, to see over 115,000 places where works reference Benjamin Franklin’s maxim: “Little strokes fell great oaks”.
  • Updated new Book Reader, which looks great on mobile devices and provides a much clearer and simpler book borrowing experience. Try out the new Book Reader and see for yourself!

There are a few small changes in the BookReader that we think you’ll like specifically. EPUB and PDF loans can be initiated from within an existing BookReader loan. What this means for Open Library users is two pretty cool things you’ve long requested:

  • Users who start loans from the BookReader can borrow either EPUB or PDF formats, and switch formats during the loan period.
  • Users who start loans from the BookReader can return loans early, even EPUBs and PDFs.


screen shot showing onscreen areas to download and return books

We hope these changes will delight readers, empower developers, and help the community to make even more quality contributions. The path ahead looks even more promising. With clear direction and exciting redesign concepts in the works, the Open Library team is eager to bring you an Open Library at the cutting edge of the 21st century while giving you access to five centuries’ of texts.

image from old reading textbook

Thank you to Jessamyn West, Brenton Cheng, Mek Karpeles, Giovanni Damiola, Richard Carceres, and the many volunteers in the community.

[from the Open Library blog]

Posted in Announcements, Books Archive, Open Library | Tagged , , | 1 Comment

Beta Wayback Machine – Now with Site Search!

Wayback Machine with Site Search
For the last 15 years, users of the Wayback Machine have browsed past versions of websites by entering in URLs into the main search box and clicking on Browse History. With the generous support of The Laura and John Arnold Foundation, we’re adding an exciting new feature to this search box: keyword search!

With this new beta search service, users will now be able to find the home pages of over 361 Million websites preserved in the Wayback Machine just by typing in keywords that describe these sites (e.g. “new york times”). As they type keywords into the search box, they will be presented with a list of relevant archived websites with snippets containing:

  • a link to the archived versions of the site’s home page in the Wayback Machine
  • a thumbnail image of the site’s homepage (when available)
  • a short description of the site’s homepage
  • a capture summary of the site
    • number of unique URLs by content type (webpage, image, audio, video)
    • number of valid web captures over the associated time period

keyword search in wayback machine

Key Features

  • Search as you type
    • Instant results as you type — predictive, interactive and speedy
  • Multilingual
    • Search in any language or using symbols — expanding scope and utility
  • Site-based Filtering
    • Limit results to certain websites or domains using the site: operator (e.g. site:edu)

Behind the Scenes

  • Search index was built by processing over 250 billion webpages archived over 20 years
    • Index contains more than a billion terms collected from over 400 billion hyperlinks to the homepages of websites
  • Search results are ranked based on the number of relevant hyperlinks to the site’s homepage and the total number of web captures from the site

Example queries

We hope that this service, to search and discover archived web resources through time, will create new opportunities for scholarly work and innovation.

A big Thank You to: Vinay Goel, Kenji Nagahashi, Mark Graham, Bill Lubanovic, John Lekashman, Greg Lindahl, Vangelis Banos, Richard Caceres, Zijian He, Eugene Krevenets, Benjamin Mandel, Rakesh Pandey, Wendy Hanamura and Brewster Kahle

Posted in Announcements, News | 4 Comments

SHOWCASE: the GIF Collider at Berkeley Art Museum

Image from the GIF Collider, by Greg Niemeyer and Olya Dubatova

Image from the GIF Collider, by Greg Niemeyer and Olya Dubatova

by Greg Niemeyer, Director, Berkeley Center for New Media

Have you ever wondered what happened to all the GIF animations that sparkled in the dawn of the internet? According to artists Greg Niemeyer and Olya Dubatova, they have become part of the digital subconscious, and the Berkeley Art Museum & Pacific Film Archive (BAMPFA) is presenting what that subconscious might look like, in an exhibit called GIF Collider.

Niemeyer studied both the Internet Archive’s  collections of GIF animations and the Prelinger Film Archives from the 1950’s. He noticed how the film archives, which include ads, educational films and propaganda, show a heavy gender and racial bias. In comparison, the GIF animations from forty years later reflect less gender and racial bias—but we can’t help but wonder with more historical distance, what kinds of bias will become apparent in the future?  What about these GIFs do we not see now, that will be obvious in 50 years?

For three days, from Wednesday, October 26 to Friday, October 28, from dawn to dusk, BAMPFA invites you to ponder these questions as thousands of GIF animations emerge and collide on the huge public outdoor screen in a ballet of memory and erasure. Call it an outstallation, and it’s free for the walking to the intersection of Addison Avenue and Oxford Avenue in Berkeley. The GIFS will be presented in several chapters, playing for 30 minutes of every hour.

A special showcase with music made for the GIF Collider by Paz Lenchantin (Pixies) and with live music performances by Trevor Bajus and Space Town is planned for Friday, Oct 28, from 6-8 pm.

For more information:

Posted in Announcements, News | Leave a comment

Defining Web pages, Web sites and Web captures

The Internet Archive has been archiving the web for
20 years and has preserved billions of webpages from millions of websites. These webpages are often made up of, and link to, many images, videos, style sheets, scripts and other web objects. Over the years, the Archive has saved over 510 billion such time-stamped web objects, which we term web captures.

We define a webpage as a valid web capture that is an HTML document, a plain text document, or a PDF.

A domain on the web is an owned section of the internet namespace, such as or or A host on the web is identified by a fully qualified domain name or FQDN that specifies its exact location in the tree hierarchy of the Domain Name System. The FQDN consists of the following parts: hostname and domain name.  As an example, in case of the host, its hostname is blog and the host is located within the domain

We define a website to be a host that has served webpages and has at least one incoming link from a webpage belonging to a different domain.

As of today, the Internet Archive officially holds 273 billion webpages from over 361 million websites, taking up 15 petabytes of storage.

Posted in Announcements, News, Web Archive | 3 Comments

Authors Alliance and Internet Archive Team Up to Make Books Available

picture1-1by Michael Wolfe, Executive Director, Authors Alliance
To write a book takes time, effort, more often than not, love. Happily, books are built to last, and with the proper stewardship remain relevant, provide insight and information, or entertain for generations. So why is it that, when the internet provides more avenues than ever for making work accessible, the vast majority of books written in the last 100 years are out of print and largely unavailable? Authors Alliance has been working with its members to help recover their unavailable books and give them another public life. Since the release of our guide to Understanding Rights Reversion in 2015, we have provided information, assistance, and know-how to authors on the topic of recovering rights in order to bring back works that have fallen out of view. While many authors choose to make these recovered titles available commercially, a growing contingent has instead committed to ensuring their works endure in the public eye by making them available under Creative Commons licenses or dedicating them to the public domain. Many of our members’ titles are already discoverable through the HathiTrust digital library, and we are now partnering with the Internet Archive to make these works available in full on our new Authors Alliance Collection Page.

Authors Alliance members Robert Darnton, Joseph Nye, and Thomas Leonard are just some of the authors whose books are now freely available in full-text digital versions under Creative Commons licenses. Join them to rescue your previously published work from obscurity, safeguard your intellectual legacy, and help us build a robust Internet Archive collection. If you have regained rights to your previously published book(s) and would like to feature them in the Internet Archive and on Open Library, this guide to sharing your work is a good place to start. If you have any trouble, contact us! We can help take care of the details and will even handle the scanning and ingest of pre-digital works. And, if you have a backlist but haven’t yet begun the process of regaining rights, we can help with that too. Check out our guide to Understanding Rights Reversion and our guide to crafting a reversion letter to get started. You can always reach out to us directly to help get you on track to unlock your books, regain your rights, and give your work new life online. Contact us to get started, and help us build the Authors Alliance collection page in the Internet Archive!

Posted in Announcements, News | 4 Comments

Dewey Defeats Truman, Pence Defeats Kaine



Physicist Niels Bohr may or may not have been thinking about The Chicago Daily Tribune’s famously erroneous 1948 “Dewey Defeats Truman” headline when he wrote, “Prediction is very difficult, especially about the future.” In a case of déjà vu all over again, history repeated itself last week.


No one expects a political party to provide an objective analysis of a debate, so the Republican party’s verdict on the Vice Presidential debate came as no surprise:

“Americans from all across the country tuned in to watch the one and only Vice Presidential debate. During the debate we helped fact check and monitor the conversation in real time @GOP. The consensus was clear after the dust settled, Mike Pence was the clear winner of the debate.”

What did raise more than a few eyebrows was the timing: announced the “results” of the debate over two hours before the debate actually began. Although it’s not unusual for party officials to prepare article outlines in advance, what was atypical was the timing of the post. A staffer noticed the mistake and took down the pages touting Pence’s accomplishment, but not before the Internet Archive’s Wayback Machine preserved a web capture of the site. It’s a fine example of how our archives serve as timely and unedited historical records.

Posted in Announcements, News | 3 Comments

Access to Knowledge in Canada

The Internet Archive Canada asked Lila Bailey to report on the policy landscape for digital libraries in Canada.   This is a summary of her report:   Looking good.

On September 30th, the Canadian National Institute for the Blind transferred accessible books in audio format to Australia through the book service of the Accessible Books Consortium (ABC). This transfer occurred without the legal obligation to request permission from the copyright owners. This effort was made possible by the Marrakesh Treaty, which creates exceptions in copyright law for the print-disabled. As we previously noted, Canada was the 20th signatory to the treaty, triggering it to enter into force.

Canada has made great strides towards increasing access to human knowledge in recent years. Judicial and legislative developments have brought balance into the law, ushering in more opportunities for public access and use of copyright protected works. And now, with the Marrakesh Treaty entering into effect, it seems a good a time to highlight Canada’s contributions to the world’s accessible digital heritage.

Our sister organization, Internet Archive Canada, has digitized more than 530,000 books, microreproductions, archival fonds, and maps. Libraries and institutions that have collaborated with, financially supported, and contributed material to IAC stretch across the entire country, from Memorial University in Newfoundland to University of Victoria in British Columbia. Internet Archive Canada has been working on accessibility projects, and has digitized more than 10,000 texts in partnership with the Accessible Content E-Portal. To date, this material has only been available to students and scholars within Ontario’s university system. Joining the Marrakesh Treaty now makes it possible for accessible versions of works to be shared more broadly within Canada, and with the other countries that have ratified the treaty.

Canadiana is another group that has helped to advance access to knowledge in Canada. Initially created by Canadian Universities in 1978 to microform National Library collections, Canadiana has more recently worked to digitize Canadian heritage with a focus mainly on public domain printed materials. The University of Toronto Library has also developed full-text digital collections, primarily consisting of public domain materials. These special collections contain a wide variety of items, including over 200,000 books, over 600 archived versions of local government websites, Canadian pamphlets and broadsides, and a fine art repository among many other materials. Similarly, the University of Alberta has developed an open access digital portal called Peel’s Prairie Provinces – a collection containing both an online bibliography of books, pamphlets and other materials related to the settlement and development of the Canadian West, as well as a searchable full-text collection of digital version of many of these materials. The portal allows access to a diverse collection that includes approximately 7,500 digitized books, over 66,000 newspaper issues, 16,000 postcards and 1,000 maps.

The above are just a few examples of Canadian efforts to bring analog materials into digital form to allow increased access to knowledge. Many more such projects can be found via the Canadian National Digital Heritage Index (CNDHI). Supported by funding from Library and Archives Canada and the Canadian Research Knowledge Network, CNDHI is designed to increase awareness of, and access to digital heritage collections in Canada, to support the academic research enterprise and to facilitate information sharing within the Canadian documentary heritage community.

These digitization activities have made significant strides towards opening access to human knowledge in Canada, however, to date, these efforts have been piecemeal. In June of 2016, Library and Archives Canada (LAC) announced a National Heritage Digitization Strategy in order “to bring Canada’s cultural and scientific heritage into the digital era to ensure that we continue to understand the past and document the present as guides to future action.” The goal of the strategy is to provide a cohesive path toward the digitization of Canadian memory institutions’ collections, thus ensuring the institutions remain relevant in the digital age by making their collections easily accessible. LAC wishes to compliment the current efforts of Canadian memory institutions such as those described above by ensuring that a national plan of action is in place.

The public policy landscape in Canada has been generally supportive of access to knowledge efforts. For example, the Canadian Supreme Court has interpreted certain legal provisions, called “fair dealing,” as expansive user rights that cannot be unduly constrained. In a case called CCH Canadian Ltd. v. Law Society of Upper Canada, the Court held that it was fair dealing for the Great Library of Canada to make photocopies of court decisions on behalf of attorneys. In Alberta v. Access Copyright, the Supreme Court held that is fair dealing for teachers to copy short excerpts of copyrighted works for students in their classes. The Court found that such copying was done for the acceptable purpose of research and private study because, as a user right, the relevant perspective from which to consider the purpose was the user/student whose research and private study was furthered by the teacher’s copying. The court also held that the “amount of the dealing” factor should not be assessed in the aggregate. Instead, the court must look at the amount of the work in proportion to the length of the whole works.

In SOCAN v. Bell Canada, the Supreme Court reaffirmed the principles articulated in the Access Copyright case. Here, the Court held that a commercial platform allowing users to stream 30-second preview clips of musical works before they decided whether to purchase the work was also considered fair dealing for the purpose of research. The Court reiterated that the purpose must be assessed from the perspective of the user and not the commercial entity that was trying to sell the music. In each of these cases, the Supreme Court of Canada acknowledged fair dealing as the exercise of users’ rights that must be broadly interpreted.

As a result of these decisions, many Canadian educational institutions developed reasonable fair dealing guidelines which provide educators with a set of criteria for determining whether a particular instance of copying requires permission, or whether it is protected by fair dealing. For example, the University of Toronto’s Fair Dealing Guidelines provide a step-by-step analysis of whether a given use of a copyright protected work may be fair dealing, as well as a few more specific guidelines about what constitutes fair dealing, allowing more uses of copyrighted works without permission.

Additionally, the Canadian legislature passed the Copyright Modernization Act (CMA). The CMA added several important user-oriented provisions, including the addition of education, parody, and satire as acceptable fair dealing purposes. Taken together with the recent Supreme Court decisions discussed above, Canadian law now allows quite a bit more flexibility in using copyrighted works without permission.

The CMA allows private individuals to do more with copyright protected works without legal liability. For example, the CMA created the so-called “YouTube exception” which allows for non-commercial sharing of user-generated content that contains copyrighted material. The provision is designed to permit activity that many ordinary Internet users engage in regularly, such as creating mashups, or using a popular song in the background of a personal home video. This provision is subject to conditions (i.e., identification of the source and author, legality of the original work or the copy used, and absence of a substantial adverse effect on the exploitation of the original work).

A series of additional provisions protect consumers from liability for other “ordinary activities that are commonly accepted,” but which had previously remained illegal under Canadian copyright law. For example, the CMA now permits format shifting of personal copies of works, such as transferring a song from CD to an MP3 player. Similarly, the CMA permits time shifting of copyrighted materials for later listening, reading or viewing. Finally, the law permits individuals to make backup copies of copyrighted works, provided that, among other things, the individual does not give any of the reproductions away to others. However, each of these expansions of user-rights to permit format-shifting, time-shifting, and the creation of backup copies are all subject to the condition that the creation of the reproduction not circumvent a “technological protection measure.”  As such, they may not be as user-friendly in practice as they may appear on paper.

The CMA also expanded the use rights of libraries, museums, and archives. For example, the law now allows libraries, museums, and archives to format shift a work in its permanent collection if the original is in a format that is obsolete or the technology required to use the original is unavailable or is becoming unavailable. Further, libraries, museums, and archives can distribute certain materials digitally, provided that they take certain measures to protect the copyright owner’s rights. There is a similar allowance for unpublished works deposited in archives. The CMA also allows the use of publicly accessible online materials for educational purposes, provided that the source and author are attributed, and unless the works are protected by “digital locks.”

The CMA also revised the statutory damages provisions in a user-friendly manner. The law now distinguishes between commercial from non-commercial infringements for the purposes of statutory damages awards. Specifically, where the “infringements are for non-commercial purposes”, the court may order between $100 and $5,000 in damages “with respect to all infringements involved in the proceedings for all works.” In other words, statutory damages in a proceeding for non-commercial infringement are now limited to $5,000, no matter how many works were infringed.  Furthermore, in exercising its discretion to award statutory damages for non-commercial infringements, the court is to consider “the need for an award [of damages] to be proportionate to the infringements, in consideration of the hardship the award may cause to the defendant, whether the infringement was for private purposes or not, and the impact of the infringements on the plaintiff.”

These recent developments in Canadian law, in conjunction with its ratification of the Marrakesh Treaty, make the landscape ripe for further expansions of digital access to knowledge in the future. Internet Archive Canada will be exploring opportunities for partnerships and projects to bring Canada digital and help the nation to become an international leader in access to knowledge.


Posted in News | 2 Comments

Oct 26th Event: Celebrating 20 Years of Archiving the Web


          The Web dwells in a never-ending present. It is—elementally—ethereal, ephemeral, unstable, and unreliable.         

                               –Jill Lepore, from “The Cobweb: Can the Internet be Archived?”                                                    in the New Yorker, January 26, 2015

For twenty years, here at the Internet Archive, we’ve been trying to capture lightning in a bottle. How do you archive the “ethereal, ephemeral, unstable and unreliable“ Web? Since 1996, that has been part of our daily work. We crawl the Web, preserve it, try to make it play back, as if you were back in 1999 on your own GeoCities page, delighting in that animated Under Construction GIF you just posted.

On October 26, 2016, we will be celebrating our 20th Anniversary, and we hope you will join us. We’ve been grappling with how to convey the enormity of our task. How do you visualize the universe of the Web—the audio, images, Web pages, and software that we’ve been archiving for the last 20 years? When you come to our celebration, we’ll be presenting the work of media innovators, each trying to capture the ephemeral Web:


One view from Cyberscape, Owen Cornec and Vinay Goel’s visualization of the top 800,000 Web sites

  • Cyberscape—Data visualization engineer, Owen Cornec and Internet Archive Data Scientist, Vinay Goel team up to create an interactive exploration of the top sites on the Web, as captured by the Wayback Machine as early as 1996.


  • Deleted Cities—Artist Richard Vijgen’s interactive visualization of GeoCities, once the Web’s largest online community. When Yahoo decided GeoCities was obsolete in 2009, the Internet Archive and Archive Team rushed to preserve tens of millions of GeoCities “homestead” pages before they were erased. Vijgen’s work takes you back to the neighborhoods and virtual cities where a vibrant society once lived online.

Paul D. Miller aka DJ Spooky will perform a newly commissioned piece on October 26.

  • DJ Spooky aka Paul D. Miller & media innovator, Greg Niemeyer join forces to create an audio and video composition, drawn completely from media preserved in the Internet Archive. DJ Spooky’s work ranges from producing 14 albums to the DVD anthology, “Pioneers of African American Cinema,” about which the New York Times wrote “there has never been a more significant video release.”
  • How Media & Messaging are Shaping the 2016 Election—journalist and former Managing Editor of the Sunlight Foundation, Kathy Kiely, explains how short snippets—of debates, political ads, cable news—are altering the Presidential landscape. This analysis is made possible in part by the Internet Archive’s Political Ad Archive, preserving key ads and debates and monitoring how they are used in swing states.


  • Defining Memes & Memories—perhaps the world’s only Free Range Archivist, Jason Scott, takes you on a wild ride through 20 years of memes that captured the global imagination.  From the original keyboard cat to Three Wolf Moon, Scott explores the Archive items and collections that rocked the world.

And to round up the evening, Internet Archive Founder, Brewster Kahle, will reflect upon his lifelong obsession—backing up the Web, making it more reliable and secure. Our work is just beginning, but if we are successful, new generations of learners will be able to access the amazing universe of the Web, learn from it, and build societies that are even better.

GET YOUR FREE TICKET TO “How to Build an Archive—20 Years in the Making.” Wednesday, October 26, 2016 from 5-9:30 p.m. at the Internet Archive, 300 Funston Avenue, San Francisco.


Posted in Announcements, News | 7 Comments

Internet Archive data fuels journalists’ analyses of how TV news shows covered prez debate

The presidential debate between Hillary Clinton and Donald Trump on September 26 drew an audience of 84 million, shattering records. It was also a first for the Internet Archive, which made data publicly available, for free, on how TV news shows covered the debate. These data, generated by the Duplitron, the open source tool used to generate counts of ad airings for the Political TV Ad Archive, also is able to track coverage of specific video clips by TV news shows.

Download TV News Archive presidential debate data here.

Journalists took these data and crunched away, creating novel visualizations to help the public understand how TV news presented the debates.

The New York Times created a visual timeline of TV cable news coverage in the 24 hours following the presidential debate, with separate lines for CNN, MSNBC, and Fox News. Below the time line were short explanations of the peaks and how the different networks varied in their presentations even when they all covered roughly the same ground. The project was the work of Jasmine C. Lee, Alicia Parlapiano, Adam Pearce, and Karen Yourish. For much of the day on Sept. 29, it was featured at the top of the New York Times website.


To see more visualizations created by journalists using TV News Archive data following the first presidential debate, visit the Political TV Ad Archive.

The Internet Archive will make similar data available on the upcoming vice presidential debate, as well as the remaining presidential debates. This effort is part of a collaboration with the Annenberg Public Policy Center to study how voters learn about candidates from debates.



Posted in Announcements, News | Tagged , , , , , , , , , | 2 Comments

Guest Post: Preserving Digital Music – Why Netlabel Archive Matters

The following entry is by Simon Carless, who worked for the Internet Archive in the early 2000’s before moving on to work in media and conferences, while simultaneously maintaining collections at the Internet Archive and running the for-free game information site Mobygames.

netlabelsIt’s fascinating that the early Internet era (digital) data can sometimes be trickier to preserve & access than pre-Internet (analog) data. A prime example is the amazing work of the Netlabel Archive, which I wanted to both laud and highlight as ‘digital archiving done right’.

Created in 2016 by the amazing Zach Bridier, the Netlabel Archive has preserved the catalogs of 11 early ‘netlabels’ and counting, a number of which involve music that was either completely unavailable online, or difficult to listen to online. One of these netlabels is the one that I ran from 1996 to 2009, Mono/Monotonik. So obviously, I’m particularly delighted by that project. But a number of the other netlabels are also great and previously tricky to access, and I’m even more excited for those. (Reminder: all these netlabels freely distributed their music at the time, which makes it a great thing to archive and bring back.)

The nub of the problem around early netlabels  – particularly from 1996 to 2003 – is due to PCs & the Internet (& pre-Internet BBSes!) just not being fast enough or having enough storage to support MP3 downloads at that time.

So this early netlabel music – on PCs and even other computers like Commodore Amigas – was composed in smaller (in kB!) module files, which was composed and played on computers by using sample data and MIDI-style ‘note triggering’ with rudimentary real-time effects. This allows 5-minute long songs to be just 30kB-300kB in size, versus the 5mB or more that a MP3 takes.

For the more recent history of netlabels, I founded the Netlabels collection at the Internet Archive back in 2003, and that’s grown to hold over 65,000 individual music releases – and hundreds of thousands of tracks – by 2016. But the Internet Archive’s collection was largely designed to hold MP3 and OGG files, and so the early .MODs, .XMs and .ITs were not always preserved as part of this collection – and they were certainly not listenable to in-browser.

Additionally, there were a number of netlabels that used their own storage instead of the Internet Archive’s, even after 2003. But if it disappeared, their data disappeared with it, and music files are generally large enough not to be archived by the saintly Wayback Machine.

So if early netlabel archives exist, it was as ZIP/LHA archives on or other relevant demoscene FTP sites. (Netlabels were spawned from the demoscene to some extent, since demo soundtracks use the same format of .MODs and .XMs.) And tracker music is annoyingly hard to play on today’s PCs and Macs – there are programs (such as VLC & more specialist apps) which do it, but it’s not remotely mainstream & not web browser-streamable.

So what Zach has done is keep the original .ZIP/.LHA files, which often had additional ASCII art & release info in them, save the .MODs and .XMs, convert everything to .MP3, painstakingly catalog all of the releases, and then upload the entire caboodle (both original and converted files) to both the Internet Archive and additionally to YouTube, where there are gigantic playlists for each label. So there’s now multiple opportunities for in-browser listening & the original files are also properly preserved.

This means we can now all easily browse and listen to the complete catalog of Five Musicians, a seminal early global PC tracker group/netlabel, as well as the super-neat Finnish electronic music netlabel Milk, the aggressive chiptune/noise label mp3death, and a host of others. And I recently uploaded a rare FTP backup from 1998 which allowed him to put up the 10 releases (that we know about!) from funky electronic netlabel Cutoff. These may have been partially online in databases like Modland, but certainly weren’t this accessible, complete, or well-collected.

What’s somewhat crazy about this is that we’re not even talking about ancient history here – at most, these digital files are 20 years old. And they’re already becoming difficult to access, listen to, or in a few cases even find.

For example, I had to dig deep into backup CD-ROMs to find some of the secret bootleg No’Mo releases that we deliberately _didn’t_ put on the Mono website back in 1996 – opting to distribute them via BBSes instead. These files literally didn’t exist on the Internet any more, despite being small and digital-native.

I think that’s – hopefully – the exception rather than the rule. But without diligent work by Zach (much kudos to him!) & similar work by other citizen digital activists like the 4am Apple II archiver, Jason Scott (obviously!) and a host of others, we’d have issues. And we may need more help still – some of this digital-first materials may disappear permanently, as the CD-ROMs or other media they are on become unreadable.

But we’re still doing a PRETTY good job on preservation, especially with CD-ROMs being ingested in massive amounts onto the Internet Archive regularly. (I’m working with MobyGames & another to-be-announced organization on preserving video game press CD-ROMs on, for example, and Jason Scott’s CD-ROM work is many magnitudes larger than mine.)

Yet I actually think contextualization and access to these materials is just as big a problem, if not bigger. Once we’ve got this raw data, who’s available to look through it, pick out the relevant stuff, and make it easily viewable or streamable to anyone who wants to see it? That’s why the game art/screenshots on those press CD-ROMs is also being extracted and uploaded to MobyGames for easy Google Images access, and why Netlabel Archive’s work to put streamable versions of the music on and YouTube is so vital. (And why playable-in-browser emulation work is SO very important!)

In the end, you can preserve as much data as you want, but if nobody can find it or understand it, well – it’s not for naught, but it’s also not the reason you went to all the trouble of archiving it in the first place. And the fact the Netlabel Archive does both – the preserving AND the accessibility – makes it a gem worth celebrating. Thanks again for all your work, Zach.


Posted in News | 1 Comment