Music Analysis Beginnings

As mentioned in our recent Building Music Libraries post, we are working with researchers at Columbia University and UPF in Barcelona to run their code on the music collection to help their research and to provide new analyses that could help with exploration and understanding.

We are doing some pilot runs to generate files which some close observers may see in the music item directories on  Audio fingerprints from audfprint are .afpt and music attributes from Essentia are in _esslow.json.gz (download sample) and _esshigh.json.gz.

Spectrogram of a Grateful Dead track

Spectrogram of a Grateful Dead track

We are also creating image files showing the audio spectrum used.  We hope this is useful for those that want to see if files have been compressed in the past (even if they are posted as flac files now).  There is also a .png for each audio file of a basic waveform that is being used in the archive’s beta site as eye candy.

More as it happens, but we wanted you know there is some progress and you will see some new files.  If you have proposed other analyses that would benefit from being run over a large corpus, please let us know by contacting info at archive dot org.

Thank you to the researchers and the Archive programmers who are working together to make this happen.


Posted in Audio Archive, Live Music Archive, Music | Comments Off on Music Analysis Beginnings

Using Docker to Encapsulate Complicated Program is Successful

The Internet Archive has been using docker in a useful way that is a bit out of the mainstream: to package a command-line binary and its dependencies so we can deploy it on a cluster and use it in the same way we would a static binary.

Columbia University’s Daniel Ellis created an audio fingerprinting program that was used in a competition.   It was not packaged as a debian package or other distribution approach.   It took a while for our staff to find how to install it and its many dependencies consistently on Ubuntu, but it seemed pretty heavy handed to install that on our worker cluster.    So we explored using docker and it has been successful.   While old hand for some, I thought it might be interesting to explain what we did.

1) Created a docker file to make a docker container that held all of the code needed to run the system.

2) Worked with our systems group to figure out how to install docker on our cluster with a security profile we felt comfortable with.   This included running the binary in the container as user nobody.

3) Ramped up slowly to test the downloading and running of this container.   In general it would take 10-25 minutes to download the container the first time. Once cached on a worker node, it was very fast to start up.    This cache is persistent between many jobs, so this is efficient.

4) Use the container as we would a shell command, but passed files into the container by mounting a sub filesystem for it to read and write to.   Also helped with signaling errors.

5) Starting production use now.

We hope that docker can help us with other programs that require complicated or legacy environments to run.

Congratulations to Raj Kumar, Aaron Ximm, and Andy Bezella for the creative solution to problem that could have made it difficult for us to use some complicated academic code in our production environment.

Go docker!

Posted in Music, Technical | 3 Comments

SEEKING: Visual Studies PostDoc for an Exciting New Opportunity at Internet Archive!

Council on Library and Information Resources

Today, the Internet Archive and the Council on Library and Information Resources (CLIR) announced a new position:

Visual Data Curation Fellow

Do you know a recent Ph.D in Visual Studies (film, photography, information sciences, fine art) who would like to work at the Internet Archive? We’re looking for a talented Post-doc to come work with our growing Film Archive. This two-year position is based at the Internet Archive offices in San Francisco and begins July 1, 2015 through June 30, 2017. We want to thank CLIR and the Andrew W. Mellon Foundation for a generous grant to support this position. For more information visit CLIR.  Applications are open here through December 29.

Posted in News | Comments Off on SEEKING: Visual Studies PostDoc for an Exciting New Opportunity at Internet Archive!

Lost Landscapes of San Francisco: Fundraiser Benefitting Internet Archive — Friday, December 19, 2014

FerryBldgFromWaterDuskRick Prelinger’s Lost Landscapes of San Francisco is back for one final performance this year!   Now you can catch this perennially sold-out show and your ticket donation will benefit the Internet Archive, a nonprofit digital library which hosts the Prelinger Collection. Please give generously to support the effort.

Friday, December 19, 2014
6 pm Reception
7:30 pm Film

300 Funston Ave.
San Francisco, CA 94118

Get tickets here!

TouristsGGBopening1936ATripDownMarketStreet1906_1This year’s LOST LANDSCAPES brings together familiar and unseen archival film clips showing San Francisco as it was and is no more. Blanketing the 20th-century city from the Bay to Ocean Beach and the Presidio to Bayview, this screening includes San Franciscans at work and play; early hippies in the Haight; a highly privileged walk on the unfinished Golden Gate Bridge;
newly-discovered images of Playland and the waterfront; families living and playing in their neighborhoods; detail-rich streetscapes of the late 1960s; peace rallies in Golden Gate Park; 1930s color images of a busy Market Street; a selected reprise of greatest hits from years 1-8; and much, much more.

As usual, the viewers make the soundtrack — audience members are asked to identify places and events, ask questions, share their thoughts, and create an unruly interactive symphony of speculation about the city we’ve lost and the city we’d like to live in.

The film begins at 7:30 pm and is preceded by an informal
reception that begins at 6:00 pm.

Posted in Announcements, Event, News | 2 Comments

Inviting the Internet Over to Play


At our Annual Event last week, the Archive announced a variety of new projects and plans, including our new beta interface, our compact book scanner, and our progress in tracking political ads on television. The event (full video is here) went very well, with lots of activities and social gathering before and afterwards, and included the first public unveiling of our newest project, the Internet Arcade.

Photo by Kyle Way

Photo by Kyle Way

It was obvious we were on to something – the smallish room with the two stations set up to play emulated arcade games from the collection was constantly packed. Players young and old tried out classic video games, including parents showing their children games they’d played in their own teenage years. All of it was running off the Archive’s own web pages through standard web browsers, with no special plug-ins – and it held up well. We even tracked high scores.


The party, of course, was just the beginning – over the weekend, we quietly announced that the Internet Arcade was available through the main site. With over 900 arcade machines in the collection, most every major machine released between 1976 and 1988 was included. (The emulation system we use, JSMESS, is a Javascript port of a long-running emulation project called MESS/MAME, which has had hundreds of contributors over the years – we salute them.)

After an initial tweet or two, the Arcade’s existence went from a mention by Waxy and Laughing Squid, to sites like Hacker News and Mashable, and from there it hit larger and larger audiences. Within a few hours news had spread to a whole range of sites, including Joystiq, The Verge, Engadget, CNN, PC World, Gizmodo, Ars Technica… and, well, let’s just say a very large amount of sites were reporting on this story.

And that’s when the world showed up.

We’re still counting, but we know hundreds of thousands of people came, many of them all at once, to play.

And as these thousands of curious visitors and first-time callers came to the Archive to try out our collection, minor inefficiencies became showstoppers and the site was temporarily crushed. Our brave administration team persevered, repairs were made, and the site settled in for the new reality:

That's a lot of new visitors!

Everything’s fine and normal… then we crash and fix things… and WOW that’s a lot of new visitors!

This crush of new visitors are coming to the Internet Archive, possibly for the first time ever, and we welcome them with open arms. After all, that’s what we were founded for –  our stated purpose is to function as the Internet’s Library, with stored websites, digitized texts, music, movies and software.  It’s our mission as a non-profit library: make as much of culture and information available to as many people as possible. You can lose a workday or a whole winter in our virtual stacks, and our users often do.

Meanwhile, the story continues to have legs, appearing in newspapers, on radio shows, video podcasts, and message boards around the world.

And then we made it to TV news:


So now that we have (apparently) the world’s attention… ahem ahem..

Even we don’t know where this story is going to lead. But one thing is sure – video games and software are as important a part of history and culture as books, movies and music have been in the past.  And we’re dedicated to bringing all of this to you, the Internet. Sure, it can be a bit surprising when the entire internet comes over to play, but we wouldn’t have put out the welcome mat if we didn’t want you to visit.

As a non-profit, we depend heavily on user donations to stay afloat – we even take Bitcoin and subscriptions. Keeping 20 petabytes of information flowing, fast and free, is what we’re working on day and night and the positive messages and feedback we’ve gotten this past week (and over the years) tell us we’re doing the right thing.

The JSMESS emulation project is one of many open-source projects the Internet Archive is involved with, and while a lot of it is fun and games we’ve got a serious side too, gathering up disappearing web resources and important historical events into our archives to preserve for next generations. We hope that after you relive your childhood or live out a second new one, you’ll stick around and see what else we have here. It’s quite a place.

Game on!






Posted in Announcements, News, Software Archive | 3 Comments


Last week we announced a new beta version of the site.  The beta is the first step toward inviting people to participate in building libraries together.













2014 beta site

2014 beta site

Why redesign the site?

The Wayback Machine was launched in 2001, and the current look of the site was debuted in 2002 when we added movies, texts, software, and music.  There have been minor design changes and we’ve added features over the years to make the library materials more usable, but the current interface has just accumulated over time.  We have not “rethought” the site in a holistic way in the past 12 years.

A lot has changed since 2002, for the Internet Archive and on the web.  In 2002 the archive contained 5,000 non-Wayback items, about half movies from the Prelinger Archive and half live music concerts from the community with a few books and pieces of software sprinkled in. Those 5,000 files added up to about 3 terabytes of data.  Today we have more than 20 million media items that add up to about 10,000 terabytes of data (that’s not including 435 billion saved web pages that take up an additional 10,000 terabytes of space).

As we added more stuff to the archive, people came to visit.  We ended 2002 with about 9,000 registered users.  Today we have just a hair under 2 million registered users, and around 2.5 million individuals use the library materials every day.

Having thousands of movies available on the Internet in 2002 was actually pretty rare (remember, Youtube didn’t exist until 2005). Those 5,000 media items couldn’t be played on our site – you had to download them to your own computer to watch or listen. It was very difficult to add your own files to the Internet Archive – and who would have had the bandwidth to do it anyway?  In 2002 only 21% of U.S. homes had “high speed” internet connections.  High speed back then meant 200 kb per second. [1]

And of course, we can’t forget mobile. About 20-30% of our users today are on mobile devices, and the current web site is not serving them well.

Over the years the archive has grown immensely in terms of material and patrons. Our mission is Universal Access to All Knowledge.  And we think we can do better both with Access and with gathering All Knowledge if we have new tools and a better interface for the site.

Why this interface?

We started talking about the redesign in January of this year.  (Well, honestly we’ve been talking about it since 2006, but this was the first serious, archive-wide project.)

First we found a wonderful Creative Director, David Merkoski, and hired a great designer, Kristen Schlott.  We interviewed people, both users of the archive and people who had never heard of us, and asked them questions about how they use media. We examined how our site was being used, and talked about the intricacies and complications that come with archiving 20 million disparate things. We researched how other sites deal with large amounts of media. We used our current collections and use cases to understand how different designs would perform. Our lead developer, Tracey Jaquith, built prototypes and we user tested them. We talked to some of our power users and partners about our plans and showed them the prototype to get feedback. We had a LOT of meetings.

Idea clustering after user interviews

Idea clustering after user interviews

During this process we realized that we needed to find a way to open the archive up to more participation.  The Internet Archive has built some important and useful collections, both with partners and on our own.  We digitize 1,000 books per day.  We archive 1 billion URLs every week.  We capture television 24 hours per day, every single day.  But there is a lot of media out there in the world, and we can’t save all of it for the future without the help of experts.

Who are the experts?  You!  There are some amazing collections of media in the archive, out on the web, and sitting around on shelves and in basements that have been created by the people who know and care the most about saving those things and making sure their collections are complete and well described.  We want to create a place for those people to build communities around their interests where they can safely store these amazing collections and show them to as many people as possible.  If we all work together, we can create the most useful library the world has ever seen.


Today the beta has the same basic functions as the current site, with some great additions: more visual cues to help you find things, facets on collections to quickly get you where you want to go, easy searching within collections, user pages, and many more.  We think it’s already an improvement over the current site – otherwise, we wouldn’t be showing it to you yet!

But the tools that will allow you to create your own collections and collaborate with others are still being built.  These features will be released in stages so that we can test them out in the beta and see how they work for people.  We will use feedback from patrons – both what you tell us, and the usage logs for the beta – to make decisions about how things will evolve. (Don’t worry, we aren’t keeping IP addresses — the beta respects user privacy.) When you’re in the beta, you’re going to run into things that might not work quite the way you expected, or that have suddenly changed since you used them yesterday. Sometimes it will be slow or you’ll find bugs. New things will appear, and other things may disappear. New tools will suddenly start working. We hope that for our intrepid beta users, this will be part of the fun. (Because we certainly think it’s fun!)


What new things are coming?

To some extent, this remains to be seen.  We will in part make decisions based on how the beta is used, so please use it!

Our current ideas include: speeding up the site; allowing patrons to create their own collections; improving accessibility for the print disabled, adding ways for patrons to collaborate around collections and items, etc.

There’s a lot more to come.  We hope you will explore all of these new options with us, and help us build the library.  If you would like to give us feedback, please write to us at info at archive dot org, or leave comments here.



Posted in Archive Version 2, News | 6 Comments

New York Times: The Internet Archive, Trying to Encompass All Creation

Thanks to the New York Times for doing a great write-up of our annual celebration.  Check it out!



Posted in News | Comments Off on New York Times: The Internet Archive, Trying to Encompass All Creation

NYtimes Readers: Try our beta website

The NYtimes article on us has our beta website address incorrect as, but please visit instead.


Posted in News | Comments Off on NYtimes Readers: Try our beta website

Invitation to Aaron Swartz Day Nov. 8 in SF

Saturday, November 8, 2014
Internet Archive
300 Funston Ave
San Francisco, CA 94119


The Internet Archive is hosting an Aaron Swartz Day Celebration on what would have been Aaron’s 28th birthday: November 8, 2014, from 6-10:30 pm.



Although we are looking ahead, rather than dwelling on the past, this year’s theme is “Setting the record straight.”

Now that we have brought people together and shared information with each other, the smoke has cleared a bit, and we can clearly explain to the world exactly what Aaron actually did and did not do.

Reception: 6pm-7pm – Come mingle with the speakers and celebrate Aaron’s accomplishments.

Speakers: 7pm-8pm – The Year in Aaron 2014: A comprehensive update.

Movie: 8-9:45 pm – Watch The Internet’s Own Boy with Director Brian Knappenberger.

Q&A: 9:45 – Audience Q & A with Brian Knappenberger and Trevor Timm (co-founder and Executive Director of the Freedom of the Press Foundation) after the movie!


April Glaser (EFF, Freedom to Innovate Summit)
The Freedom to Innovate Summit is a collaboration between EFF and the Center for Civic Media at MIT that calls upon Universities to protect students who innovate at the boundaries of the law.

Yan Zhu (Yahoo, SF Hackathon Organizer)
Yan will explain the history, and evolution to the present day, of the Aaron Swartz International Hackathon.

Brewster Kahle (Digital Librarian, Internet Archive)
Internet Archive has just launched a new set of tools for building collaborative libraries online that were inspired by Aaron’s dreams and visions.

Cindy Cohn (EFF Legal Director – CFAA Reform)
A short and simple update on a very complicated subject: Why most attempts to reform the Computer Fraud and Abuse Act have largely stalled in Congress.

Kevin Poulsen (Journalist – FOIA case that MIT intervened in)
An update on the most recent batch of documents and video from Aaron’s FBI and Secret Service files that have finally trickled out of the U.S. government over this last year, after undergoing further redactions by MIT.

Garrett Robinson and James Dolan (SecureDrop)
2014 was a big year for Aaron’s whistleblowing submission platform, with 15 new instances including:  Forbes, Greenpeace New Zealand, The Guardian, The Intercept, The New Yorker, BayLeaks, and The Washington Post.

Daniel Purcell (Keker & Van Nest, one of Aaron’s lawyers)
Along with Eiliot Peters, Dan Purcell was hired by Aaron and his family in September 2012 to defend Aaron at his criminal trial, set for March 2013. Dan will talk about Aaron’s defenses to the criminal charges and the expert testimony the legal team planned to present.

The event will take place following this year’s San Francisco-based Aaron Swartz International Hackathon, which is going on Saturday and Sunday from 11am-6pm at the Internet Archive PLEASE CLICK HERE. Confirmed 2014 cities include:  Berlin, Boston, Buenos Aires, Houston, Kathmandu, Los Angeles, Magdeberg, New York, Oakland, Oxford, and San Francisco.


On November 8, Pivot is airing Internet’s Own Boy: The Story of Aaron Swartz.  Check local listings.

For more information, contact:
Lisa Rein, Coordinator, Aaron Swartz Day

Posted in Announcements, News | 1 Comment

Building Libraries Together: New Tools for a New Direction


(NYtimes on this announcement, video of talks)

Let’s work together to save all human knowledge.  Today the Internet Archive is announcing a new beta site and new tools to encourage everyone to lend a hand.

Prototype Table Top Scribe for scanning books

Prototype Table Top Scribe for scanning books

We were founded in 1996 as an archive OF the Internet; we saved web pages and made them available through the Wayback Machine starting in 2001. In 2002 we became an archive ON the internet when we began digitizing and hosting movies, books, TV, music and software by working closely with libraries and online communities. Much of the work of building the current archive has been done by us and a relatively small number of selected partners.

Today marks a change in direction.

Listening Room

Listening Room

We are creating new tools to help every media-based community build their own collections on a long term platform that is available to the entire world for free. Collectors will be able to upload media, reference media from other collections, use tools to coordinate the activities of their community, and create a distinct Internet presence while also offering users the chance to explore diverse collections of other content.

In this future, communities and libraries will take the central role in building collections, leveraging the tools and storage of the Internet Archive.

Political campaign ads

Political campaign pilot interface

Still in its early development, the Internet Archive is looking for feedback and help in this new direction.  Shaping these tools will be a joint process with our library and community partners.

Introducing new tools today, with further developments to come:

    • Table-top book scanner that works with back-end Archive technology and staff to create beautiful online books
Beta preview of

Beta preview of

The Internet Archive needs your help to create and use these tools.   Your donations of time, money, digital and physical materials can help us Build Libraries Together.

Posted in Announcements, Archive Version 2, News | 4 Comments

Building Music Libraries

The Internet Archive is working with partners to preserve our musical heritage. The music collections started 8 years ago with the live music recordings and grew when we started hosting netlabels.

Scanning an LP cover

Scanning an LP cover

Now through new efforts and partnerships we have begun to expand and explore the music collections further.  We are working with researchers, record labels, collectors, internet communities and other archives to gather music media, build tools for preservation and expand metadata for exploration.

We have already made tremendous progress. We have archived millions of tracks, we are working with the Archive of Contemporary Music to digitize portions of their extensive collections of physical media, the community has provided meticulous metadata, and researchers from university programs have begun to analyze the music.

Listening Room

Listening Room

A prototype “listening room” in the Internet Archive’s building in San Francisco is available free to the public to listen to the full musical holdings.  Access to these collections will also be provided to select computer science researchers via a secure “virtual reading room” in our data center.  As tools and the collections grow, we will offer everyone access to the metadata to help them explore, and then offer links to commercial sites for listening or purchasing.

We invite interested people to participate:

Archives. The Internet Archive and the Archive of Contemporary Music in New York have started digitizing ACM’s holdings with consistent, high quality, standards-based methods to build a scalable workflow.  We welcome other archives with similar projects, or who would like to help.  “Digitizing our large physical collections is an important step for our archive to allow others to learn from this deep legacy,” said Bob George, Director of the Archive of Contemporary Music, NYC.


Digitizing CDs at the Archive of Contemporary Music

Collectors.  Digitize, donate, or lend material for digitization.  Improve metadata or provide context to help others understand the depth and cultural relevance of these collections.  “Recycled Records is happy to have directed the donation of many thousands of LPs to the Internet Archive to help with their projects and for the love of music,” Bruce Lyall, proprietor of Recycled Records.

Labels.  Preserving a complete collection of everything published by a label is best done by or with the record label.  We would like to work with labels to get their releases archived and properly cataloged.  “The upcoming Music Libraries program continues the very work that enables our label, and the musicians who record for us, to bring the music of earlier times to audiences today. We are proud to participate in a tradition of preservation that has brought joy to so many through music.”  said David Fox, Co-founder of Musica Omnia.

Cataloging services.  Commercial and non-commercial cataloging services can participate by making sure there are proper links from and to these collections.  The open, community-created catalog has already been very helpful.


Commercial vendors and streaming services.  Links from these collections to commercial services can help users buy and listen to full tracks.  These services might have valuable metadata as well that can help users navigate.

Musicians and bands.  Please create more great works that libraries can preserve and provide access to.  We would like to hear your ideas about making the site useful for both musicians and the general public.

Researchers, historians, and music lovers.  Annotate, organize, datamine, and surface music in the collections, and help us preserve those works not yet in the collections.  “Access to a comprehensive archive of commercial music audio is the key missing link for research relating signal processing to listener behavior,” said Daniel Ellis, professor at Columbia University.  By analyzing the rhythms, keys, instruments, and genres, researchers will help create more complete metadata and aid discovery.

Looking to the future, we hope to expand these shared music collections by uniting the work done by other archives and collectors.  By bringing all of this music and its metadata into a shared library, we hope to bring the richness of our musical heritage to people all over the world.

Visit the Listening Room

Internet Archive
300 Funston Ave
San Francisco, CA 94118
Hours: Fridays from 1-4pm, or by appointment.

If you would like to participate in any way, please email us.

Posted in Announcements, Live Music Archive, Music, News | 4 Comments

Archive of Contemporary Music and the Internet Archive Team up to Create a Music Library

bobgeorgeWhen the personal record collection of music producer Bob George hit 47,000 discs, he knew something had to be done.  “I wanted to give them away, but they were mostly punk, reggae and hip-hop,” he recalled, “and no established library or archive was interested.” The only thing to do, it would seem, was to turn his collection into a non-profit archive in New York called the ARChive of Contemporary Music.  29 years later, the ARC is one of the largest popular music collections in the world, with some three million sound recordings, 19,000 music-related books, and millions of photos, press kits and artifacts.  Now this rich musical resource—used primarily by musicologists and the entertainment industry—is teaming up with one of the largest digital libraries in the world, the San Francisco-based Internet Archive, to create a music library that will preserve and provide researcher access to a wide range of music and the rich materials that surround it.

ACMdigitizationPowered by teams of volunteers, the two archives are partnering to digitize CDs and LPs and then use audio fingerprinting to match tracks with metadata from catalogs and other services.  Using Internet Archive scanners, the ARC is digitizing its books and photographs at its New York facility.  When complete, this music library will be a rich resource for historians, musicologists and the general public.

Listening Room

Listening Room

Starting today, the public can listen to millions of tracks for free, including many that are not available in Spotify or iTunes, at the Internet Archive’s new listening room in San Francisco.  “The Internet Archive has allowed us to move forward at unprecedented speed, originally with book scanning and now with the digitization of a wide range of audio formats,” said Bob George.  “The physical records from around the world that the ARC has archived are a unique treasure,” said Brewster Kahle, founder and digital librarian of the Internet Archive. “Soon these records will be studied in new ways because they will be digital as well.”

ACMpullquoteSince 1985, George, the ARC’s co-founder and director, has run the organization in Tribeca, New York City, supported by friends in the music industry including Paul Simon, David Bowie and Nile Rodgers.  The Rolling Stones guitarist Keith Richards endows a collection of blues and R&B recordings there. Filmmakers Martin Scorsese and Jonathan Demme stop by when trying to track down hard-to-find songs.  Yet for most of its almost three decades, the ARC has been a decidedly “analog” experience:  records, CDs and cassette tapes line its walls; to experience a song you usually have to drop a needle into a pristine vinyl groove.  The collaboration with the web-based Internet Archive represents a new direction.  “We feel that our primary mission, to collect and preserve this material, is near completion,” said Bob George. “Now we are seeking ways to allow greater access to this incredible collection.”

Scanning an LP cover

Scanning an LP cover

The Internet Archive may be best known for the 435 billion web pages in its Wayback Machine, but this digital library has always been a place where live music collectors go to preserve concerts on the web.  Its audio collections include some 130,000 live concerts by bands such as the Grateful Dead, Jack Johnson and Smashing Pumpkins—many with more than a million plays. Recently, the ARC shipped 46,000 seventy-eight rpm recordings to the San Francisco-based non-profit, and has donated tens-of-thousands of long-playing records. Music labels Music Omnia and Other Minds are making their entire collections searchable on, in part because the Internet Archive is one of the few online platforms that preserves audio, texts, musical manuscripts, photos and films and makes them accessible forever, for free.

The Internet Archive listening room is now open to the public for free on Fridays from 1-4 pm, holidays excepted, and by appointment at 300 Funston Avenue, San Francisco, CA.  Those interested in donating physical music collections to the ARC or Internet Archive should contact or


Posted in Announcements, News | Comments Off on Archive of Contemporary Music and the Internet Archive Team up to Create a Music Library

Archive-It: Crawling the Web Together

A post by the Archive-It team

Today Phase 1 of the 5.0 release of the Archive-It web application was released for use by the 326 partners using the Archive-It service.

In 1996 when the Internet Archive was founded, we used automated crawlers to capture the web, snapping up millions of web pages and preserving them for history. Ironically, our digital record of humankind was being driven by computer algorithms.

As the years went by, it became clear that we needed people and communities to capture and save what is really and truly important. So in February 2006 we launched the Archive-It service, 1.0, which allowed traditional librarians and archivists to become web archivists by initiating focused, curated crawls of the live web using a simple web application with partner/tech support. Launching Archive-It meant we could help our colleagues create their own web collections for their own libraries and also foster a community around web archiving to work together to build a global digital public library at

Now, as we expand to the next generation of Archive-It with our 5.0 release, we hope to provide even greater tools for collection development. Released this week, 5.0 phase 1 highlights a shiny new user interface and significantly enhanced post-crawl reports that include infographics with visual representations of the data.

representative of the data

Figure 1: Screenshot from the Reports section of the new Archive-It 5.0 user interface

Back in 2006 there was little understanding of web archiving and many organizations were questioning whether this was a valid activity that could or should be a part of their larger institutional collecting strategies. After all, the challenges were staggering: the quality of web content was all over the map; conflicting policies and organizational structures posed challenges; no one had yet established best practices for selecting the content, how to handle metadata, or how to integrate this new type of content into other holdings and existing catalogs at the institution.   Also, back then we could not have predicted the extent to which material that once existed in physical form would now only appear on the web in digital form.

We launched the Archive-It service with a small band of believers and supporters, among them librarians and archivists from Indiana University, University of Texas at Austin, Library of Virginia, Montana State Library, and North Carolina State Archives and State Library. Partners were very patient with us and with Archive-It 1.0, which was bare bones. Collaborating and working with the library and archive community has always been a top priority for the Internet Archive, and a defining characteristic of the Archive-It service. There have been many times during the past 8+ years when we have not known the answer to a question and we say: “Let’s ask the community and see what they think!” And the community has always gotten back to us with supportive answers   – both illustrative and specific.

Figure 2: Screenshot from the North Carolina State Government Web Site Archive of the North Carolina State Archives and State Library of North Carolina.

As time went on, the community of web archivists grew and we were able to produce some compelling answers to the question: why web archive? Here are just a few:

  • To create a thematic or topical web archive
  • To fulfill a mandate to preserve institutional memory and history
  • To archive state or local agency publications no longer being deposited in print form
  • To archive records to meet university or government retention policies
  • To preserve an historical record of an institution’s web and /or social media presence
  • To capture a website before re-design or it is taken offline
  • To archive online art, exhibitions, and artists’ materials


Figure 3: Screenshot from the Latin American Government Documents Archive, LAGDA of the University of Texas at Austin.

Figure 4: Screenshot from the Catalogues Raisonnés collection of the New York Art Resources Consortium (NYARC).

To date in 2014, 326 Archive-It partners have created 2700 public collections on a diversity and range of topics, subjects, events and domains. These collections have become integral to these organizations’ collecting strategies and have helped to raise awareness and understanding about why web archiving is so important.

We like to say that the Archive-It service is both a partner and a vendor. We are a service provider and we strive to consistently deliver a high level of customer support — which we believe partners notice and appreciate. We also strive to be a partner to our community and work collaboratively on initiatives that we share together; a few of which are: a) collaborative efforts around archiving spontaneous events (like the 2011 Japanese Earthquake collection), b) teaching web archiving in graduate level MLIS programs and professional development workshops and c) the K12 Web Archiving program (now in its 7th year) where we work with 3rd to 12 graders around the county and ask them what they would like to archive for future generations. As one of the student archivists put it, “500 years from now, kids will think we were really cool.”

Many of the features and functionality that we see in the Archive-It service today are a direct result of a partner making a suggestion or request. Through face to face brainstorming sessions, online surveys, webinars, and support tickets, partners have expressed their ideas as well as offered constructive criticism. And we have listened.   We hope that as the service continues to grow and we launch Archive-It 5.0 that many of our partners will see themselves in Archive-It. Their collections will continue to be valuable to researchers, historians, scholars and the general public for many years to come.

Here are some links to just a few of those collections on the Archive-It website:

Columbia University’s collection on Human Rights:

National Museum of Women in the Arts’s collection on Contemporary Women Artists on the Web:

University of Alberta’s Circumpolar Collection:

Brigham Young University’s Mormon Missionary Collection:

Stanford University’s collection on Freedom of Information (FOIA):

As we continue down this road – excited for the future and what comes next – we know that it takes a community to archive the web and we look forward to working with our partners to build libraries together.

Posted in Announcements, Archive-It, News | 3 Comments

Media, Money & Elections: 2014 Philly Political Media Ad Watch

Philadelphia-region Political Media Ad Watch is a pilot project that allows citizens and journalists to go online to search every political message in the Philly television market, compare all the ads from a single sponsor (sample: Tom Wolf for Governor) —positive and negative—and trace back who is paying for those ads.

She’s Dishonest!
He’s in Bed with an Accused Mobster!
This is what television audiences in Pennsylvania and Southern New Jersey are hearing a lot of this season. And it’s not Judge Judy or the Jerry Springer Show. Nope. It’s the deeply disturbing reality television show of our nation’s mid-term elections.

Dark accusations run back-to-back with heartwarming assurances of compassion.  All financed by increasingly unfettered flows of cash from ever more veiled donors.

Voters have a right to know who’s paying for these messages. And this flood of commercials begs a few critical questions for our democracy:

  • With so much heat, where can citizens find the light they need to make thoughtful choices?
  • Are the local media, many of whom make big bucks on election advertising, doing a good job giving voters the information and context they need to make sound decisions on Election Day?
  • Can we establish a baseline of metrics to evaluate the performance of local media during elections?

The project is a collaboration between the Internet ArchiveSunlight Foundation, Philadelphia’s Committee of Seventy (a non-partisan government watchdog), University of Delaware’s Center for Community Research & Service and the Linguistic Data Consortium at the University of Pennsylvania. It immediately enables local media to do a better job sifting between fact and fiction in political messaging and revealing financial sources of political influence.

In the coming year, University of Delaware researchers will sift project data to answer some basic questions about how local media is serving the public:

  • To what extent, if any, do local television news broadcasts examine the claims that are made in the political ads that appear on the newscasts?
  • Do the broadcasts cover the same issues that are the subject of political ads? If so, which issues are covered, which issues are not covered?
  • How much time is devoted to that coverage? Where does that coverage appear in the newscasts?

And in the long term, our pioneering work in the Philadelphia-region will help us create an affordable and technically scalable model to answer these questions in local markets nationwide leading up to the 2016 elections.

One of the exciting features of this project is that it brings cutting edge technology together with campaign finance expertise and grassroots good-government advocates in Philadelphia to potentially provide vastly greater understanding on who funds our political system and how they influence campaigns on the ground. Each of these organizations by themselves have a strong potential impact—together, we have the ability to amplify the rich, revealing information that can move voters and sway debate toward better outcomes.

What We’re Doing

The Internet Archive is recording, indexing for search and presenting online Philadelphia TV Market Area television news—which includes 22 counties in Pennsylvania and southern New Jersey; indexing for search all political ads therein; creating an interface for trained volunteers to identify and tag political advertising; joining indexed ads with sponsor information databases; making news and ads searchable, quotable and embeddable; capturing and presenting, in a full-text searchable database, much of the region’s Web media ecosystem..

The Sunlight Foundation is training volunteer political ad sponsorship coders, creating adaptations of the Influence Explorer interface and database to include real time Pennsylvania state campaign data; developing specialized optical character recognition algorithms for extracting Public Inspection File disclosures on sponsorship for TV political ad buys on its Political Ad Sleuth database; conducting outreach to journalists and others for their collaboration and use of resources for stories; integrating ad sponsor data into related Sunlight Foundation data tools and API’s; working with the Internet Archive to sync up sponsorship data with the actual ads in the same interface.

The Committee of Seventy is organizing a team of volunteers; acting as liaison with Philadelphia-region civic organizations; conducting outreach to area press; and providing guidance on issues and political candidates to track.

The University of Delaware’s Center for Community Research & Service at the School of Public Policy & Administration will conduct an analysis of the broadcast news programs in the Philadelphia television market, aired September 1 through Election Day, November 4.  After Election Day, the University team will conduct content analysis to address the research questions above and publish findings next year.

The Linguistic Data Consortium at the University of Pennsylvania is providing technical support and advice regarding the Internet Archive’s broadcast monitoring in the Philadelphia area.

Project Resources

View all identified political TV ads
• Watch video tour guide to using Philly-region TV news search
Search just Philadelphia content from the TV News Archive
Philadelphia stations’ political ad sponsor reports to FCC
Archived Philadelphia web media ecosystem sites (key word searchable)

Project Advisors

Kathleen Hall Jamieson, the Elizabeth Ware Packard Professor of Communication at the Annenberg School for Communication; and Walter and Leonore Annenberg Director of the Annenberg Public Policy Center at the University of Pennsylvania.

Travis N. Ridout, the Thomas S. Foley Distinguished Professor of Government and Public Policy and Associate Professor in the school of Politics, Philosophy and Public at Washington State University; and co-director of the Wesleyan Media Project.

David Westin, former president of ABC News, Founding CEO of NewsRight, a digital start-up spun off from the AP; and now Principal of Witherbee Holdings, LLC

Supported in part by grants and other contributions from:

David Glassco
Democracy Fund
Rita Allen Foundation
Hawthorn Family Fund
Buck Foundation (NYC)
Kahle/Austin Foundation
John S. and James L. Knight Foundation
Philadelphia Foundation, from an anonymous contributor to their donor-advised funds

Project Collaborator Contacts

Internet Archive – Roger Macdonald
Sunlight Foundation – Kathy Kiely
University of Delaware – Danilo Yanich
Committee of Seventy – Ellen Kaplan
Linguistic Data Consortium – Denise DiPersio





Posted in Announcements, News | 6 Comments

Invitation to the Internet Archive Annual Event


Posted in Announcements, News | 1 Comment

Please Help Protect Net Neutrality

Please stand with the Internet Archive to Protect Net Neutrality by writing to your congressperson.    Today, many organizations are putting “Internet Loading” symbols on their sites to bring awareness to the stakes to those of us that would be at the mercy of the Cable and Phone Companies to selectively slow down our sites for profit or just because they may not like our policies.

China started blocking the Internet Archive again a couple of months ago, we believe, because they do not like our open access policies.    In this way, we have started to understand the power in the hands of the Internet service providers.    Lets keep our access to Internet sites “Neutral” and not at the discretion of companies and governments.

Please write to your congressperson.

Posted in Announcements, News | 2 Comments

Millions of historic images posted to Flickr

by Robert Miller, Global Director of Books, Internet Archive


“Reading a book from the inside out!”. Well not quite, but a new way to read our eBooks has just been launched. Check out this great BBC article:

Here is the fabulous Flickr commons collection:

BBC articleAnd here is our welcome to Flickr’s Common Post:

What is it and how did it get done?
A Yahoo research fellow at Georgetown University, Kalev Leetaru, extracted over 14 million images from 2 million Internet Archive public domain eBooks that span over 500 years of content.  Because we have OCR’d the books, we have now been able to attach about 500 words before and after each image. This means you can now see, click and read about each image in the collection. Think full-text search of images!

How many images are there?
As of today, 2.6 million of the 14 million images have been uploaded to Flickr Commons. Soon we will be able to add continuously to this collection from the over 1,000+ new eBooks we scan each day. Dr. Simon Chaplin, Head of the Wellcome Library says, “This way of discovering and reading a book will help transform our medical heritage collection as it goes up online. This is a big step forward and will bring digitized book collections to new audiences.”

What is fun to do with this collection?
Trying typing in the word “telephone’ and enjoy what images appear? Curious about how death has been characterized over 500 years of images – type in “mordis”. Feeling good about health care – type in medicine and prepare to be amazed. Remember, all of these images are in the public domain!

Future plans?
We will be working with our wonderful friends at Flickr and our great Library partners to make this collection even more interesting –  more images, more sub-collections and some very interesting ideas of how to use some image recognition tools to help us learn more about, well, anything!

Questions about this collection, projects or things to come?
Email me at

Posted in Books Archive, Cool items, News | 35 Comments

Zoia Horn, librarian and activist, dies

Ms. Horn presenting The Zoia Horn Intellectual Freedom Award to the Internet Archive’s Brewster Kahle

July 12, 2014 marked the passing of an extraordinary librarian, Zoia Horn. Ms. Horn was best known in library circles for spending three weeks in jail in 1972 for having refused to testify before a grand jury regarding information relating to Phillip Berrigan’s library use. Ms. Horn stated: “To me it stands on: Freedom of thought — but government spying in homes, in libraries and universities inhibits and destroys this freedom.”

Throughout her life, Ms. Horn was on the forefront of the protection of academic and intellectual freedom, especially in libraries. She was an outspoken opponent of the PATRIOT ACT. She won numerous awards for her work, and a Zoia Horn Intellectual Freedom Award was inaugurated in 2004 by the California Library Association.

The Internet Archive is proud to have been a recipient of that award in 2010, and Brewster Kahle was presented with the award by Ms. Horn herself.

Along with so many others who have fought for freedom, we will greatly miss Ms. Horn, and we honor her memory by continuing her work.

Zoia Horn’s autobiography (read online)

Posted in Announcements, News | 2 Comments

Free the Screenshots!

As the Archive moves more widely into the archiving of software, it quickly becomes apparent that there’s going to be an awful lot of programs online without much indication of what they are. With many thousands of programs or program collections to choose from, determining what might be inside becomes a pretty involved task.

In the case of movies, images and texts, there are previews that help show what is contained in the files in a given item. These are extremely helpful, as they not only show the quality or style of the works, but give all sorts of information that might not be reflected in the metadata.

Starting now, the same will be true for many types of software.


The Atari 800 graphical masterpiece Astro Chase.

Using a combination of the JSMESS emulator and screen capturing software, the Archive has begun automatic “playing out” of sets of programs, snagging shots of what the software does, and then providing it as a guidepost of what is to come with that program.

For example, work has just been completed on the playable Sega Genesis Library,  where the directory view of the items in the collection show helpful screenshots, and individual games show animated playthroughs of the beginning of the cartridge.

00_coverscreenshotThe process is still evolving – currently it requires real-time capture (that is, capturing the first five minutes of a program takes an actual five minutes), but with multiple machines moving through collections, screenshots will be available for huge amounts of programs in coming weeks and months.

Along with the obvious graphical prettiness comes an even greater cultural benefit: the freeing of screenshots.

As these shots have often been done manually or have been gathered by hand, there has risen a tendency to put watermarks or credits with the images to indicate who did the work. While it’s an understandable urge to want some kudos for the effort, it meant that the very work being lauded (the graphics of the program) was being vandalized to ensure credit where credit was due.

None of the screenshots we are generating will have watermarks, and can be used freely for other purposes as you see fit.

To celebrate this, we’ve created a compilation of all the Sega Genesis screenshots generated by the project so far. The compilation is here. Be warned – it’s 4.3 gigabytes of 16,900 screenshots of 573 cartridges! (There’s a way to browse it at this link.)

Many screenshots are simply informative, but many more are truly works of art, as artists and programmers strained the edges of these underpowered machines to create the most evocative images possible. With this screenshotting effort underway, that work will hopefully get a new life and respect on the web.

Free the Screenshots!




Posted in News, Software Archive | Comments Off on Free the Screenshots!

Working to Stop Rewriting Copyright Laws via TPP Treaty

The Internet Archive joined Our Fair Deal along with EFF and Public Knowledge to stop the US from using the Trans-Pacific Partnership treaty from changing our copyright laws.   The coalition sent two open letters to TPP negotiators today on critical issues that you can learn about here. Let’s foster open debate and proper process before further changes to copyright laws restrict public access even more.

Please consider joining this coalition.

Posted in Announcements, Books Archive, News | Comments Off on Working to Stop Rewriting Copyright Laws via TPP Treaty