Microphone Check: Thousands of Hip-Hop Mixtapes at the Archive

The Internet Archive has been growing an interesting sub-collection of music for the past few months: Hip-Hop Mixtapes. The resulting collection still has a way to go before it’s anywhere near what is out there (limited by bandwidth and a few other technical factors), but now that it’s past 150 solid days of music on there, it’s quite enough to browse and “get the idea”, should you be so inclined.

Note: Hip-Hop tends to be for a mature audience, both in subject matter and language.

I’m sure this is entirely old knowledge for some people, but it was new to me, so I’ll describe the situation and the thinking.


There’s some excellent introductions and writeups about mixtapes in Hip-Hop culture at these external articles:

So, in quick summary, there have been mixtapes of many varieties for many years, going back to the 1970s to the dawn of what we call Hip-Hop, and throughout the time since the “tapes” have become CDs and ZIP files and are now still being released out into “the internet” to be spread around. The goal is to gain traction and attention for your musical act, or for your skills as a DJ, or any of a dozen reasons related to getting music to the masses.

There is an entire ecosystem of mixtape distribution and access. There are easily tens of thousands of known mixtapes that have existed. This is a huge, already-extant environment out there, that was established, culturally critical, and born-digital.

It only made sense for a library like the Internet Archive to provide it as well.

There’s a lot coded into the covers of these mixtapes (not to even mention the stuff coded into the lyrics themselves) – there’s stressing of riches, drug use, power, and oppression. There’s commentary on government, on social issues, and on the meaning of entertainment and celebrity. There’s parody, there’s aggrandizement, and there’s every attempt to draw in the listeners in what is a pretty large pile of material floating around. It’s not about this song or that grandiose portrait, though – it’s about the fact this whole set of material has meaning, reality and relevance to many, many people.

How do I know this has relevance? Within 24 hours of the first set of mixtapes going onto the Archive, many of the albums already had hundreds of listeners, and one of them broke a thousand views. Since then, a good amount have had tens of thousands of listens. Somebody wants this stuff, that’s for sure. And that’s fundamentally what the Archive is about – bringing access to the world.

The end goal here is simple: Providing free access to huge amounts of culture, so people can reference, contextualize, enjoy and delight over material in an easy-to-reach, linkable, usable manner. Apparently it’s already taken off, but here you go too.

Get your drank on here.

Posted in Announcements, Music, News | 2 Comments

Wayback Machine captures Melania Trump’s deleted internet bio

Melania Trump’s personal website is now gone from the internet — but is preserved by the Internet Archive’s Wayback Machine — after a Huffington Post reporter and other news outlets began questioning elements of the would-be First Lady’s biography.

Yesterday Christina Wilkie, a national political reporter for the Huffington Post, published a story noting that Melania Trump’s elaborate website, www.melaniatrump.com, which existed as recently as July 20, now redirects to the Trump Organization’s official website. The removal of the website followed questions about a biography that appeared on it, that claimed  that Melania Trump had “earned a degree in design and architecture at University in Slovenia.”

Many media outlets have followed suit, writing that the website has now disappeared.

Today Melania Trump tweeted that the website was taken down because  “it does not accurately reflect my current and professional interests.”

Screenshot 2016-07-28 13.13.40


Wilkie and other reporters had questioned whether Trump truly obtained those degrees from the university. The inquiries took on new potency after she was accused of possible plagiarism in her speech before the Republican National Convention last week. The campaign has not answered questions about the biography. Snopes.com has reported that there is no “University of Slovenia.”

Meanwhile, Melania’s original biography is preserved on the Internet Archive’s Wayback Machine, which crawls websites to create a historical archive. The most recent snapshot was taken on July 20 — see the screenshot below.

Screenshot 2016-07-28 13.00.56


The Political TV Ad Archive is tracking and archiving political ads in the 2016 elections. In addition, we’ve set up a special Archive-It collection to track candidates’ and political organizations’ social media websites here, with more 320 million captures to date.

Cross posted on the Political TV Ad Archive. July 29: quote from Melania Trump’s defunct website corrected.

Posted in Announcements, News | Tagged , , , , | 9 Comments

Pokébarbarians at the Gate

Millions of people from around the world visit the Internet Archive every day to read books, listen to audio recordings, watch films, use the Wayback Machine to revisit almost half a billion web pages, and much more. Lately, though, we’ve had a different kind of visitor: gaggles of Pokémon Go players.

(In case you’ve been living in a cave without Internet connectivity for the last month, Pokémon Go is an augmented reality Internet game. Participants on three different teams band together to find and capture as many types of Pokémon as they can, sending Nintendo a goldmine of personal data in the process.)


It turns out that the stairs of the Internet Archive’s San Francisco headquarters are a PokéGym, a site where players can train their Pokémon and fight with other Pokémon. Fortunately, the Pokémon warriors aren’t rowdy or disruptive; they resemble somnambulistic zombies stumbling around under the control of their glowing smartphone screens.

As Jean Cocteau noted, “Fashion is everything that goes out of fashion.” Pokémon will join pet rocks, beanie babies, and chia pets in the annals of popular fads sooner than later. Perhaps then the gamers will take advantage of their Internet devices to discover that the Internet Archive has much more to offer than the ephemeral, pixelated creatures outside of our doors.

Posted in Announcements, News | 7 Comments

The Copyright Office is trying to redefine libraries, but libraries don’t want it — Who is it for?

The Library Copyright Alliance (which represents the American Library Association and the Association of Research Libraries) has said it does not want changes, the Society of American Archivists has said it does not want changes. The Internet Archive does not want changes, DPLA does not want changes… So why is the Copyright Office holding “hush hush” meetings to “answer their last questions” before going to Congress with a proposed rewrite of the section of Copyright law that pertains to libraries?

This recent move, which has its genesis in an outdated set of proposals from 2008, is just another in series of out of touch ideas coming from the Copyright Office. We’ve seen them propose “notice and staydown” filtering of the Internet and disastrous “extended collective licensing” for digitization projects. These and other proposals have lead some to start asking whose Copyright Office this is, anyway. Now the Copyright Office wants to completely overhaul Section 108 of the Copyright Act, the “library exceptions,” in ways that could break the Wayback Machine and repeal fair use for libraries.

We are extremely concerned that Congress could take the Copyright Office’s proposal seriously, and believe that libraries are actually calling for these changes. That’s why we flew to Washington, D.C. to deliver the message to the Copyright Office in person: now is not the time for changes to Section 108. Libraries and technology have been evolving quickly. Good things are beginning to happen as a result. Drafting a law now could make something that is working well more complicated, and could calcify processes that would otherwise continue to evolve to make digitization efforts and web archiving work even better for libraries and content owners alike.

In fact, just proposing this new legislation will likely have the effect of hitting the pause button on libraries. It will lead to uncertainty for the libraries that have already begun to modernize by digitizing their analog collections and learning how to collect and preserve born-digital materials. It could lead libraries who have been considering such projects to “wait and see.”

Perhaps that’s the point. Because the Copyright Office’s proposal doesn’t seem to help libraries, or the public they serve, at all.

Posted in Announcements, News | 13 Comments

Is it 1968? Not really — but past convention video clips show controversy

Research by Robin Chin

Is it 1968? Many pundits have been asking this question in recent days, in the lead up to what is expected to be a contentious–and some worry about violent–GOP convention in Cleveland, where Donald Trump is expected to accept the GOP nomination. A spate of mass gun killings, the death of two African American men in recent weeks at the hands of police, the murder of five police officers by a sniper during a demonstration and then three more by a lone gun man in Baton Rouge, terrorism here and abroad, involvement overseas in intractable conflicts, growing economic inequality — none of these developments quite parallel the tumultuous events of the 1960s. But the situation was volatile then, and it’s volatile now.

To set the scene, thanks to the TV News Archive, the Internet Archive‘s online free library of TV news clips, revisiting some of the more “crazy” conventions of years past (headline by Politico), or simply notable or controversial moments, is just a search away. All of these clips are editable, embeddable, and shareable on social media.

Chicago, 1968

When the Democrats met in Chicago in 1968, it was in the shadow of the assassinations of Martin Luther King and Democratic primary candidate Robert Kennedy. Vice President Hubert Humphrey had the support of the some 60 percent of the delegates, largely local party leaders — people who would be super delegates today. While a liberal, Humphrey’s support of the war as Lyndon B. Johnson’s vice president made him unpopular in the anti-war movement.

As described by Politico, “With Humphrey’s nomination all but certain, protesters associated with the Youth International Party (the Yippies) and National Mobilization Committee to End the War in Vietnam (the MOBE) took to the streets outside Chicago’s convention hall; inside, city policemen allied with the local political machine roughed up liberal delegates and journalists in plain view of news cameras. “I wasn’t sentenced and sent here!” a prominent New York Democrat bellowed as a uniformed officer dragged him off the floor. “I was elected!”

The clip below, from the CNN documentary series, “The Sixties,” shows police beating up protestors on the streets. A special commission appointed to investigate the protests characterized the violent events as a “police riot” directed at protesters and recommended prosecution of police who used indiscriminate violence.

That same night, Humphrey took to the podium to accept the nomination. He referred the violence outside when he said, “[O]ne cannot help but reflect, the deep sadness that we feel over the troubles and the violence which have erupted regrettably and tragically in the streets of this great city and for the personal injuries that have occurred. Surely we have now learned the lesson that violence breeds counter violence and it cannot be condoned whatever the source.”

San Francisco, 1964

In 1964, GOP moderates Nelson Rockefeller and George Romney, then governor of Michigan, led an unsuccessful campaign against conservative insurgent Barry Goldwater, at a convention Goldwater biographer Robert Alan Goldberg later dubbed the “Woodstock of the right.” (Romney was former presidential candidate Mitt Romney’s father.) Goldwater was a fierce opponent of the Civil Rights Act and strong supporter of military intervention against the Soviet Union.

Some have compared him to Trump because of his belligerence and unpopularity with the establishment Republicans. For example, like Trump, he was not one to mince words about his enemies. At the convention, when asked by a reporter about LBJ and the Civil Rights Act, he replied, “He’s the phoniest individual who ever came around.”

The convention was raucous, filled with delegates booing the moderates — as when Rockefeller called on the crowd to reject extremists. But the moment most remembered was when Goldwater took the podium to accept the nomination, when, to enormous applause, he said:

“I would remind you that extremism in the defense of liberty is no vice. [applause] And let me remind you also that moderation in the pursuit of justice is no virtue.”

Goldwater went on to lose the election, badly, to Lyndon B. Johnson.

Other historic moments

The TV News Archive is full of many other convention speech clips of moments that turned history’s tide. Here, for example, is John F. Kennedy, accepting the Democratic nomination in 1960, stating that voters should not “throw away” their vote because of concern about his religious affiliation. He went on to become the first Catholic president of the United States.

And here is Richard Nixon, in his 1968 nomination speech, talking about the increase in crime and criticizing those who say “law and order” was code for racism. He was speaking to the charged issues surrounding race and policing at the time:

“Time is running out for the merchants of corruption…and to those who say law and order is a code word for racism there and here is the reply. Our goal is justice for every American. If we are to have respect for law in America we must have laws that deserve respect.”

Nixon’s words, however, have a doubly ironic ring today. First, because the debate over policing in the African American community stubbornly persists decades later. And second, because of his own role in covering up the Watergate scandal, which involved dirty tricks against the Democrats during the 1972 campaign. Nixon would eventually resign from the presidency in 1974. Three years later, in 1977, the journalist David Frost asked Nixon under what circumstances a president can do something illegal. Nixon’s famous answer: “Well, when the president does it, that means that it is not illegal.”

For those wanting to plumb the riches of past convention speeches, below is a list, with links, of most major convention speeches by nominees, starting with Harry Truman in 1948 and going to Barack Obama in 2012. The speeches were broadcast on C-Span.

1948: Harry Truman acceptance speech at Democratic National Convention in Philadelphia, PA Part 1.

Harry Truman acceptance speech at Democratic National Convention in Philadelphia, PA Part 2.

1952: Adlai Stevenson acceptance speech at Democratic National Convention in Chicago, IL Part 1.

Adlai Stevenson acceptance speech at Democratic National Convention in Chicago, IL Part 2.

1956: Republican Convention and Eisenhower’s nomination  Universal newsreel.

Dwight D. Eisenhower acceptance speech at Republican National Convention in Daly City, CA Part 1.

Dwight D. Eisenhower acceptance speech at Republican National Convention in Daly City, CA Part 2.

1960: John F. Kennedy acceptance speech at 1960 Democratic National Conventions in Los Angeles, CA Part 1.

John F. Kennedy acceptance speech at 1960 Democratic National Conventions in Los Angeles, CA Part 2.

Former President Hebert Hoover speech at Republican National Convention Chicago, IL.

Henry Cabot Lodge VP acceptance speech at  National Convention Chicago, IL.

1964: Barry Goldwater acceptance speech at Republican National Convention Daly City, CA.

Robert Kennedy speech at Democratic National Convention Atlantic City, NJ.

Lyndon Johnson acceptance speech Atlantic City, NJ Part 1.

Lyndon Johnson acceptance speech Atlantic City, NJ Part 2.

1968: Spiro Agnew VP acceptance speech at Republican National Convention in Miami Beach, FL.

Richard Nixon acceptance speech at Republican National Convention Miami Beach, FL.

Hubert Humphrey acceptance speech at Democratic National Convention Chicago, Il  NBC News.

1972: McGovern acceptance speech at Democratic National Convention Miami Beach, FL Part 1.

McGovern acceptance speech at Democratic National Convention Miami Beach, FL Part 2.

Richard Nixon acceptance speech at Republican National Convention Miami Beach, FL.

Richard Nixon acceptance speech at Republican National Convention Miami Beach, Florida NBC News.

1976: Barbara Jordan keynote speech at Democratic Convention New York, NY.

Jimmy Carter acceptance speech at Democratic National Convention New York, NY Part 1.

Jimmy Carter acceptance speech at Democratic National Convention New York, NY Part 2.

August 17, 1976 Republic National Convention Kansas City, MO delegates debating Ronald Reagan rule requiring Ford to name VP before they vote  CBS News Part 1.

August 17, 976 Republic National Convention Kansas City, MO includes delegates debating Ronald Reagan rule C16 requiring Ford to name VP before they vote  CBS News Part 2.

Gerald Ford acceptance speech at the Republican National Convention Kansas City, MO Part 1.

Gerald Ford acceptance speech at the Republican National Convention Kansas City, MO Part 2.

Ronald Reagan endorsement speech of Gerald Ford as Presidential Nominee at Republican National Convention Kansas City, MO.

1980: Ronald Reagan acceptance speech  at the Republican National Convention Detroit, MI.

Ted Kennedy speech at Democratic National Convention in New York. Kennedy was a rival for the Democratic presidential nomination.

Jimmy Carter acceptance speech at Democratic National Convention in New York, NY Part 1.

Jimmy Carter acceptance speech at Democratic National Convention in New York, NY Part 2.

1984: Geraldine Ferraro VP acceptance speech at Democratic National Convention San Francisco, CA.

Walter Mondale acceptance speech at Democratic National Convention San Francisco, CA Part 1.

Walter Mondale acceptance speech at Democratic National Convention San Francisco, CA Part 2.

Ronald Reagan acceptance speech at Republican National Convention Dallas, TX.

Mario Cuomo keynote speech at Democratic National Convention San Franciso, CA.

1988: Ann Richards keynote speech at Democratic National Convention Atlanta, GA.

Michael Dukakis acceptance speech at Democratic National Convention Atlanta, GA Part 1.

Michael Dukakis acceptance speech at Democratic National Convention Atlanta, GA Part 2.

Dan Quayle VP acceptance speech at Republican National Convention New Orleans, LA.

George H.W. Bush acceptance speech at Republican National Convention New Orleans, LA.

1992: Barbara Jordan speech at Democratic National Convention New York, NY.

Al Gore VP acceptance speech at Democratic National Convention New York, NY.

Bill Clinton acceptance speech at the Democratic National Convention New York, NY.

Pat Buchanan Keynote speech at Republican National Convention Houston, TX.

Ronald Reagan speech at Republican National Convention  Houston, TX Part 1.

Ronald Reagan speech at Republican National Convention  Houston, TX Part 2.

George H. W. Bush acceptance speech at the Republican National Convention Houston, TX.

1996: Jack Kemp VP acceptance speech at Republican National Convention San Diego, CA.

Bob Dole acceptance speech at Republican National Convention San Diego, CA.

Hillary Clinton speech at the Democratic National Convention Chicago, IL.

Bill Clinton acceptance speech at the Democratic National Convention Chicago, IL. (Currently not available on the TV News Archive.)

2000: Dick Cheney VP 2000 acceptance speech at Republican National Convention in Philadelphia, PA.

George W. Bush acceptance speech at Republican National Convention in Philadelphia, PA Part 1.

George W. Bush acceptance speech at Republican National Convention in Philadelphia, PA Part 2.

Al Gore acceptance speech at Democratic National Convention in Los Angeles, CA.

2004: Barack Obama keynote speech at Democratic National Convention Boston, MA. (Currently not available on the TV News Archive.)

2004 John Edwards speech at Democratic National Convention  Boston, MA.

John Kerry acceptance speech at  Democratic National Convention  Boston, MA.

John McCain speech at Republican National Convention New York, NY.

Laura Bush speech at  Republican National Convention New York, NY.

George W. Bush acceptance speech at Republican National Convention New York, NY.  (Currently not available on the TV News Archive.)

2008: Ted Kennedy speech at Democratic National Convention Denver, CO.

Michelle Obama speech at Democratic National Convention Denver, CO.

Bill Clinton speech at Democratic National Convention Denver, CO.

Joe Biden VP portion of acceptance speech at Democratic National Convention Denver, CO.

Barack Obama acceptance speech at Democratic National Convention Denver, CO.

Sarah Palin VP acceptance speech at Republican National Convention St. Paul, MN.

Cindy McCain speech at Republican National Convention St. Paul, MN.

John McCain acceptance speech at Republican National Convention St. Paul, MN.

2012: Barack Obama acceptance speech at Democratic National Convention Charlotte, NC CSPAN coverage.

Mitt Romney acceptance speech at Republican National Convention Tampa, FL CSPAN coverage.

Posted in Announcements, News | Tagged , , , , , , , , , , , , , , , | 2 Comments

New Rita Allen Foundation grant fuels political ad tracking through Election Day

As the Democrats and Republicans convene at their national party conventions in coming weeks, the general election kicks into full swing. Thanks to generous support from the Rita Allen Foundation, we are delighted to announce that the Political TV Ad Archive, a project of the Internet Archive, will be ramping up to track political ads airing in eight key battleground states in the lead up to Election Day.

The $110,000 grant will enable Political TV Ad Archive to continue the work begun during the primary months, when the project tracked more than 145,000 airings of ads in 23 markets in key primary states. The project uses audio fingerprinting algorithms to track occurrences of ads backed by candidates, political action committees, “dark money” nonprofit groups and more—all linked to information on where and when ads have aired, sponsors, subjects and messages.




The website provides a searchable database of all the political ads archived, and all ads are embeddable and shareable on social media. In addition, the underlying metadata on frequency ad airings is available for downloading, and journalists from such outlets as The Washington Post, Fox News, and FiveThirtyEight.com have used it to inform reporting, visualizations, and other creative uses to put these ads in context for readers. The Political TV Ad Archive also partners with respected journalism and fact checking organizations, such as the Center for Responsive Politics, PolitiFact, and FactCheck.org.

The Rita Allen Foundation supported the initial development of the Archive’s technology through a pilot project, the Philly Political Media Watch Project, which collected ads aired in the Philadelphia region in the lead-up to the 2014 midterm election. The Rita Allen Foundation also helped to sponsor the primary election phase of the Political TV Ad Archive, which received funding from the Knight News Challenge on Elections.


Posted in News | 4 Comments

Unlocking Books for the Blind and Visually Impaired

imageThe Internet Archive has been making print materials more accessible to the blind and print disabled for years, but now with Canada’s joining the Marrakesh Treaty, our sister organization, the Internet Archive Canada might be able to serve people in many more countries.

In 2010, we launched the Open Library Accessible Books collection, which now contains nearly 2 million books in accessible formats. Our sister organization, Internet Archive Canada, has also been working on accessibility projects, and has digitized more than 8500 texts in partnership with the Accessible Content E-Portal, which is on track to have over 10,000 items available in accessible formats by the end of the month.

On June 30th, Canada tipped the scales towards broader access to books for all by joining the Marrakesh Treaty. This move will allow the Treaty to go into effect on September 30, 2016 in the nations where it has been ratified, so that print-disabled and visually impaired people can more fully and actively participate in global society.

The goal of the Marrakesh Treaty is to help to end the “book famine” faced by people who are blind, visually impaired, or otherwise print disabled. Currently only 1% to 7% of the world’s published books ever become available in accessible formats. This is partly due to barriers to access created by copyright laws–something the Treaty helps to remove.

The Marrakesh Treaty removes barriers in two ways. First, it requires ratifying nations to have an exception in their domestic copyright laws for the blind, visually impaired, and their organizations to make books and other print resources available in accessible formats, such as Braille, large print, or audio versions, without needing permission from the copyright holder. Second, the Treaty allows for the exchange of accessible versions of books and other copyrighted works across borders, again without copyright holder permission. This will help to avoid the duplication of efforts across different countries, and will allow those with larger collections of accessible books to share them with visually impaired people in countries with fewer resources.

The first 20 countries to ratify or accede to the Marrakesh Treaty were: India, El Salvador, United Arab Emirates, Mali, Uruguay, Paraguay, Singapore, Argentina, Mexico, Mongolia, Republic of Korea, Australia, Brazil, Peru, Democratic People’s Republic of Korea, Israel, Chile, Ecuador, Guatemala and Canada. People in these countries will soon start realizing the tangible benefits of providing access to knowledge to those who have historically been left out.

To date this material has only been available to students and scholars within Ontario’s university system. The Marrakesh Treaty now makes it possible for these works to be shared more broadly within Canada, and with the other countries listed above. Hopefully the rest of the world will take note, and join forces to provide universal access to all knowledge.

Posted in Announcements, Books Archive, News | 2 Comments

Those Hilarious Times When Emulations Stop Working

Jason Scott, Software Curator and Your Emulation Buddy, writing in.

With tens of thousands of items in the archive.org stacks that are in some way running in-browser emulations, we’ve got a pretty strong library of computing history afoot, with many more joining in the future. On top of that, we have thousands of people playing these different programs, consoles, and arcade games from all over the world.

Therefore, if anything goes slightly amiss, we hear it from every angle: twitter, item reviews, e-mails, and even the occasional phone call. People expect to come to a software item on the Internet Archive and have it play in their browser! It’s great this expectation is now considered a critical aspect of computer and game history. But it also means we have to go hunting down what the problem might be when stuff goes awry.

Sometimes, it’s something nice and simple, like “I can’t figure out the keys or the commands” or “How do I find the magic sock in the village.”, which puts us in the position of a sort of 1980s Software Company Help Line. Other times, it’s helping fix situations where some emulated software is configured wrong and certain functions don’t work. (The emulation might run too fast, or show the wrong colors, or not work past a certain point in the game.)

But then sometimes it’s something like this:


In this case, a set of programs were all working just fine a while ago, and then suddenly started sending out weird “Runtime” errors. Or this nostalgia-inducing error:


Here’s the interesting thing: The emulated historic machine would continue to run. In other words, we had a still-functioning, emulated broken machine, as if you’d brought home a damaged 486 PC in 1993 from the store and realized it was made of cheaper parts than you expected.

To make things even more strange, this was only happening to emulated DOS programs in the Google Chrome browser. And only Google Chrome version 51.x. And only in the 32-bit version of Google Chrome 51.x. (A huge thanks to the growing number of people who helped this get tracked down.)

This is what people should have been seeing, which I think we can agree looks much better:


The short-term fix is to run Firefox instead of Chrome for the moment if you see a crash, but that’s not really a “fix” per se – Chrome has had the bug reported to them and they’re hard at work on it (and working on a bug can be a lot of work). And there’s no guarantee an update to Firefox (or the Edge Browser, or any of the other browsers working today) won’t cause other weird problems going down the line.

All this, then, can remind people how strange, how interlocking, and even fragile our web ecosystem is at the moment. The “Web” is a web of standards dancing with improvisations, hacks, best guesses and a radically moving target of what needs to be obeyed and discarded. With the automatic downloading of new versions of browsers from a small set of makers, we gain security, but more-obscure bugs might change the functioning of a website overnight. We make sure the newest standards are followed as quickly as possible, but we also wake up to finding out an old trusted standard was deemed no longer worthy of use.

Old standards or features (background music in web pages, the gopher protocol, Flash) give way to new plugins or processes, and the web must be expected, as best it can, to deal with the new and the old and fail gracefully when it can’t quite do it. As part of the work of the Decentralized Web Summit was to bring forward the strengths of this world (collaboration, transparency, reproducibility) while pulling back from the weaknesses of this shifting landscape (centralization, gatekeeping, utter and total loss of history), it’s obvious a lot of people recognize this is an ongoing situation, needing vigilance and hard work.

In the meantime, we’ll do our best to keep on how the latest and greatest browsers deal with the still-fresh world of in-browser emulation, and try to emulate hardware that did come working from the factory.

In the meantime, enjoy some Apple II programs. On us.

Posted in Emulation, Software Archive, Technical | 2 Comments

Decentralized Web Server: Possible Approach with Cost and Performance Estimates

At the first Decentralized Web Summit Tim Berners-Lee asked if a content-BK and TBLaddressable peer-to-peer server system scales to the demands of the World Wide Web. This is meant to be a partial answer to a piece of the puzzle.  For background, this might help.

Decentralized web pages will be served by users, peer-to-peer, but there can also be high-performance super-nodes which would serve as caches and archives. These super-nodes could be run by archives, like the Internet Archive, and ISPs who want to deliver pages quickly to their users. I will call such a super-node a “Decentralized Web Server” or “D-Web Server” and work through a thought experiment on how much it would cost to have one that would store many webpages and serve them up fast.

Web objects, such as text and images, in the Decentralized Web are generally retrieved based on a computed hash of the content. This is called “content addressing.” Therefore, a request for a webpage from the network will be based on its hash rather than contacting a specific server. This object can be served from any D-Web server without worrying that it will be faked because the contents will be checked to make sure it is the right content by rehashing it and checking to make sure it was right.

For the purposes of this post, we will use the basic machines that the petabox-in-great-roomInternet Archive currently uses as a data point. These are 24-core, 250TByte disk storage (on 36 drives), 192GB RAM, 2Gbit/sec network, 4u height machines that cost about $14k. Therefore:

  • $14k for 1 D-Web server

Let’s estimate the average compressed decentralized web object size is 50KBytes (an object is page, javascript, image, movie—things that make up a webpage). This is larger than what the Internet Archive web crawl average, but it’s in the ballpark.

Therefore, if we use all the storage for web objects, then that would be 5 billion web objects (250TB/50KB). This would be maybe 1 million basic websites (each website would have 5 thousand web pieces which I would guess is much more than the average WordPress website, though there are of course notable websites with much more). Therefore, this is enough for a large growth in the decentralized web and it could keep all versions. Therefore:

  • Store 5 billion web objects, or 1 million websites

How many requests could it answer? Answering a decentralized website request would mean to ask “do I have the requested object?” and if yes, to then serve it. If this D-Web server is one of many, then it may not have all webpages on it even though it seems we could probably store all pages for a long part of the growth of the Decentralized Web.

Let’s break it into two types: “Do we have it?” and “Here is the web object”. “Do we have it?” can be done efficiently with a Bloom Filter. It is done by taking the request, hashing it eight times and looking up those bits up in RAM to see if they are there. I will not explain it further than to say an entry can take about 3 bytes of RAM and can answer questions very, very fast. Therefore, the lookup array for 5 billion objects would take 15GB, which is a small percentage of our RAM.

I don’t know the speed this can run, but it is probably in excess of 100k requests per second. (This paper seemed to put the number over 1 million per second.) A request is a sha256 hash, which, if recorded in binary, is 32 bytes. So 3.2MBytes/sec would be the incoming bandwidth rate, which is not a problem. Therefore:

* 100k “Do We Have It?” requests processed per second (guess).

The number of requests able to be served could depend on the bandwidth of the machine, and it could depend on the file system. If a web object is 50KB compressed, and served compressed, then with 2Gbits/second, we could serve a maximum of 5,000 per second based on bandwidth. If each hard drive is about 200 seeks per second, and a retrieval is four seeks on average (this is an estimate), then with 36 hard drives, that would be 1,800 retrieves per second. If there were popular pages, these would stay in ram or an SSD, so it could be even quite faster. But assuming 1,800 per second, this would be about 700Mbits/sec which is not stretching the proposed machines. Therefore:

* 1,800 “Here is the web object” requests processed per second maximum.

How many users would the serve? To make a guess, maybe we could use the use of mobile devices use of web servers. At least in my family, the web use is a small percentage of the total traffic, and even the sites that are used are unlikely to be decentralized websites (like YouTube). So if a user uses 1GByte per month on web traffic, and 5% of those are decentralized websites, so 50MB/month per user of decentralized websites could give an estimate. If the server can serve at 700Mbits/sec, then that is 226Terabytes/month. At at the 50MB usage that would be over 4 million users. Therefore:

* Over 4 million users can be served from that single server (again, a guess.)

So, by this argument, a single Decentralized Web Server can serve a million websites to 4 million users and cost $14,000. Even if it does not perform this well, this could work well for quite a while.

Obviously, we do not want just one Decentralized Web Server, but it is interesting to know that one computer could serve the whole system during early stages, and then more can be added at any time. If there were more, then the system would be more robust, could scale to larger amounts of data, could serve users faster because the content could be brought closer to users.

Performance and cost do not seem to be a problem—in fact, there may be an advantage to the decentralized web over current web server technology.

Posted in Announcements, News | 3 Comments

Geez, Now Internet Insurance?

We seem to make some people mad.

The Internet Archive, a non-profit library, hosts many things. Many, many things. Billions of old webpages, lots of concerts, nostalgia computer games, TV, books, old movies, contributed books, music, and video, and much more.

But some of it seems to make some people mad. China is blocking us, Russia recent stopped blocking us, and India took a crack at blocking us last year. And then there are the occasional denial-of-service attacks by who-knows-who? One recent DDoS attack was apparently claimed by some Anonymous-linked group. Another one seemed to ask for a bitcoin to turn it off. Yup, “Pay us $400 and we will put you back on the air.” Really?  (We didn’t give it to them.)

Each time this happens, it causes a bunch of engineers and managers to run around to deal with it. Thankfully, a bunch of people donated this last time, out of sympathy, I guess — thank you!

We have tried to handle these without architectural changes, but it is getting hard. This last time we had to call a vacationing engineer in the middle of his night… Zeus knows we have enough self-inflicted screwups and growing pains to deal with. But now this?

One change we could make would be to send our traffic through CloudFlare, or similar, to filter out unwelcome packets as an “Insurance against Internet attackers.” Some people go to “cloud services” that have the sysadmins filter out the zealous ones. Both of these solutions would mean that our traffic would go through someone else’s hosts, which means $, privacy loss, and general loss of the end-to-end Internet. It is like converting to Gmail because there are so many spammers on the net and Google is capable of filtering out those losers.

The Internet Archive is trying to demonstrate that an affordable, end-to-end strategy works:

  •     we protect our reader’s privacy by running our own servers, and try not to log IP addresses;
  •     we don’t want to have co-location centers that control physical access to our servers, so we build our own;
  •     we don’t like having someone else run our email servers, but we get deluged with spam;
  •     we do not want to have someone else control our IP addresses, so we have our own ASN;
  •     we want the web to be even more resilient against the censors and the rot of time, so we pioneer the Decentralized Web.

Having our traffic filtered by a third party only when we are attacked may not be so bad, but it shows it is harder and harder for normal people to run their own servers.

Let’s work together to keep the Internet a welcoming place to both large and small players without needing insurance and third-party protectors.

Optimistically yours,

Brewster Kahle

Founder and Digital Librarian

Posted in Announcements, News | 6 Comments

Decentralized Web Summit: Towards Reliable, Private, and Fun


[See coverage by the NYtimes, Fortune, Boing Boing, other press]


Internet Archive Founder, Brewster Kahle, the father of the Internet, Vint Cerf and Sir Tim Berners-Lee, “father of the World Wide Web,” at the first Decentralized Web Summit in San Francisco.

More than 300 web architects, activists, archivists and policy makers gathered at the Internet Archive for the  first Decentralized Web Summit, where I was honored to share a stage with internet pioneers, Vint Cerf, and Sir Tim Berners-Lee. We wanted to bring together the original “fathers of the internet and World Wide Web” with a new generation of builders to see if together we could align around–and in some cases reinvent–a Web that is more reliable, private, and fun.  Hackers came from Bangkok to Boston, London and Lisbon, New York and Berlin to answer our call to “Lock Open the Web.”

Building a web that is decentralized— where many websites are delivered through a peer-to-peer network– would lead to a the web being hosted from many places leading to more reliable access, availability of past versions, access from more places around the world, and higher performance. It can also lead to more reader-privacy because it is harder to watch or control what one reads.  Integrating a payments system into a decentralized web can help people make money by publishing on the web without the need for 3rd parties.  This meeting focused on the values, technical, policy, deployment issues of reinventing basic infrastructure like the web.

Mitchell BakerFirst in the opening welcome, Mitchell Baker, head of Mozilla, reported that Mozilla, the company that made open main-stream, is going back to the core values, focusing on what users want the Web to be.  Mitchell said Mozilla is rethinking everything, even what a browser should be in the coming age. She highlighted four principles we need to think about when building a Decentralized Web:  that the Web should be Immediate, Open, Universal and have Agency–that there are policies and standards that help users mediate and control their own Web experiences. Talking about the values that need to baked into the code turned out to be the dominant theme of the event.


vint1Next, Vint Cerf, Google’s Internet Evangelist and  “father of the Internet,” called for a “Self-Archiving Web” in the first keynote address.  He described a “digital dark age” when our lives online have disappeared and how a more advanced Web, one that archives itself throughout time, could help avoid that outcome.  Over the three days of events, how to actually build a Web that archives itself came to seem quite doable.  In fact,  several of talented groups, including IPFS and the Dat Project, demonstrated pieces of what could make a Decentralized Web real.

Tim Berners-Lee (father of the Web) opened by saying the current technology and protocols could and should etimvolve to incorporate what we want from of our Web. He told us he created the
Web to be decentralized, so that anyone could set up their own server or host their own domain. Over time the Web has become “siloized” and we have “sold our soul of privacy in order to get stuff for free.” When Tim said rethinking the HTTP specification is feasible–the possibilities for change and improvement opened up for everyone.


bk2Brewster Kahle of the Internet Archive (me) ventured we wanted a Web that baked our values into the code itself– Universal Access to all
Knowledge, freedom of expression, reliability, reader privacy, and fun.

To build reliable access requires serving websites from multiple places on the net. We heard proposals to build “multi-home” websites using content-addressible structures rather than contacting a single website for answers. There were demonstrations of ZeroNet, IPFS, and DAT that did this.

Protecting reader privacy is difficult when all traffic to a website can be monitored, blocked, or controlled. The security panel that included Mike Perry of Tor and Paige Peterson of MaidSafe, said that having one’s requests and retrieved documents “hopping around” rather than going straight from server to client can help ensure greater privacy. Combining this with multi-homed access seems like a good start.

We can start making a smooth transition from the current Web to leverage these ideas by using all of our current infrastructure of browsers and URL’s–and not requiring people to download software. While not ideal, we can build a Decentralized Web on top of the current Web using Javascript, so each reader of the Decentralized Web is also a server of it, allowing the Web naturally to scale and reinforce itself as more readers joined in. The Internet Archive has already started supporting this projects with free machines and storage.

BK and TBL“Polyfill” was final bit of advice I got from Tim Berners-Lee before he left.  Polyfill, he said is a kind of English version of Spackle, that is used to fix and patch walls. In this case, Polyfill is Javascript.  He said that almost all proposals to make a change to the Web are prototyped in javascript and then can be built in as they are debugged and demonstrated to be useful.

There we have it: let’s make polyfill additions to the existing Web to demonstrate how a Reliable, Private, and Fun Web can emerge.

Congratulations to the Internet Archive for pulling this together.

Arms Raised Group Shot Builders Day

Posted in Announcements | 2 Comments

Copyright Office’s Proposed Notice and Staydown System Would Force the Internet Archive and Other Platforms to Censor the Web

censoredIn May, the US Copyright Office came to San Francisco to hear from various stakeholders about how well Section 512 of the Digital Millennium Copyright Act or DMCA is working. The Internet Archive appeared at these hearings to talk about the perspective of nonprofit libraries. The DMCA is the part of copyright law that provides for a “notice and takedown” process for copyrighted works on the Internet. Platforms who host content can get legal immunity if they take down materials when they get a complaint from the copyright owner.

This is an incredibly powerful tool for content owners–there is no other area of law that allows content to be removed from the web with a mere accusation of guilt. Victims of harassment, defamation, invasions of privacy, or any other legal claim, have to go to court to have anything taken down.

Unfortunately, this tool can be, and has been abused. We see this every day at the Internet Archive when we get overbroad DMCA takedown notices, claiming material that is in the public domain, is fair use, or is critical of the content owner. More often than not, these bad notices are just mistakes, but sometimes notices are sent intentionally to silence speech. Since this tool can be so easily abused, it is one that should be approached with extreme caution.

We were very concerned to hear that the Copyright Office is strongly considering recommending changing the DMCA to mandate a “Notice and Staydown” regime. This is the language that the Copyright Office uses to talk about censoring the web. The idea is that once a platform gets a notice regarding a specific copyrighted work, like a specific picture, song, book, or film, that platform would then be responsible for making sure that the work never appears on the platform ever again. Other users would have to be prevented, using filtering technology, from ever posting that specific content ever again. It would have to “Stay Down.”

This idea is dangerous in a number of ways:

  • No Due Process. Notice and Staydown would remove all of the user protections built in to the DMCA. Currently, the statute allows users who believe material they have posted was taken down in error to file a counter-notification. If the copyright holder does not choose to bring a lawsuit, then the content can be reposted. The law also prohibits the sending of false notices, and allows users who have been falsely accused to bring a claim against their accuser. These protections for the user would simply go away if platforms were forced to proactively filter content.
  • Requires Platforms to Monitor User Activity. The current statute protects user privacy by explicitly stating that platforms have no duty to monitor user activity for copyright infringement. Notice and Staydown would change this–requiring platforms to be constantly looking over users’ shoulders.
  • Promotes Censorship. Notice and Staydown has a serious First Amendment problem. The government mandating the use of technology to affirmatively take speech offline before it’s even posted, without any form of review, potentially violates free speech laws.
  • It Just Won’t Work In Most Cases. Piracy on the web is a real problem for creators. However, filtering at the platform level is just very unlikely to stop the worst of the piracy problem. Filtering doesn’t work for links. It doesn’t work well for certain types of content, like photographs, which are easily altered to avoid the filter. And so far, no computer algorithm has been developed that can determine whether a particular upload is fair use. Notice and Staydown would force many cases of legitimate fair use off the web. Further, intermediaries are not the right party to be implementing this technology. They don’t have all the facts about the works, such as whether they have been licensed. Most platforms are not in a good position to be making legal judgements, and they are motivated to avoid the potential for high statutory damages. All this means that platforms are likely to filter out legitimate uses of content.
  • Lacks Transparency.  These technical filters would act as a black box that the public would have no ability to review or appeal. It would be very difficult to know how much legitimate activity was being censored.
  • Costly and Burdensome. Developing an accurate filter that will work for each and every platform on the web will be an extremely costly endeavor. YouTube spent $60 million developing its Content ID system, which only works for audio and video content. It is very expensive to do this well. Nonprofits, libraries, and educational institutions who act as internet service providers would be forced to spend a huge amount of their already scarce resources policing copyright.
  • Technology Changes Quickly, Law Changes Slowly. The DMCA requires registered DMCA agents to provide a fax number. In 1998, that made sense. Today it is silly. Technology changes far too quickly for law to keep up. Governments should not be in the business of mandating the use of technology to solve a specific policy problem.

The DMCA has its problems, but Notice and Staydown would be an absolute disaster. Unfortunately, members of the general public were not invited to the Copyright Office proceedings last week. The many thousands of comments submitted by Internet users on this subject were not considered valuable input; rather, one panelist characterized them as a “DDoS attack” on the Copyright Office website, showing how little the people who are seeking to regulate the web actually understand it.

The Copyright Office has called for more research on how the DMCA is working for copyright holders and for platforms. We agree that this research is important. However, we must remember that the rest of the online world will also be impacted by changes to the DMCA.

Posted in Announcements, News | 30 Comments

Web Archiving with National Libraries


After the Internet Archive started web archiving in the late 1990s, National libraries also took their first steps towards systematic preservation of the web. Over 30 national libraries currently have a web archiving programme. Many among them archive the web under a legal mandate, which is an extension of the Legal Deposit system to cover non-print publication and enable heritage institutions such as a national library to collect copies of online publications within a country or state.

The Internet Archive has a long tradition of working with national libraries. As a key provider of web archiving technologies and services, Internet Archive has made available open source software for crawling and access, enabling national bodies to undertake web archiving locally. The Internet Archive also runs a global web archiving service for the general public, a tailored broad crawling service for national libraries and Archive-It, a subscription service for creating, managing, accessing and storing web archive collections. Many national libraries are partners of these services.

The Internet Archive conducted a stakeholders’ consultation exercise between November 2015 and March 2016, with the aim to understand current practices, and then review Internet Archive’s current services in this light and explore new aspects for national libraries. Thirty organizations and individuals were consulted, representing national libraries, archives, researchers, independent consultants and web archiving service providers.

The main findings of the consultation are summarized below, which give an overview of the current practices of web archiving at national libraries, as well as a general impression of the progress in web archiving and specific feedback on Internet Archive’s role and services.

  • Strategy and organization
    Web archiving has become increasingly important in national libraries’ strategy. Many have wanted to own the activity and develop the capability in-house. This requires integration of web archives with the library other collections and the traditional library practice for collection development. Budget cuts and lack of resources were observed at many national libraries, making it difficult to sustain the ongoing development of tools for web archiving.
  • Quality and comprehensiveness of collection
    There is a general frustration about the content gaps in the web archives. National libraries also have strong desires to collect the portion of Twitter, YouTube, Facebook and other social media which is considered as part of their respective national domain. They would also like to leverage web archiving as a complementary collecting tool for digital objects on the web and that are included in web archives such as eBooks, eJournals, music and maps.
  • Access and research use
    National web archives are, in general, poorly used due to access restrictions. Many national libraries wish to support research use of their web archives, by engaging with researchers to understand requirements and eventually embedding web archive collections into the research process.
  • Reflection on 20 years of web archiving
    While there is recognition of the progress in web archiving, there is also a general feeling that the community is stuck with a certain way of doing things without making any significant technological progress in the last ten years, and being outpaced by the fast evolving web.
  • Perception and expectation of Internet Archive’s services
    Aspects of Internet Archive’s currently services are unknown or misperceived. Stakeholders wish for services that are complementary to what national libraries undertake locally and help them put in place better web archives. There is a strong expectation for the Internet Archive to lead the ongoing collaborative development of (especially) Heritrix and the Wayback software. A number of national libraries have expressed the need for a service supporting the use of key software including maintenance, support and new features. There are also clearly expressed interests in services that can help libraries collect advanced content such as social media and embedded videos.

The Internet Archive would like to thank the participants again for being open with us and providing us with valuable input which will inform the development and improvement of our services.

The full consultation report can be accessed at https://archive.org/details/InternetArchiveStakeholdersConsultationFindingsPublic.

Posted in Announcements, News | 5 Comments

IA + ARC + Cuba


Cuba Music Week is a live and online effort – both crowd sourced and curated – to highlight the importance and beauty of Cuban Music. One goal is to introduce people to ideas and music from this vibrant culture.

In the past we have created “weeks” on Muslim music, Brazil and India. To do this we contact artists, academic institutions, bloggers, broadcasters, venues and collectors to send essays, activities and events that could be coordinated with our event. Sometime the response is great, sometimes not.

Cuba is our fourth attempt and we have partnered with Cubadiscos, a Cuban government organization that hosts a weeklong music festival and a symposium on the music in Havana. Cuba has a few problems with the internet, so there is no website. We have posted a list of their activities on our site from a list that we only got the day before the festival began!

Just for fun have a look at the galleries of record covers, cha cha maybe? Our galleries are one of the best features we create. The ARC doesn’t scan images of other people’s holdings or borrow materials for the site – we own everything pictured. A few of the recordings are taken from the joint ARC and Internet Archive collection stored out in the Richmond warehouses. Here are two sweet ‘almost’ Cuban, afro-Cuban recordings from this collection. They were donated by the family of Jerry Adams.



Mr. Adams was a radio DJ who became a major voice in promoting the Monterey Jazz Festival and helped Clint Eastwood build his collection. So some very nice stuff here. A good reason why the Internet Archive is, and should be, going after audio collections of quality with us.

One of the best features of the site are the databases, listing the Cuban recordings here at the ARC and glossaries of genres and instruments – many hundreds of styles and instruments briefly described. It’s info that is only available here. Soon everything will be stolen by Wikipedia, but for now probably the only easy-to-find source for much of this information. For audio fun we have worked with the Peabody Award winning radio show, Afropop Worldwide to bring everyone 18 hours on Cuban Music. Soon all of their 25+ years of audio will be available on the Internet Archive.
An important outgrowth of this project is our work – both the Internet Archive’s and the Archive of Contemporary Music’s – with the Cuban National Library José Martí. Last year I met with Perdo Urra who was working on a project to take old library typed and handwritten index cards on the recordings in their collection into OCR readable form.  So for us they rushed this project forward and now there are more than 30,000 cards scanned, making this data available online for scholars for the first time. Catalog available here and one example below.

Our Cuba site site will remain active as an online resource to make this culturally significant body of work readily available to people around the globe for study and enjoyment.

Do have a look at Cuba Music Week and spread the word.

Thanks,  B. George,

Director, The ARChive of Contemporary Music, NYC.

Sound Curator, The Internet Archive, San Francisco


Posted in News | Comments Off on IA + ARC + Cuba

Join us for the first Decentralized Web Summit — June 8-9, in SF

Decentralized Web Summit: Locking The Web Open at the Internet Archive

The first Decentralized Web Summit is a call for dreamers and builders who believe we can lock the Web open for good. This goal of the Summit (June 8) and Meetup featuring lightning talks and workshops (June 9) is to spark collaboration and take concrete steps to create a better Web.

Together we can build a more reliable, more dynamic, and more private Web on top of the existing web infrastructure.

At the Summit on June 8, the “father of the Internet,” Vint Cerf, will share with us his “Lessons from the Internet,” the things he’s learned in his 40+ years that may help us create a new, more secure, private and robust Web. EFF’s Cory Doctorow, such a fine weaver of digital dystopias in his science fiction, will share what has gone awry with the current Web and what kind of values we need to build into the code this time.

Current builders of decentralized technologies will be on hand to share their visions of how we can build a fully decentralized Web. The founders and builders of IPFS, the Dat Project, WebTorrent, Tahoe-LAFS, zcash, Zeronet.io, BitTorrent, Ethereum, BigChainDB, Blockstack, Interledger, Mediachain, MaidSafe, Storj and others will present their technologies and answer questions. If you have a project or workshop to share on June 9, we’d love to hear from you at Dwebsummit@archive.org.

You can join the conversation in our Decentralized Web Slack channel, or — as a decentralized option — you can join the Slack as a guest through Matrix.

It will take the passion and expertise of many to lock the Web open. As Internet Archive founder, Brewster Kahle, wrote last year:

We can make openness irrevocable.
We can build this.

We can do it together.

On June 8-9, let’s collaborate to get there.

For more information and official schedule, go to decentralizedweb.net.

Event Info:

Wednesday, June 8, 2016 at 8:00 AM Thursday, June 9, 2016 at 8:00 PM

Internet Archive, 300 Funston Avenue, San Francisco, CA 94118

Please register on our Eventbrite (limit 250 participants on June 8).

Posted in Announcements, Event | 10 Comments

The tech powering the Political TV Ad Archive

Ever wonder how we built the Political TV Ad Archive? This post explains what happens back stage — how we are using advanced technology to generate the counts for how many times a particular ad has aired on television, where, and when, in markets that we track.

There are three pieces to the Political TV Ad Archive:

  • The Internet Archive collects, prepares, and serves the TV content in markets where we have feeds. Collection of TV is part of a much larger effort to meet the organization’s mission of providing “Universal Access to All Knowledge.”The Internet Archive is the online home to millions of free books, movies, software, music, images, web pages and more.
  • The Duplitron 5000 is our whimsical name for an open source system responsible for taking video and creating unique, compressed versions of the audio tracks. These are known as audio fingerprints. We create an audio fingerprint for each political ad that we discover, which we then match against our incoming stream of broadcast television to find each new copy, or airing, of that ad. These results are reported back to the Internet Archive.
  • The Political TV Ad Archive is a WordPress site that presents our data and our videos and presents it to the rest of the world. On this website, for the sake of posterity, we also archive copies of political ads that may be airing in markets we don’t track, or exclusively on social media. But for the ads that show up in areas where we’re collecting TV, we are able to present the added information about airings.


Step 1: recording television

We have a whole bunch of hardware spread around the country to record television. That content is then pieced together to form the programs that get stored on the Internet Archive’s servers. We have a few ways to collect TV content. In some cases, such as the San Francisco market, we own and manage the hardware that records local cable. In other cases, such as markets in Ohio and Iowa, the content is provided to us by third party services.

Regardless of how we get the data, the pipeline takes it to the same place. We record in minute-long chunks of video and stitch them together into programs based on what we know about the station’s schedule. This results in video segments of anywhere from 30 minutes to 12 hours. Those programs are then turned into a variety of file formats for archival purposes.

The ad counts we publish are based on actual airings, as opposed to reported airings. This means that we are not estimating counts by analyzing Federal Election Commission (FEC) reports on spending by campaigns. Nor are we digitizing reports filed by broadcasting stations with the Federal Communications Commission (FCC) about political ads, though that is a worthy goal. Instead we generate counts by looking at what actually has been broadcast to the public.

Because we are working from the source, we know we aren’t being misled. On the flip side, this means that we can only report counts for the channels we actively track and record. In the first phase of our project, we tracked more than 20 markets in 11 key primary states (details here.) We’re now in the process of planning which markets we’ll track for the general elections. Our main constraint is simple: money. Capturing TV comes at a cost.

A lot can go wrong here. Storms can affect reception, packets can be lost or corrupted before they reach our servers. The result can be time shifts or missing content. But most of the time the data winds up sitting comfortably on our hard drives unscathed.

Step 2: searching television

Video is terrible when you’re trying to look for a specific piece of it. It’s slow, it’s heavy, it is far better suited for watching than for working with, but sometimes you need to find a way.

There are a few things to try. One is transcription; if you have a time-coded transcript you can do anything. Like create a text editor for video, or search for key phrases, like “I approve this message.”

The problem is that most television is not precisely transcribed. Closed captions are required for most U.S. TV programs, but not for advertisements. Shockingly, most political ads are not captioned. There are a few open source tools out there for automated transcript generation, but the results leave much to be desired.

Introducing audio fingerprinting

We use a free and open tool called audfprint to convert our audio files into audio fingerprints.

An audio fingerprint is a summarized version of an audio file, one that has removed everything except the most interesting pieces of every few milliseconds. The trick is that the summaries are formed in a way that makes it easy to compare them, and because they are summaries, the resulting fingerprint is a lot smaller and faster to work with than the original.

The audio fingerprints we use are based on a thing called frequency. Sounds are made up of waves, and each wave repeats–oscillates–at different rates. Faster repetitions are linked to higher sounds, lower repetitions are lower sounds.

An audio file contains instructions that tell a computer how to generate these waves. Audfprint breaks the audio files into tiny chunks (around 20 chunks per second) and runs a mathematical function on each fragment to identify the most prominent waves and their corresponding frequencies.

The rest is thrown out, the summaries are stored, and the result is an audio fingerprint.

If the same sound exists across two files, a common set of dominant frequencies will be seen in both fingerprints. Audfprint makes it possible to compare the chunks between two sound files, count how many they have in common, and how many appear in roughly the same distance from one another.

This is what we use to find copies of political ads.

Step 3: cataloguing political ads

When we discover a new political ad the first thing we do is register it on the Internet Archive, kicking off the ingestion process. The person who found it types in some basic information such as who the ad mentions, who paid for it, and what topics are discussed.

The ad is then sent to the system we built to manage our fingerprinting workflow, we whimsically call the Duplitron 5000—or the “DT5k.” This uses audfprint to generate fingerprints, organizes how the fingerprints are stored, process the comparison results, and allows us to scale to process across millions of minutes of television.

DT5k generates a fingerprint for the ad, stores it, and then compares that fingerprint with hundreds of thousands of existing fingerprints for the shows that had been previously ingested into the system. It takes a few hours for all of the results to come in. When they do, the Duplitron makes sense of the numbers and tells the archive which programs contain copies of the ad and what time the ad aired.

These result end up being fairly accurate, but not perfect. The matches are based on audio, not video, which means we face trouble when the same soundtrack is used in a political ad as has been used in, for instance, an infomercial.

We are working on improving the system to filter out these kinds of false positives, but even with no changes these fingerprints have provided solid data across the markets we track.


The Duplitron 5000, counting political ads. Credit: Lyla Duey.

Step 4: enjoying the results

And so you understand a little bit more about our system. You can download our data and watch the ads at the Political TV Ad Archive.  (For more on our metadata–what’s in it, and what can you can do with it, read here.)

Over the coming months we are working to make the system more accurate. We are also exploring ways to identify newly released political ads without any need for manual entry.

P.S. We’re also working to make it as easy as possible for any researchers to download all of our fingerprints to use in their own local copies of the Duplitron 5000. Would you like to experiment with this capability? If so, contact me on Twitter at @slifty.

Posted in Announcements, News, Television Archive | Tagged , , , , , | Comments Off on The tech powering the Political TV Ad Archive

Discover Books Donates Large Numbers of Books

discoverbooksInternet Archive is proud to partner with Discover Books, a major used book seller, to help let the stories in books live on.   Discover books is donating books that the Internet Archive does not yet own and would have gone to a landfill.   Through this process the Internet Archive has more books to digitize and preserve.

Together we are giving books the longest life possible both in print and online.

Thank you to discoverbooks.com.

Posted in Announcements, Books Archive, News | 3 Comments

Reflections on From Clay to the Cloud: The Internet Archive and Our Digital Legacy, a.k.a. The Internet Archive – The Exhibition!

Screen Shot 2016-05-01 at 9.00.47 PM

Photograph by Jason Scott

By Carolyn Peter

It started with a visit to Nuala Creed’s ceramics studio in Petaluma in the spring of 2014. My interest was piqued as she described “a commission of sculptures for the Internet Archive” that was ever-growing. She heavily encouraged me to stop by the Archive to experience the famous Friday lunch and to see her work. I’m so glad she did.

While enjoying a tasty lunch of sausage and salad, I listened to Archive staff members talk about their week and curious visitors who shared their inspiration for coming to the Archive for a meal and a tour. I did not understand all the technical vocabulary, but I was struck by all the individuals who were working together on a project which, prior to this day, I had only experienced as a website on my computer screen.

Of course, I fell in love with Nuala’s sculptures as soon as we stepped inside the Great Room, where 100+ colorful figures stood facing the stage as if waiting for a performance or lecture to begin. The odd objects in their hands, their personally fashioned clothing, and their quirky expressions reinforced the idea that the Internet Archive was the shared, creative effort of a huge number of individuals. The technical was becoming human to me. By the time Brewster had brought his visitors back down into the common workspace, my mind was racing with ideas and questions.

As a museum professional who has spent her career making choices about what works of art to acquire and preserve for future generations and as someone who takes great joy in handling and caring for objects, I wondered what threads ran from this digital archive through to more traditional archives and libraries. If I had sleepless nights wondering how to best protect a work of art for posterity, how was the Internet Archive going to ensure that its vast data was going to survive for millennia to come?

Before I knew what I was doing, I heard myself telling Brewster that I would love to do an exhibition about the Internet Archive. I don’t think he or I fully registered what I was saying. That would take more time.

This curatorial challenge brewed in my mind. The more I thought about it, the more I thought an exploration of the past, present and future of archives and libraries and the basic human desire to preserve knowledge for future generations would be a perfect topic for an exhibition in my university art gallery. I knew Nuala’s series could serve as the core artistic and humanizing element for such a show, but I wondered how I would be able to convey these ideas and questions in an accessible and interesting way, how to make this invisible digital world visible? And turning the tables—if Brewster had brought art into the world of technology with his commission of the Internet Archivists series, how could I bring technology into the artistic realm?

When I approached Brewster for a second time about a year later with a proposal, he thought I was crazy. He has said it was as if I had told him I wanted to do “The Internet Archive, The Musical.” A few conversations and months later, he agreed to let me run with the idea.

I have to admit, at times, I too wondered if I was crazy. I wrestled with devising ways to visibly convey the Archive’s unfathomable vastness while also trying to spotlight the diverse aspects of the Archive through hands-on displays.

Screen Shot 2016-05-01 at 9.00.22 PM

Photograph by Jason Scott.

While some big dreams had to be let go, I was able to achieve most of the goals I set out for the exhibition. Transporting thirty-two of Nuala’s fragile sculptures to Los Angeles required two days of careful packing and a fine art shipping truck committed solely to this special load. Along with film editor Chris Jones and cameraman Scott Oller, I also created a film that documents the story of the Internet Archivists sculpture series through interviews with Nuala, Brewster and a number of the archivists who have had their sculptural portraits made.

When visitors entered the gallery, they were greeted by three of the Internet Archivist figures and a full-scale shipping container (a trompe-l’oeil work of art by Makayla Blanchard) that conveyed Brewster’s often-repeated claim that he had fit the entire World Wide Web inside a shipping container. The exhibition was filled with juxtapositions of the old and new. To the right of the three archivists was a case filled with a dozen ancient clay cuneiforms and pieces of Egyptian papyrus introducing very early forms of archiving. A china hutch displayed out of fashion media formats that the Internet Archive has been converting into digital form such as record albums, cassette tapes, slides, and VHS tapes. I partnered with LMU’s librarians to bring the mystery of archiving out into the light. Using one of the Archive’s Tabletop Scribes, the librarians scanned and digitized numerous rare books from their collection. The exhibition also included displays and computer monitors so visitors could explore the Wayback Machine, listen to music from the archive’s collections, play vintage video games and test out the Oculus Rift.

clay-cloud scribe

Photograph by Brian Forrest.

In the end, I think the exhibition asked a lot more questions than it answered. Nevertheless, I hope this first exhibition will spark others to think of ways to make the abstract ideas and invisible aspects of digital archives more tangible. Who knows, maybe, a musical is in the Internet Archive’s future.

I was sad to pack up the clay archivists and say goodbye to their smiling faces. I’m sure they are happy to be back with the rest of their friends in the Great Room on Funston Avenue, but oh, the stories they have to tell of their travels to a gallery in Los Angeles.

Carolyn Peter is the director and curator of the Laband Art Gallery at Loyola Marymount University. She curated From Clay to the Cloud: The Internet Archive and Our Digital Legacy, which was on view from January 23-March 20, 2016 at the Laband.

Posted in Announcements, Event | Comments Off on Reflections on From Clay to the Cloud: The Internet Archive and Our Digital Legacy, a.k.a. The Internet Archive – The Exhibition!

Google Library Project Legal: Let the Robots Read!


The decade-long legal battle over Google’s massive book scanning project is finally over, and it’s a huge win for libraries and fair use. On Monday, the Supreme Court declined to hear an appeal by the Author’s Guild, which had argued that Google’s scanning of millions of books was an infringement of copyright on a grand scale. The Supreme Court’s decision means that the Second Circuit case holding that Google’s creation of a database including millions of digital books is fair use still stands. The appeals court explained how its fair use rationale aligns with the very purpose of copyright law: “[W]hile authors are undoubtedly important intended beneficiaries of copyright, the ultimate, primary intended beneficiary is the public, whose access to knowledge copyright seeks to advance by providing rewards for authorship.”

Google Books gives readers and internet users the world over access to millions of works that had previously been hidden away in the archives of our most elite universities. As a Google representative said in a statement, “The product acts like a card catalog for the digital age by giving people a new way to find and buy books while at the same time advancing the interests of authors.”

Google began scanning books in partnership with a group of university libraries in 2004. In 2005, author and publisher groups filed a class action lawsuit to put a stop to the project. The parties agreed to settle the lawsuit in a manner that would have forever changed the legal landscape around book rights. The District Court judge rejected the settlement in 2011, based on concerns about competition, access, and fairness, and so litigation over the core question of fair use resumed.

Judge Chin, Judge Leval, and the Supreme Court all made the right decisions along the long and winding path to Google’s victory. Libraries around the country are now free to rely on fair use as they determine how to manage their own digitization projects–encouraging innovation and increasing our access to human knowledge.

Posted in Announcements, Books Archive, News | 6 Comments

Truck and Back Again: The Internet Archive Truck Takes a Detour

When one of our employees came out of his home over the weekend, he saw an empty parking space. Granted, in San Francisco, that’s a pretty precious thing, but since this empty parking space had held the Internet Archive Truck for the previous two days, he was not feeling particularly lucky.

A staff conversation then ensued, the city was called to see if the truck had been towed, and after a short time, it became obvious that no, somebody had stolen the Truck.

This in itself is not news: thousands of vehicles are stolen in the Bay Area every year. But what makes this unusual was the nature of the vehicle stolen… the Truck is a pretty unique looking vehicle.



Once the report was filed with the police and a few more checks were made to ensure that the truck was absolutely, positively missing and presumed stolen, the truck’s theft was announced on Twitter, which garnered tens of thousands of views and the news being spread very far. Thanks to everyone who got the word out.

What was not expected, besides the initial theft, was that a lot of people wondered why the Internet Archive, essentially a website, would have a truck. So, here’s a little bit about why.

Besides the providing of older websites, books, movies, music, software and other materials to millions of visitors a day, the Internet Archive also has buildings for physical storage located in Richmond, just outside the limits of San Francisco. In these buildings, we hold copies of books we’ve scanned, audio recordings, software boxes, films, and a variety of other materials that we are either turning digital or holding for the future. It turns out you can’t be a 100% online experience – physical life just gets in the way. We also have multiple data centers and the need to transport equipment between them.

Therefore, we’ve had a hard-working vehicle for getting these materials around: a 2003 GMC Savana Cutaway G3500, often parked out front of the Archive’s 300 Funston Avenue address and making up to several trips a week between our various locations.

In a touch of whimsy, the truck has had a unique paint job for most of its life with the Archive. Notably, this isn’t even the first mural it had on its sides; here is a shot with the previous mural:


We’re not sure of the motivation in stealing this rather unique and noticeable vehicle, and there seems to be some evidence it was driven around the city for a while after it was taken. But yesterday, we were contacted by the San Francisco Police Department with really great news:

The Truck has been recovered!

Left abandoned by the side of the road, the truck was found and is about to be returned to the Archive, and with good luck, back and in service helping us prepare and transport materials related to our mission: to bring the world’s knowledge to everyone.

Again, thanks to everyone who sounded out the original call for the truck’s return, and to the SFPD for getting a hold of the truck so quickly after it was gone.

Posted in Announcements, Cool items | 9 Comments