Decentalized Web Server: Possible Approach with Cost and Performance Estimates

At the first Decentralized Web Summit Tim Berners-Lee asked if a content-BK and TBLaddressable peer-to-peer server system scales to the demands of the World Wide Web. This is meant to be a partial answer to a piece of the puzzle.  For background, this might help.

Decentralized web pages will be served by users, peer-to-peer, but there can also be high-performance super-nodes which would serve as caches and archives. These super-nodes could be run by archives, like the Internet Archive, and ISPs who want to deliver pages quickly to their users. I will call such a super-node a “Decentralized Web Server” or “D-Web Server” and work through a thought experiment on how much it would cost to have one that would store many webpages and serve them up fast.

Web objects, such as text and images, in the Decentralized Web are generally retrieved based on a computed hash of the content. This is called “content addressing.” Therefore, a request for a webpage from the network will be based on its hash rather than contacting a specific server. This object can be served from any D-Web server without worrying that it will be faked because the contents will be checked to make sure it is the right content by rehashing it and checking to make sure it was right.

For the purposes of this post, we will use the basic machines that the petabox-in-great-roomInternet Archive currently uses as a data point. These are 24-core, 250TByte disk storage (on 36 drives), 192GB RAM, 2Gbit/sec network, 4u height machines that cost about $14k. Therefore:

  • $14k for 1 D-Web server

Let’s estimate the average compressed decentralized web object size is 50KBytes (an object is page, javascript, image, movie—things that make up a webpage). This is larger than what the Internet Archive web crawl average, but it’s in the ballpark.

Therefore, if we use all the storage for web objects, then that would be 5 billion web objects (250TB/50KB). This would be maybe 1 million basic websites (each website would have 5 thousand web pieces which I would guess is much more than the average WordPress website, though there are of course notable websites with much more). Therefore, this is enough for a large growth in the decentralized web and it could keep all versions. Therefore:

  • Store 5 billion web objects, or 1 million websites

How many requests could it answer? Answering a decentralized website request would mean to ask “do I have the requested object?” and if yes, to then serve it. If this D-Web server is one of many, then it may not have all webpages on it even though it seems we could probably store all pages for a long part of the growth of the Decentralized Web.

Let’s break it into two types: “Do we have it?” and “Here is the web object”. “Do we have it?” can be done efficiently with a Bloom Filter. It is done by taking the request, hashing it eight times and looking up those bits up in RAM to see if they are there. I will not explain it further than to say an entry can take about 3 bytes of RAM and can answer questions very, very fast. Therefore, the lookup array for 5 billion objects would take 15GB, which is a small percentage of our RAM.

I don’t know the speed this can run, but it is probably in excess of 100k requests per second. (This paper seemed to put the number over 1 million per second.) A request is a sha256 hash, which, if recorded in binary, is 32 bytes. So 3.2MBytes/sec would be the incoming bandwidth rate, which is not a problem. Therefore:

* 100k “Do We Have It?” requests processed per second (guess).

The number of requests able to be served could depend on the bandwidth of the machine, and it could depend on the file system. If a web object is 50KB compressed, and served compressed, then with 2Gbits/second, we could serve a maximum of 5,000 per second based on bandwidth. If each hard drive is about 200 seeks per second, and a retrieval is four seeks on average (this is an estimate), then with 36 hard drives, that would be 1,800 retrieves per second. If there were popular pages, these would stay in ram or an SSD, so it could be even quite faster. But assuming 1,800 per second, this would be about 700Mbits/sec which is not stretching the proposed machines. Therefore:

* 1,800 “Here is the web object” requests processed per second maximum.

How many users would the serve? To make a guess, maybe we could use the use of mobile devices use of web servers. At least in my family, the web use is a small percentage of the total traffic, and even the sites that are used are unlikely to be decentralized websites (like YouTube). So if a user uses 1GByte per month on web traffic, and 5% of those are decentralized websites, so 50MB/month per user of decentralized websites could give an estimate. If the server can serve at 700Mbits/sec, then that is 226Terabytes/month. At at the 50MB usage that would be over 4 million users. Therefore:

* Over 4 million users can be served from that single server (again, a guess.)

So, by this argument, a single Decentralized Web Server can serve a million websites to 4 million users and cost $14,000. Even if it does not perform this well, this could work well for quite a while.

Obviously, we do not want just one Decentralized Web Server, but it is interesting to know that one computer could serve the whole system during early stages, and then more can be added at any time. If there were more, then the system would be more robust, could scale to larger amounts of data, could serve users faster because the content could be brought closer to users.

Performance and cost do not seem to be a problem—in fact, there may be an advantage to the decentralized web over current web server technology.

Posted in Announcements, News | Leave a comment

Geez, Now Internet Insurance?

We seem to make some people mad.

The Internet Archive, a non-profit library, hosts many things. Many, many things. Billions of old webpages, lots of concerts, nostalgia computer games, TV, books, old movies, contributed books, music, and video, and much more.

But some of it seems to make some people mad. China is blocking us, Russia recent stopped blocking us, and India took a crack at blocking us last year. And then there are the occasional denial-of-service attacks by who-knows-who? One recent DDoS attack was apparently claimed by some Anonymous-linked group. Another one seemed to ask for a bitcoin to turn it off. Yup, “Pay us $400 and we will put you back on the air.” Really?  (We didn’t give it to them.)

Each time this happens, it causes a bunch of engineers and managers to run around to deal with it. Thankfully, a bunch of people donated this last time, out of sympathy, I guess — thank you!

We have tried to handle these without architectural changes, but it is getting hard. This last time we had to call a vacationing engineer in the middle of his night… Zeus knows we have enough self-inflicted screwups and growing pains to deal with. But now this?

One change we could make would be to send our traffic through CloudFlare, or similar, to filter out unwelcome packets as an “Insurance against Internet attackers.” Some people go to “cloud services” that have the sysadmins filter out the zealous ones. Both of these solutions would mean that our traffic would go through someone else’s hosts, which means $, privacy loss, and general loss of the end-to-end Internet. It is like converting to Gmail because there are so many spammers on the net and Google is capable of filtering out those losers.

The Internet Archive is trying to demonstrate that an affordable, end-to-end strategy works:

  •     we protect our reader’s privacy by running our own servers, and try not to log IP addresses;
  •     we don’t want to have co-location centers that control physical access to our servers, so we build our own;
  •     we don’t like having someone else run our email servers, but we get deluged with spam;
  •     we do not want to have someone else control our IP addresses, so we have our own ASN;
  •     we want the web to be even more resilient against the censors and the rot of time, so we pioneer the Decentralized Web.

Having our traffic filtered by a third party only when we are attacked may not be so bad, but it shows it is harder and harder for normal people to run their own servers.

Let’s work together to keep the Internet a welcoming place to both large and small players without needing insurance and third-party protectors.

Optimistically yours,

Brewster Kahle

Founder and Digital Librarian

Posted in Announcements, News | 1 Comment

Decentralized Web Summit: Towards Reliable, Private, and Fun

group3

[See coverage by the NYtimes, Fortune, Boing Boing, other press]

vint-cerf_tim-berners-lee-brewster_kahle

Internet Archive Founder, Brewster Kahle, the father of the Internet, Vint Cerf and Sir Tim Berners-Lee, “father of the World Wide Web,” at the first Decentralized Web Summit in San Francisco.

More than 300 web architects, activists, archivists and policy makers gathered at the Internet Archive for the  first Decentralized Web Summit, where I was honored to share a stage with internet pioneers, Vint Cerf, and Sir Tim Berners-Lee. We wanted to bring together the original “fathers of the internet and World Wide Web” with a new generation of builders to see if together we could align around–and in some cases reinvent–a Web that is more reliable, private, and fun.  Hackers came from Bangkok to Boston, London and Lisbon, New York and Berlin to answer our call to “Lock Open the Web.”

Building a web that is decentralized— where many websites are delivered through a peer-to-peer network– would lead to a the web being hosted from many places leading to more reliable access, availability of past versions, access from more places around the world, and higher performance. It can also lead to more reader-privacy because it is harder to watch or control what one reads.  Integrating a payments system into a decentralized web can help people make money by publishing on the web without the need for 3rd parties.  This meeting focused on the values, technical, policy, deployment issues of reinventing basic infrastructure like the web.

Mitchell BakerFirst in the opening welcome, Mitchell Baker, head of Mozilla, reported that Mozilla, the company that made open main-stream, is going back to the core values, focusing on what users want the Web to be.  Mitchell said Mozilla is rethinking everything, even what a browser should be in the coming age. She highlighted four principles we need to think about when building a Decentralized Web:  that the Web should be Immediate, Open, Universal and have Agency–that there are policies and standards that help users mediate and control their own Web experiences. Talking about the values that need to baked into the code turned out to be the dominant theme of the event.

 

vint1Next, Vint Cerf, Google’s Internet Evangelist and  “father of the Internet,” called for a “Self-Archiving Web” in the first keynote address.  He described a “digital dark age” when our lives online have disappeared and how a more advanced Web, one that archives itself throughout time, could help avoid that outcome.  Over the three days of events, how to actually build a Web that archives itself came to seem quite doable.  In fact,  several of talented groups, including IPFS and the Dat Project, demonstrated pieces of what could make a Decentralized Web real.

Tim Berners-Lee (father of the Web) opened by saying the current technology and protocols could and should etimvolve to incorporate what we want from of our Web. He told us he created the
Web to be decentralized, so that anyone could set up their own server or host their own domain. Over time the Web has become “siloized” and we have “sold our soul of privacy in order to get stuff for free.” When Tim said rethinking the HTTP specification is feasible–the possibilities for change and improvement opened up for everyone.

 

bk2Brewster Kahle of the Internet Archive (me) ventured we wanted a Web that baked our values into the code itself– Universal Access to all
Knowledge, freedom of expression, reliability, reader privacy, and fun.

To build reliable access requires serving websites from multiple places on the net. We heard proposals to build “multi-home” websites using content-addressible structures rather than contacting a single website for answers. There were demonstrations of ZeroNet, IPFS, and DAT that did this.

Protecting reader privacy is difficult when all traffic to a website can be monitored, blocked, or controlled. The security panel that included Mike Perry of Tor and Paige Peterson of MaidSafe, said that having one’s requests and retrieved documents “hopping around” rather than going straight from server to client can help ensure greater privacy. Combining this with multi-homed access seems like a good start.

We can start making a smooth transition from the current Web to leverage these ideas by using all of our current infrastructure of browsers and URL’s–and not requiring people to download software. While not ideal, we can build a Decentralized Web on top of the current Web using Javascript, so each reader of the Decentralized Web is also a server of it, allowing the Web naturally to scale and reinforce itself as more readers joined in. The Internet Archive has already started supporting this projects with free machines and storage.

BK and TBL“Polyfill” was final bit of advice I got from Tim Berners-Lee before he left.  Polyfill, he said is a kind of English version of Spackle, that is used to fix and patch walls. In this case, Polyfill is Javascript.  He said that almost all proposals to make a change to the Web are prototyped in javascript and then can be built in as they are debugged and demonstrated to be useful.

There we have it: let’s make polyfill additions to the existing Web to demonstrate how a Reliable, Private, and Fun Web can emerge.

Congratulations to the Internet Archive for pulling this together.

Arms Raised Group Shot Builders Day

Posted in Announcements | 2 Comments

Copyright Office’s Proposed Notice and Staydown System Would Force the Internet Archive and Other Platforms to Censor the Web

censoredIn May, the US Copyright Office came to San Francisco to hear from various stakeholders about how well Section 512 of the Digital Millennium Copyright Act or DMCA is working. The Internet Archive appeared at these hearings to talk about the perspective of nonprofit libraries. The DMCA is the part of copyright law that provides for a “notice and takedown” process for copyrighted works on the Internet. Platforms who host content can get legal immunity if they take down materials when they get a complaint from the copyright owner.

This is an incredibly powerful tool for content owners–there is no other area of law that allows content to be removed from the web with a mere accusation of guilt. Victims of harassment, defamation, invasions of privacy, or any other legal claim, have to go to court to have anything taken down.

Unfortunately, this tool can be, and has been abused. We see this every day at the Internet Archive when we get overbroad DMCA takedown notices, claiming material that is in the public domain, is fair use, or is critical of the content owner. More often than not, these bad notices are just mistakes, but sometimes notices are sent intentionally to silence speech. Since this tool can be so easily abused, it is one that should be approached with extreme caution.

We were very concerned to hear that the Copyright Office is strongly considering recommending changing the DMCA to mandate a “Notice and Staydown” regime. This is the language that the Copyright Office uses to talk about censoring the web. The idea is that once a platform gets a notice regarding a specific copyrighted work, like a specific picture, song, book, or film, that platform would then be responsible for making sure that the work never appears on the platform ever again. Other users would have to be prevented, using filtering technology, from ever posting that specific content ever again. It would have to “Stay Down.”

This idea is dangerous in a number of ways:

  • No Due Process. Notice and Staydown would remove all of the user protections built in to the DMCA. Currently, the statute allows users who believe material they have posted was taken down in error to file a counter-notification. If the copyright holder does not choose to bring a lawsuit, then the content can be reposted. The law also prohibits the sending of false notices, and allows users who have been falsely accused to being a claim against their accuser. These protections for the user would simply go away if platforms were forced to proactively filter content.
  • Requires Platforms to Monitor User Activity. The current statute protects user privacy by explicitly stating that platforms have no duty to monitor user activity for copyright infringement. Notice and Staydown would change this–requiring platforms to be constantly looking over users’ shoulders.
  • Promotes Censorship. Notice and Staydown has a serious First Amendment problem. The government mandating the use of technology to affirmatively take speech offline before it’s even posted, without any form of review, potentially violates free speech laws.
  • It Just Won’t Work In Most Cases. Piracy on the web is a real problem for creators. However, filtering at the platform level is just very unlikely to stop the worst of the piracy problem. Filtering doesn’t work for links. It doesn’t work well for certain types of content, like photographs, which are easily altered to avoid the filter. And so far, no computer algorithm has been developed that can determine whether a particular upload is fair use. Notice and Staydown would force many cases of legitimate fair use off the web. Further, intermediaries are not the right party to be implementing this technology. They don’t have all the facts about the works, such as whether they have been licensed. Most platforms are not in a good position to be making legal judgements, and they are motivated to avoid the potential for high statutory damages. All this means that platforms are likely to filter out legitimate uses of content.
  • Lacks Transparency.  These technical filters would act as a black box that the public would have no ability to review or appeal. It would be very difficult to know how much legitimate activity was being censored.
  • Costly and Burdensome. Developing an accurate filter that will work for each and every platform on the web will be an extremely costly endeavor. YouTube spent $60 million developing its Content ID system, which only works for audio and video content. It is very expensive to do this well. Nonprofits, libraries, and educational institutions who act as internet service providers would be forced to spend a huge amount of their already scarce resources policing copyright.
  • Technology Changes Quickly, Law Changes Slowly. The DMCA requires registered DMCA agents to provide a fax number. In 1998, that made sense. Today it is silly. Technology changes far too quickly for law to keep up. Governments should not be in the business of mandating the use of technology to solve a specific policy problem.

The DMCA has its problems, but Notice and Staydown would be an absolute disaster. Unfortunately, members of the general public were not invited to the Copyright Office proceedings last week. The many thousands of comments submitted by Internet users on this subject were not considered valuable input; rather, one panelist characterized them as a “DDoS attack” on the Copyright Office website, showing how little the people who are seeking to regulate the web actually understand it.

The Copyright Office has called for more research on how the DMCA is working for copyright holders and for platforms. We agree that this research is important. However, we must remember that the rest of the online world will also be impacted by changes to the DMCA.

Posted in Announcements, News | 30 Comments

Web Archiving with National Libraries

national-library-of-australia-2

After the Internet Archive started web archiving in the late 1990s, National libraries also took their first steps towards systematic preservation of the web. Over 30 national libraries currently have a web archiving programme. Many among them archive the web under a legal mandate, which is an extension of the Legal Deposit system to cover non-print publication and enable heritage institutions such as a national library to collect copies of online publications within a country or state.

The Internet Archive has a long tradition of working with national libraries. As a key provider of web archiving technologies and services, Internet Archive has made available open source software for crawling and access, enabling national bodies to undertake web archiving locally. The Internet Archive also runs a global web archiving service for the general public, a tailored broad crawling service for national libraries and Archive-It, a subscription service for creating, managing, accessing and storing web archive collections. Many national libraries are partners of these services.

The Internet Archive conducted a stakeholders’ consultation exercise between November 2015 and March 2016, with the aim to understand current practices, and then review Internet Archive’s current services in this light and explore new aspects for national libraries. Thirty organizations and individuals were consulted, representing national libraries, archives, researchers, independent consultants and web archiving service providers.

The main findings of the consultation are summarized below, which give an overview of the current practices of web archiving at national libraries, as well as a general impression of the progress in web archiving and specific feedback on Internet Archive’s role and services.

  • Strategy and organization
    Web archiving has become increasingly important in national libraries’ strategy. Many have wanted to own the activity and develop the capability in-house. This requires integration of web archives with the library other collections and the traditional library practice for collection development. Budget cuts and lack of resources were observed at many national libraries, making it difficult to sustain the ongoing development of tools for web archiving.
  • Quality and comprehensiveness of collection
    There is a general frustration about the content gaps in the web archives. National libraries also have strong desires to collect the portion of Twitter, YouTube, Facebook and other social media which is considered as part of their respective national domain. They would also like to leverage web archiving as a complementary collecting tool for digital objects on the web and that are included in web archives such as eBooks, eJournals, music and maps.
  • Access and research use
    National web archives are, in general, poorly used due to access restrictions. Many national libraries wish to support research use of their web archives, by engaging with researchers to understand requirements and eventually embedding web archive collections into the research process.
  • Reflection on 20 years of web archiving
    While there is recognition of the progress in web archiving, there is also a general feeling that the community is stuck with a certain way of doing things without making any significant technological progress in the last ten years, and being outpaced by the fast evolving web.
  • Perception and expectation of Internet Archive’s services
    Aspects of Internet Archive’s currently services are unknown or misperceived. Stakeholders wish for services that are complementary to what national libraries undertake locally and help them put in place better web archives. There is a strong expectation for the Internet Archive to lead the ongoing collaborative development of (especially) Heritrix and the Wayback software. A number of national libraries have expressed the need for a service supporting the use of key software including maintenance, support and new features. There are also clearly expressed interests in services that can help libraries collect advanced content such as social media and embedded videos.

The Internet Archive would like to thank the participants again for being open with us and providing us with valuable input which will inform the development and improvement of our services.

The full consultation report can be accessed at https://archive.org/details/InternetArchiveStakeholdersConsultationFindingsPublic.

Posted in Announcements, News | 5 Comments

IA + ARC + Cuba

CubaMusicWeek

Cuba Music Week is a live and online effort – both crowd sourced and curated – to highlight the importance and beauty of Cuban Music. One goal is to introduce people to ideas and music from this vibrant culture.

In the past we have created “weeks” on Muslim music, Brazil and India. To do this we contact artists, academic institutions, bloggers, broadcasters, venues and collectors to send essays, activities and events that could be coordinated with our event. Sometime the response is great, sometimes not.

Cuba is our fourth attempt and we have partnered with Cubadiscos, a Cuban government organization that hosts a weeklong music festival and a symposium on the music in Havana. Cuba has a few problems with the internet, so there is no website. We have posted a list of their activities on our site from a list that we only got the day before the festival began!

Just for fun have a look at the galleries of record covers, cha cha maybe? Our galleries are one of the best features we create. The ARC doesn’t scan images of other people’s holdings or borrow materials for the site – we own everything pictured. A few of the recordings are taken from the joint ARC and Internet Archive collection stored out in the Richmond warehouses. Here are two sweet ‘almost’ Cuban, afro-Cuban recordings from this collection. They were donated by the family of Jerry Adams.

AfroDizzieGillespie

Flautista

Mr. Adams was a radio DJ who became a major voice in promoting the Monterey Jazz Festival and helped Clint Eastwood build his collection. So some very nice stuff here. A good reason why the Internet Archive is, and should be, going after audio collections of quality with us.

One of the best features of the site are the databases, listing the Cuban recordings here at the ARC and glossaries of genres and instruments – many hundreds of styles and instruments briefly described. It’s info that is only available here. Soon everything will be stolen by Wikipedia, but for now probably the only easy-to-find source for much of this information. For audio fun we have worked with the Peabody Award winning radio show, Afropop Worldwide to bring everyone 18 hours on Cuban Music. Soon all of their 25+ years of audio will be available on the Internet Archive.
An important outgrowth of this project is our work – both the Internet Archive’s and the Archive of Contemporary Music’s – with the Cuban National Library José Martí. Last year I met with Perdo Urra who was working on a project to take old library typed and handwritten index cards on the recordings in their collection into OCR readable form.  So for us they rushed this project forward and now there are more than 30,000 cards scanned, making this data available online for scholars for the first time. Catalog available here and one example below.
DulceRezazo

Our Cuba site site will remain active as an online resource to make this culturally significant body of work readily available to people around the globe for study and enjoyment.

Do have a look at Cuba Music Week and spread the word.

Thanks,  B. George,

Director, The ARChive of Contemporary Music, NYC.

Sound Curator, The Internet Archive, San Francisco

 

Posted in News | Comments Off on IA + ARC + Cuba

Join us for the first Decentralized Web Summit — June 8-9, in SF

Decentralized Web Summit: Locking The Web Open at the Internet Archive

The first Decentralized Web Summit is a call for dreamers and builders who believe we can lock the Web open for good. This goal of the Summit (June 8) and Meetup featuring lightning talks and workshops (June 9) is to spark collaboration and take concrete steps to create a better Web.

Together we can build a more reliable, more dynamic, and more private Web on top of the existing web infrastructure.

At the Summit on June 8, the “father of the Internet,” Vint Cerf, will share with us his “Lessons from the Internet,” the things he’s learned in his 40+ years that may help us create a new, more secure, private and robust Web. EFF’s Cory Doctorow, such a fine weaver of digital dystopias in his science fiction, will share what has gone awry with the current Web and what kind of values we need to build into the code this time.

Current builders of decentralized technologies will be on hand to share their visions of how we can build a fully decentralized Web. The founders and builders of IPFS, the Dat Project, WebTorrent, Tahoe-LAFS, zcash, Zeronet.io, BitTorrent, Ethereum, BigChainDB, Blockstack, Interledger, Mediachain, MaidSafe, Storj and others will present their technologies and answer questions. If you have a project or workshop to share on June 9, we’d love to hear from you at Dwebsummit@archive.org.

You can join the conversation in our Decentralized Web Slack channel, or — as a decentralized option — you can join the Slack as a guest through Matrix.

It will take the passion and expertise of many to lock the Web open. As Internet Archive founder, Brewster Kahle, wrote last year:

We can make openness irrevocable.
We can build this.

We can do it together.

On June 8-9, let’s collaborate to get there.

For more information and official schedule, go to decentralizedweb.net.

Event Info:

Wednesday, June 8, 2016 at 8:00 AM Thursday, June 9, 2016 at 8:00 PM

Internet Archive, 300 Funston Avenue, San Francisco, CA 94118

Please register on our Eventbrite (limit 250 participants on June 8).

Posted in Announcements, Event | 10 Comments

The tech powering the Political TV Ad Archive

Ever wonder how we built the Political TV Ad Archive? This post explains what happens back stage — how we are using advanced technology to generate the counts for how many times a particular ad has aired on television, where, and when, in markets that we track.

There are three pieces to the Political TV Ad Archive:

  • The Internet Archive collects, prepares, and serves the TV content in markets where we have feeds. Collection of TV is part of a much larger effort to meet the organization’s mission of providing “Universal Access to All Knowledge.”The Internet Archive is the online home to millions of free books, movies, software, music, images, web pages and more.
  • The Duplitron 5000 is our whimsical name for an open source system responsible for taking video and creating unique, compressed versions of the audio tracks. These are known as audio fingerprints. We create an audio fingerprint for each political ad that we discover, which we then match against our incoming stream of broadcast television to find each new copy, or airing, of that ad. These results are reported back to the Internet Archive.
  • The Political TV Ad Archive is a WordPress site that presents our data and our videos and presents it to the rest of the world. On this website, for the sake of posterity, we also archive copies of political ads that may be airing in markets we don’t track, or exclusively on social media. But for the ads that show up in areas where we’re collecting TV, we are able to present the added information about airings.

 

Step 1: recording television

We have a whole bunch of hardware spread around the country to record television. That content is then pieced together to form the programs that get stored on the Internet Archive’s servers. We have a few ways to collect TV content. In some cases, such as the San Francisco market, we own and manage the hardware that records local cable. In other cases, such as markets in Ohio and Iowa, the content is provided to us by third party services.

Regardless of how we get the data, the pipeline takes it to the same place. We record in minute-long chunks of video and stitch them together into programs based on what we know about the station’s schedule. This results in video segments of anywhere from 30 minutes to 12 hours. Those programs are then turned into a variety of file formats for archival purposes.

The ad counts we publish are based on actual airings, as opposed to reported airings. This means that we are not estimating counts by analyzing Federal Election Commission (FEC) reports on spending by campaigns. Nor are we digitizing reports filed by broadcasting stations with the Federal Communications Commission (FCC) about political ads, though that is a worthy goal. Instead we generate counts by looking at what actually has been broadcast to the public.

Because we are working from the source, we know we aren’t being misled. On the flip side, this means that we can only report counts for the channels we actively track and record. In the first phase of our project, we tracked more than 20 markets in 11 key primary states (details here.) We’re now in the process of planning which markets we’ll track for the general elections. Our main constraint is simple: money. Capturing TV comes at a cost.

A lot can go wrong here. Storms can affect reception, packets can be lost or corrupted before they reach our servers. The result can be time shifts or missing content. But most of the time the data winds up sitting comfortably on our hard drives unscathed.

Step 2: searching television

Video is terrible when you’re trying to look for a specific piece of it. It’s slow, it’s heavy, it is far better suited for watching than for working with, but sometimes you need to find a way.

There are a few things to try. One is transcription; if you have a time-coded transcript you can do anything. Like create a text editor for video, or search for key phrases, like “I approve this message.”

The problem is that most television is not precisely transcribed. Closed captions are required for most U.S. TV programs, but not for advertisements. Shockingly, most political ads are not captioned. There are a few open source tools out there for automated transcript generation, but the results leave much to be desired.

Introducing audio fingerprinting

We use a free and open tool called audfprint to convert our audio files into audio fingerprints.

An audio fingerprint is a summarized version of an audio file, one that has removed everything except the most interesting pieces of every few milliseconds. The trick is that the summaries are formed in a way that makes it easy to compare them, and because they are summaries, the resulting fingerprint is a lot smaller and faster to work with than the original.

The audio fingerprints we use are based on a thing called frequency. Sounds are made up of waves, and each wave repeats–oscillates–at different rates. Faster repetitions are linked to higher sounds, lower repetitions are lower sounds.

An audio file contains instructions that tell a computer how to generate these waves. Audfprint breaks the audio files into tiny chunks (around 20 chunks per second) and runs a mathematical function on each fragment to identify the most prominent waves and their corresponding frequencies.

The rest is thrown out, the summaries are stored, and the result is an audio fingerprint.

If the same sound exists across two files, a common set of dominant frequencies will be seen in both fingerprints. Audfprint makes it possible to compare the chunks between two sound files, count how many they have in common, and how many appear in roughly the same distance from one another.

This is what we use to find copies of political ads.

Step 3: cataloguing political ads

When we discover a new political ad the first thing we do is register it on the Internet Archive, kicking off the ingestion process. The person who found it types in some basic information such as who the ad mentions, who paid for it, and what topics are discussed.

The ad is then sent to the system we built to manage our fingerprinting workflow, we whimsically call the Duplitron 5000—or the “DT5k.” This uses audfprint to generate fingerprints, organizes how the fingerprints are stored, process the comparison results, and allows us to scale to process across millions of minutes of television.

DT5k generates a fingerprint for the ad, stores it, and then compares that fingerprint with hundreds of thousands of existing fingerprints for the shows that had been previously ingested into the system. It takes a few hours for all of the results to come in. When they do, the Duplitron makes sense of the numbers and tells the archive which programs contain copies of the ad and what time the ad aired.

These result end up being fairly accurate, but not perfect. The matches are based on audio, not video, which means we face trouble when the same soundtrack is used in a political ad as has been used in, for instance, an infomercial.

We are working on improving the system to filter out these kinds of false positives, but even with no changes these fingerprints have provided solid data across the markets we track.

Duplitron

The Duplitron 5000, counting political ads. Credit: Lyla Duey.

Step 4: enjoying the results

And so you understand a little bit more about our system. You can download our data and watch the ads at the Political TV Ad Archive.  (For more on our metadata–what’s in it, and what can you can do with it, read here.)

Over the coming months we are working to make the system more accurate. We are also exploring ways to identify newly released political ads without any need for manual entry.

P.S. We’re also working to make it as easy as possible for any researchers to download all of our fingerprints to use in their own local copies of the Duplitron 5000. Would you like to experiment with this capability? If so, contact me on Twitter at @slifty.

Posted in Announcements, News, Television Archive | Tagged , , , , , | Comments Off on The tech powering the Political TV Ad Archive

Discover Books Donates Large Numbers of Books

discoverbooksInternet Archive is proud to partner with Discover Books, a major used book seller, to help let the stories in books live on.   Discover books is donating books that the Internet Archive does not yet own and would have gone to a landfill.   Through this process the Internet Archive has more books to digitize and preserve.

Together we are giving books the longest life possible both in print and online.

Thank you to discoverbooks.com.

Posted in Announcements, Books Archive, News | 4 Comments

Reflections on From Clay to the Cloud: The Internet Archive and Our Digital Legacy, a.k.a. The Internet Archive – The Exhibition!

Screen Shot 2016-05-01 at 9.00.47 PM

Photograph by Jason Scott

By Carolyn Peter

It started with a visit to Nuala Creed’s ceramics studio in Petaluma in the spring of 2014. My interest was piqued as she described “a commission of sculptures for the Internet Archive” that was ever-growing. She heavily encouraged me to stop by the Archive to experience the famous Friday lunch and to see her work. I’m so glad she did.

While enjoying a tasty lunch of sausage and salad, I listened to Archive staff members talk about their week and curious visitors who shared their inspiration for coming to the Archive for a meal and a tour. I did not understand all the technical vocabulary, but I was struck by all the individuals who were working together on a project which, prior to this day, I had only experienced as a website on my computer screen.

Of course, I fell in love with Nuala’s sculptures as soon as we stepped inside the Great Room, where 100+ colorful figures stood facing the stage as if waiting for a performance or lecture to begin. The odd objects in their hands, their personally fashioned clothing, and their quirky expressions reinforced the idea that the Internet Archive was the shared, creative effort of a huge number of individuals. The technical was becoming human to me. By the time Brewster had brought his visitors back down into the common workspace, my mind was racing with ideas and questions.

As a museum professional who has spent her career making choices about what works of art to acquire and preserve for future generations and as someone who takes great joy in handling and caring for objects, I wondered what threads ran from this digital archive through to more traditional archives and libraries. If I had sleepless nights wondering how to best protect a work of art for posterity, how was the Internet Archive going to ensure that its vast data was going to survive for millennia to come?

Before I knew what I was doing, I heard myself telling Brewster that I would love to do an exhibition about the Internet Archive. I don’t think he or I fully registered what I was saying. That would take more time.

This curatorial challenge brewed in my mind. The more I thought about it, the more I thought an exploration of the past, present and future of archives and libraries and the basic human desire to preserve knowledge for future generations would be a perfect topic for an exhibition in my university art gallery. I knew Nuala’s series could serve as the core artistic and humanizing element for such a show, but I wondered how I would be able to convey these ideas and questions in an accessible and interesting way, how to make this invisible digital world visible? And turning the tables—if Brewster had brought art into the world of technology with his commission of the Internet Archivists series, how could I bring technology into the artistic realm?

When I approached Brewster for a second time about a year later with a proposal, he thought I was crazy. He has said it was as if I had told him I wanted to do “The Internet Archive, The Musical.” A few conversations and months later, he agreed to let me run with the idea.

I have to admit, at times, I too wondered if I was crazy. I wrestled with devising ways to visibly convey the Archive’s unfathomable vastness while also trying to spotlight the diverse aspects of the Archive through hands-on displays.

Screen Shot 2016-05-01 at 9.00.22 PM

Photograph by Jason Scott.

While some big dreams had to be let go, I was able to achieve most of the goals I set out for the exhibition. Transporting thirty-two of Nuala’s fragile sculptures to Los Angeles required two days of careful packing and a fine art shipping truck committed solely to this special load. Along with film editor Chris Jones and cameraman Scott Oller, I also created a film that documents the story of the Internet Archivists sculpture series through interviews with Nuala, Brewster and a number of the archivists who have had their sculptural portraits made.

When visitors entered the gallery, they were greeted by three of the Internet Archivist figures and a full-scale shipping container (a trompe-l’oeil work of art by Makayla Blanchard) that conveyed Brewster’s often-repeated claim that he had fit the entire World Wide Web inside a shipping container. The exhibition was filled with juxtapositions of the old and new. To the right of the three archivists was a case filled with a dozen ancient clay cuneiforms and pieces of Egyptian papyrus introducing very early forms of archiving. A china hutch displayed out of fashion media formats that the Internet Archive has been converting into digital form such as record albums, cassette tapes, slides, and VHS tapes. I partnered with LMU’s librarians to bring the mystery of archiving out into the light. Using one of the Archive’s Tabletop Scribes, the librarians scanned and digitized numerous rare books from their collection. The exhibition also included displays and computer monitors so visitors could explore the Wayback Machine, listen to music from the archive’s collections, play vintage video games and test out the Oculus Rift.

clay-cloud scribe

Photograph by Brian Forrest.

In the end, I think the exhibition asked a lot more questions than it answered. Nevertheless, I hope this first exhibition will spark others to think of ways to make the abstract ideas and invisible aspects of digital archives more tangible. Who knows, maybe, a musical is in the Internet Archive’s future.

I was sad to pack up the clay archivists and say goodbye to their smiling faces. I’m sure they are happy to be back with the rest of their friends in the Great Room on Funston Avenue, but oh, the stories they have to tell of their travels to a gallery in Los Angeles.

Carolyn Peter is the director and curator of the Laband Art Gallery at Loyola Marymount University. She curated From Clay to the Cloud: The Internet Archive and Our Digital Legacy, which was on view from January 23-March 20, 2016 at the Laband.

Posted in Announcements, Event | Comments Off on Reflections on From Clay to the Cloud: The Internet Archive and Our Digital Legacy, a.k.a. The Internet Archive – The Exhibition!

Google Library Project Legal: Let the Robots Read!

Guardian_of_Law_by_James_Earle_Fraser,_US_Supreme_Court

The decade-long legal battle over Google’s massive book scanning project is finally over, and it’s a huge win for libraries and fair use. On Monday, the Supreme Court declined to hear an appeal by the Author’s Guild, which had argued that Google’s scanning of millions of books was an infringement of copyright on a grand scale. The Supreme Court’s decision means that the Second Circuit case holding that Google’s creation of a database including millions of digital books is fair use still stands. The appeals court explained how its fair use rationale aligns with the very purpose of copyright law: “[W]hile authors are undoubtedly important intended beneficiaries of copyright, the ultimate, primary intended beneficiary is the public, whose access to knowledge copyright seeks to advance by providing rewards for authorship.”

Google Books gives readers and internet users the world over access to millions of works that had previously been hidden away in the archives of our most elite universities. As a Google representative said in a statement, “The product acts like a card catalog for the digital age by giving people a new way to find and buy books while at the same time advancing the interests of authors.”

Google began scanning books in partnership with a group of university libraries in 2004. In 2005, author and publisher groups filed a class action lawsuit to put a stop to the project. The parties agreed to settle the lawsuit in a manner that would have forever changed the legal landscape around book rights. The District Court judge rejected the settlement in 2011, based on concerns about competition, access, and fairness, and so litigation over the core question of fair use resumed.

Judge Chin, Judge Leval, and the Supreme Court all made the right decisions along the long and winding path to Google’s victory. Libraries around the country are now free to rely on fair use as they determine how to manage their own digitization projects–encouraging innovation and increasing our access to human knowledge.

Posted in Announcements, Books Archive, News | 6 Comments

Truck and Back Again: The Internet Archive Truck Takes a Detour

When one of our employees came out of his home over the weekend, he saw an empty parking space. Granted, in San Francisco, that’s a pretty precious thing, but since this empty parking space had held the Internet Archive Truck for the previous two days, he was not feeling particularly lucky.

A staff conversation then ensued, the city was called to see if the truck had been towed, and after a short time, it became obvious that no, somebody had stolen the Truck.

This in itself is not news: thousands of vehicles are stolen in the Bay Area every year. But what makes this unusual was the nature of the vehicle stolen… the Truck is a pretty unique looking vehicle.

IMG_3634

IMG_3635

Once the report was filed with the police and a few more checks were made to ensure that the truck was absolutely, positively missing and presumed stolen, the truck’s theft was announced on Twitter, which garnered tens of thousands of views and the news being spread very far. Thanks to everyone who got the word out.

What was not expected, besides the initial theft, was that a lot of people wondered why the Internet Archive, essentially a website, would have a truck. So, here’s a little bit about why.

Besides the providing of older websites, books, movies, music, software and other materials to millions of visitors a day, the Internet Archive also has buildings for physical storage located in Richmond, just outside the limits of San Francisco. In these buildings, we hold copies of books we’ve scanned, audio recordings, software boxes, films, and a variety of other materials that we are either turning digital or holding for the future. It turns out you can’t be a 100% online experience – physical life just gets in the way. We also have multiple data centers and the need to transport equipment between them.

Therefore, we’ve had a hard-working vehicle for getting these materials around: a 2003 GMC Savana Cutaway G3500, often parked out front of the Archive’s 300 Funston Avenue address and making up to several trips a week between our various locations.

In a touch of whimsy, the truck has had a unique paint job for most of its life with the Archive. Notably, this isn’t even the first mural it had on its sides; here is a shot with the previous mural:

10620121_10152811702463834_2063151320571234802_o

We’re not sure of the motivation in stealing this rather unique and noticeable vehicle, and there seems to be some evidence it was driven around the city for a while after it was taken. But yesterday, we were contacted by the San Francisco Police Department with really great news:

The Truck has been recovered!

Left abandoned by the side of the road, the truck was found and is about to be returned to the Archive, and with good luck, back and in service helping us prepare and transport materials related to our mission: to bring the world’s knowledge to everyone.

Again, thanks to everyone who sounded out the original call for the truck’s return, and to the SFPD for getting a hold of the truck so quickly after it was gone.

Posted in Announcements, Cool items | 9 Comments

Join us for “How Digital Memory is Shaping our Future” with Abby Smith Rumsey– April 26

Abby Smith Rumsey photo by Cindi de ChannesWhat is the future of human memory? What will people know about us when we are gone?

Abby Smith Rumsey, historian and author, has explored these important questions and more in her new book When We Are No More: How Digital Memory is Shaping Our Future.

On the evening of Tuesday, April 26 at 7 p.m., the Internet Archive hosts Abby Smith Rumsey as she takes us on a journey of human memory from prehistoric times to the present, highlighting the turning points in technology that have allowed us to understand more about the history of the world around us.

Each step along the way – from paintings on cave walls to cuneiform on clay tablets, from the Gutenberg printing press to the recent technological advances of digital storage – shows how humans have adapted to the increasing need for new methods to share knowledge with a widening community. In addition to these milestones of human communication, the development of machinery in the industrial age helped unlock the geological record of the physical world around us, changing how our societies think about time and change to the natural environment on a grand scale.

When We Are No More_HC_catExamining the past helps us understand where the future might lead us. Yet with our current methods of digital storage, what will still be accessible and what steps can we take to make sure knowledge persists? Out of the vast amounts of data that we are capable of saving, what will be considered important? Only time will tell, and it will be when “we are no more.”  The Internet Archive, under the leadership of Brewster Kahle, is one organization playing an important role in bringing our civilization’s record of knowledge into the future. Smith Rumsey will share her insights into how we can leave a legacy for those in the future to best understand our lives, our struggles, our passions – our very humanity.

We hope you’ll join us for an enlightening evening with this thought-provoking author, historian and librarian.

Event Info:
How Digital Memory is Shaping Our Future:  A Conversation with Abby Smith Rumsey
Tuesday, April 26, 2016
Internet Archive, 300 Funston Avenue, San Francisco, CA 94118

Doors open at 6:30 PM, Talk begins at 7:00pm
Reception and book signing to follow presentation

This event is free and open to the public.  Please RSVP to our Eventbrite at:
http://www.eventbrite.com/e/abby-smith-rumsey-how-digital-memory-is-shaping-our-future-tickets-22473471759

For more information about Abby Smith Rumsey and her book, please visit her website at www.rumseywrites.com.

Posted in News | 3 Comments

Upcoming changes in epub generation

Epub is a format for ebooks that is used on book reader devices.   It is often mostly text, but can incorporate images. The Internet Archive offers these in two cases:  when a user uploads them, and when they are created from other formats, such as scanned books or uploaded PDFs that were made up of images of pages.

The Internet Archive creates them from images of pages using “optical character recognition” (OCR) technology. This is then reformatted into the epub format (currently epub v2). These files are sometimes created “on-the-fly” and sometimes created as files and stored in our item directories.   All “on-the-fly” epubs use the newest code, where stored ones use the code available at the time of generation.

Based on a change in the format from our OCR engine last August, many of the epubs generated between then and last week have been faulty. Newly generated epubs are now fixed, and we will soon be going back to fix the faulty ones that were stored. We have also discovered that some of the older epubs have also been faulty, and it is difficult to know which.

To fix this we are shifting to the “on-the-fly” generation for all epubs so that all epubs get the newest code.   This is how we already generate daisy, mobi, and many zip files as well.   To access the epubs for the books we have scanned the URL is https://archive.org/download/ID/ID.epub, for instance https://archive.org/download/recordofpennsylv00linn/recordofpennsylv00linn.epub.

More generally, to find when an epub can be generated, for items that do not have a field the ocr field in meta.xml, that says “language not currently OCRable”, and there is a file an abbyy format file will be in an item. For instance, in an item’s file list, the presence of an abbyy file downloadable at  http://archive.org/download/file_abbyy.gz will mean a corresponding epub file can be downloaded at http://archive.org/download/file.epub.

Posted in News | Comments Off on Upcoming changes in epub generation

New video shows rich resources available at Political TV Ad Archive

Since our launch on January 22, the Political TV Ad Archive has archived more than 1,080 ads with more than 155,000 airings. We’ve trained hundreds of journalists, students, and other interested members of the public with face-to-face trainings. But much as we would like to, we can’t talk to each of you individually. That’s why we created this video.

Watch the video for an overview of the project, the wealth of information it provides, and how fact checkers and journalists have been using it to enrich their reporting. It is a great introduction for educators to use with students, for civic groups to engage their membership in the political process, and for reporters who want to get the basics on how to use the site.

And remember: we want to hear from you about how you are using the Political TV Ad Archive. Please drop us an email at politicalad@archive.org or tweet us @PolitAdArchive. Over the week ahead, we’ll be highlighting examples of how educators have used the project in their classrooms. We’d love to feature examples of how other members of the public are using this collection to enhance deeper understanding of the 2016 elections.

Going forward, we are tracking ads in the New York City, Philadelphia, San Francisco, and Washington, DC markets. These markets will provide a window on political ads appearing in several upcoming primary states: California, Maryland, New Jersey, New York, and Pennsylvania. 

Enjoy!

Posted in Announcements, News | Comments Off on New video shows rich resources available at Political TV Ad Archive

Getting back to “View Source” on the Web: the Movable Web / Decentralized Web

The Web 1.0 moved so fast partly because you could “View Source” on a webpage you liked and then modify and re-use it to make your own webpages. This even worked with pages with JavaScript programs—you could see how it worked, modify and re-use it. The Web jumped forward.

Then came Web 2.0, where the big thing was interaction with “APIs” or application programmable interfaces.  This meant that the guts of a website were on the server and you only got to ask approved questions to get approved answers, or it would specially format a webpage for you with your answer on it.   The plus side was that websites had more dynamic webpages, but learning from how others did things became harder.

Power to the People went to Power to the Server.

Can we get both?  I believe we can, and with a new Web built on top of the existing Web.  A “decentralized web” or a “movable web” has many privacy and archivability features, but another feature could be knowledge reuse.  In this way, the set of files that make up a website—text/HTML, programs, and data—are available to the user if they want to see them.

The decentralized Web works by having a p2p distribution of the files that make up the website, and then the website runs in your browser.  By being completely portable, the website has all the pieces it needs: text, programs, and data.  It can all be versioned, archived, and examined.

[Upcoming Summit on the Decentralized Web at the Internet Archive June 8th, 2016]

For instance, this demo has the pages of a blog in a peer-to-peer file system called IPFS, but also the search engine for the site, in JavaScript, that runs locally in the browser.    The browser downloads the pages and JavaScript and the search-engine index from many places on the net and then displays in the browser.  The complete website, including its search engine and index, are therefore downloadable and inspectable.

This new Web could be a way to distribute datasets because the data would move with programs that could make use of it, thus helping document the dataset.  This use of the decentralized Web became clear to me by talking with the Karissa McKelvey and Max Ogden of the DAT Data project working on distributing scientific datasets.

What if scientific papers evolved to become movable websites (or call them “distributed websites” or “decentralized websites”)?  That way, the text of the paper, the code, and the data would all move around together documenting itself.  It could be archived, shared, and examined.

Now that would be “View Source” we could all live with and learn from.

Posted in News | Comments Off on Getting back to “View Source” on the Web: the Movable Web / Decentralized Web

The Internet Archive, ALA, and SAA Brief Filed in TV News Fair Use Case

tvnewsarchiveThe Internet Archive, joined by the American Library Association, the Association of College and Research Libraries, the Association of Research Libraries, and the Society of American Archivists filed an amicus brief in Fox v. TVEyes on March 23, 2016. In the brief, the Internet Archive and its partners urge the court to issue a decision that will support rather than hinder the development of comprehensive archives of television broadcasts.

The case involves a copyright dispute between Fox News and TVEyes, a service that records all content broadcast by more than 1,400 television and radio stations and transforms the content into a searchable database for its subscribers. Fox News sued TVEyes in 2013, alleging that the service violates its copyright. TVEyes asserted that its use of Fox News content is protected by fair use.

politicaltvadDrawing on the Internet Archive’s experience with its TV News Archive and Political TV Ad Archive, the friend-of-the-court brief highlights the public benefits that flow from archiving and making television content available for public access. “The TV News Archive allows the public to view previously aired broadcasts–as they actually went out over the air–to evaluate and understand statements made by public officials, members of the news media, advertising sponsors, and others, encouraging public discourse and political accountability,” said Roger Macdonald, Director of the TV Archive.

Moreover, creating digital databases of television content allows aggregated information about the broadcasts themselves to come to light, unlocking researchers’ ability to process, mine, and analyze media content as data. “Like library collections of books and newspapers, television archives like the TV News Archive and the Political TV Ad Archive allow anyone to thoughtfully assess content from these influential media, enhancing the work of journalists, scholars, teachers, librarians, civic organizations, and other engaged citizens,” said Tomasz Barczyk, a Berkeley Law student from the Samuelson Law, Technology & Public Policy Clinic who helped author the brief.

The brief also explains the importance of fostering a robust community of archiving organizations. Because television broadcasts are ephemeral, content is easily lost if efforts are not made to preserve it systematically.  In fact, a number of historically and culturally significant broadcasts have already been lost, from BBC news coverage of 9/11 to early episodes of Doctor Who. Archiving services prevent this disappearance by collecting, indexing, and preserving broadcast content for future public access.

A decision in this case against fair use would chill these services and could result in the loss of significant cultural resources. “This is an important case for the future of digital archives,” explained William Binkley, the other student attorney who worked on the brief. “If the court rules against TVEyes, there’s a real risk it could discourage efforts by non-profits to create searchable databases of television clips. That would deprive researchers and the general public of a tremendously valuable source of knowledge.”

The Internet Archive would like to thank Tomasz Barczyk, William Binkley, and Brianna Schofield from the Samuelson Law, Technology & Public Policy Clinic at Berkeley Law for helping to introduce an important library perspective as the Second Circuit court considers this case with important cultural implications.

Posted in Announcements, News | 1 Comment

Three takeaways after logging 1,032 political ads in the primaries

The Political TV Ad Archive launched on January 22, 2016, with the goal of archiving airings of political ads across 20 local broadcast markets in nine key primary states and embedding fact checks and source checks of those ads by our journalism partners. We’re now wrapping up this first phase of the project, and are preparing for the second, where we’ll fundraise so we can apply the same approach to political ads in key 2016 general election battleground states.

But first: here are some takeaways from our collection after logging 1,032 ads. Of those ads, we captured 263 airing at least 100 times apiece, for a total all together of more than 145,000 airings.

1. Only a small number of ads earned “Pants on Fire!” or “Four Pinocchio” fact checking ratings. Just four ads received the worst ratings possible from our fact-checking partners.

Donald Trump’s campaign won the only “Pants on Fire” rating awarded by fact checking partner PolitiFact for a campaign ad: “Trump’s television ad purports to show Mexicans swarming over ‘our southern border.’ However, the footage used to support this point actually shows African migrants streaming over a border fence between Morocco and the Spanish enclave of Melilla, more than 5,000 miles away,” wrote PolitiFact reporters C. Eugene Emery Jr. and Louis Jacobson in early January, when Trump released the ad, his very first paid ad of the campaign. The ad aired more than 1,800 times, most heavily in the early primary states of Iowa and New Hampshire.

Trump also won a “four Pinocchio” rating from the Washington Post’s Fact Checker for this ad which charges John Kasich of helping “Wall Street predator Lehman Brothers destroy the world economy.” “[I]t’s preposterous and simply not credible to say Kasich, as one managing director out of 700, in a firm of 25,000, “helped” the firm “destroy the world economy,” wrote reporter Michelle Ye Hee Lee.

Two other ads received the “four Pinocchio” rating from the Washington Post’s Fact Checker. This one, from Ted Cruz’s campaign, claims that Marco Rubio supported an immigration plan that would have given President Obama the authority to admit Syrian refugees, including ISIS terrorists. “[T]his statement is simply bizarre,” wrote Glenn Kessler. “With or without the Senate immigration bill, Obama had the authority to admit refugees, from any country, under the Refugee Act of 1980, as long as they are refugees and are admissible….What does ISIS have to do with it? Nothing. Terrorists are not admissible under the laws of the United States.”

This one, from Conservative Solutions PAC, the super PAC supporting Rubio, claims that there was only one “Republican helpful” who had “actually done something” to dismantle the Affordable Care Act, by inserting a provision preventing protection for insurance companies from losses if they didn’t do accurate estimates on the premiums in first three years of the law. “Rubio goes way too far in claiming credit here,” wrote Kessler. “He raised initial concerns about the risk-corridor provision, but the winning legislative strategy was executed by other lawmakers.”

Overall, our fact-checking and journalism partners—the Center for Responsive Politics, the Center for Public IntegrityFactCheck.org, PolitiFact, and the Washington Post’s Fact Checker—wrote 57 fact- and source-checks of 50 ads sponsored by presidential campaigns and outside groups. (The American Press Institute and Duke Reporters’ Lab, also partners, provided training and tools for journalists fact checking ads.)

Of the 25 fact checks done by PolitiFact, 60 percent of the ads earned “Half True,” “Mostly True,” and “True” ratings, with the remainder earning “Mostly False,” “False,” and “Pants on Fire” ratings. The Washington Post’s Fact Checker, the other fact-checking group that uses ratings, fact-checked 11 ads. Of these, seven earned ratings of three or four Pinocchios. A series of ads featuring former employees and students denouncing Trump University, from a “dark money” group that doesn’t disclose its donors, earned the coveted “Geppetto Checkmark” for accuracy. Those ads aired widely in Florida and Ohio leading up to the primaries there.

The ad that produced the most fact checks and source checks was this one from the very same group, the American Future Fund, for an attack ad on John Kasich. Robert Farley of FactCheck.org wrote, “An ad from a conservative group attacks Ohio Gov. John Kasich as an ‘Obama Republican,’ and misleadingly claims his budget ‘raised taxes by billions, hitting businesses hard and the middle class even harder.'” PolitiFact Ohio reporter Nadia Pflaum gave the ad a “False” rating; Michelle Ye Hee Lee of the Washington Post’s Fact Checker awarded it “Three Pinocchios.” The Center for Public Integrity described the American Future Fund as “a conservative nonprofit linked to the billionaire brothers Charles and David Koch that since 2010 has inundated federal and state races with tens of millions of dollars.”

This ad from Donald Trump’s campaign earned a “Pants on Fire” rating from PolitiFact.

2. Super Campaign Dodger, and other creative ways to experience and analyze political ads. Journalists did some serious digging into the downloadable metadata the Political TV Ad Archive provides here to analyze trends in presidential ad campaigns.

The Economist mashed up data about airings in Iowa and New Hampshire with polling data and asked the question: Does political advertising work? The answer—”a bit of MEH” (or, “minimal-effects hypothesis”)—in other words, voters are persuaded, but just the littlest bit.

Farai Chideya of FiveThirtyEight and Kate Stohr of Fusion delved into data on anti-Trump ads airing ahead of the Florida primary—which Trump went on to win handily, despite the onslaught.

Nick Niedzwiadek plumbed the collection when writing about political ad gaffes for The Wall Street Journal. Nadja Popovich of The Guardian graphed Bernie Sanders’s surge in ad airings in Nevada, ahead of the contest there.

William La Jeunesse of Fox News reported on negative ads here. Philip Bump of The Washington Post used gifs to illustrate just how painful it was to be a TV-watching voter in South Carolina in the lead up to the primary there.

And in what was the most interactive use of the project’s metadata, Andrew McGill, a senior associate editor for The Atlantic, created an old-style video game, where the viewer uses the space key on a computer keyboard to try to dodge all the ads that aired on Iowa airwaves ahead of the caucuses there. For links to other journalists’ uses of the Political TV Ad Archive, click here.

via GIPHY
3. Candidates’ campaigns dominated; super PACs favored candidates who failed. In our collection, candidates’ official campaigns sponsored the most ad airings—63 percent. Super PACs accounted for another 27 percent, and nonprofit groups, often called “dark money” groups because they do not disclose their donors, accounted for nine percent of ad airings.

Bernie Sanders‘ and Hillary Clinton‘s campaigns had the most ad airings—29,347 and 26,891 respectively. Of the GOP candidates, who faced a more divided competition, it was Marco Rubio’s campaign that had the most airings—11,798—and Donald Trump was second, with 9,590. However, in the Republican field, super PACs played a much bigger role, particularly those advocating for candidates who have since pulled out of the race. Conservative Solutions PAC, the super PAC that supported Marco Rubio in his candidacy, showed 12,851 airings; Right to Rise, which supported Jeb Bush, had 12,543.

This pair of issue ads sponsored by the AARP (aka the American Association of Retired People), aired at least 9,653 times; the ads focus on social security and have been broadcast across the markets monitored by the Political TV Ad Archive.

The biggest non-news shows that featured political ads were “Jeopardy!,” “Live With Kelly and Michael,” and “Wheel of Fortune.” Fusion did an analysis that showed that the most popular entertainment shows targeted by presidential candidates and mashed it up with Nielsen data about viewership. For example, Bernie Sanders’ campaign favored “Jimmy Kimmel Live,” while Hillary Clinton’s campaign likes “The Ellen Degeneres Show.”

 

Screenshot 2016-03-04 13.50.08

The Political TV Ad Archive–which is a project of the Internet Archive’s TV News Archive–is now conducting a thorough review of this project, which was funded by a grant from the Knight News Challenge, an initiative of the John S. and James L. Knight Foundation. The Challenge is a joint effort of the Rita Allen Foundation, the Democracy Fund, and the Hewlett Foundation.

Stay tuned for news of the Political TV Ad Archive’s plans for covering future primaries in California, New York, and Pennsylvania, and beyond, our fundraising for the second phase of this project: fundraising to track ads in key battleground states in the general elections.

This post is cross posted at the Political TV Ad Archive.

Posted in Announcements, News | Tagged , , , , , , , , , , , , , | 1 Comment

Save our Safe Harbor: Submission to Copyright Office on the DMCA Safe Harbor for User Contributions

lighthouseThe United States Copyright Office is seeking feedback on how the “notice and takedown” system created by the Digital Millennium Copyright Act, also known as the “DMCA Safe Harbors,” is working. Congress decided that in this country, users of the Internet should be allowed to share their ideas with the world via Internet platforms. In order to facilitate this broad goal, Congress established a system that protects platforms from liability for the copyright infringement of their users, as long as the platforms remove material when a copyright holder complains. The DMCA also allows users to challenge improper takedowns.

We filed comments this week, explaining that the DMCA is generally working as Congress intended it to. These provisions allow platforms like the Internet Archive to provide services such as hosting and making available user-generated content without the risk of getting embroiled in lawsuit after lawsuit. We also offered some thoughts on ways the DMCA could work better for nonprofits and libraries, for example, by deterring copyright holders from using the notice and takedown process to silence legitimate commentary or criticism.

The DMCA Safe Harbors, while imperfect, have been essential to the growth of the Internet as an engine for innovation and free expression. We are happy to provide our perspective on this important issue to the Copyright Office.

Posted in Announcements, News | 2 Comments

Guess what we find in books? A look Inside our Midwest Regional Digitization Center– by Jeff Sharpe

The history of a book isn’t captured merely by the background of the author or its publishing date or its written content. Most books were purchased and read by someone; they are from a specific time and place. That too is part of each book’s history. Sometimes in digitizing books we find pressed flowers or a single leaf or pieces of paper that were used as bookmarks then forgotten. We even found a desiccated chameleon in one book  When we find something like that at the Internet Archive’s Digitization Centers, we digitize the object because it is part of the history of that book. We see our mission to be archiving each book exactly as it was found, so that when you flip through a book, you are seeing it as if you had the physical copy in your hands, not just black text on a white page.

Take for example this book from the Lincoln Financial Foundation Collection:  The Life and Speeches of Henry Clay. In the chapter on Clay’s speeches, you can see what Abraham Lincoln highlighted, points he thought worthy of noting.

Blog Lincoln notations in book

In fact, by seeing what Lincoln underscored as he read this book and by reading his notes, you get a glimpse into what may have shaped his ideas; how he might have then used certain concepts to express his thoughts and policies about slavery and its abolition. The history of this book, which was held and read and annotated by Abraham Lincoln, had a direct effect on the history of this nation. A historic book that also has a history of it’s own.

We’ve digitized over 125,000 items here at the Midwest Regional Digitization Center at the Allen County Public Library in Fort Wayne, Indiana. In several books we digitized for the University of Pittsburgh’s Darlington Collection, we found some treasures. In one, we found a note by William Henry Harrison , then governor of the Indian Territory in 1803. (Scroll down the pages to see the letters in situ.)

In another we found a promissory note by Aaron Burr from 1796 for a large sum of money. Burr was a controversial person to say the least. He was not only a Revolutionary War hero, Thomas Jefferson’s Vice President and a presidential candidate himself, but also the man who shot and killed Alexander Hamilton in a duel.

Once someone at the University of Pittsburgh contacted me regarding an item a digital reader had made them aware of:  a previously unknown, original survey report written by none other than Daniel Boone!  He asked me if  I knew anything about it. I verified that we had found and digitized it–along with the note by Aaron Burr  and the letter by William Henry Harrison. I got a shocked reply, “Where??”  Apparently digitizing not only opened up access to these books, it also rediscovered long-lost manuscripts stuck between the pages, penned by important figures in American history. Blog Boone letter (1)

The history of these books turned out to contain the history of this country, highlighted in a very personal way. Whether it is someone pressing a violet between the pages, Abe Lincoln researching abolition, or a forgotten survey report by Daniel Boone, sometimes the material we digitize can bring our past alive.  What will you discover lodged between the pages in our three million digital books?

Take a tour of the Midwest Regional Digitization with Jeff Sharpe in this recent video.


jeffsharpeJeff Sharpe is Senior Digitization Manager for the Midwest Region.

Jeff’s work experience in administration and research led him to the Internet Archive’s digitization center in the Allen County Public Library in Fort Wayne Indiana. He’s proud of his role in helping to bring well over a hundred thousand books online for universal access, including more than fifteen thousand items digitized by volunteers at the Midwest Center. Jeff is a voracious reader and loves books. He has a passion for history and archaeology– particularly from the Mayan civilization which has led him to  travel extensively to Mayan ruins. He enjoys among other things bicycle riding, gardening, and hanging out with his wife, two kids, and their two dogs.

Posted in Announcements, News | 11 Comments