Internet Archive files amicus brief in support of fair use and innovation in libraries

 

 

 

 

 

 

 

 

Today marks the beginning of Fair Use Week, which celebrates the importance of fair use for libraries, students, teachers, journalists, creators, and the public. Last week, the Internet Archive joined the American Library Association, the Association of Research Libraries, and the Association of College and Research Libraries on a friend of the court brief in the Capitol Records v. Redigi case. This case raises the important question about whether it is legal to resell lawful copies of digital music files—that is, whether the first sale right exists in digital form, and how that right interacts with fair use. The first sale right, codified at Section 109(a) of the Copyright Act, is the same law that allows libraries to lend books and other copyrighted works to the public. As library collections become increasingly digital, libraries are relying on fair use and first sale rights in order to perform their everyday duties, including preservation and lending.

The brief argues first that the court’s fair use analysis should favor secondary uses that have the same underlying purpose as the first sale right.
“In Authors Guild v. HathiTrust… [the Second Circuit Court] used the rationale for a specific exception—17 U.S.C. § 121, which permits the making of accessible format copies for the print disabled—to support a finding of a valid purpose under the first factor. Likewise, the Copyright Office has repeatedly based fair use conclusions on specific exceptions in the context of a rulemaking under section 1201 of the Digital Millennium Copyright Act, 17 U.S.C. § 1201. As this Court did in HathiTrust or the Copyright Office did in the section 1201 rulemaking, the district court should have recognized that the purpose behind the first sale doctrine tilted the first fair use factor in favor of ReDigi.”

Second, the brief argues that a positive fair use determination in the Redigi case would enable libraries to provide new and innovative digital services to their users. The brief states:
“Fair use findings in technology cases have encouraged libraries to provide new, digitally-based services such as the HathiTrust Digital Library. In addition to enabling researchers to find relevant texts and perform critical data-mining, HathiTrust provides full-text access to over fourteen million volumes to people who have print disabilities. A fair use finding in this case would provide libraries with additional legal certainty to roll out innovative services such as the Internet Archive’s Open Library. Such a result would increase users’ access to important content without diminishing authors’ incentive to create new works.”

You can read the full text of the brief here.

Posted in Announcements, News | Leave a comment

Internet Archive Reaches Semifinals in MacArthur Foundation’s Competition for $100 Million Grant

by Wendy Hanamura

The Internet Archive headquarters: a temple to universal access to knowledge.

At the Internet Archive, we believe that libraries can be instruments of change.

So we are proud to announce that the Internet Archive is one of eight groups named semi-finalists today in 100&Change, a global competition for a single $100 million grant from the John D. and Catherine T. MacArthur Foundation. The competition seeks bold solutions to critical problems of our time. Here’s how we propose creating transformative, lasting change:

Our vision empowers libraries to unlock their rich analog collections for a new generation of learners, enabling free, long-term, public access to knowledge.

In today’s digital world, a new generation explores knowledge largely through their computers and phones. So as digital librarians, we worry when millions of books, representing a century of knowledge, are still not accessible online to scholars, journalists, students, and the public. Libraries have been stymied by huge costs, restrictions on eBooks, and missing technology. The legal path forward has not been clear. All of this means libraries haven’t been able to meet the digital demands of a new generation. And access to libraries is still not universal or equitable.

Our plan provides libraries and learners with free digital access to four million books. With our partners, we will curate, digitize, and enable digital lending of these digital volumes to any library in the country that owns the physical book. We plan to start with the books most widely held and used in libraries and classrooms. The scale of the project will help reduce digitization costs by 50 percent or more. How do we know this can work? We’ve been prototyping this model for six years at Open Library, digitizing 540,000 modern books originating from 100 partners.  Through Open Library, we lend books to the public in a manner that respects the rights of authors and publishers, in a process that mirrors the traditional way libraries circulate physical books.

What makes this a gamer-changer? Today, the Internet Archive already offers public access to 2.5 million books in the public domain, and 540,000 modern works. We need to be bigger and bolder. At the Internet Archive, we only lend one copy at a time, so in order to serve more learners, we seek thousands of libraries to join us. That can happen if we build the technical infrastructure that allows libraries everywhere to leverage those digital books. Plus, this is an issue of dollars-and-cents. Libraries should never pay to digitize a book more than once. Right now libraries pay an average of $17.50 for each interlibrary loan of a physical book. As books become electronic, those funds can be directed to more urgent needs. And above all, this grant will help all libraries become digital libraries, releasing the tremendous value in the collections they have curated over centuries.

With so many brilliant, effective thinkers applying to 100&Change, it always felt as if our chances were one in a hundred—and indeed they were! There was robust participation: 7,069 competition registrants submitted 1,904 proposals. Of those, 801 passed an initial administrative review and were evaluated by a panel of expert judges who each provided ratings on four criteria: meaningfulness, verifiability, durability, and feasibility. MacArthur’s Board of Directors made the final selection.  To be one of eight semifinalists from 800 qualified applicants is a tremendous honor.  

Eileen Alfaro, San Francisco fifth-grader. One day she could be carrying 4 million eBooks under her arm.

And as we work hard to hone our plans in the months ahead, here’s what propels us forward: Eileen Alfaro, the Internet Archive’s brightest rising star. Every day after school, this San Francisco fifth-grader does her homework at the Internet Archive, while her mother Roxana works. A straight-A student, Eileen loves nothing more than reading. We can put four million of the best books into her hands. Forever. For free.

Our proposal? Making libraries instruments of change for a new generation of learners like Eileen.

 

 

 

 

A summary of the Internet Archive’s solution, an overview video of its project and a MacArthur video describing our proposal is available here www.macfound.org/InternetArchive.

Posted in Announcements, News | 4 Comments

Internet Archive Offers to Host PACER Data

 

 

 

 

 

 

 

The Internet Archive has long supported the efforts of the Free Law Movement to make the laws and edicts of government of the United States more broadly available. With our colleague Aaron Swartz and the efforts of numerous groups across the country including the Free Law Foundation and Princeton’s Center for Information Technology Policy, we host the RECAP repository of documents from the federal district courts.  Many of these public domain document were downloaded by users of the goverment’s PACER  system for $0.10 per page and uploaded to the Internet Archive. The RECAP repository is available for free, and in bulk, which is useful for researchers.

On Tuesday, February 14, the U.S. Congress will hold the first hearings in over a decade examining the operation of the PACER system. The hearing will be before the Subcommittee on Courts, Intellectual Property and the Internet of the Judiciary Committee in the House of Representatives. The Internet Archive was pleased to accept the committee’s invitation to submit a statement for the record and we have submitted the following, which includes an offer to host the PACER data now and forever to make the works of our federal courts more readily available to inform the citizenry and to further the effective and fair administration of justice.

Our courts must function in the light of day, and in this day and age that means on the Internet. The Internet Archive is happy to try to help.

February 10, 2017

The Honorable Darrell Issa, Chairman
The Honorable Jerry Nadler, Ranking Member
Subcommittee on Courts, Intellectual Property and the Internet
Committee on the Judiciary
House of Representatives
Washington, DC 20515

Dear Chairman Issa and Ranking Member Nadler,

Thank you for the opportunity to submit comments on the Judiciary Committee’s hearing entitled “Judicial Transparency and Ethics.” I write on behalf of the Internet Archive, a non-profit digital library that is based in San Francisco with facilities throughout the world.

For more than 20 years, the Internet Archive has been archiving digital collections and making them available at no cost and with no restriction on the Internet. The Internet Archive works with the Library of Congress, the National Archives, and numerous national libraries around the world to collect, store, and provide permanent access to millions of books, videos, audio and hundreds of millions of pages of U.S. government documents, including over 14,000 hours of video of Congressional hearings.

By this submission, the Internet Archive would like to clearly state to the Judiciary Committee, as well as to the Administrative Office of the U.S. Courts and the Judicial Conference of the United States, that we would be delighted to archive and host—for free, forever, and without restriction on access to the public—all records contained in PACER.

People download more than 20 million books from the Internet Archive each month. We preserve 1 billion web pages each week for public access through the “Wayback Machine.” Indeed, the Wayback Machine is the only publicly accessible archive of all the websites of Congress. At any given moment, we are delivering about 30 gigabits of data per second. We host more than 20 petabytes of data in total.

By comparison, the PACER corpus is a fraction of a petabyte and does not use a significant amount of bandwidth. We have the capacity to host this information, and I know there are many other organizations on the Internet who would be able to make dramatic increases in the usability and utility of our Federal Judiciary’s database if it were made available in a more modern fashion and without artificial restrictions on use.

The stated purpose of PACER is to make public court records “freely available to the greatest extent possible.” Sixteen years ago, the United States Courts predicted that PACER would allow the public to “surf to the courthouse door on the Internet.” Today, anyone visiting a federal courthouse can view the public record for free. PACER, on the other hand, charges users per-page fees that are prohibitive for many members of the public. The Judiciary could resolve this unfortunate discrepancy—immediately—at no cost. This is our offer.

The Internet Archive has deep experience with collections of this kind. In fact, we already host the records from over a million federal court cases that have been donated by the public as part of the RECAP Project. However, a million cases is a small portion of the hundreds of millions of cases that PACER contains, and we are frustrated that it is so difficult to obtain and serve the workings of our federal courts to the public. This is a fairly trivial technical task, and we would welcome the opportunity to make much more data available.

I must also note that the Internet Archive is not alone in being well-equipped to offer this service. There are other large digital repositories that similarly serve the public for free. I cannot speak for them, but I believe that once the corpus is available for no fee and without restriction, they too will replicate it and offer similar service. Indeed, others may build useful tools for reading, searching, and studying the corpus of public court records that makes up our federal case law.

In order to recognize the vision of universal free access to public court records, the Federal Judiciary would essentially have to do nothing. We are experts at “crawling” online databases in an efficient and careful fashion that does not burden those systems. We are already able to comprehensively crawl PACER from a technical perspective, but the resulting fees would be astronomical. The Federal Judiciary has a Memorandum of Understanding with both the Executive Office for US Trustees and with the Government Printing Office that gives each entity no-fee access for the public benefit. The collection we would provide to the public would be far more comprehensive than the GPO’s current court opinion program—although I must laud that program for providing a digitally-authenticated collection of many opinions.

By making federal judicial dockets available in this manner, the Federal Judiciary would enable free and unlimited public access to all records that exist in PACER, finally living up to the name of the program. In today’s world, public access means access on the Internet. Public access also means that people can work with big data without having to pass a cash register for each document.

This PACER collection we would maintain and improve would have far more detailed metadata and contextual information than the GPO service or the PACER Case Locator service. And, that’s just for starters, because we know that there are thousands of eager researchers, journalists, and government workers (including Congressional staff) who would immediately jump in and work with us.

By providing no-cost access to the Internet Archive to PACER and accepting our commitment to make this information available for use without restriction in perpetuity, we believe we can work with our government to make the workings of our court more usable to government attorneys, to members of the bar, and to the public at large.

Sincerely yours,

Brewster Kahle
Digital Librarian and Founder, Internet Archive

Notes:

  1. S. Rep. 107–174, 107th Cong., 2d Sess., at 23 (2002), https://www.govinfo.gov/content/pkg/CRPT-107srpt174/pdf/CRPT-107srpt174.pdf.
  2. Electronic Public Access at 10, THE THIRD BRANCH: NEWSLETTER OF THE FEDERAL COURTS, Sep. 2000, at 3, https://archive.org/details/thirdbranch32332200001fede/.
Posted in News | 1 Comment

Apple Pie Potluck and Constitutional Law Teach-In — Friday Feb 17th 5:30-9PM


Initial information — more details to come:
In honor of the General Strike:

Constitutional Law Teach-in at the Internet Archive with EFF and Others

EFF and other lawyers will lead a conversation about the current issues and threats in constitutional law. Focusing on specific sections and amendments we will talk about current cases on censorship, surveillance, search and seizure, and more.

Workshops on using encryption tools and maybe musical performances will accompany.
If you want to present, perform, or have other ideas, please email us.

When: Friday, February 17th 5:30pm-9pm (program 6-8)
Where: Internet Archive
300 Funston Ave. SF, CA 94118
Potluck-style: Please bring apple pie or other food
Reserve your free ticket here
Streamed via Facebook Live
Donations welcome

Lawyers Attending:

  • Cindy Cohn – Executive Director of EFF
  • Corynne McSherry – Legal Director of EFF
  • Victoria Baranetsky – First Look Media Technology Legal Fellow for the Reporter’s Committee for Freedom of the Press
  • Geoff King – Lecturer at UC Berkeley, and Non-Residential Fellow at Stanford Center for Internet and Society
  • Bill Fernholz – Lecturer In Residence at Berkeley Law

For those who cannot attend in person, we will stream the event on Facebook Live, so make sure you’re following us on Facebook.

Posted in Announcements, News | Leave a comment

This week’s TV news highlights with fact checks

by Katie Dahl

As part of a new regular feature, the Internet Archive presents highlights from our national fact checking partners of TV news segments aired over the past week. These include President Donald Trump’s assertion that the number of police officers killed on the beat has increased; his latest attack on the press; his claim that sanctuary cities breed crime; the proposition that Nordstrom’s decision to drop Ivanka Trump’s apparel line was political;  several Trump statements from his Super Bowl interview with O’Reilly, and background on the silencing of Sen. Elizabeth Warren, D., Mass., on the floor of the Senate. 

Claim: Number of officers shot and killed in line of duty increased (true)

Trump earned a rare “Gepetto’s checkmark” for truthfulness from The Washington Post’s Fact Checker when he told a gathering of law enforcement that, The number of officers shot and killed in the line of duty last year increased by 56 percent from the year before.” Reporter Michelle Ye Hee Lee wrote, “Trump’s grim statistic seemed too remarkable to be correct:…But the figure is solid. Last year was a notable year in police deaths, largely because of the number of police officers who were fatally shot in ambush attacks across the country.”

Claim: press doesn’t want to report on terrorism (wrong)From our Trump Archive: in describing “radical islamic terrorist” attacks around the world, President Trump claimed the “very very dishonest press doesn’t want to report” them. The fact-checkers at PolitiFact found no evidence for this assertion, rating the claim as “Pants on Fire”: “The media may sometimes be cautious about assigning religious motivation to a terrorist attack when the facts are unclear or still being investigated. But that’s not the same as covering them up through lack of coverage.” Reporters at FactCheck.org called Trump’s claim “nonsense.”

Claim: Sanctuary cities breed crime (no evidence)

Also from the Trump Archive: in an interview on FOX News, host Bill O’Reilly asked for Trump’s reaction to news that officials in California are discussing whether to become a sanctuary state. Trump responded that he is opposed to sanctuary cities, saying they “breed crime.” PolitiFact reporter Allison Graves wrote that there isn’t much research on the impact of sanctuary cities on crime, but that at least one recent study shows no effect on crime rates. Michelle Ye Hee Lee gave the claim “three Pinocchios” from The Washington Post’s Fact Checker: “Trump goes too far declaring that the cities “breed crime.” He not only makes a correlation, but also ascribes a causation, without facts to support either.”

 

Claim: Putin’s a killer (experts say yes)

In the Super Bowl interview, O’Reilly pressed President Trump about his respect for Putin, saying “Putin’s a killer.” Trump’s response was “We got a lot of killers. You think our country is so innocent?” PolitiFact’s Graves reported on O’Reilly’s assertion that Putin is a killer, writing that “the political climate in Russia is responsible for a sizable amount of journalists murders in the country…. Many of the perpetrators are thought to be government and military officials and political groups.”

Claim: Three million undocumented immigrants voted illegally in November elections (no evidence)

Trump continued his unsubstantiated claim that three million undocumented immigrants voted illegally in the November election. When pushed on the need for evidence, Trump was undeterred, saying “[m]any people have come out and said I’m right. You know that.” PolitiFact repeated its finding that there is no evidence for this kind of voter fraud: “Trump’s claim is undermined by years of publically available information such as a report that found just 56 cases of noncitizens voting between 2000 and 2011.”

Claim: Nordstrom’s decision to drop Ivanka Trump’s apparel line was political (No evidence)

After Nordstrom dropped his daughter Ivanka Trump’s apparel line, President Trump attacked the decision as political. His press secretary, Sean Spicer, followed at a news conference saying, “[T]his is a direct attack on his policies and her name.” Reporting for The Washington Post Fact Checker, Lee cited an internal company email from November 2016, which states the company would continue to sell the brand as long as it was profitable. Then on February 2, Nordstrom announced it was dropping the line, because of “poor sales.” Lee gave the claim “four Pinocchios.”

Explainer: what is “Senate rule XIX” (rarely invoked)

During a Senate floor debate about the nomination of then Sen. Jeff Sessions, R., Ala., to be attorney general, Senate Majority Leader Mitch McConnell, R., Ky., silenced Sen. Elizabeth Warren, D., Mass., as she read from a letter by Corretta Scott King. In doing so, he cited an obscure rule, known as Senate rule XIX, which reads: “[N]o Senator in debate shall, directly or indirectly, by any form of words impute to another Senator or to other Senators any conduct or motive unworthy or unbecoming a Senator.” PolitiFact reporter Louis Jacobson provided a useful primer on the rule, including statistics on how often it’s been invoked in Senate history: most likely, only twice, once in 1915 and another tie in 1952.

Katie Dahl is a research associate with the TV New Archive.

Posted in Announcements, News | Tagged , , , , , , , , , , , , , , , , , , , | 1 Comment

Upgraded Secure Communications Applications I am Now Using

I am upgrading the security of my communications while still being easy to use. I thought I would share what I currently use in case it is helpful to copy and I would appreciate comments.

I want end-to-end encryption so nobody can intercept what I am saying (unless they have infected my phone or computer, but that is another issue), and bonus points for making it so that it is unknown who I am communicating with and when (private metadata and traffic). Skype, phonecalls, sms/texts, slack and email are now known to not be private (at least by default) thanks to Edward Snowden. This is too bad since I still use these. (Slack is not end-to-end encrypted even for direct messages, which it could and should.) So far I have only partially achieved the first step: end-to-end encryption. I am migrating to:

  • txt and sms replacement, somewhat phonecalls: Signal for point-to-point instant messaging replacing sms and skype. Free software, free of cost, and open source, works on smart phones.I have donated.
  • skype texting replacement: Signal for laptops and with a chrome-based desktop Signal app on my Mac (which is what I mostly use). It uses phone numbers as identifiers, which is kind of a pain. EFF friend called this “best of breed” for security. Small development staff.   There is a tip for updating it to have names rather than phonenumbers: go to the … menu, go to settings, at the bottom is update contacts.
  • skype video/slack audiovideo replacement:    appear.in for 1-on-1 and small group video chat that is end-to-end encrypted replacing Skype for me. This does not require a download or an account. Go to the homepage, type a bunch of characters to make a meeting room, then send the resulting url to someone and they can use that throw-away meeting room. Super easy. Uses webrtc (now standard in browsers), and https with it, they say it is end-to-end encrypted. They have a iphone app as well, but don’t know about security. This does not seemed designed for super high security, but seems to be pretty good.
  • webex replacement:   zoom.us for larger group video chats replacing Webex for me. Free of cost for most of my uses, easy to use (requires download, but is super easy) . It says it is end-to-end encrypted with a little lock icon when in use and encrypted.
  • Facetime occasionally on my iphone replacing cellphone calls to friends with an iphone. Apple says that it is end-to-end encrypted.
  • Thunderbird + Enigmail to sign all email, receive encrypted email, and sometimes sending encypted Email, with an organizational email server (archive.org not gmail). Enigmail is moderately hard to set up, I had help in a meetup. Cost free, and I believe free and open source software. I am donating.
  • encrypted notes file (the mac Notes app) on my mac for high priority secure notes. It syncs the encrypted file with my iphone via icloud.
  • Breadwallet, bitcoin wallet on my iphone, for small amounts of bitcoin for casual purchases. Super easy and a full wallet (does not hang off a server). Love this wallet. Cost free. I invested a tiny amount of money in the company– great guys.
  • Torbrowser for private web browsing beyond Firefox’s Private browsing feature. Free and open source software, cost free. I have donated.
  • On Macintosh os/x it’s easy to turn on full disk encryption (FileVault). Go to the “Security and Privacy” setting and turn on FileVault. If you do, be sure *not* to accept its offer to store the key in iCloud. Write down the “recovery key”, and hide it somewhere away from the computer. The security of this approach is based on the security of your normal login password, so if it’s lame, change it to something that can’t be guessed or brute forced easily.  (from a commenter, Eric Blossom)
  • Web search: DuckDuckGo or StartPage.com. (from a commenter, Reinout)

Any comments or ideas are welcome. I realize have traded off security for ease of use. I hope stronger tools get easier and I suggest we all invest in tools based on donations and development help. I wish I knew my mac and iphone were not compromised. Not sure how to do that.

I have tried ricochet as an instant messaging client that secures who I am talking to via Tor, easy to use, but few I know use it, so I don’t use it often. I have tried encrypting my email using pgp via enigmail but have run into trouble with others being able to read it, so I do not encrypt email by default. As an aside, encryption is related in a funny way to content-addressible systems, which is a different subject, but this is magic and the future.

(earlier version of this post is on http://brewster.kahle.org )

Posted in Announcements, News | 4 Comments

Micropayments to Archive.org by using the Brave Browser (and bitcoin)

I hope Ted Nelson is proud. The Internet Archive just signed up for getting micropayments from participating Brave Browser users.  Brave Browser is an alt-browser for controlling ads, mostly, but they added a micropayments feature (beta).

You need put in some bitcoin that will then be distributed to the sites you visit in a month. Cool! (they help you get bitcoin)

We don’t expect it will raise the money we need to make a copy of archive.org in Canada, but we are glad to participate in this program.  Thank you, Brave, and our intrepid users.

Posted in Announcements, News | 2 Comments

If You See Something, Save Something – 6 Ways to Save Pages In the Wayback Machine

In recent days many people have shown interest in making sure the Wayback Machine has copies of the web pages they care about most. These saved pages can be cited, shared, linked to – and they will continue to exist even after the original page changes or is removed from the web.

There are several ways to save pages and whole sites so that they appear in the Wayback Machine.  Here are 6 of them.

1. Save Page Now

Put a URL into the form, press the button, and we save the page.  You will instantly have a permanent URL for your page.

save page now

At the moment, there are a few exceptions for this method – some sites prohibit crawling, a few have SSL (security) settings that make it break – but this method will work for most pages.  The feature saves the page you enter including the images and CSS.  It does not save any of the outlinks, and can’t be used to initiate a crawl of an entire web site. We do not keep your IP address, so your submission is anonymous.

2. Chrome extension

Install the Wayback Machine Chrome extension in your browser.  Go to a page you want to archive, click the icon in your toolbar, and select Save Page Now. We will save the page and give you a permanent URL.

Chrome extension allows save page now

The same provisos from “Save Page Now” apply – there are some pages where it won’t work, and it only saves one page at a time.  One plus to installing the extension though is that now as you surf around, when you run into a missing page we will alert you if we have a saved copy.

We also have a Firefox add-on; it will have Save Page Now functionality soon.  We are working on a Safari extension as well.

3. Wikipedia JavaScript Bookmarklet

Nobody loves a primary source more than a Wikipedia editor.  To that end, they offer a Wayback Machine JavaScript Bookmarklet that allows you to quickly save a web page from any browser.

wikipedia wayback bookmarklet

4. Volunteer for Archive Team

Archive Team is an entirely volunteer driven group who are interested in saving Internet history.  Many of the sites and pages they save end up in the Wayback Machine.  Visit the Archive Team site to learn more about how to volunteer with them.

Archive Team

5. Sign up for an Archive-It Account

Archive-It is a subscription service provided by Internet Archive that allows you to run your own crawling projects without any technical expertise.  Tell us what to crawl and how often to crawl it, and we execute the crawl and put the results in the Wayback Machine.

Archive-It

Archive-It is a paid subscription service with technical and web archivist support. This option is most appropriate for organizations that have a mandate to save certain types or categories of web content on a regular basis. If your institution is a current Archive-It partner, contact them for how you can contribute.

6. End of Term Archive

Every time the US government administration changes, Internet Archive works with partners to make a copy of government-related sites and web presences.  We call it the End of Term Archive.  You can help us discover new government sites by using the Nomination Tool to suggest pages or sites.  These nominations are added to the crawl and end up in the Wayback Machine.

End of term archive nomination tool

 

The Internet Archive has been saving web pages for 20 years.  This archive has been built by thousands of people, and we would like you to help.  Use one of the methods above to make sure we have the pages you care about.

 

Posted in Announcements, News, Wayback Machine, Web Archive | 13 Comments

In the news: Trump Archive, end-of-term preservation, & link rot

News outlets have been getting the word out on Internet Archive efforts to preserve President-elect Donald Trump’s statements; the outgoing Obama Administration’s web page and government data; as well as preventing that nasty experience of encountering a “404” when you click on a link online, aka “link rot.”

Trump Archive 

A number of journalists have been exploring the riches contained within the newly launched Trump Archive, a TV news clips of the president-elect speaking peppered with links to more than 500 fact checks by national fact-checking groups.

Annie Wiener, writing for The New Yorker, immerses herself in Trump statements and discovers 56 mentions of the escalator in Trump tower, and that Trump:

“is a fan of the word “sleaze,” and of the phrase “tough cookie,” which he has used to describe policemen, his opponents’ political donors, Paul LePage, “real-estate guys in New York and elsewhere,” an unnamed friend who is a “great financial guy,” isis, three professional football players, Reince Priebus, Lyndon Johnson, and Trump’s father, Fred. After watching long stretches of video, she writes, “It occurred to me that spending time online in the Trump Archive could be a form of immersion therapy: a means of overcoming shock through prolonged exposure.”

Geoffrey Fowler, tech columnist for The Wall Street Journal, bemoans the lack of easy-to-use tech tools to help people be responsible citizens overall, but also notes the promise–and challenge–of a curated collection like the Trump Archive:

“The Trump Archive shows what’s hard about using tech to hold officials accountable. It’s assembled and hand-curated by humans. Yet even using the transcripts, it can be hard to tell the difference between a spoken name and a person who’s actually speaking. Archive officials say making their database applicable to hundreds or thousands more politicians would require help from tech firms with capabilities in machine learning and voice and facial recognition.”

Fowler also published this video, featuring plenty of Trump, an interview with Roger Macdonald, director of the TV News Archive; and ample footage of the Internet Archive’s San Francisco headquarters.

The Trump Archive also was featured in Marketplace Tech®, The HillForbesNewsweek, Buzzfeed News TechPlzVentureBeat, engadgetand more.

Preserving Obama Administration websites, social media

The Internet Archive’s efforts to help preserve government websites via the Wayback Machine during and after the transition has continued to garner attention. Wired reports on a group of climate scientists working against the clock to archive government websites related to global warming:

One half was setting web crawlers upon NOAA web pages that could be easily copied and sent to the Internet Archive. The other was working their way through the harder-to-crack data sets—the ones that fuel pages like the EPA’s incredibly detailed interactive map of greenhouse gas emissions, zoomable down to each high-emitting factory and power plant.

The New Scientist also writes on efforts to archive climate data:

Fears that data could be misused or altered have prompted crowd-sourcing to back up federal climate and environmental data, including Climate Mirror, a distributed volunteer effort supported by the Internet Archive and the Universities of Pennsylvania and Toronto.

The Los Angeles Times and Quartz offer reports on archiving climate data.

Internet Archive works against link rot

Tech publications were quick to inform their readers about the Internet Archive’s new chrome extension that fights link rot by directing users to archived web pages. Here is Mashable:

Now Internet Archive has built a Wayback Machine Chrome extension. It works like this: If you click on a link that would normally lead to an error page (think 404), the extension will instead give users the option to load an archived version of the page. The link is no longer simply gone.

Also writing on the fight against link rot: NetworkWorldVenture BeatThe Tech PortalBleeping Computer, and ZDNet.

 

 

 

Posted in Announcements, News | Tagged , , , , , , , , , , , , , , , , , , , , , , , | 8 Comments

Lost Landscapes of San Francisco: Fundraiser Benefitting Internet Archive — Monday, January 30th, 2017

By Rick Prelinger, Prelinger Archives

Internet Archive presents the 11th annual Lost Landscapes of San Francisco show on Monday, January 30 at 7:30 pm. The show will be preceded by a small reception at 6:30 pm, when doors will open.

Get tickets here!

While this is the seventh year we’ve been presenting this participatory archival film show at the Archive, the story goes back much further. I’ve been collecting historical footage of San Francisco and the Bay Area in earnest since 1993, when we acquired the collection assembled by noted local historian and film preservationist Bert Gould. Since that time I’ve worked to collect film material showing the history of this dynamic and complex region. Much of it is online for free viewing, downloading and reuse as part of the Prelinger Collection.

In 1996 Chris Carlsson and LisaRuth Elliott of Shaping San Francisco encouraged me to put together a little show of historical footage for a talk at CounterPULSE. Shaping SF, by the way, is a highly active local history organization, a longtime partner of the Archive and presently working with IA to digitize a large collection of San Francisco community newspapers. I made a program and planned a narration. The little CounterPULSE dance studio theater filled quickly on show night and we had to turn many away, but the people who were able to get in talked their way through the show, asking questions, identifying places and people and arguing over precise identifications with their neighbors. It was a wonderful event — nothing like the kind of film showing that takes place in church-like silence, but an active, participatory event where people freely shared their knowledge and experience of San Francisco’s history. A new show the year afterward was also jammed. Long Now Foundation stepped up and offered to make this event part of their Seminars on Long-Term Thinking talk series, and in year 3 we moved to the 400-seat Cowell Theater at Fort Mason. This was at once a wonderful experience and an occasion for great chagrin, because at least 250 people who showed up were unable to get in. And so we moved to the beautiful Herbst Theater and in 2011 to the 1410-seat Castro Theatre, where we’ve been every year since then. And for the last eight years we’ve also been putting on Lost Landscapes at Internet Archive. Many great things have happened at the Archive showings: people have recognized their relatives in the films, and many have seen their own streets and neighborhoods as they’ve never before seen them.

Combining favorites from past years with this year’s footage discoveries, the 11th annual feature-length program shows San Francisco’s neighborhoods, infrastructures, celebrations and people from 1906 through the 1970s. This year’s program features new scenes of San Franciscans working, playing, marching and partying during the Great Depression; unseen footage of Seals Stadium and the Cow Palace in the late 1930s; newly-discovered footage of the San Francisco Produce Market in operation; glimpses of neighborhoods now gone; Cathedral Hill on the cusp of redevelopment; 1960s antiwar activism; newly found footage of Tom Mooney’s victory parade after his release from Alcatraz in 1939; Bay ferries in operation; rare images of southeastern San Francisco and the Hunters Point drydock; the 1975 Gay Freedom Day parade; a 1940s-era ode to our fog; and many more newly discovered gems.

As always, the audience makes the soundtrack! This is a great room for the show, as the shape of the Great Room makes it easy for participants to hear one another’s comments. Come prepared to identify places, people and events, to ask questions and to engage in spirited real-time repartee with fellow audience members, and look for hints of San Francisco’s future in the shape of its lost past.

Monday, January 30th
6:30 pm Reception
7:30 pm Interactive Film Program

Internet Archive
300 Funston Ave.
San Francisco, CA 94118

Get tickets here!

Posted in Announcements, News | 3 Comments

See Trump Archive fact checks in one place

Robin Chin, Katie Dahl, Tracey Jaquith, Roger Macdonald, Nancy Watzman, and Dan Schultz are contributing research and engineering for the Trump Archive. 

Now it’s easier to find fact checks of specific statements by President-elect Donald Trump in our new Trump Archive, an experimental collection of TV news clips featuring Trump–including fact checks of his press conference on January 11, his first since July 2016.

We’ve got 500+ fact checks by FactCheck.org, the Pulitzer-prize winning PolitiFact, and The Washington Post‘s Fact Checker embedded within the Trump Archive; these are now viewable on this dedicated page, with the option of downloading a csv containing links to fact checks, links to TV news clips, date of airing, and topics covered.

The Internet Archive’s Trump Archive launched on January 5 with 700+ televised speeches, interviews, debates, and other news broadcasts related to President-elect Donald Trump, and it continues to grow.

We created the Trump Archive in response to journalists and scholars who had trouble finding clips of Trump speaking through the caption search function in our TV News Archive library. We are hand-curating this collection as an experimental prototype for learning how to engineer solutions so similar archives can be created–whether by the Internet Archive or members of the public–about other elected officials and topics of interest. We are looking for collaborative partners to explore artificial intelligence approaches to creating such collections, with an ease and scale far beyond what can be accomplished now by hand.

The list of fact checks in the Trump Archive includes claims made by Trump during his press conference on January 11 covering issues from health care to ISIS to Trump’s connections to Russia. Here’s a sampling.

Health care

Trump said: “Obamacare is a complete and total disaster. It’s imploding as we said. Some states have over 100 percent increase.”

FactCheck.org: “Only Arizona has an average increase that high, and 84 percent with marketplace coverage in 2016 received tax credits to purchase insurance.”

PolitiFact: “While the average premium increase in Arizona rose by 145 percent in 2017, it is the only state with a triple-digit increase. Alabama saw the second highest increase, 71 percent. On the other end, a few states saw decreases. The average premium increase across all states was 25 percent.”

The Washington Post‘s Fact Checker: “Trump exaggerates here, and appears to misunderstand a fundamental part of the Affordable Care Act. State-by-state weighted average increases range from just 1.3 percent in Rhode Island to as high as 71 percent in Oklahoma. But the most common plans in the marketplace in 2017 experienced an average increase of 22 percent. These plans have been used as the benchmark to calculate government subsidies.”

ISIS

Trump: “I mean if you look, this administration created ISIS by leaving at the wrong time. The void was created, ISIS was formed.”

FactCheck.org: “Trump continues to oversimplify the situation by placing the entirety of the blame for the creation of ISIS on Obama’s decision to withdraw troops from Iraq.”

PolitiFact: “This is a more tempered version of Trump’s previous Pants on Fire claim that Obama and Clinton “founded ISIS.” Experts told PolitiFact that you can reasonably criticize the Obama administration’s withdrawal from Iraq, lack of support to anti-Assad rebels in Syria, and intervention in Libya for contributing to the power of ISIS. But the timeline was set in motion by the Bush administration.”

The Washington Post‘s Fact Checker: “Trump greatly simplifies a complex situation.”

Russia

Trump: “I have no deals that could happen in Russia, because we’ve stayed away. And I have no loans with Russia.”  

PolitiFact:  “It’s true that Trump has yet to build a hotel or tower in Russia, but he has eyed the Moscow skyline for decades.

We don’t know for sure about the extent of Trump’s business dealings in Russia, because he hasn’t released his tax returns. But his son, Donald Trump Jr., said in a 2008 real estate conference that “Russians make up a pretty disproportionate cross-section of a lot of our assets.”

We do know that Trump agreed to host the Miss Universe pageant in Moscow in 2013, a $20 million deal facilitated by a Russian real estate mogul and billionaire Aras Agalarov. (Trump also cameoed in Agalarov’s son’s dance-pop music video). He also made millions selling a 17-bedroom Florida mansion to a Russian billionaire.

The Washington Post‘s Fact Checker: “Trump is being misleading when he says he has stayed away from Russia. Trump repeatedly sought deals in Russia. In 1987, he went to Moscow to find a site for luxury hotel; no deal emerged. In 1996, he sought to build a condominium complex in Russia; that also did not succeed. In 2005, Trump signed a one-year deal with a New York development company to explore a Trump Tower in Moscow, but the effort fizzled.

In a 2008 speech, Donald Trump Jr. made it clear that the Trumps want to do business in Russia, but were finding it difficult. “Russians make up a pretty disproportionate cross-section of a lot of our assets,” Trump’s son said at a real estate conference in 2008, according to an account posted on the website of eTurboNews, a trade publication. “We see a lot of money pouring in from Russia.”

Posted in News | Tagged , , , , , , , , | 7 Comments

Wayback Machine Chrome extension now available

The Wayback Machine Chrome browser extension helps make the web more reliable by detecting dead web pages and offering to replay archived versions of them.  You can get it here.

For the past 20 years, the Internet Archive has recorded and preserved web pages, and hundreds of billions of them are available via the Wayback Machine.  This is good because we are learning the web is fragile and ephemeral.  For example a 2013 Harvard study found that 49% of the URLs referenced in U.S. Supreme Court decisions are now dead.  Those decisions affect everyone in the U.S., and the evidence the opinions are based on is disappearing.

When previously valid URLs don’t respond, but instead return a result code of 404, we call that link rot.  The Wayback Machine Chrome extension is designed to help mitigate against link rot and other common web breakdowns.  

By using the “Wayback Machine” extension for Chrome, users are automatically offered the opportunity to view archived pages whenever any one of several error conditions, including code 404, or “page not found,” are encountered.  If those codes are detected, the Wayback Machine extension silently queries the Wayback Machine, in real-time, to see if an archived version is available.  If one is available, a notice is displayed via Chrome, offering the user the option to see the archived page.

The Internet Archive considers the privacy of our users to be of critical importance. We try not to record IP addresses, and we have fought National Security letters.  You can rest assured that the use of the Wayback Machine Chrome extension will not expose your browsing history.  In addition we are in conversation with Google about adding a proxy server as an additional layer of protection.

Thank you for giving the Wayback Machine for Chrome extension a try.  You can test it with this URL: http://www.pfaw.org:80/attacks.htm  We are committed to supporting better web browsing experiences and welcome your feedback and suggestions about how we can improve.  Please send us your bug reports, feature requests and other feedback directly to info@archive.org.

Posted in Announcements, News | 29 Comments

Internet Archive’s Trump Archive launches today

The Trump Archive launches today with 700+ televised speeches, interviews, debates, and other news broadcasts related to President-elect Donald Trump, created using the Internet Archive’s TV News Archive.

A work in progress, the growing collection now includes more than 520 hours of Trump video. The earliest excerpt dates from December 2009, and the collection continues through the present. It includes more than 500 video statements fact checked by FactCheck.org, PolitiFact, and The Washington Post’s Fact Checker covering such controversial topics as immigration, Trump’s tax returns, Hillary Clinton’s emails, and health care.

Full list of fact checks with links to video statements in TV News Archive.
Note: We are working to update this spreadsheet with improved links. Stay tuned.

Visit the Trump Archive.

Reporters, researchers, Wikipedians, and the general public are invited to quote, compare and contrast televised statements made by Trump.

  • Use clips in your articles and videos.
  • Create supercuts on topics like Trump’s perspectives of the US press, made with our online “Popcorn” video editor.  
  • Let us know what content we are missing.  
  • If you have the technical resources, help us enhance search and discovery by collaborating in experiments to apply artificial intelligence-driven facial recognition, voice identification, and other video content analysis approaches.
  • How would you like to use such an archive?  Comment below, or write us info@archive.org

Why a Trump Archive?

We draw on this material, and our experience with building the successful Political TV Ad Archive, to create a curated collection of material related to Trump, with an emphasis on fact-checked statements. The video is searchable, quotable, and shareable on social media.

In response to requests by our fact checking partners on the Political TV Ad Archive project and other media, we hope to provide assistance for those tracking Trump’s evolving statements on public policy issues.

For example: in July 2016, Trump told ABC’s George Stephanopoulos, “I have no relationship with Putin…I don’t think I’ve ever met him.” Stephanopoulos pressed him on this point during the interview, saying that Trump had previously claimed a relationship with him. PolitiFact ruled this statement by Trump as a “full flip flop”: “Trump’s denial of a relationship with Putin contradicted what he had said on multiple previous occasions.”

By providing a free and enduring source for TV news broadcasts of Trump’s statements, the Internet Archive hopes to make it more efficient for the media, researchers, and the public to track Trump’s statements while fact-checking and reporting on the new administration. The Trump Archive can also serve as a rich treasure trove of video material for any creative use: comedy, art, documentaries, wherever people’s inspiration takes them.

We consider the Trump Archive to be an experimental model for creating similar archives for other public officials. For example, we’ll explore the idea of creating curated collections for Trump’s nominees to head federal agencies; members of Congress of both parties (for example, perhaps the Senate and House majority and minority leadership); Supreme Court nominees, and so on.

While we’ve largely hand-curated this collection, we hope to collaborate with researchers to apply machine intelligence to expand this collection, building others and making search of our entire TV library vastly more efficient.

Such experimentation builds on our experience with first prototyping and then developing the the Political TV Ad Archive. Our first collection of political TV ads, covering ads aired in Philadelphia during the 2014 mid-term elections, was built largely by hand. However, in preparation for the Political TV Ad Archive, we created a new open source tool, the Duplitron, that was able to identify ad airings by deploying audio fingerprinting. During the course of the project, we collected nearly 3,000 ads and documented more than 364,000 ad airings.

Why now?

Just because something is broadcast or posted on the internet doesn’t mean it’s forever. Reporters and the public may take it for granted that a news story or a piece of broadcast video is only a google search away, but as newspapers, companies, and organizations fail and change, often vital information is lost. The web is far more fragile than is generally understood.

The Internet Archive’s core mission is to preserve and make accessible our cultural heritage. For example, the Wayback Machine preserves websites over time, so if pages or sites are deleted, they can still be found. For example, Rachel Maddow of MSNBC reported on how the president-elect had deleted a web page from the official transition website that had touted Trump properties.

We also preserve political and news content through the TV News Archive, which contains news broadcasts by major networks back to 2009, searchable via closed captioning. The Political TV Ad Archive archives 2016 election ads along with relevant fact checks and follow-the-money reporting by our journalism partners. Our Political Campaign web archive is preserving election-related online media, such as select candidate and political groups’ websites and Twitter and Instagram feeds.

What’s next

The Trump Archive is a work in progress; we will continue to refine the content. We hope to work with others to broaden the materials available, to make search more efficient, and otherwise make it more useful for the public. We’d like you feedback and suggestions.

The great American author William Faulkner wrote, “The past is never dead. It’s not even past.” We believe that the Trump Archive, in preserving the past, can help the public engage more knowledgeably with our future.

Many thanks to the thoughtful contributions of Robin Chin, Jessica Clark, Katie Dahl, Katie Donnelly, John Gonzalez, Wendy Hanamura, Tracey Jaquith, Jeff Kaplan, Roger Macdonald, Ralf Muehlen, Craig Newmark, Sylvia Paull, Alexis Rossi, Dan Schultz, Nancy Watzman, our Partners & Funders and the Vanderbilt Television News Archive – on whose shoulders we stand.

Posted in Announcements, News | Tagged , , , , , , , , , , | 82 Comments

Join us for a White House Social Media and Gov Data Hackathon!

gov_hackathonJoin us at the Internet Archive this Saturday January 7 for a government data hackathon! We are hosting an informal hackathon working with White House social media data, government web data, and data from election-related collections. We will provide more gov data than you can shake a script at! If you are interested in attending, please register using this form. The event will take place at our 300 Funston Avenue headquarters from 10am-5pm.

We have been working with the White House on their admirable project to provide public access to eight years of White House social media data for research and creative reuse. Read more on their efforts at this blog post. Copies of this data will be publicly accessible at archive.org. We have also been furiously archiving the federal government web as part of our collaborative End of Term Web Archive and have also collected a voluminous amount of media and web data as part of the 2016 election cycle. Data from these projects — and others — will be made publicly accessible for folks to analyze, study, and do fun, interesting things with.

At Saturday’s hackathon, we will give an overview of the datasets available, have short talks from affiliated projects and services, and point to tools and methods for analyzing the hackathon’s data. We plan for a loose, informal event. Some datasets that will be available for the event and publicly accessible online:

  • Obama Administration White House social media from 2009-current, including Twitter, Tumblr, Vine, Facebook, and (possibly) YouTube
  • Comprehensive web archive data of current White House websites: whitehouse.gov, petitions.whitehouse.gov, letsmove.gov and other .gov websites
  • The End of Term Web Archives, a large-scale collaborative effort to preserve the federal government web ( .gov/.mil) at presidential transitions, including web data from 2008, 2012, and our current 2016 project
  • Special sub-collections of government data, such as every powerpoint in the Internet Archive’s web archive from the .mil web domain
  • Extensive archives of of social media data related to the 2016 election including data from candidates, pundits, and media
  • Full text transcripts of Trump candidate speeches
  • Python notebooks, cluster computing tools, and pointers to methods for playing with data at scale.

Much of this data was collected in partnership with other libraries and with the support of external funders. We thank, foremost, the current White House Office of Digital Strategy staff for their advocacy for open access and working with us and others to make their social media open to the public. We also thank our End of Term Web Archive partners and related community efforts helping preserve the .gov web, as well as the funders that have supported many of the collecting and engineering efforts that makes all this data publicly accessible, including the Institute of Museum and Library Services, Altiscalethe Knight Foundation, the Democracy Fund, the Kahle-Austin Foundation, and others.

Posted in Announcements, News | Tagged , , , , , , | 19 Comments

Would Like to Archive Government Web Services, not just Web Sites– Please help

Archiving .gov and .mil websites is going on now, with lots of help—but what if we could archive full government web services? This would mean keeping interactive sites that include databases and forms, available for future use even if the original website changes or is removed.

We like this idea because we would preserve how websites worked, not just what they looked like. As websites become more database driven and interactive, this would be a bigger help than the already helpful Wayback Machine.

We believe this is possible now given the increased use of virtual machines and cloud services. Webmasters are adjusting to having their systems work in an isolated environment and one that can be snapshot’d.

What we need are some webmasters who would like to try this. We think that government websites would be perfect because they tend to change as administrations change and the datasets are often public data.

If you run a website and would like to participate in this experiment or would like to help on the receiving end, please send a note to info@archive.org or reply to this post.

Archiving web services could usher in a completely new age in archiving of Internet resources.

 

 

Posted in Announcements, News | 4 Comments

A Year-end Message from the TV News Archive

by Katie Donnelly

Over the past extremely unpredictable election year, the Internet Archive invented new methods and tools to give journalists, researchers, and the public the power to access, scrutinize, share, and thoroughly fact-check political ads, presidential debates, and TV news broadcasts.

Our efforts were designed to help citizens better understand the patterns of political messages designed to persuade them and find factual, reliable information in what is disturbingly being seen as a “post-truth” world.

The Political TV Ad Archive project proved to be highly useful to our high-profile fact-checking partners, as well as reporters at an array of outlets including The New York Times, The Washington Post, FOX News, The Economist, The Atlantic, and more. By providing data about when, where, and how many times political ads aired on TV in key markets, the project unlocked new creative potential for data reporters to analyze how campaigns and outside groups were targeting messages to voters in different locations.

Breaking events, like political debates and speeches, also offered a chance for archived TV content to shine, allowing reporters to isolate and share clips in near-real time, and fact-checkers to harvest dubious statements for further exploration. In addition, the project’s experience with developing audio fingerprinting (through a new invention we call the Duplitron) for identifying instances of ads inspired a new use: tracking candidate debate sound bites in subsequent TV news shows.

In this way, reporters and researchers were able to analyze and report on which political statements were trending across different TV programs. This provided a way to show how political statements were trending across various networks, revealing the ideological, and agenda-setting and other editorial choices made by news producers about what issues to highlight and overlook.

screenshot-2016-12-19-13-21-14

As Roger Macdonald, director of the TV News Archive, wrote to project partners: “Citizens will increasingly hunger for sound information to inform wise electoral decisions. With our Republic being riven by increasing socio-political chaos and infectious divisions, whose magnitude has not been seen since before our Civil War, we think there are uncommon opportunities to serve citizens with the information for which they will increasingly yearn. We have an historic opportunity to thoughtfully place some grains of sand on the balance pan of reason.”

The project was supported by a generous grant from the Knight News Challenge, funded in partnership with the Knight Foundation, the Democracy Fund, the Hewlett Foundation and the Rita Allen Foundation, and received additional support from the Rita Allen Foundation, the Democracy Fund, PLCB Foundation, Craig Newmark, Christopher Buck, and others

Here is a quick look at project accomplishments:

Political TV Ad Archive

  • Total number of archived ad views, most embedded in partner sites: 2,036,063
  • Number of ads collected: 2,991
  • Political ads broadcast 364,822 times over 26 markets
  • Number of fact and source checks: 131
  • Press coverage: 156 articles

Katie Donnelly is associate director at Dot Connectors Studio, a Philadelphia-based strategy firm that has worked with the Political TV Ad Archive.

Posted in News | Tagged , , , , , , , , , , | Comments Off on A Year-end Message from the TV News Archive

New Research Tool for Visualizing Two Million Hours of Television News

Guest post by Kalev Leetaru

Today the Internet Archive announces a new interactive timeline visualization–the Television Explorer–that lets you trace how any keyword–think “emails”, “tax returns”, “alt-right”–has been covered on U.S. television news over the past half-decade.

See the Television Explorer, a new tool for exploring TV News.

screenshot-2016-12-19-09-50-09

Over the past year and a half, the GDELT Project and the Internet Archive’s Television News Archive have worked closely together to visualize how U.S. television news has covered the contentious 2016 political campaign.

One of the tools we created was the 2016 Candidate Television Tracker, which used closed captioning to count how many times each of the presidential candidates was mentioned on television and offered a day-by-day timeline showing the ebbs and flows of who was “winning” the free media wars. (Answer: President-elect Donald Trump.) This tool was used by such media outlets as The Atlantic, The Washington Post, FiveThirtyEight, Politico and The Guardian, among many others.

Now we are adapting this tool to allow more sophisticated searches: rather than just the presidential candidates, now you can trace television news coverage of any keyword of your choosing. You can even run advanced searches that find words in conjunction with other works or phrases, such as finding mentions of Hillary Clinton that also discuss her email server. All search results are available for download via CSV and JSON export, making it possible for data journalists, researchers, and advocates to fine tune their analysis of the data.

When searching, you get back a visual timeline showing how often that word or phrase has appeared on American television news over the past half-decade. Nearly two million hours of television news totaling more than 5.7 billion words from over 150 distinct stations spanning July 2009 to present (though not all stations were monitored for the entire period) are searchable in this interface.

Unlike the Internet Archive’s Television New Archive interface, which returns results at the level of an hour or half-hour “show,” the interface here reaches inside of those six and a half years of programming and breaks the more than one million shows into individual sentences and counts how many of those sentences contain your keyword of interest. Instead of reporting that CNN had 24 hour-long shows yesterday that mentioned Donald Trump one or more times, the interface here will count how many sentences uttered on CNN yesterday mentioned his name–a vastly more accurate metric for assessing media attention.

Explore how CNN covered the presidential campaign of 2012 versus 2016 and understand just how big of a media event this year’s election really was. See precisely when Edward Snowden burst onto the scene and how Wikileaks got more coverage during the 2016 presidential election than its debut in 2010. Watch the seasonal spikes of Thanksgiving, or see how ebola received little attention, even as thousands died in Africa, becoming a topic only after the first Americans became infected.

Using the “near” search feature, plot coverage of Wikileaks that also mentioned either “Podesta,” “email,” or “emails” nearby and discover that FOX paid far more attention to the DNC and Podesta email hacks than CNN, MSNBC, CNBC or Bloomberg. In contrast, CNN focused more intensely on the Trayvon Martin shooting (Aljazeera America and Bloomberg were not yet being monitored by the Archive), while Aljazeera led coverage of the Michael Brown and Eric Garner deaths.

screenshot-2016-12-19-09-53-55

Search of term “Wikileaks” near Podesta, emails, Clinton

Search for “ivory” to see that Aljazeera America (which ceased operation in April 2016) devoted vastly more of its coverage to elephant poaching in Africa than any other monitored national network. It also paid the most attention to “Africa” and to the “refugee” crisis. On the other hand, Bloomberg has devoted much more of its time to “China” and to the economic crisis in “Greece” last year.

We look forward to seeing what people do with this new tool Please share your favorite searches on Twitter with the hashtag “#internetarchivetvsearch”. If you have any questions, please email kalev.leetaru5@gmail.com or nancyw@archive.org.

Kalev Leetaru is an independent data journalist. 

Posted in Announcements, News | Tagged , , , , , , , , , , , , , , , , , , , , , | 3 Comments

Robots.txt Files and Archiving .gov and .mil Websites


The Internet Archive is
collecting webpages from over 6,000 government domains, over 200,000 hosts, and feeds from around 10,000 official federal social media accounts. Some have asked if we ignore URL exclusions expressed in robots.txt files.

The answer is a bit complicated.  Historically, sometimes yes and sometimes no; but going forward the answer is “even less so.”

mollymonsterRobots.txt files live on the top level of a website at a url like this: https://example.com/robots.txt. This standard was developed in 1994 to guide search engine crawlers in a variety of ways, including some areas to avoid crawling.   This standard is used by Google, for instance.

These files were useful 20 years ago for the Internet Archive’s crawlers, but have become less and less so over the years because many sites have not actively maintained the files from the point of view of archiving. Also, large websites or hosted websites often do not make it easy for their users to edit these files, and large websites increasingly guide or block crawlers with technological measures. Another problem is knowing when a domain name changes hands, so a current robots.txt file is not relevant to a different era. As time has gone on, for those who want to exclude their sites we encourage webmasters to send exclusion requests to info@archive.org and encourage them to specify what time period they apply to.

Our end-of-term crawls of .gov and .mil websites in 2008, 2012, and 2016 have ignored exclusion directives in robots.txt in order to get more complete snapshots. Other crawls done by the Internet Archive and other entities have had different policies.  We have had little or no negative feedback on this, and little or no positive feedback — in fact little feedback at all. The Wayback Machine has also been replaying the captured .gov and .mil webpages for some time in the beta wayback, regardless of robots.txt.   

Overall, we hope to capture government and military websites well, and hope to keep this valuable information available to users in the future.

Posted in News, Wayback Machine, Web Archive | 3 Comments

Preserving U.S. Government Websites and Data as the Obama Term Ends

Long before the 2016 Presidential election cycle librarians have understood this often-overlooked fact: vast amounts of government data and digital information are at risk of vanishing when a presidential term ends and administrations change.  For example, 83% of .gov pdf’s disappeared between 2008 and 2012.

That is why the Internet Archive, along with partners from the Library of Congress, University of North Texas, George Washington University, Stanford University, California Digital Library, and other public and private libraries, are hard at work on the End of Term Web Archive, a wide-ranging effort to preserve the entirety of the federal government web presence, especially the .gov and .mil domains, along with federal websites on other domains and official government social media accounts.

While not the only project the Internet Archive is doing to preserve government websites, ftp sites, and databases at this time, the End of Term Web Archive is a far reaching one.

The Internet Archive is collecting webpages from over 6,000 government domains, over 200,000 hosts, and feeds from around 10,000 official federal social media accounts. The effort is likely to preserve hundreds of millions of individual government webpages and data and could end up totaling well over 100 terabytes of data of archived materials. Over its full history of web archiving, the Internet Archive has preserved over 3.5 billion URLs from the .gov domain including over 45 million PDFs.

This end-of-term collection builds on similar initiatives in 2008 and 2012 by original partners Internet Archive, Library of Congress, University of North Texas, and California Digital Library to document the “gov web,” which has no mandated, domain-wide single custodian. For instance, here is the National Institute of Literacy (NIFL) website in 2008. The domain went offline in 2011. Similarly, the Sustainable Development Indicators (SDI) site was later taken down. Other websites, such as invasivespecies.gov were later folded into larger agency domains. Every web page archived is accessible through the Wayback Machine and past and current End of Term specific collections are full-text searchable through the main End of Term portal. We have also worked with additional partners to provide access to the full data for use in data-mining research and projects.

The project has received considerable press attention this year, with related stories in The New York Times, Politico, The Washington Post, Library Journal, Motherboard, and others.

“No single government entity is responsible for archiving the entire federal government’s web presence,” explained Jefferson Bailey, the Internet Archive’s Director of Web Archiving.  “Web data is already highly ephemeral and websites without a mandated custodian are even more imperiled. These sites include significant amounts of publicly-funded federal research, data, projects, and reporting that may only exist or be published on the web. This is tremendously important historical information. It also creates an amazing opportunity for libraries and archives to join forces and resources and collaborate to archive and provide permanent access to this material.”

This year has also seen a significant increase in citizen and librarian driven “hackathons” and “nomination-a-thons” where subject experts and concerned information professionals crowdsource lists of high-value or endangered websites for the End of Term archiving partners to crawl. Librarian groups in New York City are holding nomination events to make sure important sites are preserved. And universities such as  The University of Toronto are holding events for “guerrilla archiving” focused specifically on preserving climate related data.

We need your help too! You can use the End of Term Nomination Tool to nominate any .gov or government website or social media site and it will be archived by the project team.   If you have other ideas, please comment here or send ideas to info@archive.org.   And you can also help by donating to the Internet Archive to help our continued mission to provide “Universal Access to All Knowledge.”

Posted in Announcements, News | Tagged , , , | 14 Comments

Internet Archive Canada and National Security Letter in the news: roundup

The Internet Archive garnered major media attention over the past week, first, on our plan to create a Canadian copy, and second, on the news we received a National Security Letter (NSL) requesting personal information about a user, the second in our history.

Canadian copy

Brewster Kahle’s post explaining why, in light of the new administration, the Internet Archive is raising money to build a copy of its collections in Canada hit a nerve.  More details were in a FAQ.

On November 29, Rachel Maddow led her MSNBC show with a segment about how the Internet Archive’s Wayback Machine helps reporters by preserving a record of what politicians say online, even when they later delete it.

One of her main examples: how soon after winning the election, President-elect Donald Trump’s official federal transition web page included a “rundown ….of all of the ‘world’s top properties that Donald Trump’s owns.”

The website has since been deleted, Maddow noted.

Maddow also called the Internet Archive, a “national treasure…an international treasure.” (We’re blushing.)

Meanwhile, Paul Sawers noted in Venture Beat:

 Given that lies and fake news played a crucial part in the 2016 U.S. presidential election narrative, it is somewhat notable that the Internet Archive had launched the Political TV Ad Archive back in January to help journalists fact-check claims made during political campaigning.

In The Washington Times, Andrew Blake wrote about the Internet Archive’s plans to create a Canadian copy and also reported:

Mr. Trump’s office did not immediately respond to a request for comment Wednesday. Prior to being elected president, however, the Republican businessman suggested taking action to prevent Americans from becoming radicalized online by the Islamic State terror group’s social media recruitment efforts.

Here’s a link to Trump’s speech referenced by The Washington Times.

Sam Thielman reported in The Guardian on challenges facing libraries generally, including the Internet Archive’s decision to create a Canadian copy of data. The piece also discusses how the New York Public Library has changed its privacy policies to assure readers that it will not keep user data longer than expected.

Other media outlets reporting on the Internet Archive’s news include NBC News, the BBC, the New RepublicRecode Daily, and Newsweek.

Increasing transparency on National Security Letters

Last week the Internet Archive also revealed we received a National Security Letter (NSL), requesting we turn over personal information about a particular user, the second in our history. We worked with the Electronic Frontier Foundation (EFF) to challenge the letter and gain the right to release it in redacted form; in the process, we also highlighted an error in the NSL about the right to appeal, which may have affected thousands of other letters.

Kim Zetter, a reporter for The Intercept, reported at length about how the Internet Archive took the unusual step of challenging the NSL–and won:

Now, Kahle and the archive are notching another victory, one that underlines the progress their original fight helped set in motion. The archive, a nonprofit online library, has disclosed that it received another NSL in August, its first since the one it received and fought in 2007. Once again it pushed back, but this time events unfolded differently: The archive was able to challenge the NSL and gag order directly in a letter to the FBI, rather than through a secretive lawsuit. In November, the bureau again backed down and, without a protracted battle, has now allowed the archive to publish the NSL in redacted form.

Dhrumil Mehta of FiveThirtyEight.com reported on the error exposed by the Internet Archive and the EFF–namely, the NSL incorrectly described the means for possible appeals of the gag order preventing an organization that has received such a letter from publicizing it. Mehta has filed a Freedom of Information Act request (FOIA) to find out how many letters sent out by the Federal Bureau of Investigation (FBI) contain this error:

This letter was particularly troublesome to privacy advocates because it contained misinformation about the rights of a letter recipient to challenge the nondisclosure requirement. The letter stated that the Internet Archive could “make an annual challenge to the nondisclosure requirement.” The Electronic Frontier Foundation, an advocacy organization that is legally representing the Internet Archive, pointed out in a press release that the passage of the USA Freedom Act in June of 2015 changed the law to allow letter recipients to challenge the National Security Letter at any time, not just once annually. In response to the EFF’s claim, the FBI withdrew its National Security Letter, allowed the Internet Archive to publish a redacted version of the letter containing the error and promised to correct the mistake by informing everyone else who got the same erroneous language.

It’s not just us

Tim Johnson of McClatchyDC drew all the themes together, linking the Internet Archive’s Canada announcement, the news on the NSL, and actions other library organizations are taking, all in one piece.

It turns out the nonprofit Internet Archive isn’t alone in taking action.

The New York Public Library announced a change this week to its privacy policy, informing users that it would retain less information about their activities.

The American Library Association, headquartered in Chicago, embraced that move and encourages others, including telling public libraries to encrypt all communications and lock up stored data to protect it from a prying government.

 

Posted in Announcements, News | Tagged , , , , , , , , , , , , , , , | 17 Comments