Internet Archive Canada and National Security Letter in the news: roundup

The Internet Archive garnered major media attention over the past week, first, on our plan to create a Canadian copy, and second, on the news we received a National Security Letter (NSL) requesting personal information about a user, the second in our history.

Canadian copy

Brewster Kahle’s post explaining why, in light of the new administration, the Internet Archive is raising money to build a copy of its collections in Canada hit a nerve.  More details were in a FAQ.

On November 29, Rachel Maddow led her MSNBC show with a segment about how the Internet Archive’s Wayback Machine helps reporters by preserving a record of what politicians say online, even when they later delete it.

One of her main examples: how soon after winning the election, President-elect Donald Trump’s official federal transition web page included a “rundown ….of all of the ‘world’s top properties that Donald Trump’s owns.”

The website has since been deleted, Maddow noted.

Maddow also called the Internet Archive, a “national treasure…an international treasure.” (We’re blushing.)

Meanwhile, Paul Sawers noted in Venture Beat:

 Given that lies and fake news played a crucial part in the 2016 U.S. presidential election narrative, it is somewhat notable that the Internet Archive had launched the Political TV Ad Archive back in January to help journalists fact-check claims made during political campaigning.

In The Washington Times, Andrew Blake wrote about the Internet Archive’s plans to create a Canadian copy and also reported:

Mr. Trump’s office did not immediately respond to a request for comment Wednesday. Prior to being elected president, however, the Republican businessman suggested taking action to prevent Americans from becoming radicalized online by the Islamic State terror group’s social media recruitment efforts.

Here’s a link to Trump’s speech referenced by The Washington Times.

Sam Thielman reported in The Guardian on challenges facing libraries generally, including the Internet Archive’s decision to create a Canadian copy of data. The piece also discusses how the New York Public Library has changed its privacy policies to assure readers that it will not keep user data longer than expected.

Other media outlets reporting on the Internet Archive’s news include NBC News, the BBC, the New RepublicRecode Daily, and Newsweek.

Increasing transparency on National Security Letters

Last week the Internet Archive also revealed we received a National Security Letter (NSL), requesting we turn over personal information about a particular user, the second in our history. We worked with the Electronic Frontier Foundation (EFF) to challenge the letter and gain the right to release it in redacted form; in the process, we also highlighted an error in the NSL about the right to appeal, which may have affected thousands of other letters.

Kim Zetter, a reporter for The Intercept, reported at length about how the Internet Archive took the unusual step of challenging the NSL–and won:

Now, Kahle and the archive are notching another victory, one that underlines the progress their original fight helped set in motion. The archive, a nonprofit online library, has disclosed that it received another NSL in August, its first since the one it received and fought in 2007. Once again it pushed back, but this time events unfolded differently: The archive was able to challenge the NSL and gag order directly in a letter to the FBI, rather than through a secretive lawsuit. In November, the bureau again backed down and, without a protracted battle, has now allowed the archive to publish the NSL in redacted form.

Dhrumil Mehta of FiveThirtyEight.com reported on the error exposed by the Internet Archive and the EFF–namely, the NSL incorrectly described the means for possible appeals of the gag order preventing an organization that has received such a letter from publicizing it. Mehta has filed a Freedom of Information Act request (FOIA) to find out how many letters sent out by the Federal Bureau of Investigation (FBI) contain this error:

This letter was particularly troublesome to privacy advocates because it contained misinformation about the rights of a letter recipient to challenge the nondisclosure requirement. The letter stated that the Internet Archive could “make an annual challenge to the nondisclosure requirement.” The Electronic Frontier Foundation, an advocacy organization that is legally representing the Internet Archive, pointed out in a press release that the passage of the USA Freedom Act in June of 2015 changed the law to allow letter recipients to challenge the National Security Letter at any time, not just once annually. In response to the EFF’s claim, the FBI withdrew its National Security Letter, allowed the Internet Archive to publish a redacted version of the letter containing the error and promised to correct the mistake by informing everyone else who got the same erroneous language.

It’s not just us

Tim Johnson of McClatchyDC drew all the themes together, linking the Internet Archive’s Canada announcement, the news on the NSL, and actions other library organizations are taking, all in one piece.

It turns out the nonprofit Internet Archive isn’t alone in taking action.

The New York Public Library announced a change this week to its privacy policy, informing users that it would retain less information about their activities.

The American Library Association, headquartered in Chicago, embraced that move and encourages others, including telling public libraries to encrypt all communications and lock up stored data to protect it from a prying government.

 

Posted in Announcements, News | Tagged , , , , , , , , , , , , , , , | 2 Comments

FAQs about the Internet Archive Canada

Responses from Brewster Kahle, Founder & Digital Librarian of the Internet Archive

Based on interest from our letter that mentioned our raising money to make a copy of Internet Archive’s digital collections in Canada, press and others have asked a bunch of good questions. Here is a compendium of our answers:

Q. Were you working on a back-up before the election of Trump?
Yes, we have a partial copy of the Internet Archive in Alexandria, Egypt, and in Amsterdam, the Netherlands.

And also before the election we had been planning with the University of Toronto and University of Alberta to host the materials digitized from Canadian libraries at the Internet Archive Canada, which is a completely separate nonprofit from ours.

The statements by Trump on the campaign trail (see below) have ramped us into higher gear, moving us further and faster than we would have. The election led us to think bigger.

Q. Was there anything specific about Trump’s win that made you want to step up your game in terms of a backup archive? What in particular concerns you about what he has said/done? What potential risks do you see?
Upon his election we looked through our archive to find what his stand might be on the Internet policies and found announcements.

At this point, I think it would be prudent to take President-elect Trump at his word. Here are some of his statements, preserved in our Television News Archive. https://archive.org/tv

CNN Republican Presidential Debate
CNN December 15, 2015
Wolf Blitzer: Mr. Trump, are you open to closing parts of the internet?
Donald Trump: I would certainly be open to closing areas where we are at war with somebody. I sure as hell don’t want to let people that want to kill us and kill our nation use our internet. Yes, sir, I am.

https://archive.org/details/CSPAN_20151208_063000_Key_Capitol_Hill_Hearings
Donald Trump quote at a campaign rally at the USS Yorktown in South Carolina CSPAN broadcast speech on December 8, 2015
Donald Trump: So the press has to be responsible. They’re not being responsible, because we are losing a lot of people because of the internet. We have to do something. We have to go see Bill Gates and a lot of different people that really understand what is happening. We have to talk to them, maybe in certain areas, closing that internet up in some way. Some of you will say, “Oh, freedom of speech, freedom of speech.” these are foolish people. We have a lot of foolish people. We have a lot of foolish people. We have got to maybe do something with the internet because they are recruiting by the thousands.

Donald Trump on freedom of the press:
https://archive.org/details/R_macdonald-trumpOnPressV6

Q. How does this work? What goes into creating a backup of this magnitude (in whatever brief lay terms you can condense it to)?
There are stages we can take to achieve our overall goal. The first stage would be done with the University of Toronto and University of Alberta: to make a copy of what has been digitized from these Canadian collections (books and microfilm) and move that onto their university servers.

The next stage is to create a partial mirror at the Internet Archive Canada, which we have been planning to do.

Then the next stage is to create a “backup copy” in Canada for researchers. The best case scenario would be to have an active organization running a live copy of as much of the Internet Archive’s collections as makes sense. This is what we would like to do.

Q: Is there a specific dollar amount that you are aiming for?
To build a running archive in Canada will cost approximately $5 million, which is our goal. But we can take steps in this direction with less. Then there is ongoing support.

Q: How will you raise the money?
Great question. We are asking for donations from our users and supporters. Donations to the Internet Archive are tax-deductible in the US and can be made at https://archive.org/donate/

Q. What is the Internet Archive of Canada? Can I make a donation to it?
The Internet Archive Canada is a Not-For-Profit Corporation, registered under number 435509-1. It has been running for years and employs 11 book scanners in Toronto and Alberta. It is not a registered public charity, and donations are tax-deductible on donors’ US income only. To donate, please send cheques to:

Internet Archive Canada
130 St. George St.
Suite 7001
Toronto, ON M5V 3T5
CANADA

Q. What does it mean when you say you archive the “Internet.” Is this national? Or is it a global endeavor?
The Internet Archive archives many things: books, music, video, webpages, television and makes these materials available for free on the archive.org, openlibrary.org, and archive-it.org sites.  Take, for instance, the scope of our Web archiving in the Wayback Machine: https://archive.org/web. It houses a massive archive of over 250 billion web pages, made up of many collections. The Wayback Machine is freely accessible to anyone and it is used by hundreds of thousands of people every day. It is a global project to archive these pages.

Q. What else does the Internet Archive preserve, beyond the Wayback Machine?
The Internet Archive is a non-profit digital library founded by Brewster Kahle in 1996 with the mission to provide “Universal access to all Knowledge.” The organization seeks to preserve the world’s cultural heritage and to provide open access to our shared knowledge in the digital era, supporting the work of historians, scholars, journalists, students, the blind and reading disabled, as well as the general public. The Internet Archive’s digital collections include more than 26 petabytes of data: 279 billion web pages, moving images (2.2 million films and videos), audio (2.5 million recordings, 140,000 live concerts), texts (8 million texts including 3 million digital books), software (100,000 items) and television (3 million hours). Each day, 2-3 million visitors use or contribute to the Internet Archive, making it one of the world’s top 250 sites. It has created new models for digital conservation by forging alliances with more than 450 libraries, universities and national archives around the world.

Posted in News | 9 Comments

Help Us Keep the Archive Free, Accessible, and Reader Private

The Web Needs a MemoryThe history of libraries is one of loss.  The Library of Alexandria is best known for its disappearance.

Libraries like ours are susceptible to different fault lines:

Earthquakes,

Legal regimes,

Institutional failure.

So this year, we have set a new goal: to create a copy of Internet Archive’s digital collections in another country. We are building the Internet Archive of Canada because, to quote our friends at LOCKSS, “lots of copies keep stuff safe.” This project will cost millions. So this is the one time of the year I will ask you: please make a tax-deductible donation to help make sure the Internet Archive lasts forever. (FAQ on this effort).

On November 9th in America, we woke up to a new administration promising radical change. It was a firm reminder that institutions like ours, built for the long-term, need to design for change.

For us, it means keeping our cultural materials safe, private and perpetually accessible. It means preparing for a Web that may face greater restrictions.

It means serving patrons in a world in which government surveillance is not going away; indeed it looks like it will increase.

Throughout history, libraries have fought against terrible violations of privacy—where people have been rounded up simply for what they read.  At the Internet Archive, we are fighting to protect our readers’ privacy in the digital world.

We can do this because we are independent, thanks to broad support from many of you. The Internet Archive is a non-profit library built on trust. Our mission: to give everyone access to all knowledge, forever. For free. The Internet Archive has only 150 staff but runs one of the top-250 websites in the world. Reader privacy is very important to us, so we don’t accept ads that track your behavior.  We don’t even collect your IP address. But we still need to pay for the increasing costs of servers, staff and rent.

You may not know this, but your support for the Internet Archive makes more than 3 million e-books available for free to millions of Open Library patrons around the world.

Your support has fueled the work of journalists who used our Political TV Ad Archive in their fact-checking of candidates’ claims.

It keeps the Wayback Machine going, saving 300 million Web pages each week, so no one will ever be able to change the past just because there is no digital record of it. The Web needs a memory, the ability to look back.

If you find our work has been useful to you, please take a minute to donate whatever you can afford today. Help ensure the Internet Archive lasts forever.  I promise you—It will be money well spent.

Posted in Announcements | 238 Comments

National Security Letter to Us from FBI Includes Error – How Many Like it were Sent to Others?

The Internet Archive, with the help of the Electronic Frontier Foundation (EFF) is making public the second National Security Letter (NSL) issued to the Archive in our history (we received our first NSL in 2007 and successfully contested it with help from EFF and the ACLU). In response to our challenging this new NSL, the FBI has agreed to correct its standard NSL template and send clarifications about the law to potentially thousands of communications providers who have received NSLs in the last year and a half.

NSLs are a controversial tool that the FBI uses to demand specific types of private account information from service providers without a judge’s prior approval. NSLs also come with a gag order on the recipient. Their constitutionality is currently being litigated in courts.

The NSL we received includes incorrect and outdated information regarding the options available to a recipient of an NSL to challenge its gag. Specifically, the NSL states that such a challenge can only be issued once a year. But in 2015, Congress did away with that annual limitation and made it easier to challenge gag orders. The FBI has confirmed that the error was part of a standard NSL template and other providers received NSLs with the same significant error. We don’t know how many, but it is possibly in the thousands (according to the FBI, they sent out around 13,000 NSLs last year). How many recipients might have delayed or even been deterred from issuing challenges due to this error? Thankfully, the FBI says that they will now be issuing corrections regarding the law. You can see their letter to us here.

Publishing this NSL is also important because only a few have ever been made public due to their across-the-board gag restriction, in spite of the fact that hundreds of thousands of NSLs have been issued since 2001.

Information regarding the individual targeted by this NSL and the issuing office is redacted in the version that we are releasing. We didn’t find any documents in our records responsive to the NSL, so nothing was turned over.

We are deeply appreciative for the assistance of EFF in this matter, enabling us to make public an example of a mostly obscured practice with very significant implications for individual privacy and civil liberties.  See EFF’s press release  as well their excellent collection of blog posts for more background and analysis.

Posted in Announcements, News | 8 Comments

The Internet Arcade becomes an Archive Reality

A couple years back, we introduced the Internet Arcade, which enabled people around the world to play a number of Arcade titles from the last 40 years in their browsers, instantly. We’ve also had collections of console games, and a general library of tens of thousands of software programs which has also proven very popular.

The work continues to expand the emulated systems and refresh what titles are available, but a project we’ve had going on the side for a while just came to fruition.

Among the organizations that turned out to benefit from having our browser-based emulations was X-Arcade, manufacturers of high-quality joysticks and control panels for use with computers and software. Meant to have the original Arcade feel, a few examples of these controllers were gifted to the Archive and we’ve used them pretty extensively in demonstration days and special events.

Last year, X-Arcade announced an old-school full-sized arcade machine case for sale, and generously offered to send one to the Archive as well. We contacted an excellent artist, Mar Williams of Sudux.com, who has done excellent art for the DEFCON hacking conference and many other events, and she put together custom Internet Archive-themed arcade side art for the machine. Here’s what she came up with:

ia-mockup

The machine has made its way through shipping and moving companies and arrived at the Internet Archive’s 300 Funston Avenue headquarters in great shape, along with all the electronics and parts to make it go soon.

It’s one thing to see a mockup, and another to see the actual machine in your lobby:

img_2662

Over the next few weeks, the system will be set up to run with the Internet Archive systems and provide a really nice demonstration station for the many guests and visitors we see. It really jazzes up the place!

In the meantime, we’re now providing you with links to download the artwork files, in case you want to use them yourself.

Thanks again to X-Arcade for the lovely addition to our lobby, and to Mar Williams for such fantastic art!

layout_preview

Posted in Announcements, Cool items, Emulation | 9 Comments

Please: Help Build the 2016 U.S. Presidential Election Web Archive

seal_of_the_president_of_the_united_states-svgHelp us build a web archive documenting reactions to the 2016 Presidential Election. You can submit websites and other online materials, and provide relevant descriptive information, via this simple submission form. We will archive and provide ongoing access to these materials as part of the Internet Archive Global Events collection.

Since its beginning, the Internet Archive has worked with a global partner community of cultural heritage institutions, researchers and scholars, and citizens to build crowdsourced topical web archives that preserve primary sources documenting significant global events. Past collections include the Occupy Movement, the 2013 US Government Shutdown, the Jasmine Revolution in Tunisia, and the Charlie Hebdo attacks. These collections leverage the power of individual curators and motivated citizens to help expand our collective efforts to diversity and augment the historical record. Any webpages, sites, or other online resources about the 2016 Presidential Election are in scope. This web archive will build upon our affiliated efforts, such as the Political TV Ad Archive, and other collecting strategies, to provide permanent access to current political events.

As we noted in a recent blog post, the Internet Archive is “well positioned, with our mission of Universal Access to All Knowledge, to help inform the public in turbulent times, to demonstrate the power in sharing and openness.” You can help us in this mission by submitting websites that preserve the online record of this unique historical moment.

Posted in Announcements, Archive-It, News, Web Archive | 8 Comments

US Election Results

I am a bit shell shocked– I did not think the election would go the way it did.   I want to reassure everyone– we are safe– funding, mission, partners have no reason to change.   I find this reassuring, hopefully you do as well.

As we take the next weeks to have this sink in, I believe we will come to find we will have new responsibilities, increased roles to play, in keeping the world an open and free environment.

We are well positioned, with our mission of Universal Access to All Knowledge, to help inform the public in turbulent times, to demonstrate the power in sharing and openness.

I look forward to working with our staff, our partners, and the new partners that this creates, to see what our role should be to build the best damn library we can to serve the Maximum Public Good.

Over the next couple of weeks, please think through what we might do.  Looking forward to your ideas.

yours,

Brewster Kahle
Digital Librarian
brewster@archive.org

Posted in Announcements | 9 Comments

Aaron Swartz Weekend

by Lisa Rein, Cofounder and Coordinator, Aaron Swartz Day

In memory of Aaron Swartz, whose social, technical, and political insights still touch us daily, Lisa Rein, in partnership with the Internet Archive, will be hosting a weekend of events on Saturday, November 5 and Sunday, November 6. Friends, collaborators, and hackers can participate in a two-day Hackathon and Aaron Swartz Day Evening Reception.

Schedule of events held at the Internet Archive:

Saturday, November 5, from 10 am – 6 pm and Sunday, November 6, from 11am – 5pm — Participate in the Hackathon, which will focus on SecureDrop, the whistleblower submission system originally created by Aaron just before he passed away.

Saturday night, November 5th, from 6:30pm – 9:30pm — Celebrate and remember Aaron, and also the grand tradition of working hard to make the world a better place, at the Aaron Swartz Day Evening Celebration:

Reception: 6:30pm – 7:30pm – Come mingle with the speakers and enjoy nectar, wine & tasty nibbles.

Migrate your way upstairs: 7:30-8:00pm – We decided to give folks a little window of time to finish up their nibbles and wine at the reception, exchange contact info, and make their way upstairs to grab a seat to watch the speakers, which will begin promptly at 8pm.

Speakers 8:00pm – 9:30pm:

A Special Statement from Chelsea Manning (in celebration of this year’s Aaron Swartz Day and International Hackathon)

Tiffiniy Cheng (Co-founder and Co-director Fight for the Future)

Cindy Cohn (Executive Director, Electronic Frontier Foundation)

Shari Steele (Executive Director, Tor Project)

Yan Zhu (Security Expert, Friend of Chelsea Manning)

Alison Macrina (Founder and Executive Director, Library Freedom Project)

Conor Schaefer (DevOps Engineer, SecureDrop)

Brewster Kahle (Digital Librarian, Internet Archive) w/Vinay Goel  (Senior Data Engineer, Internet Archive)

Please RSVP to this event

For more information, contact:
lisa@lisarein.com
http://www.aaronswartzday.org

Posted in Event | 1 Comment

Election Night at the Internet Archive

The Internet Archive is informally open to our employees, their families and friends, and our community to watch the election results next Tuesday night. This is a spur-of-the-moment invitation and an experiment. If there are enough people interested, we will use the great room.

To cover the cost of pizza and soda, please purchase a $10 “ticket” on our Eventbrite.

The event will run from 6pm until the election is called — 11pm at the latest. We will limit the number of people and we reserve the right to ask anyone to leave for any reason.

If you are interested in volunteering to help that evening, please contact Salem at salem@archive.org.

Posted in News | Comments Off on Election Night at the Internet Archive

GifCities: The GeoCities Animated GIF Search Engine

 

underconstruction

dancing_babyhomer_1

 

 

line

skeletonworm          surfcpu       webfun          gif_guitarman

Try the Internet Archive’s animated GIF search engine at GifCities.org!  You can now get your early-web GIF fix and have a fun way to browse the web archive. Search for snowglobes or butterflies or balloons or (naturally) cats. If you click on a GIF, then it brings to you to the original page from the Wayback Machine. (Then please consider donating to the Archive)

One of the goals for our 20th anniversary event last week was to highlight the amusing and wacky corners of the web, as represented in our web archive, in order to provide a light-hearted, novel perspective on the history of this amazing publication platform that we have worked to preserve over the years.

The animated GIF is perhaps the iconic, indomitable filetype of the early web.  Meme-vessel, page-spacer, action-graphic-maker — GIFS are a quintessential feature of the 1990’s web aesthetic, but remain just as popular today as they were twenty years ago. GeoCities, the first major web hosting platform for individual users to create their own pages, and once the third most visited site on the web before being shut down in 2009, occupies a similarly notable place in the history of the web.

So we combined these two aspects of web history by extracting every animated GIF from GeoCities in our web archive and built a search engine on top of them. Behold, for your viewing pleasure, over 4,500,000 animated GIFs (1,600,000 unique), searchable based on filename and URL path, with most GIFs linking to the archived GeoCities web page where it was originally displayed.

Some random staff faves:

dinosaur

 skullmail  dogsruledoorsmor

landing-a

Soft-launched at our anniversary event on Wednesday, where we also projected GifCities on the side of our headquarters in San Francisco, the project has been featured in The Guardian, BoingBoing, the A.V. Club, CNET,  and others. The GeoCities GIF collection was also made available for creative reuse by artists and researchers, and featured in work such as the GifCollider project currently showing at BAMPFA (see the videos online) and the Hall of GIFs data visualization at NCSU. Shout-outs also go to others working with the GeoCities web archive, including the Geocities Research Institute and historians. More details on the project can be found at the GifCities about page.

And yes, like every other upstanding web citizen, we GifCities’ed ourselves:internet1archive1

Posted in Announcements, News | 3 Comments

Making the Web More Reliable — 20 Years and Counting

blog-wwwcheck
As a part of our 20th anniversary, here are some highlights about tools and projects, from the Internet Archive, helping to make the web a more reliable infrastructure for supporting our culture and commerce.

All in all, the Internet Archive is building collections and tools to help make the open web a permanent resource for current users and into the future.

Please donate to make it even better.

Thank you to the hundreds of people who have worked for the Internet Archive over the past 20 years, and to the thousands who have supported the Archive and contributed to the collections.

 

Posted in Announcements, News | 3 Comments

Searching Through Everything

With over 20 million items in the Internet Archive’s many collections, having a good way to search through them to find exactly what you want is crucial. It is equally important to be able to filter the data in flexible ways so that you see subsets of the data most relevant to you. We are pleased to offer two new features that might change everything about how you search.

Faceted Filtering

Once you’ve executed a site search, either from the search form at the top right of every page or by going to the search page directly, you’ll see a bunch of new checkboxes down the left-hand side, in addition to the search results. These checkboxes are grouped into categories, such as “Media Type” and “Topics & Subjects”.

Clicking any of the checkboxes adds the corresponding term to the search criteria, allowing you to more precisely define the filtered set of search results. Checkmarking more than one term within the same category causes items that match any of the selected terms to be displayed, whereas checkmarking items from two different categories means that only items matching both terms will be shown. Play around with it, and you’ll see how intuitive it is. Checking or unchecking new terms causes search results to be re-filtered on the fly.

We were looking for a way to provide a more powerful, visual approach to filtering search results. When we user-tested the faceted search interface, our testers loved it. It was a familiar interface already in use throughout the Internet which offered both simplicity and richness.

Full-Text Search (in Beta)

Every day, we see an average of 50,000 hits on our search pages, as you, our users, search for title, creator, and various other metadata about the items we’ve archived. But you have long asked when you would be able to search not only across all items but within them as well. For years you’ve been able to search within the text of a single book using our BookReader, but never before have you been able to search across and within all 9 million available text items at the Internet Archive in a single shot. Until now.

Full-Text Search

And here’s all you have to do: On the search page, after entering your search query in the text field, checkmark “Search full text of books” just underneath the text field, and then click or tap “GO”. That’s it! In seconds, you’ll have the results of searching through millions of texts. Note that the facets at the left work a little differently from non-full-text searches; just click or tap one to add it as a filter criterion.

At the moment, we’re still in beta. Suffice to say, we’ve faced quite a number of challenges in configuring and populating our full-text search engine, from creating the Elasticsearch clusters to dealing with optical character recognition (OCR) issues related to strange fonts, running page headers, or language recognition. We are continuing to make improvements, and still have a ways to go.

But please use it! Try searching for some phrase that’s stuck in your head from a book long ago forgotten, and see what comes up. You now have the contents of 9 million texts at your fingertips.

Posted in Announcements, Books Archive, News | 9 Comments

More than 1 million formerly broken links in English Wikipedia updated to archived versions from the Wayback Machine

blog-no404

The Internet Archive, the Wikimedia Foundation, and volunteers from the Wikipedia community, have now fixed more than 1 million broken outbound web links on English Wikipedia. This was possible because, in addition to other web archiving projects, the Internet Archive has been monitoring all new, and edited, outbound links from English Wikipedia for three years and archiving them soon after changes are made to articles.  As a result of this work, as pages on the Web become inaccessible, links to archived versions in the Internet Archive’s Wayback Machine can take their place.  This has now been done for the English Wikipedia and more than 1 million links are now pointing to preserved copies of missing web content.

This story is a testament to the sharing, cooperative nature and resulting benefits of the open world.

What do you do when good web links go bad? If you are a volunteer editor on Wikipedia, you start by writing software to examine every outbound link in English Wikipedia to make sure it is still available via the “live web.” If, for whatever reason, it is no longer good (e.g. if it returns a “404” error code or “Page Not Found”) you check to see if an archived copy of the page is available via the Internet Archive’s Wayback Machine. If it is, you instruct your software to edit the Wikipedia page to point to the archived version, taking care to let users of the link know they will be visiting a version via the Wayback Machine.

That is exactly what Maximilian Doerr and Stephen Balbach have done. As a result of their work, in close collaboration with the non-profit Internet Archive and the Wikimedia Foundation’s Wikipedia Library program and Community Tech team, now more than 1 million broken links have been repaired. For example, footnote #85 from the article about Easter Island, now links to: https://web.archive.org/web/20071011083729/http://islandheritage.org/faq.html when before it linked to the missing page http://islandheritage.org/faq.html.  Pretty cool, right?

“We are honored to work with the Wikipedia community to help maintain the cultural treasure that is Wikipedia,” said Brewster Kahle, founder and Digital Librarian of the Internet Archive, home of the Wayback Machine. “By editing broken outbound links on English Wikipedia to their archived versions available via the Wayback Machine, we are helping to provide persistent availability to reference information. Links that would have otherwise lead to a virtual dead end.”

“What Max and Stephen have done in partnership with Mark Graham at the Internet Archive is nothing short of critical for Wikipedia’s enduring value as a shared repository of knowledge. Without dependable and persistent links, our articles lose their backbone of reliable sources. It’s amazing what a few people can do when they are motivated by sharing—and preserving—knowledge,” said Jake Orlowitz, head of the Wikipedia Library.

“Having the opportunity to contribute something big to the community with a fun task like this is why I am a Wikipedia volunteer and bot operator.  It’s also the reason why I continue to work on this never-ending project, and I’m proud to call myself its lead developer,” said Maximilian, the primary developer and operator of InternetArchiveBot.

So, what is next for this collaboration between Wikipedia and the Internet Archive? Well… there are nearly 300 Wikipedia language editions to rid of broken links. And, we are exploring ways to help make links added to Wikipedia self-healing. It’s a big job and we could use help.

Making the web more reliable… one web page at a time. It’s what we do!

A huge Thank You! to Stephen Balbach, Maximilian Doerr, Vinay Goel, Mark Graham, Brewster Kahle, John Lekashman, Kenji Nagahashi, the Wikimedia Foundation, and Wikipedia community members.

Posted in News | 3 Comments

I CAN HAZ MEME HISTORY??

It’s always going to be an open question as to what parts of culture will survive beyond each generation, but there’s very little doubt that one of them is going to be memes.

Memes are, after all, their own successful transmission of entertainment. A photo, an image that you might have seen before, comes to you with a new context. A turn of phrase, used by a politician or celebrity and in some way ridiculous or unique, comes back you in all sorts of new ways (Imma let you finish) and ultimately gets put back into your emails, instant messages, or even back into mass media itself.

However, there are some pretty obvious questions as to what memes even are or what qualifies as a meme. Everyone has an opinion (and a meme) to back up their position.leo2

One can say that image macros, those combinations of an expressive image with big bold text, are memes; but it’s best to think of them as one (very prominent) kind of a whole spectrum of Meme.

Image Macros rule the roost because they’re platform independent. They slip into our lives from e-mails, texts, websites and even posted on walls and doors. The chosen image (in this example, from the Baz Luhrman directed Great Gatsby) portrays an independent idea (Here’s to you) and the text compliments or contrasts it. The smallest, atomic level of an idea. And it gets into your mind, like a piece of candy (or a piece of grit).

photofunia-1475750857It can get way more complicated, however. This 1980s “Internet Archive” logo was automatically generated by an online script which does the hard work of layout, fonts and blending for you. When news of this tool broke in September of 2016 (it had been around a long time before that), this exact template showed up everywhere, from nightclub flyers to endless tweets. Within a short time, the ideas of both “using a computer to do art” and “the 1980s” became part of the payload of this image, as well as the inevitable feeling it was even more cliche and tired as hundreds piled on to using it. The long-term prospects of this “1980s art” meme are unknown.

223798 And let’s not forget that “memes” (a term coined by Richard Dawkins in his 1976 book The Selfish Gene) themselves go back decades before the internet made its first carefully engineered cross-continental connections. Office photocopies ran rampant with passed along motivational (or de-motivational) posters, telling you that you didn’t need to be crazy to work here… but it helps! Suffering the pains of analog transfer, the endless remixing and hand touchups of these posters gave them a weathered look, as if aged by their very (relative) longevity. To many others, this whole grandparent of the internet meme had a more familiar name: Folklore.

Memes are therefore rich in history and a fundamental part of the online experience, passed along by the thousands every single day as a part of communicating with each other. They deserve study, and they’ve gotten it.

Websites have been created to describe both the contributing factors and the available examples of memes throughout the years. The most prominent has been Know Your Meme, which through several rounds of ownership and contributors has consistently provided access to the surprisingly deep dive of research a supposedly shallow “meme” has behind it.

meme-gapBut the very fluidity and flexibility of memes can be a huge weakness — a single webpage or a single version of an image will be the main reference point for knowing why a meme came to be, and the lifespan of these references are short indeed. Even when hosted at prominent hosting sites or as part of a larger established site, one good housecleaning or consolidation will shut off access to the information, possibly forever.

This is where the Internet Archive comes in. With our hundreds of billions of saved URLs from 20 years stored in the Wayback Machine, a neutral storehouse of not just the inspirations for memes but examples of the memes themselves are kept safe for retrieval beyond the fleeting fads and whims of the present.
58145293

The metaphor of “the web” turns out to be more and more apt as time goes on — like spider webs, they’re both surprisingly strong, but also can be unexpectedly lost in an instant. Connections that seemed immutable and everlasting will drop off the face of the earth at the drop of a hat (or a server, or an unpaid hosting bill).

Memes are, as I said, compressed culture. And when you lose culture, you lose context and meaning to the words and thoughts that came before. The Wayback machine will be a part of ensuring they stick around for a long time to come.

Posted in News, Wayback Machine, Web Archive | 2 Comments

How the Internet Archive is hacking the election

There are thirteen days until Election Day — not that we’re counting.

In this most bizarre, unruly, terrifying, fascinating election year, the Internet Archive has been in the thick of it. We’re using technology to give journalists, researchers and the public the power to take the political junk food that’s typically spoon fed to all of us—the political ads, the presidential debates, the TV news broadcasts—and help us to scrutinize the labels, dig into the content, and turn that meal into something more nutritious.

political ad archivePolitical ads. We’ve archived more than 2,600 different ads over at the Political TV Ad Archive and used the open source Duplitron created by senior technologist Dan Schultz to count nearly 300,000 airings of the TV ads across 26 media markets. We’ve linked the ads to OpenSecrets.org information on the sponsors—whether it’s a super PAC, a candidate committee, or a nonprofit “dark money” group.

Journalists have used the underlying metadata to visualize this information creatively, whether it’s the moment when anti-Trump ads started popping up in Florida (FiveThirtyEight.com), revealing how Ted Cruz favors “The Sound of Music”  (Time.com), or turning the experience of being an Iowa voter deluged with campaign ads into an 8-bit arcade-style video game (The Atlantic).

Meanwhile, our fact checking partners at FactCheck.org, PolitiFact, and The Washington Post’s Fact Checker, have fact checked 116 archived ads and counting, not just for the presidential candidates but for U.S. Senate, House, and local campaigns as well. Of the 70 ads fact check by PolitiFact reporters, nearly half have earned ratings ranging from “Mostly false” to “Pants on Fire!”

Example: this “Pants on Fire!” ad played nearly 300 times in Cleveland, Ohio, in August, where Democrat Ted Strickland is facing incumbent Senate Rob Portman, a Republican, in a competitive race.  The claim: that as governor, Democrat Ted Strickland proposed deep budget cuts and then “wasted over $250,000 remodeling his bathrooms at the governor’s mansion.” While it’s true Strickland proposed budget cuts in the wake of the 2008 financial crisis, the money used to renovate the governor’s mansion didn’t come from that pool of money. What’s more, the bathrooms in question were not for the governor’s personal use, but rather for tourists who come to visit the mansion.

Presidential debates. In the recent presidential debates, the Internet Archive opened up the TV News Archive to offer near real-time broadcasts while the candidates were still on the stage. Journalists and fact checkers used this online resource to share clips of key points in the debate.

Example: during the third presidential debate, Farai Chideya, a reporter for FiveThirtyEight.com, linked to this clip in a live blog about the debate, noting that abortion is a key issue for Trump’s core supporters.

Twenty-five hours after the debate, we learned that the public made 85 quotes from our TV News Archive debate footage, and that viewers played these more than one million times—a healthy response to this brand new experiment.

TV News. When the debates were over, we used the Duplitron on TV news to tally which debate clips were shared on such networks as CNN, FOX News, and MSNBC and shows such as “Good Morning America” and the “Today show.” Journalists used our downloadable data to create visualizations to show how TV News shows present the debates to viewers.

nytExample: this interactive visualization in The New York Times shows readers how the different cable news networks presented the first debates, and highlights the differences between them.

The Wall Street Journal, the Economist, Fusion and The Atlantic all have used the data to visualize how the debates were portrayed for viewers. In addition, we’re keeping our eyes open and Duplitron turned on for tracking how TV news shows cover other key video. For example, we have data on how TV news shows used clips from the 2005 “Access Hollywood” tape, in which Trump bragged about groping women, and his subsequent apology.

In the thirteen days remaining before the election, we’ll continue to track airings of political ads in key battleground state markets, work with fact checking and journalist partners, and stay on the TV news beat with attention to breaking news.

And when it’s all over, we’re looking forward to working with our partners to figure out what just happened, what we’ve learned, and how we can help in the future.

 

Posted in Announcements, News | Tagged , , , , , , , , , , , , | Comments Off on How the Internet Archive is hacking the election

10 Years of Archiving the Web Together

As the Internet Archive turns 20, the Archive-It community is proud to celebrate an anniversary of its own: 10 years of working with thousands of librarians, archivists, and others to preserve the web and build rich, expansive collections of websites for discovery and use by future generations. Eighteen partners inaugurated the Archive-It service in 2006. Since then, that list has grown to include more than 450 organizations and individuals, each with its unique goals and collecting scope. In this time they added more than 17 billion (yes, with a “b”) URLs to their collections.

Archive-It partners over the years. Clockwise from top-left: Margaret Maes (LIPA) and Nicholas Taylor (Stanford University); James Jacobs (Stanford University) and Kent Norsworthy (University of Texas at Austin); K12 web archivists at PS 174 in Queens; Renate Giacomuzzi, Elisabeth Sporer (University of Innsbruck), and Kristine Hanna (Internet Archive)

Archive-It partners over the years. Clockwise from top-left: Margaret Maes (Legal Information Preservation Alliance) and Nicholas Taylor (Stanford University); James Jacobs (Stanford University) and Kent Norsworthy (University of Texas at Austin); K12 web archivists at PS 174 in Queens; Renate Giacomuzzi, Elisabeth Sporer (University of Innsbruck), and Kristine Hanna (Internet Archive)

And to give you just a hint of how the overall collection has grown: that’s about 5 billion new URLs in just the last year! They’ve captured some momentous historical events, local community history, and social and cultural activity across more than 7,000 collections to date, everything from 700+ human rights sites to the tea party movement; tobacco industry records to Mormon missionaries’ blogs. And of course who can forget all of the LOLcats? They’ve collaborated on capturing breaking news, opened doors to the next generation of curators in our K12 web archiving program, and explored their own collections in new forms with datasets leveraging our researcher services.

archive-it-2006_oldweb

The Archive-It pilot website in 2005

Archive-It is Internet Archive’s web archiving service that helps institutions build, preserve, and provide access to collections of archived web content. It was developed in response to the needs of libraries, archives, historical societies, museums, and other organizations who sought to use the same powerful technology behind the Wayback Machine to curate their own web archives. The service was then the first of its kind, but has grown and expanded to meet the needs of an ever-widening scope of partners dedicated to archiving the web.

archive-it_2-0

Adding a website to a collection in Archive-It 2.0, as released in July 2006.

Our pilot partners, who began testing a beta version of the service in late 2005, helped to develop and improve the essential tools that such a service would provide and used those tools to create collections, documenting local and global histories in a new way. Based on feedback from the pilot partners, the Archive-It web application launched publicly in 2006 with the most basic of curation tools: create a collection, capture content, and make it publicly available. The service and the community grew exponentially from there.

Archive-It 5.0 realtime crawl tracking.

Archive-It 5.0 realtime crawl tracking.

The myriad partner-driven technical (to say nothing of aesthetic!) improvements of the last ten years are reflected in this year’s release of Archive-It 5.0, the first full redesign of the Archive-It web application since its launch. In the meantime, Archive-It continues to work with the community to preserve and provide access to amazing collections and to develop new tools for archiving the web, including new capture technologies, data transfer APIs, and more.

With year 11 (and Archive-It 5.1) just around the corner, we look forward to helping our partner institutions use new tools, build new collections, and expand the broader community working to archive the web.

Posted in News | Comments Off on 10 Years of Archiving the Web Together

Lending Launches on Archive.org, Plus Bookreader Updates

We have been loaning digital books through Open Library since 2010. We started with about 10,000 books in the lending collections, and soon there will be more than 500,000 books available.  

Today we launch lending on Archive.org, so patrons no longer need to go to Open Library to borrow books. The same parameters for borrowing apply — books are free to borrow for logged in users, and they can be borrowed for a period of 2 weeks.

For Open Library users, the lending path has changed a bit — see this post for more information.

For Archive.org users, you’re going to see many more modern books available in the coming weeks. These books will appear in collections and search results with a blue “Borrow” notice on them.

ia-borrowbookinsearchresults

 

Logged in users will be able to borrow the book from the book’s details page where you see the full metadata. Remember, creating an account on archive.org is free, and so is borrowing books.

ia-borrow-detail

When you click “Borrow This Book” you will be taken to the new bookreader.  You can search, use the read aloud feature, zoom in and out, and change the number of pages you see at once. The book will be available in your browser for 2 weeks as long as you are connected to the Internet.

If you prefer to read your book offline, you can download a PDF or EPUB version of the book to be read in Adobe Digital Editions (free download).  You must install Adobe Digital Editions before you can read the offline version of your book.

ia-bookloandownloaddialog-detail

When you want to return the book, you can return it from Adobe Digital Editions (if you chose to download) and from the bookreader.

ia-bookreturndialog-detail

In addition to the new borrow features, we have updated the bookreader to display better on mobile devices. The layout now changes when you are on a very small screen in order to make it easier to use.  You will see one page at a time, and some of the functions are located in the menu on the left.

ia-mobile-menucircled

If you would like to download an offline copy of the book accessible through Adobe Digitial Editions (don’t forget to download the app first!) open the menu and choose “Loan Information.”

ia-mobile-menu-loaninfocircled

From here you can download a PDF or EPUB to read offline, or return the book.

ia-mobile-downloadloans

We hope you will explore the books available for lending, and enjoy the features of the new bookreader.

Many thanks to: Richard Caceres, Brenton Cheng, Carolyn Li-Madeo, Tracey Jaquith, Jessamyn West, Jeff Kaplan, John Lekashman, Dwalu Khasu, John Gonzalez and Alexis Rossi.

Posted in Books Archive, News | Comments Off on Lending Launches on Archive.org, Plus Bookreader Updates

The New Memory Palace

By Paul D. Miller aka DJ Spooky

     “Sometimes it is the people no one can imagine anything of who do the things no one can imagine.”

– Alan Turing’s biopic, The Imitation Game, 2014

Photo Credit: Mitchell Maher

DJ Spooky at Internet Archive’s 20th Anniversary Celebration
Photo Credit: Mitchell Maher

A lot of things have changed in the last 20 years. A lot of things haven’t. We’ve moved from the tyranny of physical media to the seemingly unlimited possibilities of total digital immersion. We’ve moved from a top down, mega corporate dominated media, to a hyper-fragmented multiverse where any kind of information is accessible within reason (and sometimes without!). The fundamental issue that “memory” and how it responds to the digital etherealization of all aspects of the information economy we inhabit conditions everything we do in this 21st-century culture of post-, post-, post-everything contemporary America. Whether it’s the legions of people who walk the streets with Bluetooth enabled earbuds that allow them to ignore the physical reality of the world around them, or the Pokémon Go hordes playing the world’s largest video game as it’s overlaid on stuff that happens “IRL” (In Real Life) that layer digital role playing over the world: diagnosis is pending. But the fundamental fact is clear: digital archives are more important than ever and how we engage and access the archival material of the past, shapes and molds the way we experience the present and future. Playing with the Archive is a kind of digital analytics of the subconscious impulse to collage. It’s also really fun.

mnemosyne1Mnemosyne was the Greek muse who personified memory. She was a Titaness who was the daughter of Uranus (who represented “Sky”), the son and husband of Gaia, Mother Earth. When you break it down, Mnemosyne had a deeply complicated life, and ended up birthing the other muses with her nephew, Zeus. Ancient Greek myth was quite an incestuous place, and every deity had complicated and deeply interwoven histories that added layers and layers of what we would now call “intertextuality.” Look at it this way: a Titaness, Mnemosyne, gave birth to Urania (Muse of Astronomy), Polyhymnia (Muse of hymns,) Melpomene (Muse of tragedy,) Erato (Muse of lyric poetry,) Clio (Muse of history,) Calliope (Muse of epic poetry,) Terpsichore (Muse of dance,) and Euterpe (Muse of music). It’s complicated. Mnemosyne also presided over her own pool in Hades as a counterpoint to the river Lethe, where the dead went to drink to forget their previous life. If you wanted to remember things, you went to Mnemosyne’s pool instead. You had to be clever enough to find it. Otherwise, you’d end up crossing the river under the control of spirits guided by the “helmsman” whose title translates from the Greek term “kybernētēs” across the mythical river into the land of the dead aka Hades. What’s amazing about the wildly “recombinant” logic of this cast of characters is that somehow it became the foundation of our modern methods for naming almost every aspect of digital media — including the term “media.” Media, like the term data is a plural form of a word “appropriated” directly from Latin. But the eerie resonance it has with our era comes into play when we think of the ways “the archive” acts as a downright uncanny reflection site of language and its collision between code and culture.

neuromancer-william-gibsonUntil the internet, the term cyber was usually used to measure words about governance and then later evolved to how we look at computers, computer networks, and now things like augmented reality and virtual reality. The term traces back to the word cybernetics, which was popularized by the renowned mathematician Norbert Wiener, founder of Information theory, at MIT. There’s a strange emergent logic that connects the dots here: permutation, wordplay, and above all, the use of borrowed motifs and ahistorical connections between utterly unassociated material. I guess William S. Burroughs was right: the world has become a mega-Cybertron, a place where everything is mixed, cut and paste style, to make new meanings from old. With people like Norbert Wiener, cybernetics usually refers to the study of mechanical and electronic systems designed at heart, to replace human systems. The term “cyberspace” was coined by William Gibson, to reflect the etherealized world of his 1982 classic, Burning Chrome. He used it again as a reference point for Neuromancer, his groundbreaking novel. A great, oft-cited passage gives you a sense how resonant it is with our current time:

Cyberspace. A consensual hallucination experienced daily by billions of legitimate operators, in every nation, by children being taught mathematical concepts… a graphic representation of data abstracted from the banks of every computer in the human system. Unthinkable complexity. Lines of light ranged in the nonspace of the mind, clusters and constellations of data. Like city lights, receding…

When the Internet Archive asked me to do a megamix of their archive of recordings from their data files, I was a bit overwhelmed. There’s no way any human being could comb through even the way they’ve documented just the web, let alone the material they have asked people to upload.Where to start? Sir Tim Berners-Lee’s speech inaugurating the internet back when he came up with the term the “Semantic Web?”  The first recordings from Edison? That could be cool. Maybe mix that with GW Bush’s State of the Union speech inaugurating the invasion of Iraq? Why not. Take Hedy Lamar’s original blueprints for spread spectrum “secret communications systems” and mix that with recordings of William S. Burrough and Malcolm X, with a beat made from open source 1920’s jazz and 1950’s New Orleans blues? Why not. Grab some clips of Cory Doctorow talking about the upcoming war on open computing and mix it with Parliament Funkadelic? Sure. Take the first “sound heard around the world,” the telemetry signals guiding the Sputnik satellite as it swirled around planet Earth to become our first orbital artificial moon? Cool. Why not? Take a speech from Margaret Sanger, the woman who started Planned Parenthood, and mix it with Public Enemy? Cool. Take D.W. Griffith’s “Birth of a Nation” and re-score it with the Quincy Jones theme from “Fat Albert?” That would actually be kind of cool, but would require a lot of editing.

The basic idea here is that once you have the recordings and documentation of all aspects of human activity from the last several centuries, that is a serious “mega-mix.”

What you will hear in the short track I made is a mini reflection of the density of the sheer volume of materials that the Internet Archive has onsite. It is a humble reminder that through the computer, the network, and the wireless transmission of information, we have an immaculate reflection of what Alan Turing may have called “morphogenesis” — the human, all too human, attempt to corral the world into anthropocentric metaphors that seek to convey the sublime, the edge of human understanding: the emergent patterns that occur when you recombine material with unexpectedly powerful new connections.

Photo Credit: Mitchell Maher

Memory Palace on flexi vinyl
Photo Credit: Mitchell Maher

I’m honored to be the first DJ to start. But I’m also honored that many, many more will follow. The Archive is a mirror of infinite recombinant potential. I hope that its gift of free culture and free exchange creates a place where we will be comfortable with what is almost impossible to guess comes next. It is not a “collaborative filter” but a place where you are invited to explore on your own and come up with new ways of seeing the infinite memory palace of the fragments of history, time, and space that make this modern 21st century world work.

Enjoy.

Paul D. Miller aka DDJ SpookyJ Spooky’s work ranges from creating the first DJ app to producing an impactful DVD anthology about the “Pioneers of African American Cinema.” According to a New York Times review, “there has never been a more significant video release than ‘Pioneers of African-American Cinema.'” The prolific innovator and artist also created 13 music albums and is about to release a fourteenth. Called “Phantom Dancehall,” it is an intense mix of hip hop, Jamaican ska and dancehall culture.

Posted in Announcements, Event, Music, News | 1 Comment

20,000 Hard Drives on a Mission

drives2
The mission of the Internet Archive is “Universal Access to All Knowledge.” The knowledge we archive is represented as digital data. We often get questions related to “How much data does Internet Archive actually keep?” and “How do you store and preserve that knowledge?”

All content uploaded to the Archive is stored in “Items.” As with items in traditional libraries, Internet Archive items are structured to contain a single book, or movie, or music album — generally a single piece of knowledge that can be meaningfully cataloged and retrieved, along with descriptive information (metadata) that usually includes the title, creator (author), and other curatorial information about the content. From a technical standpoint, items are stored in a well-defined structure within Linux directories.

Once a new item is created, automated systems quickly replicate that item across two distinct disk drives in separate servers that are (usually) in separate physical data centers. This “mirroring” of content is done both to minimize the likelihood of data loss or data corruption (due to unexpected harddrive or system failures) and to increase the efficiency of access to the content. Both of these storage locations (called “primary” and “secondary”) are immediately available to serve their copy of the content to patrons… and if one storage location becomes unavailable, the content remains available from the alternate storage location.

We refer to this overall scheme as “paired storage.” Because of the dual-storage arrangement, when we talk about “how much” data we store, we usually refer to what really matters to the patrons — the amount of unique compressed content in storage — that is, the amount prior to replication into paired-storage. So for numbers below, the amount of physical disk space (“raw” storage) is typically twice the amount stated.

As we have pursued our mission, the need for storing data has grown. In October of 2012, we held just over 10 petabytes of unique content. Today, we have archived a little over 30 petabytes, and we add between 13 and 15 terabytes of content per day (web and television are the most voluminous).

Currently, Internet Archive hosts about 20,000 individual disk drives. Each of these are housed in specialized computers (we call them “datanodes”) that have 36 data drives (plus two operating systems drives) per machine. Datanodes are organized into racks of 10 machines (360 data drives), and interconnected via high-speed ethernet to form our storage cluster. Even though our content storage has tripled over the past four years, our count of disk drives has stayed about the same. This is because disk drive technology improvements. Datanodes that were once populated with 36 individual 2-terabyte (2T) drives are today filled with 8-terabyte (8T) drives, moving single node capacity from 72 terabytes (64.8T formatted) to 288 terabytes (259.2T formatted) in the same physical space! This evolution of disk density did not happen in a single step, so we have populations of 2T, 3T, 4T, and 8T drives in our storage clusters.

petaboxOur data mirroring scheme ensures that information stored on any specific disk, on a specific node, and in a specific rack is replicated to another disk of the same capacity, in the same relative slot, and in the same relative datanode in a another rack usually in another datacenter. In other words, data stored on drive 07 of datanode 5 of rack 12 of Internet Archive datacenter 6 (fully identified as ia601205-07) has the same information stored in datacenter 8 (ia8) at ia801205-07. This organization and naming scheme keeps tracking and monitoring 20,000 drives with a small team manageable.

We maintain our datacenters at ambient temperatures and humidity, meaning that we don’t incur the cost of operating and maintaining an air-conditioned environment (although we do use exhaust fans in hot weather). This keeps our power consumption down to just the operational requirements of the racks (about 5 kilowatts each), but does put some constraints on environmental specifications for the computers we use as data nodes. So far, this approach has (for the most part) worked in terms of both computer and disk drive longevity.

Of course, disk drives all eventually fail. So we have an active team that monitors drive health and replaces drives showing early signs for failure. We replaced 2,453 drives in 2015, and 1,963 year-to-date 2016… an average of 6.7 drives per day. Across all drives in the cluster the average “age” (arithmetic mean of the time in-service) is 779 days. The median age is 730 days, and the most tenured drive in our cluster has been in continuous use for 6.85 years!

So what happens when a drive does fail? Items on that drive are made “read only” and our operations team is alerted. A new drive is put in to replace the failed one and immediately after replacement, the content from the mirror drive is copied onto the fresh drive and read/write status is restored.

Although there are certainly alternatives to drive mirroring to ensure data integrity in a large storage system (ECC systems like RAID arrays, CEPH, Hadoop, etc.) Internet Archive chooses the simplicity of mirroring in-part to preserve the the transparency of data on a per-drive basis. The risk of ECC approaches is that in the case of truly catastrophic events, falling below certain thresholds of disk population survival means a total loss of all data in that array. The mirroring approach means that any disk that survives the catastrophe has usable information on it.

Over the past 20 years, Internet Archive has learned many lessons related to storage. These include: be patient in adopting newly introduced technology (wait for it to mature a bit!); with ambient air comes ambient humidity — plan for it; uniformity of infrastructure components is essential (including disk firmware). One of several challenges we see on the horizon is a direct consequence of the increases in disk density — it takes a long time to move data to and from a high-capacity disk. Across pair-bonded 1Gbps node interconnects, transferring data to or from an 8T drive requires 8 hours and 11 minutes at “full speed” and in-practice can extend to several days with network traffic and activity interruptions. This introduces a longer “window of vulnerability” for the unlikely “double-disk failure” scenario (both sides of the mirror becoming unusable). To address this we are looking as increased speeds for node-to-node networking as well as alternative storage schemes that compensate for this risk.

As a final note, I want to thank the small team of extremely hard-working individuals at Internet Archive who maintain and evolve the compute and storage infrastructure that enables us to pursue our mission and service our patrons. Without their hard work and dedicated service, we would not be able to store and preserve the knowledge and information that the community works hard to collect and curate.

Thank you to the 2015-2016 Core Infrastructure Team (and contributors):
Andy Bezella, Hank Bromley, Dwalu Khasu, Sean Fagan, Ralf Muehlen, Tim Johnson, Jim Nelson, Mark Seiden, Samuel Stoller, and Trevor von Stein

-jcg (John C. Gonzalez)

Posted in News | 15 Comments

FAQs for some new features available in the Beta Wayback Machine

blog-wbinfo

The Beta Wayback Machine has some new features including searching to find a website and a summary of types of media on a website.

How can I use the Wayback Machine’s Site Search to find websites? The Site Search feature of the Wayback Machine is based on an index built by evaluating terms from hundreds of billions of links to the homepages of more than 350 million sites. Search results are ranked by the number of captures in the Wayback and the number of relevant links to the site’s homepage.

Can I find sites by searching for words that are in their pages? No, at least not yet. Site Search for the Wayback Machine will help you find the homepages of sites, based on words people have used to describe those sites, as opposed to words that appear on pages from sites.

Can I search sites with text from multiple languages? Yes! In fact, you can search for any unicode character, e.g. you can search for (try clicking on it). If you can generate characters with your computer, you should be able to use them to search for sites via the Wayback Machine. Go ahead, try searching for правда

Can I still find sites in the Wayback Machine if I just know the URL? Yes, just enter a domain or URL the way you have in the past and press the “Browse History” button.

What is the “Summary of <site>” link above the graph on the calendar page telling me? It shows you the breakdown of the web captures for a given domain by content type (text, images, videos, PDFs, etc.) In addition, it shows the number of captures, URLs and new URLs, by year for all the years available via the Wayback Machine, so you can see how a certain site has changed over time.

What are the sources of your captures? When you roll over individual web captures (that pop-up when you roll over the dots on the calendar page for a URL) you may notice some text links shows up above the calendar, along with the word “why”. Those links will take you to the Collection of web captures associated with the specific web crawl the capture came from. Every day hundreds of web crawls contribute to the web captures available via the Wayback Machine. Behind each, there is a story about factors like who, why, when and how.

Why are some of the dots on the calendar page different colors? We color the dots, and links, associated with individual web captures, or multiple web captures, for a given day. Blue means the web server result code the crawler got for the related capture was a 2nn (good); Green means the crawlers got a status code 3nn (redirect); Orange means the crawler got a status code 4nn (client error), and Red means the crawler saw a 5nn (server error). Most of the time you will probably want to select the blue dots or links.

Can I find sites by searching for a word specific to that site? Yes, by adding in “site:<domain>” your results will be restricted to the specified domain. E.g. “site:gov clinton” will search for sites related to the term “clinton” in the domain “gov”.

Posted in News | 6 Comments