Author Archives: Brewster Kahle

Thank you for helping us increase our bandwidth

Last week the Internet Archive upped our bandwidth capacity 30%, based on increased usage and increased financial support.  Thank you.

This is our outbound bandwidth graph that has several stories to tell…

A year ago, usage was 30Gbits/sec. At the beginning of this year, we were at 40Gbits/sec, and we were handling it.  That is 13 Petabytes of downloads per month.  This has served millions of users to materials in the wayback machine, those listening 78 RPMs, those browsing digitized books, streaming from the TV archive, etc.  We were about the 250th most popular website according to Alexa Internet.

Then Covid-19 hit and demand rocketed to 50Gbits/sec and overran our network infrastructure’s ability to handle it.  So much so, our network statistics probes had difficulty collecting data (hence the white spots in the graphs).   

We bought a second router with new line cards, and got it installed and running (and none of this is easy during a pandemic), and increased our capacity from 47Gbits/sec peak to 62Gbits/sec peak.   And we are handling it better, but it is still consumed.

Alexa Internet now says we are about the 160th most popular website.

So now we are looking at the next steps up, which will take more equipment and is more wizardry, but we are working on it.

Thank you again for the support, and if you would like to donate more, please know it is going to build collections to serve millions.  https://archive.org/donate

The National Emergency Library – Who Needs It? Who Reads It? Lessons from the First Two Weeks

At a time when every day can feel like a month, it’s hard to believe that the National Emergency Library has only existed for two weeks. Recognizing the unique challenges of connecting students and readers with books now on shelves they cannot reach, the Internet Archive loosened the restrictions on our controlled digital lending library to allow increased lending of materials. Reactions have been passionate, to say the least—elation by teachers able to  access our virtual stacks, concern by authors about the program’s impact, and fundamental questions about our role as a library in these dire times when one billion students worldwide are cut off from their classrooms and libraries.

For those of you who are being introduced to us for the first time due to the National Emergency Library: Welcome! The doors of the Internet Archive have been open for nearly 25 years and we’ve served hundreds of millions of visitors—we’ve always got room to welcome one more. And for those of you who have tracked our evolution through the years, we know you have questions.

When we turned off waitlists for our lending library on March 24th, it was in response to messages and requests we’d been getting from many sources—librarians who were closing their doors in response to lockdowns, school teachers who were concerned their students could no longer do research and discovery through the primary sources they had on campus, and organizations we respected who knew we had the capability to fill an unexpected gap. A need that we knew we could provide quickly in response.

We moved in “Internet Time” and the speed and swiftness of our solution surprised some and caught others off guard. In our rush to help we didn’t engage with the creator community and the ecosystem in which their works are made and published. We hear your concerns and we’ve taken action: the Internet Archive has added staff to our Patron Services team and we are responding quickly to the incoming requests to take books out of the National Emergency Library. While we can’t go back in time, we can move forward with more information and insight based on data the National Emergency Library has generated thus far.

The Internet Archive takes reader privacy seriously, so we don’t have specific analytics or logs to share (we took the government to court to assure we didn’t have to do that,) but we do have some general information that may be of use to authors, publishers and readers about the ways patrons are using the National Emergency Library. We will be sharing more in the coming weeks of this crisis.

Majority of books are borrowed for less than 30 minutes

Even with a preview function where readers can see the first few pages of a book, most people who go through the check out process are looking at the book for less than 30 minutes, with no more interactions until it is automatically returned two weeks later. We suspect that fewer than 10% of books borrowed are actually opened again after the first day (but we have more work to do to confirm this). Patrons may be using the checked-out book for fact checking or research, but we suspect a large number of people are browsing the book in a way similar to browsing library shelves.

The total number of books that are checked out and read is about the number of books borrowed from a town library

Trying to compare a physical check-out of a book with a digital check-out is difficult. Assuming that the number of physical books borrowed from a library corresponds to digitally borrowed books that are read after the first day, then the Internet Archive currently lends about as many as a US library that serves a population of about 30,000.

Our usage pattern may be more like a serendipitous walk through a bookstore or the library stacks. In the real world, a patron takes a book off the shelf, flips through to see if it’s of interest, and then either selects the book or puts it back on the shelf. However, in our virtual library, to flip fully through the book you have to borrow it. The large number of books that have no activity beyond the first few minutes of interaction suggest patrons are using our service to browse books.

90% of the books borrowed were published more than 10 years ago, two-thirds were published during the 20th century

The books in the National Emergency Library were published between 1925 and 5 years ago, because books older than that are in the public domain—out of copyright and fully downloadable. Books newer than 5 years are not in the National Emergency Library. Unlike the age of most books in bookstores, the books readers are borrowing are older books, with 10% being from the last 10 years. Two-thirds of these books were published during the 20th century.

And when people find what they need, it solves a problem, such as this subject librarian who found a book published in 1975:

A bit of Fun: Some of the least common subject catagories of borrowed books

These subject tags come from library catalog records and other annotations by organizations such as ISKME has done with the Universal School Library collection, assigned to aid search and discovery of resources for educators.

We’ll continue to glean and share what we can as this project continues and we hope that the needs that gave rise to the National Emergency Library come to an end soon.

When school’s out, what will we learn?

More than 100 countries have closed their schools, including 43 states in the U.S.

Forty years ago as a freshman, I pulled my first book off the shelves of Hayden Library at MIT. This month, every MIT undergraduate departed from campus in an attempt to contain COVID-19, leaving behind the vast resources of that library. Ready or not, we are all being thrust into an enormous experiment in online learning. One that can have positive and permanent outcomes, if we handle it right.

With schools closing from Changshu to Cambridge, suddenly students are cut off from the physical resources they rely on: the teachers, the classrooms and libraries that are the backbone of learning. And in this flux, those in marginalized communities—from rural areas without broadband or schools with few online books—are even more profoundly challenged. The Economist reports that in the United states, “7 million school-age children cannot access the internet at home.”

“If this is just a prolonged pause in our education and economy, without the benefits of learning and adapting, one of the most profound impacts of COVID-19 may be…a “quiet brain drain.” It will be time our children never get back.”

But here’s the good news: we know how to do this, to impart knowledge at scale over the Internet. Online courses, online libraries and broadband all exist—but we need to expand and upgrade them to meet the needs of the close to one billion learners around the world whose classrooms have been shuttered.

24 years ago, I founded the Internet Archive as a nonprofit digital library serving more than a million learners every day. Today, the Internet Archive is working with hundreds of public, school and university libraries to digitize their core collections and make them freely available over the Internet. Even as MIT was sending students home, we were working with MIT Libraries to see how many of their books we have already digitized. In 24 hours, we were able to hand them back 166,000 digitized books to lend online through their catalogue and via archive.org. This week, the Internet Archive created a National Emergency Library of 1.4 million digitized books to serve the needs of students, educators and learners who can now access them from home.

At archive.org/nel or OpenLibrary.org, you can borrow 1.4 million digitized books for free during the COVID-19 crisis.

Think of this as a huge experiment. In one big push, we can improve online learning and its infrastructure in a way that may otherwise have taken years. This crisis encourages universities to be bold, to make investments that ultimately may mean many more students can benefit. Perhaps 500 undergraduates can fill a hall at MIT, but how many millions can take an online MIT course, once the books, materials and lessons are online?

China is a few weeks ahead of the United States when it comes to experimenting with online learning. In January, my son, Caslon, was teaching English to 4th graders in Changshu. Now he is teaching them from San Francisco, with recorded lessons and online interaction. Next month, his school in China is poised to reopen, but I suspect it will be forever changed.

If this is just a prolonged pause in our education and economy, without the benefits of learning and adapting, one of the most profound impacts of COVID-19 may be what Dr. Kate Tairyan, Chief Medical Officer of the online college NextGenU.org, calls a “quiet brain drain.” It will be time our children never get back.

But we have the opportunity to harness American ingenuity to build a stronger, more robust educational system—by leveraging the Internet, new technologies, and our investments in digitizing books at scale into something that democratizes learning for a generation to come.

Brewster Kahle is the founder and Digital Librarian of the Internet Archive. A passionate advocate for public Internet access and a successful entrepreneur, he has spent his career intent on a singular focus: providing Universal Access to All Knowledge. Kahle graduated from the Massachusetts Institute of Technology, where he studied artificial intelligence.

Internet Archive Staff and Covid-19: Work-at-Home for Most, Full-Pay Furlough & Medical for Scanners

This is an unsettling time, and the Internet Archive has been working with staff, partner libraries, and patron communities to weather this storm.

Our staff and community is core of who we are– we are not the data, we are people. We care deeply and have been taking the following steps to support staff.

Most of the Internet Archive staff now work at home– this is going well: zoom, slack, jitsi, whereby, google docs, broadband– the miracles of our Internet world make this possible.  Fortunately, we had already become a largely distributed staff because of prices in San Francisco and our interest in engaging the best people we could no matter where they live.

For the 50 book scan center staff that work in libraries that are now closed, we do not have enough productive remote work and no paid work. (Libraries paying for our scanning services is a major source of earned income for the Internet Archive.) For these important employees we are leveraging government assistance to accomplish a furlough for 3 months at regular pay with medical benefits. So our scanners are safe, not working, and paid.

Figuring out how to do this in England, the US, and Canada, has been challenging especially trying to leverage ever-changing government subsidies.  Fortunately England announced added help for furloughed workers, and the United States seems to be working on expanded benefits. We always look to save money but we will make sure our furloughed employees are fully paid with medical during this period in any case. We have made sure they are safe now and that they know we want them to come back to work.

For the few that will not have jobs after the lights come back on, based on org changes, we have supported them at a higher level than those on furlough to help them through this time and relaunch.

To pay for these measures, we have gotten some donations and some employees have offered to work 4 days a week for the coming months to help, but it will hurt. Your support is most welcome.

Thankfully, so far, the libraries that support us are planning to restart scanning when it is safe to do so. Based on the now-apparent need to digitize modern books for remote digital access, we hope more libraries will support our scanning services.  

With strong staff and partnerships we can grow to produce new services that are appropriate for these times such as the National Emergency Library that is now lending books to thousands of displaced students.

Thank you for your support and stay safe.

Libraries and Publishing Now– Viva la Library!

Readers consume publisher’s products many hours every day– and consume on publisher’s terms. Publisher’s framing on our screens, publisher’s business models, publisher’s flow and pacing. Yes, there are many publishers now, but we are, mostly, locked into their presentation forms. We check into their black box theaters and consume as intended.

Libraries have always bought publisher’s products but have traditionally offered alternative access modes to these materials, and can again. As an example let’s take newspapers. Published with scoops and urgency, yesterday is “old news,” the paper it was printed on is then only useful the next day as “fish wrap”– the paper piles up and we felt guilty about the trash. That is the framing of the publisher: old is useless, new is valuable. This has carried into social media– flip up to read on. Scroll through your “feed” (gosh, the word “feed” is illustrative, what happens after “feed” is “fed”?  Well, it comes out the other end in a way we do not cherish 🙂 ).

But a library gives old news a new life, not a commercial life, but a life that encourages reflection, perspective, critique, analysis. In a word– “History”. The library keeps the former “news” and offers it in new ways in a new framing, with new tools– not just flip flip flip. It can be quoted, placed side by side with other publisher’s news and enable researchers to inject commentary.

This capture, representation, searching, rethinking is not a crime– it is thought, it is memory and our history– it builds to become our culture. It has been supported, nurtured, taught.

But the library is in danger in our digital world. In print, one could keep what one had read. In digital that is harder technically, and publishers are specifically making it harder. Technical enforcement measures and laws are making remembering difficult, and worse, a crime.

Libraries live to offer new ways to see published works that were often produced for a different purpose. But this is difficult in a digital world.

Digital newspapers sometimes disappear from their web presence. App-based newspapers can not be pointed to with a citation or URL. Archives, sometimes available, are segmented into each publisher’s platforms.

Similarly, digital books live in proprietary digital book readers that disappear the books. If “cut and paste” functions at all, often just inside that “platform.” Annotations are stored with the vendor, with their terms and conditions.

A personal library now means a purchase list on a website.

Libraries and publishers have lived together throughout the paper era, not always peacefully, but libraries were possible because of paper technologies, laws, and funding. Multiple copies were kept in different libraries ensuring preservation and creating different access modes for different communities.

Once publications became electronic, preservation and access became harder. Radio and television did not fit into the library mold. Early tele-text, Lexis-Nexis, Westlaw, and AOL really did not work as library collections in traditional libraries. Academic journal publishing shifted to digital and libraries moved to serve as customer service departments for leased database access.

Some of us helped build the Internet so digital works could be archived and “libraried”. And then made archives of Web pages and created services around them.

But it turns out that few of us did this, and the biggest, Google, did it privately and for profit.  The Internet Archive was created to help and has archived billions of Web Pages, millions of hours of TV and radio, millions of books, records, movies and software.

Most traditional libraries have done little to preserve digital materials. The Internet Archive is quite unique in focusing on this mission and I would say under supported. Encouraging, however, is that 100,000 individuals a year now donate to support the Internet Archive’s public services. Hope is there.

We need libraries of digital materials, tools to use these libraries, and ways to protect them, fund them and integrate them into schools and our lives more generally. This way we can remember, think, and build on the past.

With so much in digital form, and storage and communication so easy, it should be the librarian’s day!  It can be the library user’s day…

Let’s build that world… of preservation and access, of reflection and critique, with confidence that what happened actually happened so that our histories can rely on immutable evidence.

Libraries do not command the world, but libraries are necessary in the functioning of a thoughtful world.

Thank you for supporting the Internet Archive.

Viva la Library!

Weaving Books into the Web—Starting with Wikipedia

[announcement video, Wired]

The Internet Archive has transformed 130,000 references to books in Wikipedia into live links to 50,000 digitized Internet Archive books in several Wikipedia language editions including English, Greek, and Arabic. And we are just getting started. By working with Wikipedia communities and scanning more books, both users and robots will link many more book references directly into Internet Archive books. In these cases, diving deeper into a subject will be a single click.

Moriel Schottlender, Senior Software Engineer, Wikimedia Foundation, speech announcing this program

“I want this,” said Brewster Kahle’s neighbor Carmen Steele, age 15, “at school I am allowed to start with Wikipedia, but I need to quote the original books. This allows me to do this even in the middle of the night.”

For example, the Wikipedia article on Martin Luther King, Jr cites the book To Redeem the Soul of America, by Adam Fairclough. That citation now links directly to page 299 inside the digital version of the book provided by the Internet Archive. There are 66 cited and linked books on that article alone. 

In the Martin Luther King, Jr. article of Wikipedia, page references can now take you directly to the book.

Readers can see a couple of pages to preview the book and, if they want to read further, they can borrow the digital copy using Controlled Digital Lending in a way that’s analogous to how they borrow physical books from their local library.

“What has been written in books over many centuries is critical to informing a generation of digital learners,” said Brewster Kahle, Digital Librarian of the Internet Archive. “We hope to connect readers with books by weaving books into the fabric of the web itself, starting with Wikipedia.”

You can help accelerate these efforts by sponsoring books or funding the effort. It costs the Internet Archive about $20 to digitize and preserve a physical book in order to bring it to Internet readers. The goal is to bring another 4 million important books online over the next several years.  Please donate or contact us to help with this project.

From a presentation on October 23, 2019 by Moriel Schottlender, Tech lead at the Wikimedia Foundation.

“Together we can achieve Universal Access to All Knowledge,” said Mark Graham, Director of the Internet Archive’s Wayback Machine. “One linked book, paper, web page, news article, music file, video and image at a time.”


Thank you for the donation of 78rpm records from a Craigslist poster

Mark Ellis alerted us to a Craiglist post of a storage locker of records being offered for free in San Jose in 2 hours. The owner wanted them gone. The Internet Archive sprang into action and our truck rolled.

Lots of people had responded to the ad that wanted specific records for free, but not that many that wanted 78rpm records. We love 78rpm records. We preserve them and digitize ones we do not have for the Great 78 Project.  At the end we got 1 pallet full of 78’s, maybe 2,700 discs, and they are queued for digitization.

Thank you to Joey Myers for posting on Craigslist, to Mark Ellis for alerting Jason Scott of the Archive, and the Archive staff that jumped on it.

Correct Metadata is Hard: a Lesson from the Great 78 Project

We have been digitizing about 8,000 78rpm record sides each month and now have 122,000 of them done. These have been posted on the net and over a million people have explored them. We have been digitizing, typing the information on the label, and linking to other information like discographies, databases, reviews and the like.

Volunteers, users, and internal QA checkers have pointing out typos, and we decided to go back over a couple of month’s metadata and found problems. And then we contracted with professional proofreaders and they found even more (2% of the records at this point had something to point out, some are matters of opinion or aesthetics, some lead to corrections).

We are going to pay the professional proofreaders to correct the 5 most important fields for all 122,000 records, but can use more help. We are pointing these out here in hopes to interest volunteer proofreaders and to share our experience in continually improving our collections.

Here are some of the issues with the primary performer field: before-the-after that we have now corrected from the June 2019 transfers (before | after) that we hope to upload in the next couple of weeks:

Jose Melis And His Latin American Ensemble | Jose Melis And His-Latin American Ensemble
Columbia-Orchestra | Columbia-Orchester
S. Formichi and T. Chelotti | S. Formichi e T. Chelotti
Dennis Daye and The Rhythmaires | Dennis Day and The Rhythmaires
Harry James and His Orchestra | Harry James and His Orch.
Charles Hart & Elliot Shaw | Charles Hart & Elliott Shaw
Peerless Quartet | Peerless Quartette

Some of the title corrections:

O Vino Fa ‘Papla (Wine Makes You Talk) | ‘O Vino Fa ‘Papla (Wine Makes You Talk)
Masked Ball Salaction | Masked Ball Selection
Moonlight and Roses (Brings Mem’ries Of You) | Moonlight and Roses (Bring Mem’ries Of You)
Que Bonita Eres Tu (You Are Beutiful) | Que Bonita Eres Tu (You Are Beautiful)
Buttered Roll | “Buttered Roll”
Paradise | “Paradise”
Got a Right to Cry | “Got a Right to Cry”
Blue Moods | “Blue Moods”
Auf Wiederseh’n Sweerheart | Auf Wiederseh’n Sweetheart
George M. Cohan Medley – Part 1 | George M. Cohan Medley – Part 2
Dewildered | Bewildered
Lolita (Seranata) | Lolita (Serenata)
Got a Right to Cry | “Got a Right to Cry” Joe Liggins and His Honeydrippers
Blue Moods | “Blue Moods”
Body and Soul | “Body and Soul”
Mais Qui Est-Ce | Mais Qui Est-Ce?
Wail Till the Sun Shines Nellie Blues | Wait Till the Sun Shines Nellie Blues
Que Te Pasa Joe (What Happens Joe) | Que Te Pasa Jose (What Happens Joe)
SAMSON AND DELILAH Softly Awakens My Heart | SAMSON AND DELILAH Softly Awakes My Heart
I’m Gonna COO, COO, COO | (I’m Gonna) COO, COO, COO

Most 20th Century Books Unavailable to Internet Users – We Can Fix That

The books of the 20th century are largely not online.  They are mostly not available from even the biggest booksellers. And, libraries who have collected hard copies of these books have not been able to deliver them in a cost-efficient, simple, digital form to their patrons. 

The way libraries could fill that gap is to adopt and deliver a controlled digital lending service. The Internet Archive is trying to do its part but needs others to join in. 

The Internet Archive has worked with 500 libraries over the last 15 years to digitize 3.5 million books. But based on copyright concerns the selection has often been restricted to pre-1923 books. We need complete libraries and comprehensive access to nurture a well-informed citizenry. The following graph shows the number of books digitized by the Internet Archive, binned by decade:

Up until 1923 the graph shows our collection increasing and mirroring the rise in publications.Then it dips and slows because of concerns and confusion about copyright protections for books published after that date.  It picks up again in the 1990s because these books are more readily available and separate funding has helped us digitize some recent modern books Nevertheless, the end result is that the gap is big – the digital world is missing  a huge chunk of the 20th Century. 

Users can’t even fill that gap by buying the books from that time period. According to a recent paper by Professor Rebecca Giblin, the commercial life of a book is typically exhausted 1.4 to 5 years from publication; some 90% of titles become unavailable in physical form within just two years. Most older books are therefore not available to be purchased in either physical or digital form. The following graph, pulled from a study by Professor Paul Heald, shows books by decade that are available on Amazon.com. It shows that the world’s largest bookseller has the same huge gap – the 20th century is simply missing. 

The 20th Century represents a significant portion of published knowledge – approximately one-third of all books – as shown in the graph below.  These books are largely unavailable commercially, BUT they are not completely lost. Many of these books are on library shelves, accessible only if you physically visit the library that owns those books. Even if you’re willing to visit, those books might still not be accessible. Libraries, pressed to repurpose their buildings, have increasingly moved volumes to off-site storage facilities.

The way to make 20th Century books available to library patrons is to digitize those books and let every library who owns a physical copy lend that book in digital form. This type of service has come to be known as controlled digital lending (CDL).  The Internet Archive has been doing this for years. We lend out-of-copyright and in-copyright volumes that we physically own. We’ve reformatted the physical volume, produced a digital version and lend only that digital version to one user at a time. Our experience shows that this responds to a real demand, fills a genuine need satisfactorily, gives new life to older books, and brings important knowledge to a new audience. Check out this case study for CDL involving the book Wasted which figured prominently in the Brett Kavanaugh Supreme Court nomination hearings.  

Our experience has been replicated by other early adopters and providers of a CDL service. Here’s a list of some of them. We believe every library can transform itself into a digital library. If you own the physical book, you can choose to circulate a digital version instead.

We urge more libraries to join Open Libraries and lend digitized versions of their print collections, making more copies of books available for loan and getting more books into the hands of digital  readers everywhere. 

Helping us judge a book by its cover: software help request

The Internet Archive would appreciate some help from a volunteer programmer to create software that would help determine if a book cover is useful to our users as a thumbnail or if we should use the title page instead. For many of our older books, they have cloth covers that are not useful, for instance:

But others are useful:

Just telling by age is not enough, because even 1923 cloth covers are sometimes good indicators of what the book is about (and are nice looking):

We would like a piece of code that can help us determine if the cover is useful or not to display as the thumbnail of a book. It does not have to be exact, but it would be useful if it knew when it didn’t have a good determination so we could run it by a person.

To help any potential programmer volunteers, we have created folders of hundreds of examples in 3 catatories: year 1923 books with not-very-useful covers, year 1923 books with useful covers, and year 2000 books with useful covers. The filenames of the images are the Internet Archive item identifier that can be used to find the full item:  1922forniaminera00bradrich.jpg would come from https://archive.org/details/1922forniaminera00bradrich.   We would like a program (hopefully fast, small, and free/open source) that would say useful or not-useful and a confidence. 

Interested in helping? Brenton at archive.org is a good point of contact on this project.   Thank you for considering this. We can use the help. You can also use the comments on this post for any questions.

FYI: To create these datasets, I ran these command lines, and then by hand pulled some of the 1923 covers into the “useful” folder.

bash-3.2$ ia search "date:1923 AND mediatype:texts AND NOT collection:opensource AND NOT collection:universallibrary AND scanningcenter:*" --itemlist --sort=downloads\ desc | he\
ad -1000 | parallel --will-cite -j10 "curl -Ls https://archive.org/download/{}/page/cover_.jpg?fail=soon.jpg\&cnt=0 >> ~/tmp/cloth/{}.jpg"

bash-3.2$ ia search "date:2000 AND mediatype:texts AND scanningcenter:cebu" --itemlist --sort=downloads\ desc | head -1000 | parallel --will-cite -j10 "curl -Ls https://archive.\
org/download/{}/page/cover_.jpg?fail=soon.jpg\&cnt=0 >> ~/tmp/picture/{}.jpg"