Author Archives: Brewster Kahle

From Brewster Kahle—I Set Out to Build the Next Library of Alexandria. Now I Wonder: Will There Be Libraries in 25 Years?

Editorial note: This op-ed first ran in Time Magazine in 2021. We are reposting it here with permission as we head into oral argument for our appeal in the publishers’ lawsuit against our library, scheduled for next Friday, June 28, 2024.

When I started the Internet Archive 25 years ago, I focused our non-profit library on digital collections: preserving web pages, archiving television news, and digitizing books. The Internet Archive was seen as innovative and unusual. Now all libraries are increasingly electronic, and necessarily so. To fight disinformation, to serve readers during the pandemic, and to be relevant to 21st-century learners, libraries must become digital.

But just as the Web increased people’s access to information exponentially, an opposite trend has evolved. Global media corporations—emboldened by the expansive copyright laws they helped craft and the emerging technology that reaches right into our reading devices—are exerting absolute control over digital information. These two conflicting forces—towards unfettered availability and completely walled access to information—have defined the last 25 years of the Internet. How we handle this ongoing clash will define our civic discourse in the next 25 years. If we fail to forge the right path, publishers’ business models could eliminate one of the great tools for democratizing society: our independent libraries.

These are not small mom-and-pop publishers: a handful of publishers dominate all books sales and distribution including trade books, ebooks, and text books. Right now, these corporate publishers are squeezing libraries in ways that may render it impossible for any library to own digital texts in five years, let alone 25. Soon, librarians will be reduced to customer service reps for a Netflix-like rental catalog of bestsellers. If that comes to pass, you might as well replace your library card with a credit card. That’s what these billion-dollar-publishers are pushing.

The libraries I grew up with would buy books, preserve them, and lend them for free to their patrons. If my library did not have a particular book, then it would borrow a copy from another library for me. In the shift from print to digital, many commercial publishers are declaring each of these activities illegal: they refuse libraries the right to buy ebooks, preserve ebooks, or lend ebooks. They demand that libraries license ebooks for a limited time or for limited uses at exorbitant prices, and some publishers refuse to license audiobooks or ebooks to libraries at all, making those digital works unavailable to hundreds of millions of library patrons.

Although we’re best known for the Wayback Machine, a historical archive of the World Wide Web, the Internet Archive also buys ebooks from the few independent publishers that will sell, really sell, ebooks to us. With these ebooks, we lend them to one reader at a time, protected with the same technologies that publishers use to protect their ebooks. The Internet Archive also digitizes print books that were purchased or donated. Similarly, we lend them to one reader at a time, following a practice employed by hundreds of libraries over the last decade called “controlled digital lending.”

Last year,* four of the biggest commercial publishers in the world sued the Internet Archive to stop this longstanding library practice of controlled lending of scanned books. The publishers filed their lawsuit early in the pandemic, when public and school libraries were closed. In March 2020, more than one hundred shuttered libraries signed a statement of support asking that the Internet Archive do something to meet the extraordinary circumstances of the moment. We responded as any library would: making our digitized books available, without waitlists, to help teachers, parents, and students stranded without books. This emergency measure ended two weeks before the intended 14-week period.

The lawsuit demands that the Internet Archive destroy 1.4 million digitized books, books we legally acquired and scanned in cooperation with dozens of library partners. If the publishers win this lawsuit, then every instance of online reading would require the permission and license of a publisher. It would give publishers unprecedented control over what we can read and when, as well as troves of data about our reading habits.

Publishers’ bullying tactics have stirred lawmakers in Maryland, New York, Massachusetts, and Rhode Island to draft laws requiring that publishers treat libraries fairly. Maryland’s legislature passed the law unanimously. In those states, if an ebook is licensed to consumers, publishers will be required by law to license it to libraries on reasonable terms. But lobbyists for the publishing industry claim even these laws are unconstitutional. This is a dangerous state of affairs. Libraries should be free to buy, preserve, and lend all books regardless of the format.

Suing libraries is not a new tactic for these billion-dollar corporations and their surrogates: Georgia State University’s law library battled a copyright lawsuit for 12 years; HathiTrust Digital Library battled the Author’s Guild for five years. In each case, the library organization won, but it took millions of dollars that libraries can ill-afford.

Libraries spend billions of dollars on publishers’ products, supporting authors, illustrators, and designers. If libraries become mere customer service departments for publisher’s pre-packaged product lines, the role that librarians play in highlighting marginalized voices, providing information to the disadvantaged, and preserving cultural memory independent of those in power will be lost.

As we shift from print to digital, we can and must support institutions and practices that were refined over hundreds of years starting with selling ebooks to readers and libraries.

So if we all handle this next phase of the Internet well, I believe the answer is, yes, there will be libraries in 25 years, many libraries—and many publishers, many booksellers, millions of compensated authors, and a society in which everyone will read good books.

*Editorial note: This op-ed was first published in 2021.

Aruba’s Bold Support of Library Digital Rights, by Brewster Kahle

Aruba’s Prime Minister, Evelyn Wever-Croes: “Give them the opportunity to search for the truth.”

Last week Aruba launched the island nation’s digital heritage portal online: Coleccion Aruba. As trumpeted in Wired:  “The Internet Archive Just Backed Up an Entire Caribbean Island,” but really the credit goes to Aruba. Digitizing their national cultural heritage (100k items) and putting it online for free public access is a huge achievement.

I met with the Prime Minister (pictured above), the Minister of Culture, and the Minister of Education who backed the efforts made by the National Librarian, National Archivist, and their digital strategist. Never have I seen such unified support for cultural preservation and access. They brought together people from the Dutch islands and the Internet Archive to share the news and to inspire and to lead.

Aruba was the first to sign onto the Four Digital Rights of Memory Institutions: right to Collect, Preserve, provide Access, and interlibrary Collaboration. These are bad times when we have to reclaim these rights that are being taken from all libraries, but Aruba is making a stand. Go Aruba!

Aruba’s National Librarian, Astrid Britten, signs the Four Rights, as the National Archivist, Raymond Hernandez, and Brewster Kahle look on.

If libraries are reduced to only subscribing to commercial database products rather than owning and curating collections, we will be beholden to external corporations and subject to their whims over what’s in licensed collections, and how patrons can access them. The “Spotify for Books” model is not the way we want our libraries to go. 

To top it off, the Prime Minister, Evelyn Wever-Croes, inspired us when she told us that for the next generation, we need to “Give them the opportunity to search for the truth.” Yes.

Inspiring to see a country lead so well. I hope we have the honor of working with other nations that will also assert Digital Rights for Libraries, and live by those principles.

– Brewster Kahle

Brewster Goes to Washington – Congressional Hearing on the Copyright Office Modernization Committee

A good day in Washington.   After two years of being on the Copyright Office Modernization Committee, helping advise the Copyright Office on their new registration and recordation process, a republican and a democrat from the House of Representatives held a hearing to ask questions of committee members. It was such a refreshing scene because it was bipartisan, they knew the issues, and they were spending time finding out what we suggested.

This all matters because the Copyright Office is moving to filings being digital, which is an improvement, and because it could make way for efficient submissions of digital files.   This would be a major way for the Library of Congress to get copies of books they would own, preserve, and make somewhat accessible.

Another attendee said they had gone to congressional meetings for 30 years and this one had the most engagement of any of them.  A good day in Washington, indeed.

Let us serve you, but don’t bring us down

What just happened on archive.org today, as best we know:

Tens of thousands of requests per second for our public domain OCR files were launched from 64 virtual hosts on amazon’s AWS services. (Even by web standards,10’s of thousands of requests per second is a lot.)

This activity brought archive.org down for all users for about an hour.

We are thankful to our engineers who could scramble on a Sunday afternoon on a holiday weekend to work on this.

We got the service back up by blocking those IP addresses.

But, another 64 addresses started the same type of activity a couple of hours later.  

We figured out how to block this new set, but again, with about an hour outage.

—- 

How this could have gone better for us:

Those wanting to use our materials in bulk should start slowly, and ramp up. 

Also, if you are starting a large project please contact us at info@archive.org, we are here to help.

If you find yourself blocked, please don’t just start again, reach out.

Again, please use the Internet Archive, but don’t bring us down in the process.

Anti-Hallucination Add-on for AI Services Possibility

Chatbots, like OpenIA’s ChatGPT, Google’s Bard and others, have a hallucination problem (their term, not ours). It can make something up and state it authoritatively. It is a real problem. But there can be an old-fashioned answer, as a parent might say: “Look it up!”

Imagine for a moment the Internet Archive, working with responsible AI companies and research projects, could automate “Looking it Up” in a vast library to make those services more dependable, reliable, and trustworthy. How?

The Internet Archive and AI companies could offer an anti-hallucination service ‘add-on’ to the chatbots that could cite supporting evidence and counter claims to chatbot assertions by leveraging the library collections at the Internet Archive (most of which were published before generative AI).

By citing evidence for and against assertions based on papers, books, newspapers, magazines, books, TV, radio, government documents, we can build a stronger, more reliable knowledge infrastructure for a generation that turns to their screens for answers. Although many of these generative AI companies are already, or are intending, to link their models to the internet, what the Internet Archive can uniquely offer is our vast collection of “historical internet” content. We have been archiving the web for 27 years, which means we have decades of human-generated knowledge. This might become invaluable in an age when we might see a drastic increase in AI-generated content. So an Internet Archive add-on is not just a matter of leveraging knowledge available on the internet, but also knowledge available on the history of the internet.

Is this possible? We think yes because we are already doing something like this for Wikipedia by hand and with special-purpose robots like Internet Archive Bot Wikipedia communities, and these bots, have fixed over 17 million broken links, and have linked one million assertions to specific pages in over 250,000 books. With the help of the AI companies, we believe we can make this an automated process that could respond to the customized essays their services produce. Much of the same technologies used for the chatbots can be used to mine assertions in the literature and find when, and in what context, those assertions were made.

The result would be a more dependable World Wide Web, one where disinformation and propaganda are easier to challenge, and therefore weaken.

Yes, there are 4 major publishers suing to destroy a significant part of the Internet Archive’s book corpus, but we are appealing this ruling. We believe that one role of a research library like the Internet Archive, is to own collections that can be used in new ways by researchers and the general public to understand their world.

What is required? Common purpose, partners, and money. We see a role for a Public AI Research laboratory that can mine vast collections without rights issues arising. While the collections are significant already, we see collecting, digitizing, and making available the publications of the democracies around the world to expand the corpus greatly.

We see roles for scientists, researchers, humanists, ethicists, engineers, governments, and philanthropists, working together to build a better Internet.

If you would like to be involved, please contact Mark Graham at mark@archive.org.

AI Audio Challenge: Audio Restoration of 78rpm Records based on Expert Examples

http://great78.archive.org/

Hopefully we have a dataset primed for AI researchers to do something really useful, and fun– how to take noise out of digitized 78rpm records.

The Internet Archive has 1,600 examples of quality human restorations of 78rpm records where the best tools were used to ‘lightly restore’ the audio files. This takes away scratchy surface noise while trying not to impair the music or speech. In the items are files in those items are the unrestored originals that were used.

But then the Internet Archive has over 400,000 unrestored files that are quite scratchy and difficult to listen to.

The goal is, or rather the hope is, that a program that can take all or many of the 400,000 unrestored records and make them much better. How hard this is is unknown, but hopefully it is a fun project to work on.

Many of the recordings are great and worth the effort. Please comment on this post if you are interested in diving in.

AI@IA — Extracting Words Sung on 100 year-old 78rpm records

A post in the series about how the Internet Archive is using AI to help build the library.

Freely available Artificial Intelligence tools are now able to extract words sung on 78rpm records.  The results may not be full lyrics, but we hope it can help browsing, searching, and researching.

Whisper is an open source tool from OpenAI “that approaches human level robustness and accuracy on English speech recognition.”  We were surprised how far it could get with recognizing spoken words on noisy disks and even words being sung.

For instance in As We Parted At The Gate (1915) by  Donald Chalmers, Harvey Hindermyer, and E. Austin Keith, the tool found the words:

[…] we parted at the gate,
I thought my heart would shrink.
Often now I seem to hear her last goodbye.
And the stars that tune at night will
never die as bright as they did before we
parted at the gate.
Many years have passed and gone since I
went away once more, leaving far behind
the girl I love so well.
But I wander back once more, and today
I pass the door of the cottade well, my
sweetheart, here to dwell.
All the roads they flew at fair,
but the faith is missing there.
I hear a voice repeating, you’re to live.
And I think of days gone by
with a tear so from her eyes.
On the evening as we parted at the gate,
as we parted at the gate, I thought my
heart would shrink.
Often now I seem to hear her last goodbye.
And the stars that tune at night will
never die as bright as they did before we
parted at the gate.

All of the extracted texts are now available– we hope it is useful for understanding these early recordings.  Bear in mind these are historical materials so may be offensive and also possibly incorrectly transcribed.

We are grateful that University of California Santa Barbara Library donated an almost complete set of transfers of 100 year-old Edison recordings to the Internet Archive’s Great 78 Project this year.  The recordings and the transfers were so good that the automatic tools were able to make out many of the words.

The next step is to integrate these texts into the browsing and searching interfaces at the Internet Archive.

Our Digital History Is at Risk

This piece was first published by TIME Magazine, in their Ideas section, as Amid Musk’s Chaotic Reign at Twitter, Our Digital History Is at Risk. My thanks to the wonderful team at Time for their editorial and other assistance.

As Twitter has entered the Musk era, many people are leaving the platform or rethinking its role in their lives. Whether they join another platform like Mastodon (as I have) or continue on at Twitter, the instability occasioned by Twitter’s change in ownership has revealed an underlying instability in our digital information ecosystem. 

Many have now seen how, when someone deletes their Twitter account, their profile, their tweets, even their direct messages, disappear. According to the MIT Technology Review, around a million people have left so far, and all of this information has left the platform along with them. The mass exodus from Twitter and the accompanying loss of information, while concerning in its own right, shows something fundamental about the construction of our digital information ecosystem:  Information that was once readily available to you—that even seemed to belong to you—can disappear in a moment. 

Losing access to information of private importance is surely concerning, but the situation is more worrying when we consider the role that digital networks play in our world today. Governments make official pronouncements online. Politicians campaign online. Writers and artists find audiences for their work and a place for their voice. Protest movements find traction and fellow travelers.  And, of course, Twitter was a primary publishing platform of a certain U.S. president

If Twitter were to fail entirely, all of this information could disappear from their site in an instant. This is an important part of our history. Shouldn’t we be trying to preserve it?

I’ve been working on these kinds of questions, and building solutions to some of them, for a long time. That’s part of why, over 25 years ago, I founded the Internet Archive. You may have heard of our “Wayback Machine,” a free service anyone can use to view archived web pages from the mid-1990’s to the present. This archive of the web has been built in collaboration with over a thousand libraries around the world, and it holds hundreds of billions of archived webpages today–including those presidential tweets (and many others). In addition, we’ve been preserving all kinds of important cultural artifacts in digital form: books, television news, government records, early sound and film collections, and much more. 

The scale and scope of the Internet Archive can give it the appearance of something unique, but we are simply doing the work that libraries and archives have always done: Preserving and providing access to knowledge and cultural heritage. For thousands of years, libraries and archives have provided this important public service. I started the Internet Archive because I strongly believed that this work needed to continue in digital form and into the digital age. 

While we have had many successes, it has not been easy. Like the record labels, many book publishers  didn’t know what to make of the internet at first, but now they see new opportunities for financial gain. Platforms, too, tend to put their commercial interests first. Don’t get me wrong: Publishers and platforms continue to play an important role in bringing the work of creators to market, and sometimes assist in the preservation task. But companies close, and change hands, and their commercial interests can cut against preservation and other important public benefits. 

Traditionally, libraries and archives filled this gap. But in the digital world, law and technology make their job increasingly difficult. For example, while a library could always simply buy a physical book on the open market in order to preserve it on their shelves, many publishers and platforms try to stop libraries from preserving information digitally. They may even use technical and legal measures to prevent libraries from doing so. While we strongly believe that fair use law enables libraries to perform traditional functions like preservation and lending in the digital environment, many publishers disagree, going so far as to sue libraries to stop them from doing so. 

We should not accept this state of affairs. Free societies need access to history, unaltered by changing corporate or political interests. This is the role that libraries have played and need to keep playing. This brings us back to Twitter.

In 2010, Twitter had the tremendous foresight of engaging in a partnership with the Library of Congress to preserve old tweets. At the time, the Library of Congress had been tasked by Congress “to establish a national digital information infrastructure and preservation program.” It appeared that government and private industry were working together in search of a solution to the digital preservation problem, and that Twitter was leading the way.  

It was not long before the situation broke down. In 2011, the Library of Congress issued a report noting the need for “legal and regulatory changes that would recognize the broad public interest in long-term access to digital content,” as well as the fact that “most libraries and archives cannot support under current funding” the necessary digital preservation infrastructure.”  But no legal and regulatory changes have been forthcoming, and even before the 2011 report,  Congress pulled tens of millions of dollars out of the preservation program. In these circumstances, it is perhaps unsurprising that, by 2017, the Library of Congress had ceased preserving most old tweets, and the National Digital Information Infrastructure and Preservation Program (NDIIPP) is no longer an active program at the Library of Congress. Furthermore, it is not clear whether Twitter’s new ownership will take further steps of its own to address the situation. 

Whatever Musk does, the preservation of our digital cultural heritage should not have to rely on the beneficence of one man. We need to empower libraries by ensuring that they have the same rights with respect to digital materials that they have in the physical world. Whether that means archiving old tweets, lending books digitally, or even something as exciting (to me!) as 21st century interlibrary loan, what’s important is that we have a nationwide strategy for solving the technical and legal hurdles to getting this done. 

What is the Democracy’s Library?

Illustration created with MidJourney

Democracies require an educated citizenry to flourish– and because of this, Democratic governments, at all levels, spend billions of dollars publishing reports, manuals, books, videos so that all can read and learn. That is the good news.  The bad news is that in our digital age, much of this is not accessible.   Democracy’s Library aims to change this.   

The aim of the Internet Archive Democracy’s Library is to collect, preserve and make freely available all the published works of all the democracies– the federal, provincial, and municipal government publications– so that we can efficiently learn from each other to solve our biggest challenges in parallel and in concert.

Democracy’s Library is the foundational information of free people.

We call this “Democracy’s Library” because Democracy is an open system that trusts its citizens to learn, grow and have independent agency. Democratic governments publish openly because they want important information spread widely.  There are no paywalls to the works of government, or there shouldn’t be. 

We need access to all the River reports so we can help understand and manage our declining clean water.   Access to Agricultural research to help farm more sustainably.  To Materials research to build better products and devices. To Local hearings on project results so other cities can overcome the same challenges.  To Training materials and text books for many professions.   All free– and in ways you can find them.

Bringing free public access to the public domain is the opportunity of the Internet– an infrastructure that effectively costs nothing to distribute information that has been collected and organized.

Yes, this will cost a small fortune– but it is within our grasp– to collect and organize billions of documents and datasets, preserve the materials for the ages and make them available for many purposes.  While scoping projects in the United States and Canada have now begun, we estimate this project will cost at least $100 million dollars. The big money has not been committed yet, and we’re still fundraising. But to get things kicked off, Filecoin Foundation (FF) and Filecoin Foundation for the Decentralized Web (FFDW), are supporting the project. The Internet Archive has ramped up collecting government websites and datasets as well as digitizing print materials with many library partners.

Thankfully, we do not have the rights and paywall problems that have been strangling the Internet’s best feature: an essentially free information distribution system.   

Democracy’s Library can be a free public library available on your phone and your laptop.  

Democracy’s Library will be the foundation of new services, both non-commercial and commercial, that leverages language understanding, machine learning, automatic translation, speech recognition, and visualizations.

Democracies publish openly– let’s take advantage of this.  Leverage our library system to not just lease commercial publisher’s database products, but build open collections that everyone can use and reuse without limitation.

Lets build direct conduits from governments into Democracy’s Library for long term preservation and access. A public-public partnership that long served us in the paper era, that took a pause in the mainframe era of commercial databases, can flourish again in the Internet era.  

“Public Access to the Public Domain” can be a rallying cry for Democracy’s Library.

Democracy’s Library can be a flowering of information services for free people.

Please join in and help.   Jamie Joyce of the Internet Archive (jamiejoyce@archive.org) is leading the effort in the United States, Andrea Mills of the Internet Archive Canada ( andrea@archive.org ) is leading the Canadian effort.  The project is overseen by Brewster Kahle ( brewster@archive.org ).   

If you’d like to stay connected, sign up for the #EmpoweringLibraries newsletter.

Have ideas?  Have materials?  Have a use case?  Have resources to bring to bear?   This can only happen if we work together.  

Let’s build Democracy’s Library, together.

Digital Books wear out faster than Physical Books

Ever try to read a physical book passed down in your family from 100 years ago?  Probably worked well. Ever try reading an ebook you paid for 10 years ago?   Probably a different experience. From the leasing business model of mega publishers to physical device evolution to format obsolescence, digital books are fragile and threatened.

For those of us tending libraries of digitized and born-digital books, we know that they need constant maintenance—reprocessing, reformatting, re-invigorating or they will not be readable or read. Fortunately this is what libraries do (if they are not sued to stop it). Publishers try to introduce new ideas into the public sphere. Libraries acquire these and keep them alive for generations to come.

And, to serve users with print disabilities, we have to keep up with the ever-improving tools they use.

Mega-publishers are saying electronic books do not wear out, but this is not true at all. The Internet Archive processes and reprocesses the books it has digitized as new optical character recognition technologies come around, as new text understanding technologies open new analysis, as formats change from djvu to daisy to epub1 to epub2 to epub3 to pdf-a and on and on. This takes thousands of computer-months and programmer-years to do this work. This is what libraries have signed up for—our long-term custodial roles.

Also, the digital media they reside on changes, too—from Digital Linear Tape to PATA hard drives to SATA hard drives to SSDs. If we do not actively tend our digital books they become unreadable very quickly.

Then there is cataloging and metadata. If we do not keep up with the ever-changing expectations of digital learners, then our books will not be found. This is ongoing and expensive.

Our paper books have lasted hundreds of years on our shelves and are still readable. Without active maintenance, we will be lucky if our digital books last a decade.

Also, how we use books and periodicals, in the decades after they are published, change from how they were originally intended. We are seeing researchers use books and periodicals in machine learning investigations to find trends that were never easy in a one-by-one world, or in the silos of the publisher databases. Preparing these books for this type of analysis is time consuming and now threatened by publisher’s lawsuits.

If we want future access to our digital heritage we need to make some structural changes:  changes to institution and publisher behaviors as well as supportive funding, laws, and enforcement.

The first step is to recognize preservation and access to our digital heritage is a big job and one worth doing.  Then, find ways that institutions– educational, government, non-profit, and philanthropic– could make preservation a part of our daily responsibility.

Long live books.

Illustration: midjourney AI generated.