Author Archives: Brewster Kahle

Brewster Goes to Washington – Congressional Hearing on the Copyright Office Modernization Committee

A good day in Washington.   After two years of being on the Copyright Office Modernization Committee, helping advise the Copyright Office on their new registration and recordation process, a republican and a democrat from the House of Representatives held a hearing to ask questions of committee members. It was such a refreshing scene because it was bipartisan, they knew the issues, and they were spending time finding out what we suggested.

This all matters because the Copyright Office is moving to filings being digital, which is an improvement, and because it could make way for efficient submissions of digital files.   This would be a major way for the Library of Congress to get copies of books they would own, preserve, and make somewhat accessible.

Another attendee said they had gone to congressional meetings for 30 years and this one had the most engagement of any of them.  A good day in Washington, indeed.

Let us serve you, but don’t bring us down

What just happened on archive.org today, as best we know:

Tens of thousands of requests per second for our public domain OCR files were launched from 64 virtual hosts on amazon’s AWS services. (Even by web standards,10’s of thousands of requests per second is a lot.)

This activity brought archive.org down for all users for about an hour.

We are thankful to our engineers who could scramble on a Sunday afternoon on a holiday weekend to work on this.

We got the service back up by blocking those IP addresses.

But, another 64 addresses started the same type of activity a couple of hours later.  

We figured out how to block this new set, but again, with about an hour outage.

—- 

How this could have gone better for us:

Those wanting to use our materials in bulk should start slowly, and ramp up. 

Also, if you are starting a large project please contact us at info@archive.org, we are here to help.

If you find yourself blocked, please don’t just start again, reach out.

Again, please use the Internet Archive, but don’t bring us down in the process.

Anti-Hallucination Add-on for AI Services Possibility

Chatbots, like OpenIA’s ChatGPT, Google’s Bard and others, have a hallucination problem (their term, not ours). It can make something up and state it authoritatively. It is a real problem. But there can be an old-fashioned answer, as a parent might say: “Look it up!”

Imagine for a moment the Internet Archive, working with responsible AI companies and research projects, could automate “Looking it Up” in a vast library to make those services more dependable, reliable, and trustworthy. How?

The Internet Archive and AI companies could offer an anti-hallucination service ‘add-on’ to the chatbots that could cite supporting evidence and counter claims to chatbot assertions by leveraging the library collections at the Internet Archive (most of which were published before generative AI).

By citing evidence for and against assertions based on papers, books, newspapers, magazines, books, TV, radio, government documents, we can build a stronger, more reliable knowledge infrastructure for a generation that turns to their screens for answers. Although many of these generative AI companies are already, or are intending, to link their models to the internet, what the Internet Archive can uniquely offer is our vast collection of “historical internet” content. We have been archiving the web for 27 years, which means we have decades of human-generated knowledge. This might become invaluable in an age when we might see a drastic increase in AI-generated content. So an Internet Archive add-on is not just a matter of leveraging knowledge available on the internet, but also knowledge available on the history of the internet.

Is this possible? We think yes because we are already doing something like this for Wikipedia by hand and with special-purpose robots like Internet Archive Bot Wikipedia communities, and these bots, have fixed over 17 million broken links, and have linked one million assertions to specific pages in over 250,000 books. With the help of the AI companies, we believe we can make this an automated process that could respond to the customized essays their services produce. Much of the same technologies used for the chatbots can be used to mine assertions in the literature and find when, and in what context, those assertions were made.

The result would be a more dependable World Wide Web, one where disinformation and propaganda are easier to challenge, and therefore weaken.

Yes, there are 4 major publishers suing to destroy a significant part of the Internet Archive’s book corpus, but we are appealing this ruling. We believe that one role of a research library like the Internet Archive, is to own collections that can be used in new ways by researchers and the general public to understand their world.

What is required? Common purpose, partners, and money. We see a role for a Public AI Research laboratory that can mine vast collections without rights issues arising. While the collections are significant already, we see collecting, digitizing, and making available the publications of the democracies around the world to expand the corpus greatly.

We see roles for scientists, researchers, humanists, ethicists, engineers, governments, and philanthropists, working together to build a better Internet.

If you would like to be involved, please contact Mark Graham at mark@archive.org.

AI Audio Challenge: Audio Restoration of 78rpm Records based on Expert Examples

http://great78.archive.org/

Hopefully we have a dataset primed for AI researchers to do something really useful, and fun– how to take noise out of digitized 78rpm records.

The Internet Archive has 1,600 examples of quality human restorations of 78rpm records where the best tools were used to ‘lightly restore’ the audio files. This takes away scratchy surface noise while trying not to impair the music or speech. In the items are files in those items are the unrestored originals that were used.

But then the Internet Archive has over 400,000 unrestored files that are quite scratchy and difficult to listen to.

The goal is, or rather the hope is, that a program that can take all or many of the 400,000 unrestored records and make them much better. How hard this is is unknown, but hopefully it is a fun project to work on.

Many of the recordings are great and worth the effort. Please comment on this post if you are interested in diving in.

AI@IA — Extracting Words Sung on 100 year-old 78rpm records

A post in the series about how the Internet Archive is using AI to help build the library.

Freely available Artificial Intelligence tools are now able to extract words sung on 78rpm records.  The results may not be full lyrics, but we hope it can help browsing, searching, and researching.

Whisper is an open source tool from OpenAI “that approaches human level robustness and accuracy on English speech recognition.”  We were surprised how far it could get with recognizing spoken words on noisy disks and even words being sung.

For instance in As We Parted At The Gate (1915) by  Donald Chalmers, Harvey Hindermyer, and E. Austin Keith, the tool found the words:

[…] we parted at the gate,
I thought my heart would shrink.
Often now I seem to hear her last goodbye.
And the stars that tune at night will
never die as bright as they did before we
parted at the gate.
Many years have passed and gone since I
went away once more, leaving far behind
the girl I love so well.
But I wander back once more, and today
I pass the door of the cottade well, my
sweetheart, here to dwell.
All the roads they flew at fair,
but the faith is missing there.
I hear a voice repeating, you’re to live.
And I think of days gone by
with a tear so from her eyes.
On the evening as we parted at the gate,
as we parted at the gate, I thought my
heart would shrink.
Often now I seem to hear her last goodbye.
And the stars that tune at night will
never die as bright as they did before we
parted at the gate.

All of the extracted texts are now available– we hope it is useful for understanding these early recordings.  Bear in mind these are historical materials so may be offensive and also possibly incorrectly transcribed.

We are grateful that University of California Santa Barbara Library donated an almost complete set of transfers of 100 year-old Edison recordings to the Internet Archive’s Great 78 Project this year.  The recordings and the transfers were so good that the automatic tools were able to make out many of the words.

The next step is to integrate these texts into the browsing and searching interfaces at the Internet Archive.

Our Digital History Is at Risk

This piece was first published by TIME Magazine, in their Ideas section, as Amid Musk’s Chaotic Reign at Twitter, Our Digital History Is at Risk. My thanks to the wonderful team at Time for their editorial and other assistance.

As Twitter has entered the Musk era, many people are leaving the platform or rethinking its role in their lives. Whether they join another platform like Mastodon (as I have) or continue on at Twitter, the instability occasioned by Twitter’s change in ownership has revealed an underlying instability in our digital information ecosystem. 

Many have now seen how, when someone deletes their Twitter account, their profile, their tweets, even their direct messages, disappear. According to the MIT Technology Review, around a million people have left so far, and all of this information has left the platform along with them. The mass exodus from Twitter and the accompanying loss of information, while concerning in its own right, shows something fundamental about the construction of our digital information ecosystem:  Information that was once readily available to you—that even seemed to belong to you—can disappear in a moment. 

Losing access to information of private importance is surely concerning, but the situation is more worrying when we consider the role that digital networks play in our world today. Governments make official pronouncements online. Politicians campaign online. Writers and artists find audiences for their work and a place for their voice. Protest movements find traction and fellow travelers.  And, of course, Twitter was a primary publishing platform of a certain U.S. president

If Twitter were to fail entirely, all of this information could disappear from their site in an instant. This is an important part of our history. Shouldn’t we be trying to preserve it?

I’ve been working on these kinds of questions, and building solutions to some of them, for a long time. That’s part of why, over 25 years ago, I founded the Internet Archive. You may have heard of our “Wayback Machine,” a free service anyone can use to view archived web pages from the mid-1990’s to the present. This archive of the web has been built in collaboration with over a thousand libraries around the world, and it holds hundreds of billions of archived webpages today–including those presidential tweets (and many others). In addition, we’ve been preserving all kinds of important cultural artifacts in digital form: books, television news, government records, early sound and film collections, and much more. 

The scale and scope of the Internet Archive can give it the appearance of something unique, but we are simply doing the work that libraries and archives have always done: Preserving and providing access to knowledge and cultural heritage. For thousands of years, libraries and archives have provided this important public service. I started the Internet Archive because I strongly believed that this work needed to continue in digital form and into the digital age. 

While we have had many successes, it has not been easy. Like the record labels, many book publishers  didn’t know what to make of the internet at first, but now they see new opportunities for financial gain. Platforms, too, tend to put their commercial interests first. Don’t get me wrong: Publishers and platforms continue to play an important role in bringing the work of creators to market, and sometimes assist in the preservation task. But companies close, and change hands, and their commercial interests can cut against preservation and other important public benefits. 

Traditionally, libraries and archives filled this gap. But in the digital world, law and technology make their job increasingly difficult. For example, while a library could always simply buy a physical book on the open market in order to preserve it on their shelves, many publishers and platforms try to stop libraries from preserving information digitally. They may even use technical and legal measures to prevent libraries from doing so. While we strongly believe that fair use law enables libraries to perform traditional functions like preservation and lending in the digital environment, many publishers disagree, going so far as to sue libraries to stop them from doing so. 

We should not accept this state of affairs. Free societies need access to history, unaltered by changing corporate or political interests. This is the role that libraries have played and need to keep playing. This brings us back to Twitter.

In 2010, Twitter had the tremendous foresight of engaging in a partnership with the Library of Congress to preserve old tweets. At the time, the Library of Congress had been tasked by Congress “to establish a national digital information infrastructure and preservation program.” It appeared that government and private industry were working together in search of a solution to the digital preservation problem, and that Twitter was leading the way.  

It was not long before the situation broke down. In 2011, the Library of Congress issued a report noting the need for “legal and regulatory changes that would recognize the broad public interest in long-term access to digital content,” as well as the fact that “most libraries and archives cannot support under current funding” the necessary digital preservation infrastructure.”  But no legal and regulatory changes have been forthcoming, and even before the 2011 report,  Congress pulled tens of millions of dollars out of the preservation program. In these circumstances, it is perhaps unsurprising that, by 2017, the Library of Congress had ceased preserving most old tweets, and the National Digital Information Infrastructure and Preservation Program (NDIIPP) is no longer an active program at the Library of Congress. Furthermore, it is not clear whether Twitter’s new ownership will take further steps of its own to address the situation. 

Whatever Musk does, the preservation of our digital cultural heritage should not have to rely on the beneficence of one man. We need to empower libraries by ensuring that they have the same rights with respect to digital materials that they have in the physical world. Whether that means archiving old tweets, lending books digitally, or even something as exciting (to me!) as 21st century interlibrary loan, what’s important is that we have a nationwide strategy for solving the technical and legal hurdles to getting this done. 

What is the Democracy’s Library?

Illustration created with MidJourney

Democracies require an educated citizenry to flourish– and because of this, Democratic governments, at all levels, spend billions of dollars publishing reports, manuals, books, videos so that all can read and learn. That is the good news.  The bad news is that in our digital age, much of this is not accessible.   Democracy’s Library aims to change this.   

The aim of the Internet Archive Democracy’s Library is to collect, preserve and make freely available all the published works of all the democracies– the federal, provincial, and municipal government publications– so that we can efficiently learn from each other to solve our biggest challenges in parallel and in concert.

Democracy’s Library is the foundational information of free people.

We call this “Democracy’s Library” because Democracy is an open system that trusts its citizens to learn, grow and have independent agency. Democratic governments publish openly because they want important information spread widely.  There are no paywalls to the works of government, or there shouldn’t be. 

We need access to all the River reports so we can help understand and manage our declining clean water.   Access to Agricultural research to help farm more sustainably.  To Materials research to build better products and devices. To Local hearings on project results so other cities can overcome the same challenges.  To Training materials and text books for many professions.   All free– and in ways you can find them.

Bringing free public access to the public domain is the opportunity of the Internet– an infrastructure that effectively costs nothing to distribute information that has been collected and organized.

Yes, this will cost a small fortune– but it is within our grasp– to collect and organize billions of documents and datasets, preserve the materials for the ages and make them available for many purposes.  While scoping projects in the United States and Canada have now begun, we estimate this project will cost at least $100 million dollars. The big money has not been committed yet, and we’re still fundraising. But to get things kicked off, Filecoin Foundation (FF) and Filecoin Foundation for the Decentralized Web (FFDW), are supporting the project. The Internet Archive has ramped up collecting government websites and datasets as well as digitizing print materials with many library partners.

Thankfully, we do not have the rights and paywall problems that have been strangling the Internet’s best feature: an essentially free information distribution system.   

Democracy’s Library can be a free public library available on your phone and your laptop.  

Democracy’s Library will be the foundation of new services, both non-commercial and commercial, that leverages language understanding, machine learning, automatic translation, speech recognition, and visualizations.

Democracies publish openly– let’s take advantage of this.  Leverage our library system to not just lease commercial publisher’s database products, but build open collections that everyone can use and reuse without limitation.

Lets build direct conduits from governments into Democracy’s Library for long term preservation and access. A public-public partnership that long served us in the paper era, that took a pause in the mainframe era of commercial databases, can flourish again in the Internet era.  

“Public Access to the Public Domain” can be a rallying cry for Democracy’s Library.

Democracy’s Library can be a flowering of information services for free people.

Please join in and help.   Jamie Joyce of the Internet Archive (jamiejoyce@archive.org) is leading the effort in the United States, Andrea Mills of the Internet Archive Canada ( andrea@archive.org ) is leading the Canadian effort.  The project is overseen by Brewster Kahle ( brewster@archive.org ).   

If you’d like to stay connected, sign up for the #EmpoweringLibraries newsletter.

Have ideas?  Have materials?  Have a use case?  Have resources to bring to bear?   This can only happen if we work together.  

Let’s build Democracy’s Library, together.

Digital Books wear out faster than Physical Books

Ever try to read a physical book passed down in your family from 100 years ago?  Probably worked well. Ever try reading an ebook you paid for 10 years ago?   Probably a different experience. From the leasing business model of mega publishers to physical device evolution to format obsolescence, digital books are fragile and threatened.

For those of us tending libraries of digitized and born-digital books, we know that they need constant maintenance—reprocessing, reformatting, re-invigorating or they will not be readable or read. Fortunately this is what libraries do (if they are not sued to stop it). Publishers try to introduce new ideas into the public sphere. Libraries acquire these and keep them alive for generations to come.

And, to serve users with print disabilities, we have to keep up with the ever-improving tools they use.

Mega-publishers are saying electronic books do not wear out, but this is not true at all. The Internet Archive processes and reprocesses the books it has digitized as new optical character recognition technologies come around, as new text understanding technologies open new analysis, as formats change from djvu to daisy to epub1 to epub2 to epub3 to pdf-a and on and on. This takes thousands of computer-months and programmer-years to do this work. This is what libraries have signed up for—our long-term custodial roles.

Also, the digital media they reside on changes, too—from Digital Linear Tape to PATA hard drives to SATA hard drives to SSDs. If we do not actively tend our digital books they become unreadable very quickly.

Then there is cataloging and metadata. If we do not keep up with the ever-changing expectations of digital learners, then our books will not be found. This is ongoing and expensive.

Our paper books have lasted hundreds of years on our shelves and are still readable. Without active maintenance, we will be lucky if our digital books last a decade.

Also, how we use books and periodicals, in the decades after they are published, change from how they were originally intended. We are seeing researchers use books and periodicals in machine learning investigations to find trends that were never easy in a one-by-one world, or in the silos of the publisher databases. Preparing these books for this type of analysis is time consuming and now threatened by publisher’s lawsuits.

If we want future access to our digital heritage we need to make some structural changes:  changes to institution and publisher behaviors as well as supportive funding, laws, and enforcement.

The first step is to recognize preservation and access to our digital heritage is a big job and one worth doing.  Then, find ways that institutions– educational, government, non-profit, and philanthropic– could make preservation a part of our daily responsibility.

Long live books.

Illustration: midjourney AI generated.

We have added a Mastodon Server

The Internet Archive has recently set up its own Mastodon server– a federated/decentralized open source social media package– that has garnered lots of attention lately.

We use it in ways that we use twitter now (we are not leaving twitter):
@internetarchive@mastodon.archive.org for events, announcements, and fun things
• Staff accounts (e.g. my account @brewsterkahle@mastodon.archive.org) for, well, whatever.

Why?  We need a game with many winners, not just a few powerful players.  

Through our dweb work, the Internet Archive has catalyzed decentralized web technologies through conferences, summits, meet-ups and camps for 6 years. We need new tech to help with privacy, robustness, and work around issues of disinformation and corporate consolidation.  Mastodon is built on open standards so others can build alternative clients and integrate it into other systems.  

Looking forward to many social media alternatives: Blue Sky, Matrix, and many others. 

Personally, I want to see the evolution and combination of features of Slack, Twitter, SMS, Signal, email, Discord, Facebook, IRC, zoom, google meet, and other ways we communicate.  While we are at it, how about a more integrated environment of zendesk, jira, wordpress, and google docs.  Free and open technologies that invite interoperability while communities maintain control would be ideal.  And in my day-to-day I would love fewer systems to monitor that also limit my direct exposure to celebrities, influencers, and politicians.   Oh, I can dream…

Twitter
Facebook
Mastodon
Donations
Physical donations

Please help us learn, this time about Mastodon.   Thank you, all!

Guide to the exhibition galleries of the Departament of Geology and Palaeontology in the British Museum pg 18

“Doors Open” — Go Behind–the-Scenes at the Physical Archive of the Internet Archive

Please join us on October 18th 6:00- 8:00 pm as we take a peek behind the doors of the Physical Archive in Richmond, California

In anticipation of launching Democracy’s Library on October 19th we are excited to offer a behind-the-scenes tour of our physical collections of books, music, film, and video in Richmond, California.

With this special insider event we are opening the doors to an often unseen place. See the lifecycle of physical books acquired by the Internet Archive — donation, preservation, digitization, and access. We’ll also present samples from generous donations and acquisitions of books, records, microfiche, and film, and demonstrate the Archive’s high-end motion-picture film scanner.

We look forward to offering this glimpse into a very important part of the Internet Archive in its mission to bring Universal Access to All Knowledge. 

Light refreshments will be provided

RSVP HERE

Cost: $10

DOORS OPEN:  6 PM – 8PM

ADDRESS: 2512 Florida Avenue Richmond, CA

THANK YOU FOR REGISTERING IN ADVANCE