The Wayback Machine, Archive.org, Archive-it.org, and OpenLibrary.org came up in stages over the week after cyberattacks with some of the contributor features coming up over the last couple of weeks. A few to go. Much of the development during this time has been focused on securing the services so they can still run while attacks continue.
The Internet Archive is adapting to a more hostile world, where DDOS attacks are recurring periodically (such as yesterday and today), and more severe attacks might happen. Our response has been to harden our services and learn from friends. This note is to share some high level findings, without being so detailed as to help those that are still attacking archive.org.
By tightening firewall technologies, we have changed how data flows through our systems to improve monitoring and control. The downside is these upgrades have forced changes to software, some of it quite old.
The bright side is this is forcing upgrades that we have long planned or hoped for. We are greatly helped by the free and open source community’s improving tools that can be used by large corporations as well as non-profit libraries because they are freely available.
Also, some commercial companies have offered assistance that would generally be prohibitively expensive. We are grateful for the support.
Where the Internet Archive has always focused on building collections and preserving them, we have been starkly reminded how important reliable access is to researchers, journalists, and readers. This is leading us to install technical defenses and increase staff to improve service availability.
Libraries in general, and the Internet Archive in specific, have been under attack for many years now. For us it started with the book publishers suing (about lending books), and now the recording industry (about 78rpm records), which is a drain on our staff and financial resources. Now recurring DDOS attacks distract us from the goals of preservation and access to our digital heritage.
We don’t know why these attacks have started recently and if they are coordinated, but we are building defenses.
We are grateful for the support from our patrons, through social media, through donations, and through offers of help, which frankly, makes it worthwhile to keep building a library for all of us.
Last week, along with a DDOS attack and exposure of patron email addresses and encrypted passwords, the Internet Archive’s website javascript was defaced, leading us to bring the site down to access and improve our security.
The stored data of the Internet Archive is safe and we are working on resuming services safely. This new reality requires heightened attention to cyber security and we are responding. We apologize for the impact of these library services being unavailable.
The Wayback Machine, Archive-It, scanning, and national library crawls have resumed, as well as email, blog, helpdesk, and social media communications. Our team is working around the clock across time zones to bring other services back online. In coming days more services will resume, some starting in read-only mode as full restoration will take more time.
We’re taking a cautious, deliberate approach to rebuild and strengthen our defenses. Our priority is ensuring the Internet Archive comes online stronger and more secure.
On Sept 4, 2024, the US Court of Appeals in New York affirmed the lower court ruling in the lawsuit filed against us by Hachette Book Group, HarperCollins Publishers, John Wiley & Sons, and Penguin Random House. While the Internet Archive is disappointed by this opinion—it was never the Internet Archive’s intention to get into a lawsuit over lending digitized books—we respect the outcome.
To date, we have removed over 500,000 books from lending on archive.org (and therefore also openlibrary.org). While we are reviewing all available options, this judicial opinion will lead to the removal of many more books from lending. It is important for the Internet Archive and all libraries to continue to have a healthy relationship with publishers and authors.
Please be assured that millions of digitized books will still be available to those with print disabilities, small sections will be available for those linking into them from Wikipedia and through interlibrary loan, books will continue to be preserved for the long term, and other protected library uses will continue to inform digital learners everywhere.
The Internet Archive is also increasing its investment in digital books from publishers willing to sell ebooks that libraries can own and lend. While this is currently from a small number of publishers, the number is growing and we see it as a future for the long term sustainability of authors, publishers, and libraries. Encouragingly, the Independent Publishers Group recently endorsed selling ebooks to libraries. The growing number of libraries purchasing and owning digital books brings fair compensation to authors and publishers, along with permanent preservation and access to author’s works for communities everywhere.
We respect the opinion of the courts and, while we are saddened by how this setback affects our patrons and the future of all libraries, the Internet Archive remains strong and committed to our mission of Universal Access to All Knowledge. Thank you for your help and support.
Editorial note: This op-ed first ran in Time Magazine in 2021. We are reposting it here with permission as we head into oral argument for our appeal in the publishers’ lawsuit against our library, scheduled for next Friday, June 28, 2024.
When I started the Internet Archive 25 years ago, I focused our non-profit library on digital collections: preserving web pages, archiving television news, and digitizing books. The Internet Archive was seen as innovative and unusual. Now all libraries are increasingly electronic, and necessarily so. To fight disinformation, to serve readers during the pandemic, and to be relevant to 21st-century learners, libraries must become digital.
But just as the Web increased people’s access to information exponentially, an opposite trend has evolved. Global media corporations—emboldened by the expansive copyright laws they helped craft and the emerging technology that reaches right into our reading devices—are exerting absolute control over digital information. These two conflicting forces—towards unfettered availability and completely walled access to information—have defined the last 25 years of the Internet. How we handle this ongoing clash will define our civic discourse in the next 25 years. If we fail to forge the right path, publishers’ business models could eliminate one of the great tools for democratizing society: our independent libraries.
These are not small mom-and-pop publishers: a handful of publishers dominate all books sales and distribution including trade books, ebooks, and text books. Right now, these corporate publishers are squeezing libraries in ways that may render it impossible for any library to own digital texts in five years, let alone 25. Soon, librarians will be reduced to customer service reps for a Netflix-like rental catalog of bestsellers. If that comes to pass, you might as well replace your library card with a credit card. That’s what these billion-dollar-publishers are pushing.
The libraries I grew up with would buy books, preserve them, and lend them for free to their patrons. If my library did not have a particular book, then it would borrow a copy from another library for me. In the shift from print to digital, many commercial publishers are declaring each of these activities illegal: they refuse libraries the right to buy ebooks, preserve ebooks, or lend ebooks. They demand that libraries license ebooks for a limited time or for limited uses at exorbitant prices, and some publishers refuse to license audiobooks or ebooks to libraries at all, making those digital works unavailable to hundreds of millions of library patrons.
Although we’re best known for the Wayback Machine, a historical archive of the World Wide Web, the Internet Archive also buys ebooks from the few independent publishers that will sell, really sell, ebooks to us. With these ebooks, we lend them to one reader at a time, protected with the same technologies that publishers use to protect their ebooks. The Internet Archive also digitizes print books that were purchased or donated. Similarly, we lend them to one reader at a time, following a practice employed by hundreds of libraries over the last decade called “controlled digital lending.”
Last year,* four of the biggest commercial publishers in the world sued the Internet Archive to stop this longstanding library practice of controlled lending of scanned books. The publishers filed their lawsuit early in the pandemic, when public and school libraries were closed. In March 2020, more than one hundred shuttered libraries signed a statement of support asking that the Internet Archive do something to meet the extraordinary circumstances of the moment. We responded as any library would: making our digitized books available, without waitlists, to help teachers, parents, and students stranded without books. This emergency measure ended two weeks before the intended 14-week period.
The lawsuit demands that the Internet Archive destroy 1.4 million digitized books, books we legally acquired and scanned in cooperation with dozens of library partners. If the publishers win this lawsuit, then every instance of online reading would require the permission and license of a publisher. It would give publishers unprecedented control over what we can read and when, as well as troves of data about our reading habits.
Publishers’ bullying tactics have stirred lawmakers in Maryland, New York, Massachusetts, and Rhode Island to draft laws requiring that publishers treat libraries fairly. Maryland’s legislature passed the law unanimously. In those states, if an ebook is licensed to consumers, publishers will be required by law to license it to libraries on reasonable terms. But lobbyists for the publishing industry claim even these laws are unconstitutional. This is a dangerous state of affairs. Libraries should be free to buy, preserve, and lend all books regardless of the format.
Suing libraries is not a new tactic for these billion-dollar corporations and their surrogates: Georgia State University’s law library battled a copyright lawsuit for 12 years; HathiTrust Digital Library battled the Author’s Guild for five years. In each case, the library organization won, but it took millions of dollars that libraries can ill-afford.
Libraries spend billions of dollars on publishers’ products, supporting authors, illustrators, and designers. If libraries become mere customer service departments for publisher’s pre-packaged product lines, the role that librarians play in highlighting marginalized voices, providing information to the disadvantaged, and preserving cultural memory independent of those in power will be lost.
As we shift from print to digital, we can and must support institutions and practices that were refined over hundreds of years starting with selling ebooks to readers and libraries.
So if we all handle this next phase of the Internet well, I believe the answer is, yes, there will be libraries in 25 years, many libraries—and many publishers, many booksellers, millions of compensated authors, and a society in which everyone will read good books.
*Editorial note: This op-ed was first published in 2021.
Last week Aruba launched the island nation’s digital heritage portal online: Coleccion Aruba. As trumpeted in Wired:“The Internet Archive Just Backed Up an Entire Caribbean Island,” but really the credit goes to Aruba. Digitizing their national cultural heritage (100k items) and putting it online for free public access is a huge achievement.
I met with the Prime Minister (pictured above), the Minister of Culture, and the Minister of Education who backed the efforts made by the National Librarian, National Archivist, and their digital strategist. Never have I seen such unified support for cultural preservation and access. They brought together people from the Dutch islands and the Internet Archive to share the news and to inspire and to lead.
Aruba was the first to sign onto the Four Digital Rights of Memory Institutions: right to Collect, Preserve, provide Access, and interlibrary Collaboration. These are bad times when we have to reclaim these rights that are being taken from all libraries, but Aruba is making a stand. Go Aruba!
If libraries are reduced to only subscribing to commercial database products rather than owning and curating collections, we will be beholden to external corporations and subject to their whims over what’s in licensed collections, and how patrons can access them. The “Spotify for Books” model is not the way we want our libraries to go.
To top it off, the Prime Minister, Evelyn Wever-Croes, inspired us when she told us that for the next generation, we need to “Give them the opportunity to search for the truth.” Yes.
Inspiring to see a country lead so well. I hope we have the honor of working with other nations that will also assert Digital Rights for Libraries, and live by those principles.
A good day in Washington. After two years of being on the Copyright Office Modernization Committee, helping advise the Copyright Office on their new registration and recordation process, a republican and a democrat from the House of Representatives held a hearing to ask questions of committee members. It was such a refreshing scene because it was bipartisan, they knew the issues, and they were spending time finding out what we suggested.
This all matters because the Copyright Office is moving to filings being digital, which is an improvement, and because it could make way for efficient submissions of digital files. This would be a major way for the Library of Congress to get copies of books they would own, preserve, and make somewhat accessible.
Another attendee said they had gone to congressional meetings for 30 years and this one had the most engagement of any of them. A good day in Washington, indeed.
What just happened on archive.org today, as best we know:
Tens of thousands of requests per second for our public domain OCR files were launched from 64 virtual hosts on amazon’s AWS services. (Even by web standards,10’s of thousands of requests per second is a lot.)
This activity brought archive.org down for all users for about an hour.
We are thankful to our engineers who could scramble on a Sunday afternoon on a holiday weekend to work on this.
We got the service back up by blocking those IP addresses.
But, another 64 addresses started the same type of activity a couple of hours later.
We figured out how to block this new set, but again, with about an hour outage.
—-
How this could have gone better for us:
Those wanting to use our materials in bulk should start slowly, and ramp up.
Also, if you are starting a large project please contact us at info@archive.org, we are here to help.
If you find yourself blocked, please don’t just start again, reach out.
Again, please use the Internet Archive, but don’t bring us down in the process.
Chatbots, like OpenIA’s ChatGPT, Google’s Bard and others, have a hallucination problem (their term, not ours). It can make something up and state it authoritatively. It is a real problem. But there can be an old-fashioned answer, as a parent might say: “Look it up!”
Imagine for a moment the Internet Archive, working with responsible AI companies and research projects, could automate “Looking it Up” in a vast library to make those services more dependable, reliable, and trustworthy. How?
The Internet Archive and AI companies could offer an anti-hallucination service ‘add-on’ to the chatbots that could cite supporting evidence and counter claims to chatbot assertions by leveraging the library collections at the Internet Archive (most of which were published before generative AI).
By citing evidence for and against assertions based on papers, books, newspapers, magazines, books, TV, radio, government documents, we can build a stronger, more reliable knowledge infrastructure for a generation that turns to their screens for answers. Although many of these generative AI companies are already, or are intending, to link their models to the internet, what the Internet Archive can uniquely offer is our vast collection of “historical internet” content. We have been archiving the web for 27 years, which means we have decades of human-generated knowledge. This might become invaluable in an age when we might see a drastic increase in AI-generated content. So an Internet Archive add-on is not just a matter of leveraging knowledge available on the internet, but also knowledge available on the history of the internet.
Is this possible? We think yes because we are already doing something like this for Wikipedia by hand and with special-purpose robots like Internet Archive Bot Wikipedia communities, and these bots, have fixed over 17 million broken links, and have linked one million assertions to specific pages in over 250,000 books. With the help of the AI companies, we believe we can make this an automated process that could respond to the customized essays their services produce. Much of the same technologies used for the chatbots can be used to mine assertions in the literature and find when, and in what context, those assertions were made.
The result would be a more dependable World Wide Web, one where disinformation and propaganda are easier to challenge, and therefore weaken.
Yes, there are 4 major publishers suing to destroy a significant part of the Internet Archive’s book corpus, but we are appealing this ruling. We believe that one role of a research library like the Internet Archive, is to own collections that can be used in new ways by researchers and the general public to understand their world.
What is required? Common purpose, partners, and money. We see a role for a Public AI Research laboratory that can mine vast collections without rights issues arising. While the collections are significant already, we see collecting, digitizing, and making available the publications of the democracies around the world to expand the corpus greatly.
We see roles for scientists, researchers, humanists, ethicists, engineers, governments, and philanthropists, working together to build a better Internet.
If you would like to be involved, please contact Mark Graham at mark@archive.org.
Hopefully we have a dataset primed for AI researchers to do something really useful, and fun– how to take noise out of digitized 78rpm records.
The Internet Archive has 1,600 examples of quality human restorations of 78rpm records where the best tools were used to ‘lightly restore’ the audio files. This takes away scratchy surface noise while trying not to impair the music or speech. In the items are files in those items are the unrestored originals that were used.
But then the Internet Archive has over 400,000 unrestored files that are quite scratchy and difficult to listen to.
The goal is, or rather the hope is, that a program that can take all or many of the 400,000 unrestored records and make them much better. How hard this is is unknown, but hopefully it is a fun project to work on.
Many of the recordings are great and worth the effort. Please comment on this post if you are interested in diving in.
A post in the series about how the Internet Archive is using AI to help build the library.
Freely available Artificial Intelligence tools are now able to extract words sung on 78rpm records. The results may not be full lyrics, but we hope it can help browsing, searching, and researching.
Whisper is an open source tool from OpenAI “that approaches human level robustness and accuracy on English speech recognition.” We were surprised how far it could get with recognizing spoken words on noisy disks and even words being sung.
For instance in As We Parted At The Gate (1915) by Donald Chalmers, Harvey Hindermyer, and E. Austin Keith, the tool found the words:
[…] we parted at the gate, I thought my heart would shrink. Often now I seem to hear her last goodbye. And the stars that tune at night will never die as bright as they did before we parted at the gate. Many years have passed and gone since I went away once more, leaving far behind the girl I love so well. But I wander back once more, and today I pass the door of the cottade well, my sweetheart, here to dwell. All the roads they flew at fair, but the faith is missing there. I hear a voice repeating, you’re to live. And I think of days gone by with a tear so from her eyes. On the evening as we parted at the gate, as we parted at the gate, I thought my heart would shrink. Often now I seem to hear her last goodbye. And the stars that tune at night will never die as bright as they did before we parted at the gate.
All of the extracted texts are now available– we hope it is useful for understanding these early recordings. Bear in mind these are historical materials so may be offensive and also possibly incorrectly transcribed.