Author Archives: Brewster Kahle

Internet Archive Joins IDS Project for Interlibrary Loan

The Internet Archive is pleased to announce it has joined the The Information Delivery Services (IDS) Project, a mutually supportive resource-sharing cooperative whose 120 members include public and private academic libraries from across the country.  As a member of the IDS Project, the Internet Archive expands its ability to support libraries and library patrons by providing access to two million monographs and three thousand periodicals in its physical collections available for non-returnable interlibrary loan (ILL) fulfillment. 

“The Internet Archive is a wonderful addition to the IDS Project’s team of libraries.  It is a great honor to be able to help IA reach more libraries and more patrons through the integration with IDS Logic,” said Mark Sullivan, Executive Director of the IDS Project.

If you want to learn more about the IDS Project and the Internet Archive, I will be speaking at the 17th Annual IDS Summer Conference on July 29th.

In addition to the IDS Project, the Internet Archive is also piloting a program with libraries through RapidILL. If there are other resource sharing efforts that we should investigate as we expand our ILL service, please reach out to me at brewster@archive.org.

Reflections as the Internet Archive turns 25

Photo by Rory Mitchell, The Mercantile, 2020 – CC by 4.0
(L-R) Brewster Kahle, Tamiko Thiel, Carl Feynman at Thinking Machines, May 1985. Photo courtesy of Tamiko Thiel.

A Library of Everything

As a young man, I wanted to help make a new medium that would be a step forward from Gutenberg’s invention hundreds of years before. 

By building a Library of Everything in the digital age, I thought the opportunity was not just to make it available to everybody in the world, but to make it better–smarter than paper. By using computers, we could make the Library not just searchable, but organizable; make it so that you could navigate your way through millions, and maybe eventually billions of web pages.

The first step was to make computers that worked for large collections of rich media. The next was to create a network that could tap into computers all over the world: the Arpanet that became the Internet. Next came augmented intelligence, which came to be called search engines. I then helped build WAIS–Wide Area Information Server–that helped publishers get online to anchor this new and open system, which came to be enveloped by the World Wide Web.  

By 1996, it was time to start building the library.

This library would have all the published works of humankind. This library would be available not only to those who could pay the $1 per minute that LexusNexus charged, or only at the most elite universities. This would be a library available to anybody, anywhere in the world. Could we take the role of a library a step further, so that everyone’s writings could be included–not only those with a New York book contract? Could we build a multimedia archive that contains not only writings, but also songs, recipes, games, and videos? Could we make it possible for anyone to learn about their grandmother in a hundred years’ time?

From the San Francisco Chronicle, Business Section, May 7, 1988. Photo by Jerry Telfer.

Not about an Exit or an IPO

From the beginning, the Internet Archive had to be a nonprofit because it contains everybody else’s things. Its motives had to be transparent. It had to last a long time.

In Silicon Valley, the goal is to find a profitable exit, either through acquisition or IPO, and go off to do your next thing. That was never my goal. The goal of the Internet Archive is to create a permanent memory for the Web that can be leveraged to make a new Global Mind. To find patterns in the data over time that would provide us with new insights, well beyond what you could do with a search engine.  To be not only a historical reference but a living part of the pulse of the Internet.

John Perry Barlow, lyricist for the Grateful Dead & founder of the Electronic Frontier Foundation, accepting the Internet Archive Hero Award, October 21, 2015. Photograph by Brad Shirakawa – CC by 4.0

Looking Way Back

My favorite things from the early era of the Web were the dreamers. 

In the early Web, we saw people trying to make a more democratic system work. People tried to make publishing more inclusive.

We also saw the other parts of humanity: the pornographers, the scammers, the spammers, and the trolls. They, too, saw the opportunity to realize their dreams in this new world. At the end of the day, the Internet and the World Wide Web–it’s just us. It’s just a history of humankind. And it has been an experiment in sharing and openness.

The World Wide Web at its best is a mechanism for people to share what they know, almost always for free, and to find one’s community no matter where you are in the world. 

Brewster Kahle speaking at the 2019 Charleston Library Conference. Photo by Corey SeemanCC by 4.0

Looking Way Forward

Over the next 25 years, we have a very different challenge. It’s solving some of the big problems with the Internet that we’re seeing now. Will this be our medium or will it be theirs? Will it be for a small controlling set of organizations or will it be a common good, a public resource? 

So many of us trust the Web to find recipes, how to repair your lawnmower, where to buy new shoes, who to date. Trust is perhaps the most valuable asset we have, and squandering that trust will be a global disaster. 

We may not have achieved Universal Access to All Knowledge yet, but we still can.

In another 25 years, we can have writings from not a hundred million people, but from a billion people, preserved forever. We can have compensation systems that aren’t driven by advertising models that enrich only a few. 

We can have a world with many winners, with people participating, finding communities of like-minded people they can learn from all over the world.  We can create an Internet where we feel in control. 

I believe we can build this future together. You have already helped the Internet Archive build this future. Over the last 25 years, we’ve amassed billions of pages, 70 petabytes of data to offer to the next generation. Let’s offer it to them in new and exciting ways. Let’s be the builders and dreamers of the next twenty-five years.

See a timeline of Key Moments in Access to Knowledge, videos & an invitation to our 25th Anniversary Virtual Celebration at anniversary.archive.org.

Internet Archive Launches New Pilot Program for Interlibrary Loan

Photo by Alfons Morales on Unsplash

The pandemic has resulted in a renewed focus on resource sharing among libraries. In addition to joining resource sharing organizations like the Boston Library Consortium, the Internet Archive has started to participate in the longstanding library practice of interlibrary loan (ILL). 

Internet Archive is now making two million monographs and three thousand periodicals in its physical collections available for non-returnable fulfillment through a pilot program with RapidILL, a prominent ILL coordination service. To date, more than seventy libraries have added the Internet Archive to their reciprocal lending list, and Internet Archive staff are responding to, on average, twenty ILL requests a day. If your library would like to join our pilot in Rapid, please reach out to Mike Richins at Mike.Richins@exlibrisgroup.com and request that Internet Archive be added to your library’s reciprocal lending list.

If there are other resource sharing efforts that we should investigate as we pilot our ILL service, please reach out to Brewster Kahle at brewster@archive.org.

Thank you Ubuntu and Linux Communities

The Internet Archive is wholly dependent on Ubuntu and the Linux communities that create a reliable, free (as in beer), free (as in speech), rapidly evolving operating system. It is hard to overestimate how important that is to creating services such as the Internet Archive.

When we started the Internet Archive in 1996, Sun and Oracle donated technology and we bought tape robots. By 1999, we shifted to inexpensive PC’s in a cluster, running varying Linux distributions.  

At this point, almost everything that runs on the servers of the Internet Archive is free and open-source software. (I believe our JP2 compression library may be the only piece of proprietary software we use.)

For a decade now, we have been upgrading our operating system on the cluster to the long-term support server Linux distribution of Ubuntu. Thank you, thank you. And we have never paid anything for it, but we submit code patches as the need arises.

Does anyone know the number of contributors to all the Linux projects that make up the Ubuntu distribution? How many tens or hundreds of thousands? Staggering.   

Ubuntu has ensured that every six months a better release comes out, and every two years a long-term release comes out. Like clockwork. Kudos. I am sure it is not easy, but it is inspiring, valuable and important to the world.

We started with Linux in 1997, we started with Ubuntu server release Warty Warthog in 2004 and are in the process of moving to Focal (Ubuntu 20.4).

Depending on free and open software is the smartest technology move the Internet Archive ever made.

1998: https://www.sfgate.com/business/article/Archiving-the-Internet-Brewster-Kahle-makes-3006888.php

Internet Archive servers running at the Biblioteca Alexandrina circa approximately 2002.

2002: https://archive.org/about/bibalex.php

2013: https://www.theguardian.com/technology/2013/apr/26/brewster-kahle-internet-archive

petabox2.JPG

2021: Internet Archive

Discogs Thank You! A commercial community site with bulk data access

https://thequietus.com/articles/24529-discogs-more-than-200-million-dollars

Discogs has cracked the nut, struck the right balance, and is therefore an absolute Internet treasure– Thank you.

If you don’t know them, Discogs is a central resource for the LP/78/CD music communities, and as Wikipedia said “As of 28 August 2019 Discogs contained over 11.6 million releases, by over 6 million artists, across over 1.3 million labels, contributed from over 456,000 contributor user accounts—with these figures constantly growing…”

When I met the founder, Kevin Lewandowski, a year ago he said the Portland based company supports 80 employees and is growing. They make money by being a marketplace for buyers and sellers of discs.  An LP dealer I met in Oklahoma sells most of his discs through discogs as well as going at record fairs.

The data about records is spectacularly clean. Compare it to Ebay, where the data is scattershot, and you have something quite different and reusable. It is the best parts of musicbrainz, CDDB, and Ebay– where users can catalog their collections and buy/sell records. By starting with the community function, Kevin said, the quality started out really good, and then adding the market place later led it to its success.

But there is something else Discogs does that sets it apart from many other commercial websites, and this makes All The Difference:

Discogs also makes their data available, in bulk, and with a free-to-use API.

The Great 78 Project has leveraged this bulk database to help find the date of release for 78’s.  Just yesterday, I downloaded the new dataset and added it to our 78rpm date database, and in last year 10’s of thousands more 78’s were added to discogs, and we found 1,500 more dates for our existing 78’s.   Thank you!

The Internet Archive Lost Vinyl Project leverages the API’s by looking up records we will be digitizing to find track listings.

A donor to our CD project used the public price information to appraise the CDs he donated for a tax write-off.

We want to add links back from Discogs to the Internet Archive and they have not allowed that yet (please please), but there is always something more to do.

I hope other sites, even commercial ones, would allow bulk access to their data (an API is not enough).   

Thank you, Discogs.

FOSS wins again: Free and Open Source Communities comes through on 19th Century Newspapers (and Books and Periodicals…)

I have never been more encouraged and thankful to Free and Open Source communities. Three months ago I posted a request for help with OCR’ing and processing 19th Century Newspapers and we got soooo many offers to help.  Thank you, that was heart warming and concretely helpful– already based on these suggestions we are changing over our OCR and PDF software completely to FOSS, making big improvements, and building partnerships with FOSS developers in companies, universities, and as individuals that will propel the Internet Archive to have much better digitized texts.  I am so grateful, thank you.   So encouraging.

I posted a plea for help on the Internet Archive blog: Can You Help us Make the 19th Century Searchable? and we got many social media offers and over 50 comments the post– maybe a record response rate.   

We are already changing over our OCR to Tesseract/OCRopus and leveraging many PDF libraries to create compressed, accessible, and archival PDFs.

Several people suggested the German government-lead initiative called OCR-D that has made production level tools for helping OCR and segment complex and old materials such as newspapers in the old German script Fraktur, or black letter.  (The Internet Archive had never been able to process these, and now we are doing it at scale).   We are also able to OCR more Indian languages which is fantastic.  This Government project is FOSS, and has money for outreach to make sure others use the tools– this is a step beyond most research grants. 

Tesseract has made a major step forward in the last few years.  When we last evaluated the accuracy it was not as good as the proprietary OCR, but that has changed– we have done evaluations and it is just as good, and can get better for our application because of its new architecture.   

Underlying the new Tesseract is a LSTM engine similar to the one developed for Ocropus2/ocropy, which was a project led by Tom Breuel (funded by Google, his former German University, and probably others– thank you!). He has continued working on this project even though he left academia.  A machine learning based program is introducing us to GPU based processing, which is an extra win.  It can also be trained on corrected texts so it can get better.  

Proprietary example from an Anti-Slavery newspaper from my blog post:

New one, based on free and open source software that is still faulty but better:

The time it takes on our cluster to compute is approximately the same, but if we add GPU’s we should be able to speed up OCR and PDF creation, maybe 10 times, which would help a great deal since we are processing millions of pages a day.

The PDF generation is a balance trying to achieve small file size as well as rendering quickly in browser implementations, have useful functionality (text search, page numbers, cut-and-paste of text), and comply with archival (PDF/A) and accessibility standards (PDF/UA). At the heart of the new PDF generation is the “archive-pdf-tools” Python library, which performs Mixed Raster Content (MRC) compression, creates a hidden text layer using a modified Tesseract PDF renderer that can read hOCR files as input, and ensures the PDFs are compatible with archival standards (VeraPDF is used to verify every PDF that we generate against the archival PDF standards). The MRC compression decomposes each image into a background, foreground and foreground mask, heavily compressing (and sometimes downscaling) each layer separately. The mask is compressed losslessly, ensuring that the text and lines in an image do not suffer from compression artifacts and look clear. Using this method, we observe a 10x compression factor for most of our books.

The PDFs themselves are created using the high-performance mupdf and pymupdf python library: both projects were supportive and promptly fixed various bugs, which propelled our efforts forwards.

And best of all, we have expanded our community to include people all over the world that are working together to make cultural materials more available. We have a slack channel for OCR researchers and implementers now, that you can join if you would like (to join, drop an email to merlijn@archive.org). We look to contribute software and data sets to these projects to help them improve (lead by Merlijn Wajer and Derek Fukumori).

Next steps to fulfill the dream of Vanevar Bush’s Memex, Ted Nelson’s Xanadu, Michael Hart’s Project Gutenberg, Tim Berners-Lee’s World Wide Web,  Raj Ready’s call for Universal Access to All Knowledge (and now the Internet Archive’s mission statement):

  • Find articles in periodicals, and get the titles/authors/footnotes
  • Linking footnote citations to other documents
  • OCR Balinese palm leaf manuscripts based 17,000 hand entered pages.
  • Improve Tesseract page handling to improve OCR and segmentation
  • Improve epub creation, including images from pages
  • Improve OCRopus by creating training datasets

Any help here would be most appreciated.

Thank you, Free and Open Source Communities!  We are glad to be part of such a sharing and open world.

Want Some Terabytes from the Internet Archive to Play With?

There are many computer science projects, decentralized storage, and digital humanties projects looking for data to play with. You came to the right place– the Internet Archive offers cultural information available to web users and dataminers alike.

While many of our collections have rights issues to them so require agreements and conversation, there are many that are openly available for public, bulk downloading.

Here are 3 collections, one of movies, another of audio books, and a third are scanned public domain books from the Library of Congress. If you have a macintosh or linux machine, you can use those to run these command lines. If you run each for a little while you can get just a few of the items (so you do not need to download terabytes).

These items are also available via bittorrent, but we find the Internet Archive command line tool is really helpful for this kind of thing:

$ curl -LOs https://archive.org/download/ia-pex/ia
$ chmod +x ia
$ ./ia download –search=”collection:prelinger” #17TB of public domain movies
$ ./ia download –search=”collection:librivoxaudio” #20TB of public domain audiobooks
$ ./ia download –search=”collection:library_of_congress” #166,000 public domain books from the Library of Congress (60TB)

Here is a way to figure out how much data is in each:

apt-get install jq > /dev/null
./ia search “collection:library_of_congress” -f item_size | jq -r .item_size | paste -sd+ – | bc | numfmt –grouping
./ia search “collection:librivoxaudio” -f item_size | jq -r .item_size | paste -sd+ – | bc | numfmt –grouping
./ia search “collection:prelinger” -f item_size | jq -r .item_size | paste -sd+ – | bc | numfmt –grouping

Sorry to say we do not yet have a support group for people using these tools or finding out what data is available, so for the time being you are pretty much on your own.

Can You Help us Make the 19th Century Searchable?

In 1847, Frederick Douglass started a newspaper advocating the abolition of slavery that ran until 1851.  After the Civil War, there was a newspaper for freed slaves, the Freedmen’s Record.  The Internet Archive is bringing these and many more works online for free public access. But there’s a problem: 

Our Optical Character Recognition (OCR), while the best commercially available OCR technology, is not very good at identifying text from older documents.  

Take for example, this newspaper from 1847. The images are not that great, but a person can read them:

The problem is  our computers’ optical character recognition tech gets it wrong, and the columns get confused.

What we need is “Culture Tech” (a riff on fintech, or biotech) and Culture Techies to work on important and useful projects–the things we need, but are probably not going to get gushers of private equity interest to fund. There are thousands of professionals taking on similar challenges in the field of digital humanities and we want to complement their work with industrial-scale tech that we can apply to cultural heritage materials.

One such project would be to work on technologies to bring 19th-century documents fully digital. We need to improve  OCR to enable full text search, but we also need help segmenting documents into columns and articles. The Internet Archive has lots of test materials and thousands are uploading more documents all the time.    

What we do not have is a good way to integrate work on these projects with the Internet Archive’s processing flow.  So we need help and ideas there as well.

Maybe we can host an “Archive Summer of CultureTech” or something…Just ideas.   Maybe working with a university department that would want to build programs and classes around Culture Tech… If you have ideas or skills to contribute, please post a comment here or send an email to info@archive.org with some of this information.

Libraries lend books, and must continue to lend books: Internet Archive responds to publishers’ lawsuit

Yesterday, the Internet Archive filed our response to the lawsuit brought by four commercial publishers to end the practice of Controlled Digital Lending (CDL), the digital equivalent of traditional library lending. CDL is a respectful and secure way to bring the breadth of our library collections to digital learners. Commercial ebooks, while useful, only cover a small fraction of the books in our libraries. As we launch into a fall semester that is largely remote, we must offer our students the best information to learn from—collections that were purchased over centuries and are now being digitized. What is at stake with this lawsuit? Every digital learner’s access to library books. That is why the Internet Archive is standing up to defend the rights of  hundreds of libraries that are using Controlled Digital Lending.

The publishers’ lawsuit aims to stop the longstanding and widespread library practice of Controlled Digital Lending, and stop the hundreds of libraries using this system from providing their patrons with digital books. Through CDL, libraries lend a digitized version of the physical books they have acquired as long as the physical copy doesn’t circulate and the digital files are protected from redistribution. This is how Internet Archive’s lending library works, and has for more than nine years. Publishers are seeking to shut this library down, claiming copyright law does not allow it. Our response is simple: Copyright law does not stand in the way of libraries’ rights to own books, to digitize their books, and to lend those books to patrons in a controlled way.  

What is at stake with this lawsuit? Every digital learner’s access to library books. That is why the Internet Archive is standing up to defend the rights of  hundreds of libraries that are using Controlled Digital Lending.

“The Authors Alliance has several thousand members around the world and we have endorsed the Controlled Digital Lending as a fair use,” stated Pamela Samuelson, Authors Alliance founder and Richard M. Sherman Distinguished Professor of Law at Berkeley Law. “It’s really tragic that at this time of pandemic that the publishers would try to basically cut off even access to a digital public library like the Internet Archive…I think that the idea that lending a book is illegal is just wrong.”

These publishers clearly intend this lawsuit to have a chilling effect on Controlled Digital Lending at a moment in time when it can benefit digital learners the most. For students and educators, the 2020 fall semester will be unlike any other in recent history. From K-12 schools to universities, many institutions have already announced they will keep campuses closed or severely limit access to communal spaces and materials such as books because of public health concerns. The conversation we must be having is: how will those students, instructors and researchers access information — from textbooks to primary sources? Unfortunately, four of the world’s largest book publishers seem intent on undermining both libraries’ missions and our attempts to keep educational systems operational during a global health crisis.

Ten percent of the world’s population experience disabilities that impact their ability to read. For these learners, digital books are a lifeline. The publishers’ lawsuit against the Internet Archive calls for the destruction of more than a million digitized books.

The publishers’ lawsuit does not stop at seeking to end the practice of Controlled Digital Lending. These publishers call for the destruction of the 1.5 million digital books that Internet Archive makes available to our patrons. This form of digital book burning is unprecedented and unfairly disadvantages people with print disabilities. For the blind, ebooks are a lifeline, yet less than one in ten exists in accessible formats. Since 2010, Internet Archive has made our lending library available to the blind and print disabled community, in addition to sighted users. If the publishers are successful with their lawsuit, more than a million of those books would be deleted from the Internet’s digital shelves forever.

I call on the executives at Hachette, HarperCollins, Wiley, and Penguin Random House to come together with us to help solve the pressing challenges to access to knowledge during this pandemic. Please drop this needless lawsuit.

Libraries have been bringing older books to digital learners: Four publishers sue to stop it

I wanted to share my thoughts in response to the lawsuit against the Internet Archive filed on June 1 by the publishers Hachette, Harpercollins, Wiley, and Penguin Random House.

I founded the Internet Archive, a non-profit library, 24 years ago as we brought the world digital. As a library we collect and preserve books, music, video and webpages to make a great Internet library.

We have had the honor to partner with over 1,000 different libraries, such as the Library of Congress and the Boston Public Library, to accomplish this by scanning books and collecting webpages and more. In short, the Internet Archive does what libraries have always done: we buy, collect, preserve, and share our common culture.

But remember March of this year—we went home on a Friday and were told our schools were not reopening on Monday. We got cries for help from teachers and librarians who needed to teach without physical access to the books they had purchased.

Over 130 libraries endorsed lending books from our collections, and we used Controlled Digital Lending technology to do it in a controlled, respectful way.  We lent books that we own—at the Internet Archive and also the other endorsing libraries. These books were purchased and we knew they were not circulating physically. They were all locked up. In total, 650 million books were locked up just in public libraries alone.  Because of that, we felt we could, and should, and needed to make the digitized versions of those books available to students in a controlled way to help during a global emergency. As the emergency receded, we knew libraries could return to loaning physical books and the books would be withdrawn from digital circulation. It was a lending system that we could scale up immediately and then shut back down again by June 30th.

And then, on June 1st, we were sued by four publishers and they demanded we stop lending digitized books in general and then they also demanded we permanently destroy millions of digital books. Even though the temporary National Emergency Library was closed before June 30th, the planned end date, and we are back to traditional controlled digital lending, the publishers have not backed down.

Schools and libraries are now preparing for a “Digital Fall Semester” for students all over the world, and the publishers are still suing.

Please remember that what libraries do is Buy, Preserve, and Lend books.

Controlled Digital Lending is a respectful and balanced way to bring our print collections to digital learners. A physical book, once digital, is available to only one reader at a time. Going on for nine years and now practiced by hundreds of libraries, Controlled Digital Lending is a longstanding, widespread library practice.

What is at stake with this suit may sound insignificant—that it is just Controlled Digital Lending—but please remember– this is fundamental to what libraries do: buy, preserve, and lend.   

With this suit, the publishers are saying that in the digital world, we cannot buy books anymore, we can only license and on their terms; we can only preserve in ways for which they have granted explicit permission, and for only as long as they grant permission; and we cannot lend what we have paid for because we do not own it.  This is not a rule of law, this is the rule by license. This does not make sense. 

We say that libraries have the right to buy books, preserve them, and lend them even in the digital world. This is particularly important with the books that we own physically, because learners now need them digitally.

This lawsuit is already having a chilling impact on the Digital Fall Semester we’re about to embark on. The stakes are high for so many students who will be forced to learn at home via the Internet or not learn at all.  

Librarians, publishers, authors—all of us—should be working together during this pandemic to help teachers, parents and especially the students.

I call on the executives at Hachette, HarperCollins, Wiley, and Penguin Random House to come together with us to help solve the pressing challenges to access to knowledge during this pandemic. 


Please drop this needless lawsuit.  

–Brewster Kahle, July 22, 2020