Author Archives: Brewster Kahle

Weaving Books into the Web—Starting with Wikipedia

[announcement video, Wired]

The Internet Archive has transformed 130,000 references to books in Wikipedia into live links to 50,000 digitized Internet Archive books in several Wikipedia language editions including English, Greek, and Arabic. And we are just getting started. By working with Wikipedia communities and scanning more books, both users and robots will link many more book references directly into Internet Archive books. In these cases, diving deeper into a subject will be a single click.

Moriel Schottlender, Senior Software Engineer, Wikimedia Foundation, speech announcing this program

“I want this,” said Brewster Kahle’s neighbor Carmen Steele, age 15, “at school I am allowed to start with Wikipedia, but I need to quote the original books. This allows me to do this even in the middle of the night.”

For example, the Wikipedia article on Martin Luther King, Jr cites the book To Redeem the Soul of America, by Adam Fairclough. That citation now links directly to page 299 inside the digital version of the book provided by the Internet Archive. There are 66 cited and linked books on that article alone. 

In the Martin Luther King, Jr. article of Wikipedia, page references can now take you directly to the book.

Readers can see a couple of pages to preview the book and, if they want to read further, they can borrow the digital copy using Controlled Digital Lending in a way that’s analogous to how they borrow physical books from their local library.

“What has been written in books over many centuries is critical to informing a generation of digital learners,” said Brewster Kahle, Digital Librarian of the Internet Archive. “We hope to connect readers with books by weaving books into the fabric of the web itself, starting with Wikipedia.”

You can help accelerate these efforts by sponsoring books or funding the effort. It costs the Internet Archive about $20 to digitize and preserve a physical book in order to bring it to Internet readers. The goal is to bring another 4 million important books online over the next several years.  Please donate or contact us to help with this project.

From a presentation on October 23, 2019 by Moriel Schottlender, Tech lead at the Wikimedia Foundation.

“Together we can achieve Universal Access to All Knowledge,” said Mark Graham, Director of the Internet Archive’s Wayback Machine. “One linked book, paper, web page, news article, music file, video and image at a time.”


Thank you for the donation of 78rpm records from a Craigslist poster

Mark Ellis alerted us to a Craiglist post of a storage locker of records being offered for free in San Jose in 2 hours. The owner wanted them gone. The Internet Archive sprang into action and our truck rolled.

Lots of people had responded to the ad that wanted specific records for free, but not that many that wanted 78rpm records. We love 78rpm records. We preserve them and digitize ones we do not have for the Great 78 Project.  At the end we got 1 pallet full of 78’s, maybe 2,700 discs, and they are queued for digitization.

Thank you to Joey Myers for posting on Craigslist, to Mark Ellis for alerting Jason Scott of the Archive, and the Archive staff that jumped on it.

Correct Metadata is Hard: a Lesson from the Great 78 Project

We have been digitizing about 8,000 78rpm record sides each month and now have 122,000 of them done. These have been posted on the net and over a million people have explored them. We have been digitizing, typing the information on the label, and linking to other information like discographies, databases, reviews and the like.

Volunteers, users, and internal QA checkers have pointing out typos, and we decided to go back over a couple of month’s metadata and found problems. And then we contracted with professional proofreaders and they found even more (2% of the records at this point had something to point out, some are matters of opinion or aesthetics, some lead to corrections).

We are going to pay the professional proofreaders to correct the 5 most important fields for all 122,000 records, but can use more help. We are pointing these out here in hopes to interest volunteer proofreaders and to share our experience in continually improving our collections.

Here are some of the issues with the primary performer field: before-the-after that we have now corrected from the June 2019 transfers (before | after) that we hope to upload in the next couple of weeks:

Jose Melis And His Latin American Ensemble | Jose Melis And His-Latin American Ensemble
Columbia-Orchestra | Columbia-Orchester
S. Formichi and T. Chelotti | S. Formichi e T. Chelotti
Dennis Daye and The Rhythmaires | Dennis Day and The Rhythmaires
Harry James and His Orchestra | Harry James and His Orch.
Charles Hart & Elliot Shaw | Charles Hart & Elliott Shaw
Peerless Quartet | Peerless Quartette

Some of the title corrections:

O Vino Fa ‘Papla (Wine Makes You Talk) | ‘O Vino Fa ‘Papla (Wine Makes You Talk)
Masked Ball Salaction | Masked Ball Selection
Moonlight and Roses (Brings Mem’ries Of You) | Moonlight and Roses (Bring Mem’ries Of You)
Que Bonita Eres Tu (You Are Beutiful) | Que Bonita Eres Tu (You Are Beautiful)
Buttered Roll | “Buttered Roll”
Paradise | “Paradise”
Got a Right to Cry | “Got a Right to Cry”
Blue Moods | “Blue Moods”
Auf Wiederseh’n Sweerheart | Auf Wiederseh’n Sweetheart
George M. Cohan Medley – Part 1 | George M. Cohan Medley – Part 2
Dewildered | Bewildered
Lolita (Seranata) | Lolita (Serenata)
Got a Right to Cry | “Got a Right to Cry” Joe Liggins and His Honeydrippers
Blue Moods | “Blue Moods”
Body and Soul | “Body and Soul”
Mais Qui Est-Ce | Mais Qui Est-Ce?
Wail Till the Sun Shines Nellie Blues | Wait Till the Sun Shines Nellie Blues
Que Te Pasa Joe (What Happens Joe) | Que Te Pasa Jose (What Happens Joe)
SAMSON AND DELILAH Softly Awakens My Heart | SAMSON AND DELILAH Softly Awakes My Heart
I’m Gonna COO, COO, COO | (I’m Gonna) COO, COO, COO

Most 20th Century Books Unavailable to Internet Users – We Can Fix That

The books of the 20th century are largely not online.  They are mostly not available from even the biggest booksellers. And, libraries who have collected hard copies of these books have not been able to deliver them in a cost-efficient, simple, digital form to their patrons. 

The way libraries could fill that gap is to adopt and deliver a controlled digital lending service. The Internet Archive is trying to do its part but needs others to join in. 

The Internet Archive has worked with 500 libraries over the last 15 years to digitize 3.5 million books. But based on copyright concerns the selection has often been restricted to pre-1923 books. We need complete libraries and comprehensive access to nurture a well-informed citizenry. The following graph shows the number of books digitized by the Internet Archive, binned by decade:

Up until 1923 the graph shows our collection increasing and mirroring the rise in publications.Then it dips and slows because of concerns and confusion about copyright protections for books published after that date.  It picks up again in the 1990s because these books are more readily available and separate funding has helped us digitize some recent modern books Nevertheless, the end result is that the gap is big – the digital world is missing  a huge chunk of the 20th Century. 

Users can’t even fill that gap by buying the books from that time period. According to a recent paper by Professor Rebecca Giblin, the commercial life of a book is typically exhausted 1.4 to 5 years from publication; some 90% of titles become unavailable in physical form within just two years. Most older books are therefore not available to be purchased in either physical or digital form. The following graph, pulled from a study by Professor Paul Heald, shows books by decade that are available on Amazon.com. It shows that the world’s largest bookseller has the same huge gap – the 20th century is simply missing. 

The 20th Century represents a significant portion of published knowledge – approximately one-third of all books – as shown in the graph below.  These books are largely unavailable commercially, BUT they are not completely lost. Many of these books are on library shelves, accessible only if you physically visit the library that owns those books. Even if you’re willing to visit, those books might still not be accessible. Libraries, pressed to repurpose their buildings, have increasingly moved volumes to off-site storage facilities.

The way to make 20th Century books available to library patrons is to digitize those books and let every library who owns a physical copy lend that book in digital form. This type of service has come to be known as controlled digital lending (CDL).  The Internet Archive has been doing this for years. We lend out-of-copyright and in-copyright volumes that we physically own. We’ve reformatted the physical volume, produced a digital version and lend only that digital version to one user at a time. Our experience shows that this responds to a real demand, fills a genuine need satisfactorily, gives new life to older books, and brings important knowledge to a new audience. Check out this case study for CDL involving the book Wasted which figured prominently in the Brett Kavanaugh Supreme Court nomination hearings.  

Our experience has been replicated by other early adopters and providers of a CDL service. Here’s a list of some of them. We believe every library can transform itself into a digital library. If you own the physical book, you can choose to circulate a digital version instead.

We urge more libraries to join Open Libraries and lend digitized versions of their print collections, making more copies of books available for loan and getting more books into the hands of digital  readers everywhere. 

Helping us judge a book by its cover: software help request

The Internet Archive would appreciate some help from a volunteer programmer to create software that would help determine if a book cover is useful to our users as a thumbnail or if we should use the title page instead. For many of our older books, they have cloth covers that are not useful, for instance:

But others are useful:

Just telling by age is not enough, because even 1923 cloth covers are sometimes good indicators of what the book is about (and are nice looking):

We would like a piece of code that can help us determine if the cover is useful or not to display as the thumbnail of a book. It does not have to be exact, but it would be useful if it knew when it didn’t have a good determination so we could run it by a person.

To help any potential programmer volunteers, we have created folders of hundreds of examples in 3 catatories: year 1923 books with not-very-useful covers, year 1923 books with useful covers, and year 2000 books with useful covers. The filenames of the images are the Internet Archive item identifier that can be used to find the full item:  1922forniaminera00bradrich.jpg would come from https://archive.org/details/1922forniaminera00bradrich.   We would like a program (hopefully fast, small, and free/open source) that would say useful or not-useful and a confidence. 

Interested in helping? Brenton at archive.org is a good point of contact on this project.   Thank you for considering this. We can use the help. You can also use the comments on this post for any questions.

FYI: To create these datasets, I ran these command lines, and then by hand pulled some of the 1923 covers into the “useful” folder.

bash-3.2$ ia search "date:1923 AND mediatype:texts AND NOT collection:opensource AND NOT collection:universallibrary AND scanningcenter:*" --itemlist --sort=downloads\ desc | he\
ad -1000 | parallel --will-cite -j10 "curl -Ls https://archive.org/download/{}/page/cover_.jpg?fail=soon.jpg\&cnt=0 >> ~/tmp/cloth/{}.jpg"

bash-3.2$ ia search "date:2000 AND mediatype:texts AND scanningcenter:cebu" --itemlist --sort=downloads\ desc | head -1000 | parallel --will-cite -j10 "curl -Ls https://archive.\
org/download/{}/page/cover_.jpg?fail=soon.jpg\&cnt=0 >> ~/tmp/picture/{}.jpg"

Low Vision? Disability? 1.8 Million digital books free, now Worldwide

 

You can take action now to expand access to 1.8million books worldwide.

Individuals can qualify now to access 1.8 million digitized books, free.   Internet Archive has recently expanded its program for those with low vision and disabilities.

Libraries, hospitals, schools, other orgs (worldwide!): can now sign up to authorize users, as well as get digital files for further remediation for qualifying users.

Publishers, please contribute your books for this program!

Service organizations, please host your digital books on archive.org to make seamless access to all books. Free.

Press: please help get the word out, let us know how we can help.

Donations needed to get 4 million more books. Donate books or money (about $20 per book).

Now available for access by anyone with disabilities worldwide… or for anyone to contribute books for people with disabilities: both helped by the US recently adopting the Marrakesh Treaty.

Now is our time to bring together a great library for those with disabilities.

Together we can.

 

 

Physical Archive Party October 28th, 2018

Archive friends and family please join us for the Physical Archive party, 2-6pm 2512 Florida Avenue, Richmond CA.   RSVP please so we can keep a count.  Staff, partners, friends of the Internet Archive are all invited.

RSVP HERE

This is a unique opportunity to see some of the behind the scenes activities of the Internet Archive and millions of books, music and movie items. The Internet Archive is well known for its digital collections — the digital books, software, film and music that millions of people access every day through archive.org, but did you know that much of our online content is derived from physical collections, stored in the east bay in Richmond?

2018 has been a year of focus on inventory and ramping up throughput at our digitization centers. At the physical archive we see collections of film, software, documents, books, and music at least three times before it is finally archived in Richmond. Once coming in the door to be inventoried, secondly as we ship it out to be digitized in Hong Kong or Cebu, and thirdly coming back to us for long term storage.

This year, the staff at the physical archive would like you to to see the physical archive and celebrate our achievements in 2018. Bring your roller skates and drones for competitive battles, get a drink at the ‘Open Container Bar’ and then peruse the special collections in our dedicated space. Uninhibitedly show off your dance skills at our silent server disco and enjoy brews and Halloween gruel.

We will also be showcasing our collection of books, music and film in both a working environment and our special collections room. We will have tours of our facilities, demonstration of how we get hundreds of thousands books a year digitized at our Hong Kong (and now Cebu) Super Centers and safely back again.

Prize for scariest librarian costume.
This is a halloween event so costumes are encouraged!

Software Help Requested to Segment Tracks on LP’s

Machine Learning + Music Lovers: the Internet Archive is seeking technical volunteers or interns or low-cost contractors with a passion for music to make an opensource software library capable of identifying which songs are on LPs (given a wave form or audio track of the sides). We have a training set of ~5k manually labeled LPs and thousands more which are in need of your help.

Challenges:

  • detecting start and stop of songs
  • get track titles from OCR’ed covers or labels
  • engaging UI for QA or uncertain automated output

The Internet Archive is interested in digitizing “Lost Vinyl”: those recordings that did not make it to CD or Spotify. We have been getting donations of physical LP’s (but we can always use more, please think of us…)  And at the end of the year we would like to start to digitize them. We are not sure how available we can make the resulting audio files, but let’s make sure these fabulous recordings are at least preserved.

We are looking for help in separating the tracks on an LP.  Sounds easy, but we have not been able to do it automatically yet.

For instance, this is an awesome Bruce record:

We want to detect timings and track titles:

<info>

  <title>Dancing in the Dark</title>

  <artist>Bruce Springsteen</artist>

  <trackinfo unit="mm:ss">

   <track target="01_Dancing_in_the_Dark__Blaster_mix" 
    title="Dancing in the Dark (Blaster mix)" 
    start="0:09" duration="6:11" end="6:20"/>

   <track target="02_Dancing_in_the_Dark__Radio" 
    title="Dancing in the Dark (Radio)" 
    start="6:42" duration="4:43" end="11:25"/>

   <track target="03_Dancing_in_the_Dark__Dub" 
    title="Dancing in the Dark (Dub)" 
    start="11:25" duration="5:33" end="16:58"/>

  </trackinfo>
</info>

https://archive.org/download/dancingindarksou00spri/dancingindarksou00spri_segments.xml

We have 5,000 of these that have been done by hand that can be used as a training set, and we want to do the next many thousand using a computer and human QA. Sometimes we know how many tracks there are on a side, which can help, but ideally we would not have to know.

We have derivative waveforms, fingerprints, already computed and full audio if needed.

What we would like is a piece of code, ideally python and open source, that would take an mp3, flac, or png, and create a set of timings for the tracks on it. If the code needed the number of tracks, we could supply that as well.

Then we would like to take label images such as:

 

To create the track titles for the metadata above.  (we OCR the labels, but it will be a bit lossy).

In other words, we would like to take photographs and digitization of the 2 sides of the album, and then get the titles with start and stop times.

We have done this for 5,000 LP’s already, and we would like help in automating this process so we can do it for all LP’s that did not make it to CD.

Up for helping? We can give access to existing 5,000 and what we would love is robust code that we could run on newly digitized LP’s so we could at least preserve, and maybe even bring access to the Lost Vinyl of the 20th century.

This is not as easy as it looks, but please do not be discouraged, we could use the help.

Existing open source projects could get us a long way there:

https://github.com/yu-tseng-chou/CS696-AMN-Project
https://github.com/bonnici/scrobble-along-scrobbler
https://github.com/NavJ/slicer/blob/master/slicer.py
https://github.com/tyiannak/pyAudioAnalysis

If you are interested, please write to info@archive.org.

Digital opportunity for the academic scholarly record

[MIT Libraries is holding a workshop on Grand Challenges for the scholarly record.  They asked participants for a problem/solution statement.  This is mine. -brewster kahle]

The problem of academic scholarly record now:

University library budgets are spent on closed rather than open: We invest dollars in closed/subscription services (Elsevier, JSTOR, Hathi) rather than ones open to all users (PLOS, Arxiv, Internet Archive, eBird)– and for a reason.  There is only so much money and our community demands access to closed services, and the open ones are there whether we pay for them or not.

We want open access AND digital curation and preservation– but have no means to spend cooperatively.

University libraries funded the building of Elsevier / JSTOR / HathiTrust: closed, subscription services.

We need to invest most University Library acquisition dollars in open: PLOS, Arxiv, Wikipedia, Internet Archive, eBird.

We have solved it when:

Anyone anywhere can get ALL information available to an MIT student, for free.

Everyone everywhere has the opportunity to contribute to the scholarly record as if they were MIT faculty, for free.

What should we do now?

Analog -> Digital conversion of all published scholarly must be completed soon.   And completely open, available in bulk.

Curation and Digital Preservation of born-digital research products: papers/websites/research data.

“Multi-homing” digital research product (papers, websites, research data) via peer-to-peer backends.

Who can best implement?

Vision and tech ability: Internet Archive, PLOS, Wikipedia, arxiv.

Funding now is coming from researchers, individuals, rich people.

Funding should come from University Library acquisition budgets.

Why might MIT lead?

OpenCourseware was bold.  MIT might invest in opening the scholarly record.

How might MIT do this?

Be bold.

Spend differently.

Lead.

Mass downloading 78rpm record transfers

To preserve or discover interesting 78rpm records you can download them to your own machine (rather than using our collection pages).  You can download lots on to a mac/linux machine by using a command line utility.

Preparation:  Download the IA command line tool.     Like so:

$ curl -LO https://archive.org/download/ia-pex/ia
$ chmod +x ia
$ ./ia help

Option 1:   if you want just a set of mp3’s to play download to your /tmp directory:

./ia download --search "collection:georgeblood" --no-directories --destdir /tmp -g "[!_][!7][!8]*.mp3"

or just blues (or hillbilly or other searches):

./ia download --search "collection:georgeblood AND blues" --no-directories --destdir /tmp -g "[!_][!7][!8]*.mp3"

Option 2: if you want to preserve the FLAC and MP3 and metadata files for the best version of the 78rpm record we have.  (if you are using a Mac Install homebrew on a mac, then type “brew install parallel”.  On linux try “apt-get install parallel”)

./ia search 'collection:georgeblood' --sort=publicdate\ asc --itemlist > itemlist.txt
cat itemlist.txt | parallel --joblog download.log './ia download {} --destdir /tmp -g "[!_][!7][!8]*"'

parallel --retry-failed --joblog download.log './ia download {} --destdir /tmp -g "[!_][!7][!8]*"'