Author Archives: Brewster Kahle

Helping us judge a book by its cover: software help request

The Internet Archive would appreciate some help from a volunteer programmer to create software that would help determine if a book cover is useful to our users as a thumbnail or if we should use the title page instead. For many of our older books, they have cloth covers that are not useful, for instance:

But others are useful:

Just telling by age is not enough, because even 1923 cloth covers are sometimes good indicators of what the book is about (and are nice looking):

We would like a piece of code that can help us determine if the cover is useful or not to display as the thumbnail of a book. It does not have to be exact, but it would be useful if it knew when it didn’t have a good determination so we could run it by a person.

To help any potential programmer volunteers, we have created folders of hundreds of examples in 3 catatories: year 1923 books with not-very-useful covers, year 1923 books with useful covers, and year 2000 books with useful covers. The filenames of the images are the Internet Archive item identifier that can be used to find the full item:  1922forniaminera00bradrich.jpg would come from https://archive.org/details/1922forniaminera00bradrich.   We would like a program (hopefully fast, small, and free/open source) that would say useful or not-useful and a confidence. 

Interested in helping? Brenton at archive.org is a good point of contact on this project.   Thank you for considering this. We can use the help. You can also use the comments on this post for any questions.

FYI: To create these datasets, I ran these command lines, and then by hand pulled some of the 1923 covers into the “useful” folder.

bash-3.2$ ia search "date:1923 AND mediatype:texts AND NOT collection:opensource AND NOT collection:universallibrary AND scanningcenter:*" --itemlist --sort=downloads\ desc | he\
ad -1000 | parallel --will-cite -j10 "curl -Ls https://archive.org/download/{}/page/cover_.jpg?fail=soon.jpg\&cnt=0 >> ~/tmp/cloth/{}.jpg"

bash-3.2$ ia search "date:2000 AND mediatype:texts AND scanningcenter:cebu" --itemlist --sort=downloads\ desc | head -1000 | parallel --will-cite -j10 "curl -Ls https://archive.\
org/download/{}/page/cover_.jpg?fail=soon.jpg\&cnt=0 >> ~/tmp/picture/{}.jpg"

Low Vision? Disability? 1.8 Million digital books free, now Worldwide

 

You can take action now to expand access to 1.8million books worldwide.

Individuals can qualify now to access 1.8 million digitized books, free.   Internet Archive has recently expanded its program for those with low vision and disabilities.

Libraries, hospitals, schools, other orgs (worldwide!): can now sign up to authorize users, as well as get digital files for further remediation for qualifying users.

Publishers, please contribute your books for this program!

Service organizations, please host your digital books on archive.org to make seamless access to all books. Free.

Press: please help get the word out, let us know how we can help.

Donations needed to get 4 million more books. Donate books or money (about $20 per book).

Now available for access by anyone with disabilities worldwide… or for anyone to contribute books for people with disabilities: both helped by the US recently adopting the Marrakesh Treaty.

Now is our time to bring together a great library for those with disabilities.

Together we can.

 

 

Physical Archive Party October 28th, 2018

Archive friends and family please join us for the Physical Archive party, 2-6pm 2512 Florida Avenue, Richmond CA.   RSVP please so we can keep a count.  Staff, partners, friends of the Internet Archive are all invited.

RSVP HERE

This is a unique opportunity to see some of the behind the scenes activities of the Internet Archive and millions of books, music and movie items. The Internet Archive is well known for its digital collections — the digital books, software, film and music that millions of people access every day through archive.org, but did you know that much of our online content is derived from physical collections, stored in the east bay in Richmond?

2018 has been a year of focus on inventory and ramping up throughput at our digitization centers. At the physical archive we see collections of film, software, documents, books, and music at least three times before it is finally archived in Richmond. Once coming in the door to be inventoried, secondly as we ship it out to be digitized in Hong Kong or Cebu, and thirdly coming back to us for long term storage.

This year, the staff at the physical archive would like you to to see the physical archive and celebrate our achievements in 2018. Bring your roller skates and drones for competitive battles, get a drink at the ‘Open Container Bar’ and then peruse the special collections in our dedicated space. Uninhibitedly show off your dance skills at our silent server disco and enjoy brews and Halloween gruel.

We will also be showcasing our collection of books, music and film in both a working environment and our special collections room. We will have tours of our facilities, demonstration of how we get hundreds of thousands books a year digitized at our Hong Kong (and now Cebu) Super Centers and safely back again.

Prize for scariest librarian costume.
This is a halloween event so costumes are encouraged!

Software Help Requested to Segment Tracks on LP’s

Machine Learning + Music Lovers: the Internet Archive is seeking technical volunteers or interns or low-cost contractors with a passion for music to make an opensource software library capable of identifying which songs are on LPs (given a wave form or audio track of the sides). We have a training set of ~5k manually labeled LPs and thousands more which are in need of your help.

Challenges:

  • detecting start and stop of songs
  • get track titles from OCR’ed covers or labels
  • engaging UI for QA or uncertain automated output

The Internet Archive is interested in digitizing “Lost Vinyl”: those recordings that did not make it to CD or Spotify. We have been getting donations of physical LP’s (but we can always use more, please think of us…)  And at the end of the year we would like to start to digitize them. We are not sure how available we can make the resulting audio files, but let’s make sure these fabulous recordings are at least preserved.

We are looking for help in separating the tracks on an LP.  Sounds easy, but we have not been able to do it automatically yet.

For instance, this is an awesome Bruce record:

We want to detect timings and track titles:

<info>

  <title>Dancing in the Dark</title>

  <artist>Bruce Springsteen</artist>

  <trackinfo unit="mm:ss">

   <track target="01_Dancing_in_the_Dark__Blaster_mix" 
    title="Dancing in the Dark (Blaster mix)" 
    start="0:09" duration="6:11" end="6:20"/>

   <track target="02_Dancing_in_the_Dark__Radio" 
    title="Dancing in the Dark (Radio)" 
    start="6:42" duration="4:43" end="11:25"/>

   <track target="03_Dancing_in_the_Dark__Dub" 
    title="Dancing in the Dark (Dub)" 
    start="11:25" duration="5:33" end="16:58"/>

  </trackinfo>
</info>

https://archive.org/download/dancingindarksou00spri/dancingindarksou00spri_segments.xml

We have 5,000 of these that have been done by hand that can be used as a training set, and we want to do the next many thousand using a computer and human QA. Sometimes we know how many tracks there are on a side, which can help, but ideally we would not have to know.

We have derivative waveforms, fingerprints, already computed and full audio if needed.

What we would like is a piece of code, ideally python and open source, that would take an mp3, flac, or png, and create a set of timings for the tracks on it. If the code needed the number of tracks, we could supply that as well.

Then we would like to take label images such as:

 

To create the track titles for the metadata above.  (we OCR the labels, but it will be a bit lossy).

In other words, we would like to take photographs and digitization of the 2 sides of the album, and then get the titles with start and stop times.

We have done this for 5,000 LP’s already, and we would like help in automating this process so we can do it for all LP’s that did not make it to CD.

Up for helping? We can give access to existing 5,000 and what we would love is robust code that we could run on newly digitized LP’s so we could at least preserve, and maybe even bring access to the Lost Vinyl of the 20th century.

This is not as easy as it looks, but please do not be discouraged, we could use the help.

Existing open source projects could get us a long way there:

https://github.com/yu-tseng-chou/CS696-AMN-Project
https://github.com/bonnici/scrobble-along-scrobbler
https://github.com/NavJ/slicer/blob/master/slicer.py
https://github.com/tyiannak/pyAudioAnalysis

If you are interested, please write to info@archive.org.

Digital opportunity for the academic scholarly record

[MIT Libraries is holding a workshop on Grand Challenges for the scholarly record.  They asked participants for a problem/solution statement.  This is mine. -brewster kahle]

The problem of academic scholarly record now:

University library budgets are spent on closed rather than open: We invest dollars in closed/subscription services (Elsevier, JSTOR, Hathi) rather than ones open to all users (PLOS, Arxiv, Internet Archive, eBird)– and for a reason.  There is only so much money and our community demands access to closed services, and the open ones are there whether we pay for them or not.

We want open access AND digital curation and preservation– but have no means to spend cooperatively.

University libraries funded the building of Elsevier / JSTOR / HathiTrust: closed, subscription services.

We need to invest most University Library acquisition dollars in open: PLOS, Arxiv, Wikipedia, Internet Archive, eBird.

We have solved it when:

Anyone anywhere can get ALL information available to an MIT student, for free.

Everyone everywhere has the opportunity to contribute to the scholarly record as if they were MIT faculty, for free.

What should we do now?

Analog -> Digital conversion of all published scholarly must be completed soon.   And completely open, available in bulk.

Curation and Digital Preservation of born-digital research products: papers/websites/research data.

“Multi-homing” digital research product (papers, websites, research data) via peer-to-peer backends.

Who can best implement?

Vision and tech ability: Internet Archive, PLOS, Wikipedia, arxiv.

Funding now is coming from researchers, individuals, rich people.

Funding should come from University Library acquisition budgets.

Why might MIT lead?

OpenCourseware was bold.  MIT might invest in opening the scholarly record.

How might MIT do this?

Be bold.

Spend differently.

Lead.

Mass downloading 78rpm record transfers

To preserve or discover interesting 78rpm records you can download them to your own machine (rather than using our collection pages).  You can download lots on to a mac/linux machine by using a command line utility.

Preparation:  Download the IA command line tool.     Like so:

$ curl -LO https://archive.org/download/ia-pex/ia
$ chmod +x ia
$ ./ia help

Option 1:   if you want just a set of mp3’s to play download to your /tmp directory:

./ia download --search "collection:georgeblood" --no-directories --destdir /tmp -g "[!_][!7][!8]*.mp3"

or just blues (or hillbilly or other searches):

./ia download --search "collection:georgeblood AND blues" --no-directories --destdir /tmp -g "[!_][!7][!8]*.mp3"

Option 2: if you want to preserve the FLAC and MP3 and metadata files for the best version of the 78rpm record we have.  (if you are using a Mac Install homebrew on a mac, then type “brew install parallel”.  On linux try “apt-get install parallel”)

./ia search 'collection:georgeblood' --sort=publicdate\ asc --itemlist > itemlist.txt
cat itemlist.txt | parallel --joblog download.log './ia download {} --destdir /tmp -g "[!_][!7][!8]*"'

parallel --retry-failed --joblog download.log './ia download {} --destdir /tmp -g "[!_][!7][!8]*"'

Digital Books on archive.org

Many people think of the Internet Archive as just the Wayback Machine or just one collection or another, but there is much more.  For instance, books!

As a nonprofit library we buy and lend books to the public, but we do even more than that. Working with hundreds of libraries, we buy ebooks, digitize physical books, offer them to the print-disabled, and lend books to one reader at a time, all for free via archive.org and openlibrary.org.

Archive.org is the website that offers free public access to all sorts of materials uploaded by users, collected by the Internet Archive, and digitized by the Internet Archive.  Archive.org includes books, music, video, webpages, and software.  OpenLibrary.org, a site that is maintained by the Internet Archive, is a catalog of books with the mission to offer “One webpage for every book.”  This open source catalog site, started in 2005, is editable by its users and has many code contributors. It links to various resources about that book, for instance, links to amazon.com and betterworldbooks.org to buy the book, to local libraries that own the book, to archive.org for print-disabled access or to borrow a digitized version of the book, and to other sites that have digital versions.

The goals of libraries are preservation and access. For physical books, we buy and receive donations of hundreds of thousands of books that we preserve for the long term in archival, non-circulating stacks. Support for this comes from libraries, used book vendors, foundations, and tens of thousands of individual donors to the Internet Archive, a public charity.

We also work with more than 500 libraries to help digitize their books, now more than 3 million of them, to preserve them digitally and offer online access. These libraries make their older books (mostly pre-1923)  available for free public downloading, and fantastically over 25 million older books are viewed every month.

Unfortunately, the books of the 20th century are largely not available either physically or digitally. These graphs show how the 20th century’s books are not available through Amazon for purchase, or from the Internet Archive. Some have reasoned this is because of copyright. 1923 is a special date in US copyright law because works published before this date are in the  Public Domain, while afterwards copyright status can be very complicated. Unfortunately, 1923 in these graphs also demarks a sharp drop in commercial availability of many books. These books are often only available through libraries.

Starting 10 years ago the Internet Archive began digitizing modern books, mostly from the 20th century,  for access by the blind and dyslexic. Those that are certified disabled by the Library of Congress get a decryption key for accessing Library of Congress scanned books. This key can also decrypt digitized books available on archive.org. This combined with special formats for the blind and dyslexic of the older books has brought millions of books to people that have had difficulty in the past. We are working to make these books more available to these communities in other special formats.

Publishers have been using digital protection technologies for years for ebooks sold to retail customers, often referred to as DRM (digital rights management).  Libraries lend ebooks using the same DRM, and the Internet Archive has followed that lead, using Adobe Digital Editions.

The digital protection allows books to be lent via downloads that disappear (or become inaccessible) when the loan period ends (e.g. two weeks).  For users who prefer to read their ebooks directly in a browser, the same thing happens. The book becomes inaccessible at the end of the loan period, and the next reader in line has a chance to borrow it.

While it is technically possible to break the digital protections of these technologies, it is illegal to do so. Moreover, the typical user does not do this, allowing for a flourishing ebook marketplace for current books. The Internet Archive is able to make available for loan older books that are not available in ebook format. In every case, an authorized print copy has been acquired and made unavailable for simultaneous loan.

Many of the books in our collection are books that libraries believe to be of historical importance such that they do not want to throw them away, but are not worth keeping on their physical shelves. The digitized versions are therefore made available to a single user at a time, while the physical book no longer circulates. Since the books which are lent using the controlled digital lending technologies are limited to one reader at a time, it works best for “long tail” books, books that are not available in other ways. Fortunately, many of these books are wonderful and important and we are proud to bring them to a generation of digital learners who may not have physical access to major public libraries.

We hope many more libraries start controlled digital lending of their books as this is a way to bring public access to the purchases and collections they have built over centuries.

We have recently made available a small number of books (currently 61 books) published between 1923 and 1941 under a provision of US Copyright law that was written to permit libraries to copy and lend titles that are no longer subject to commercial exploitation, and selection is currently overseen by  lawyers expert in US copyright law.

As a completely separate service from buying ebooks and loaning to users with controlled digital lending, the Internet Archive offers free hosting for cultural works (texts, audio, moving images) that are uploaded by the general public. Millions of documents from court cases, and digitized books from other projects such as the Google book program and the Digital Library of India have been uploaded over the years.

When a rights holder wants a work that was uploaded by a user taken down, a well known “Notice and Takedown” procedure is in place. The Internet Archive takes prompt action and follows the procedure, generally resulting in the work being taken down.

Where is this all going?  We are looking for partners and ideas to help bring more books to more people in more ways. More books (and more accessible books) for the print disabled, complete collections of books from the 20th century online and available, clickable footnotes for books cited in Wikipedia to bring up the full text on the right page, and many more books in bookstores and libraries. This generation of digital learners is looking for this, is expecting this. Collectively, libraries, booksellers, publishers, and authors– old and new– share these same interests.  The good news is the technologies are now available– we all have to do our parts to do to serve digital learners everywhere.

As a library, we strive to provide “Universal Access to All Knowledge.” The digital technologies make this a feasible dream.  We are working with publishers, booksellers, authors, other libraries, and most of all digital learners to find balanced and respectful ways to try to achieve this goal. If you want to help, or have ideas on what we can do to get there, please let us know.

 

Building Digital 78rpm Record Collections Together with Minimal Duplication

78_mama-yo-quiero_joaquin-garay-al-wallace-orchestra-e-b-marks_gbia0034720aBy working together, libraries who are digitizing their collections can minimize duplication of effort in order to save time and money to preserve other things.  This month we made progress with 78rpm record collections.

The goal is to bring many collections online as cost effectively as possible. Ideally, we want to show each online collection as complete but only digitize any particular item once. Therefore one digitized item may belong virtually to several collections. We are now doing this with 78rpm records in the Great 78 Project.

It starts with great collections of 78s (18 contributors so far). For each record, we look up the record label, catalog number, and title/performer, to see if we have it already digitized. If we have it already, then we check the condition of the digitized one against the new one– if we would improve the collection, we digitize the new one. If we do not need to digitize it, we add a note to the existing item that it now also belongs to another collection, as well as note where the duplicate physical item can be found.

For instance, the KUSF collection we are digitizing has many fabulous records we have never seen before including sound effect records.  But about half are records we have digitized better copies of before, so we are not digitizing most of those. We still attribute the existing digital files to the KUSF collection so it will have a digital file in the online collection for each of their physical discs.

It takes about half the time to find a record is a duplicate than to fully digitize it, and given that we are now seeing about half of our records not needing to be digitized, we are looking for ways to speed this up.

OCLC has many techniques to help with deduplication of books and we are starting to work with them on this, but for 78s we are making progress in this way. Please enjoy the 78s.

Thank you to GeorgeBlood L.P., Jake Johnson, B. George, and others.

Dreaming Open, Dreaming Big

Disheartened by anti-NetNeutrality moves in the US, but inspired by the reddit poster offering to donate $87million in bitcoin (5057BTC) to good causes (including the awesome EFF)?

It got us dreaming open and dreaming big: What could we do if the open world had lots of money, and specifically what could the Internet Archive do with it?

$100m (5930 BTC)
Bring 4 million books to billions of people and improve the quality of information on the internet for everyone.

 


$7m (415 BTC)

Rebuild the building next to ours to house interns, fellows, and researchers to create the next generation of open.

 


$12m (711 BTC)

Bring a million songs from the 78rpm era to the internet by fueling the Great 78 Project.

 

 

$5m (296 BTC)
Digitize and preserve the sound archives of the Boston Public Library, 250,000 discs (LP’s, 78’s, and CD’s), to make a public library of music for all.

 

 

$52.5M (3113 BTC)
Preserve and keep available our existing 35 petabytes of data forever. Based on a study by David Rosenthal of LOCKSS project at Stanford University, the forever cost of storing a terabyte is $1500 USD. Just long term preservation of the 200TB of US government information from the end of the last administration is $300k (17 BTC).

$15m (889 BTC)
Pay all Internet Archive employees with bitcoin for 2018.

 

 

physical archive$10m (593 BTC)
Buy the next building to store the millions of books records and films being donated to the Internet Archive. We are now filling up our current two buildings.

 

 


$4m (237 BTC)

Launch the Decentralized Web as a project to build a more private, reliable, and flexible web.

 

In fact, we are working towards all of these projects. But they can go much faster with donations and interest on the part of you, our Internet Archive supporters. Contributions of all sizes make a huge difference. The average donation this year is about $20. Together we can build a robust open world.

5,000 78rpm sides for the Great 78 Project are now posted

From the David Chomowicz and Esther Ready Collection.
Click to listen.

This month’s transfers of 5,000 78rpm sides for the Great 78 Project are now posted.

Many are Latin American music from the David Chomowicz and Esther Ready Collection.

Others are square dance music, with and without the calls, from the Larry Edelman Collection. (Thank you David Chomowicz, Esther Ready, and Larry Edelman for the donations.)

We are still working on some of the display issues with this month’s materials, so some changes are yet to come.

From the Larry Edelman Collection.
Click to listen.

Unfortunately we have only found dates for about 1/2 of this month’s batch using our automatic techniques of looking through 78disography.com, 45worlds, discogs, DAHR, and full text searching of Cashbox Magazine.  There are currently over 2,000 songs with missing dates.

If you like internet sleuthing, or leveraging our scanned discographies or your discographies and would like to join in on finding dates and reviews, please jump in. We have a slack channel of those doing this.

Congratulations to B George’s group, George Blood’s group, and the collections group at the Internet Archive for another large batch of largely disappeared 78’s.