Category Archives: Music

Saving the 78s

Written by B. George, the Director of ARChive of Contemporary Music in NYC, and Curator of Sound Collections at the Internet Archive in San Francisco.

While audio CDs whiz by at about 500 revolutions per minute, the earliest flat disks offering music whirled at 78rpm. They were mostly made from shellac, i.e., beetle (the bug, not The Beatles) resin and were the brittle predecessors to the LP (microgroove) era. The format is obsolete, and the surface noise is often unbearable and just picking them up can break your heart as they break apart in your hands. So why does the Internet Archive have more than 200,000 in our physical possession?Music

A little over a year ago New York’s ARChive of Contemporary Music (ARC) partnered with the Internet Archive to focus on preserving and digitizing audio-visual materials. ARC is the largest independent collection of popular music in the world. When we began in 1985 our mandate was microgroove recordings – meaning vinyl – LPs and forty-fives. CDs were pretty much rumors then, and we thought that other major institutions were doing a swell job of collecting earlier formats, mainly 78rpm discs. But donations and major research projects like making scans for The Grammy Museum and The Ertegun Jazz Hall of Fame placed about 12,000 78s in our collection.

For years we had been getting calls offering 78 collections that we were unable to accept. But when space and shipping became available through the Internet Archive, it was now possible to begin preserving 78s. Here’s a short history of how in only a few years ARC and the Internet Archive have created one of the largest collections in America.

Our first major donation came from the Batavia Public Library in Illinois, part of the Barrie H.Thorp Collection of 48,000 78s.

We’re always a tad suspicious of large collections like these. First thought is, “Must be junk.” Secondly, “It’s been cherrypicked.” But the Thorp Collection was screened by former ARC Board member Tom Cvikota, who found the donor, helped negotiate the gift and stored it. That was in 2007. Between then and our 2015 pickup Tom arranged for some of the recordings to be part of an exhibition at the Greengrassi Gallery, London, (UK, Mar-Apr, 2014) by artist Allen Ruppersberg, titled, For Collectors Only (Everyone is a Collector).

What makes the Thorp collection unique is the obsessive typewritten card catalog featured in a short film hosted on the exhibition’s webpage. Understanding why you collect and how you give your interests meaning is a part of Allen’s work – artworks that focus on the collector’s mentality. One nice quote by Allen referenced in Greil Marcus’ book, The History of Rock n’ Roll in Ten Songs is, “In some cases, if you live long enough, you begin to see the endings of things in which you saw the beginnings.”

Philosophical musings aside, there are 48,000 discs to deal with. That meant taking poorly packed boxes — many of them open for 20 years — and re-boxing them for proper storage. The picture below shows an example of how they arrived (on the right), and how they were palletized (on the left.)

PalletizedThe trick to repacking in a timely fashion is to not look at the records. It’s a trick that is never performed successfully. Handling fragile 78s requires grabbing one or just a few at a time. So we’re endlessly reading the labels, sleeving and resleeving, all the time checking for rarities, breakage and dirt.

Now we didn’t do all this work on our own. Working another part of the warehouse was two-and-a-half month old Zinnia Dupler — the youngest volunteer ever to give us a hand. Mom also helped a bit.

mom

A few minutes after the snap I found this gem in the Thorp collection. Coincidence? I don’t think so…burpinthebaby

“Burpin” is a country novelty tune from out of Texas by Austin broadcaster and humorist Richard “Cactus” Pryor (1923 – 2011). It came from a box jam-packed with country and hillbilly discs. This was a pleasant surprise, as we expected the collection to be like most we encounter – big band and bland pop. But here was box-after-box of hillbilly, country, and Western swing records. Now, I use’ta think I knew a bit about music. But with this collection, it was back to school for me. Just so many artists I’ve never heard of or held a record by. As we did a bit of sorting, in the ‘G’s alone there’s Curly Gribbs, Lonnie Glosson and the Georgians. Geeez! Did you know that Hank Snow had a recordin’ kid, Jimmy, and he cut “Rocky Mountain Boogie” on 4 Star records, or that Cass Daley, star of stage and screen, was the ‘Queen of Musical Mayhem?” Me neither.  The Davis Sisters, turns out, included a young Skeeter Davis(!) and not to be confused with the Davis Sister Gospel group, also in this collection. Then there’s them Koen Kobblers, Bill Mooney and his Cactus Twisters, and Ozie Waters and the Colorado Hillbillies. No matter they should be named the Colorado Mountaineers, they’re new to me.

For us this donation is a dream: it allows us to preserve material that was otherwise going to be thrown away; it has a larger cultural value beyond the music; and it contained a mountain of unfamiliar music, much of it quite rare. And most of it is not available online.

It was a second large donation that prompted the Internet Archive to move toward the idea that we should digitize all of our 78s. The Joe Terino Collection came to us through a cold call, the collection professionally appraised at $500,000. The 70,000 plus 78s were stored in a warehouse for more than 40 years, originally deposited by a distributor. Here’s the kicker: they said that we could have it all, but we had to move it – NOW! Internet Archive did and it came in on 72 pallets, in three semis, from Rhode Island to San Francisco, looking like this…JoeTernino

So Fred Patterson and the crackerjack staff out in our Richmond warehouses (Marc Wendt, Mark Graves, Sean Fagan, Lotu Tii, Tracey Gutierrez, Kelly Ransom, and Matthew Soper) pulled everything off the ramshackle pallets and carefully reboxed this valuable material.

boxes

How valuable? Well, we’re really not so sure yet, despite the appraisal, as just receiving and reboxing was such a chore. One hint is this sweet blues 78 that we managed to skim off the top of a pile.

muddywaters

The next step is curating this material, acquiring more collections and moving towards preservation through digitization. Already we have a pilot project in the works with master preservationist George Blood to develop workflow and best digitization practices.

We’re doing all this because there’s just no way to predict if the digital will outlast the physical, so preserving both will ensure the survival of cultural materials for future generations to study and enjoy. And, it’s fun.

 

Microphone Check: Thousands of Hip-Hop Mixtapes at the Archive

The Internet Archive has been growing an interesting sub-collection of music for the past few months: Hip-Hop Mixtapes. The resulting collection still has a way to go before it’s anywhere near what is out there (limited by bandwidth and a few other technical factors), but now that it’s past 150 solid days of music on there, it’s quite enough to browse and “get the idea”, should you be so inclined.

Note: Hip-Hop tends to be for a mature audience, both in subject matter and language.

I’m sure this is entirely old knowledge for some people, but it was new to me, so I’ll describe the situation and the thinking.

Front-Cover

There’s some excellent introductions and writeups about mixtapes in Hip-Hop culture at these external articles:

So, in quick summary, there have been mixtapes of many varieties for many years, going back to the 1970s to the dawn of what we call Hip-Hop, and throughout the time since the “tapes” have become CDs and ZIP files and are now still being released out into “the internet” to be spread around. The goal is to gain traction and attention for your musical act, or for your skills as a DJ, or any of a dozen reasons related to getting music to the masses.

There is an entire ecosystem of mixtape distribution and access. There are easily tens of thousands of known mixtapes that have existed. This is a huge, already-extant environment out there, that was established, culturally critical, and born-digital.

It only made sense for a library like the Internet Archive to provide it as well.

There’s a lot coded into the covers of these mixtapes (not to even mention the stuff coded into the lyrics themselves) – there’s stressing of riches, drug use, power, and oppression. There’s commentary on government, on social issues, and on the meaning of entertainment and celebrity. There’s parody, there’s aggrandizement, and there’s every attempt to draw in the listeners in what is a pretty large pile of material floating around. It’s not about this song or that grandiose portrait, though – it’s about the fact this whole set of material has meaning, reality and relevance to many, many people.

How do I know this has relevance? Within 24 hours of the first set of mixtapes going onto the Archive, many of the albums already had hundreds of listeners, and one of them broke a thousand views. Since then, a good amount have had tens of thousands of listens. Somebody wants this stuff, that’s for sure. And that’s fundamentally what the Archive is about – bringing access to the world.

The end goal here is simple: Providing free access to huge amounts of culture, so people can reference, contextualize, enjoy and delight over material in an easy-to-reach, linkable, usable manner. Apparently it’s already taken off, but here you go too.

Get your drank on here.

archive.org download counts of collections of items updates and fixes

Every month, we look over the total download counts for all public items at archive.org.  We sum item counts into their collections.  At year end 2014, we found various source reliability issues, as well as overcounting for “top collections” and many other issues.

archive.org public items tracked over time

archive.org public items tracked over time

To address the problems we did:

  • Rebuilt a new system to use our database (DB) for item download counts, instead of our less reliable (and more prone to “drift”) SOLR search engine (SE).
  • Changed monthly saved data from JSON and PHP serialized flatfiles to new DB table — much easier to use now!
  • Fixed overcounting issues for collections: texts, audio, etree, movies
  • Fixed various overcounting issues related to not unique-ing <collection> and <contributor> tags (more below)
  • Fixes to character encoding issues on <contributor> tags

Bonus points!

  • We now track *all collections*.  Previously, we only tracked items tagged:
    • <mediatype> texts
    • <mediatype> etree
    • <mediatype> audio
    • <mediatype> movies
  • For items we are tracking <contributor> tags (texts items), we now have a “Contributor page” that shows a table of historical data.
  • Graphs are now “responsive” (scale in width based on browser/mobile width)

 

The Overcount Issue for top collection/mediatypes

  • In the below graph, mediatypes and collections are shown horizontally, with a sample “collection hierarchy” today.
  • For each collection/mediatype, we show 1 example item, A B C and D, with a downloads/streams/views count next to it parenthetically.   So these are four items, spanning four collections, that happen to be in a collection hierarchy (a single item can belong to multiple collections at archive.org)
  • The Old Way had a critical flaw — it summed all sub-collection counts — when really it should have just summed all *direct child* sub-collection counts (or gone with our New Way instead)

overcount

So we now treat <mediatype> tags like <collection> tags, in terms of counting, and unique all <collection> tags to avoid items w/ minor nonideal data tags and another kind of overcounting.

 

… and one more update from Feb/1:

We graph the “difference” between absolute downloads counts for the current month minus the prior month, for each month we have data for.  This gives us graphs that show downloads/month over time.  However, values can easily go *negative* with various scenarios (which is *wickedly* confusing to our poor users!)

Here’s that situation:

A collection has a really *hot* item one month, racking up downloads in a given collection.  The next month, a DMCA takedown or otherwise removes the item from being available (and thus counted in the future).  The downloads for that collection can plummet the next month’s run when the counts are summed over public items for that collection again.  So that collection would have a negative (net) downloads count change for this next month!

Here’s our fix:

Use the current month’s collection “item membership” list for current month *and* prior month.  Sum counts for all those items for both months, and make the graphed difference be that difference.  In just about every situation that remains, graphed monthly download counts will be monotonic (nonnegative and increasing or zero).

 

 

Music Analysis Beginnings

As mentioned in our recent Building Music Libraries post, we are working with researchers at Columbia University and UPF in Barcelona to run their code on the music collection to help their research and to provide new analyses that could help with exploration and understanding.

We are doing some pilot runs to generate files which some close observers may see in the music item directories on archive.org.  Audio fingerprints from audfprint are .afpt and music attributes from Essentia are in _esslow.json.gz (download sample) and _esshigh.json.gz.

Spectrogram of a Grateful Dead track

Spectrogram of a Grateful Dead track

We are also creating image files showing the audio spectrum used.  We hope this is useful for those that want to see if files have been compressed in the past (even if they are posted as flac files now).  There is also a .png for each audio file of a basic waveform that is being used in the archive’s beta site as eye candy.

More as it happens, but we wanted you know there is some progress and you will see some new files.  If you have proposed other analyses that would benefit from being run over a large corpus, please let us know by contacting info at archive dot org.

Thank you to the researchers and the Archive programmers who are working together to make this happen.

 

Using Docker to Encapsulate Complicated Program is Successful

The Internet Archive has been using docker in a useful way that is a bit out of the mainstream: to package a command-line binary and its dependencies so we can deploy it on a cluster and use it in the same way we would a static binary.

Columbia University’s Daniel Ellis created an audio fingerprinting program that was used in a competition.   It was not packaged as a debian package or other distribution approach.   It took a while for our staff to find how to install it and its many dependencies consistently on Ubuntu, but it seemed pretty heavy handed to install that on our worker cluster.    So we explored using docker and it has been successful.   While old hand for some, I thought it might be interesting to explain what we did.

1) Created a docker file to make a docker container that held all of the code needed to run the system.

2) Worked with our systems group to figure out how to install docker on our cluster with a security profile we felt comfortable with.   This included running the binary in the container as user nobody.

3) Ramped up slowly to test the downloading and running of this container.   In general it would take 10-25 minutes to download the container the first time. Once cached on a worker node, it was very fast to start up.    This cache is persistent between many jobs, so this is efficient.

4) Use the container as we would a shell command, but passed files into the container by mounting a sub filesystem for it to read and write to.   Also helped with signaling errors.

5) Starting production use now.

We hope that docker can help us with other programs that require complicated or legacy environments to run.

Congratulations to Raj Kumar, Aaron Ximm, and Andy Bezella for the creative solution to problem that could have made it difficult for us to use some complicated academic code in our production environment.

Go docker!

Building Music Libraries

The Internet Archive is working with partners to preserve our musical heritage. The music collections started 8 years ago with the etree.org live music recordings and grew when we started hosting netlabels.

Scanning an LP cover

Scanning an LP cover

Now through new efforts and partnerships we have begun to expand and explore the music collections further.  We are working with researchers, record labels, collectors, internet communities and other archives to gather music media, build tools for preservation and expand metadata for exploration.

We have already made tremendous progress. We have archived millions of tracks, we are working with the Archive of Contemporary Music to digitize portions of their extensive collections of physical media, the MusicBrainz.org community has provided meticulous metadata, and researchers from university programs have begun to analyze the music.

Listening Room

Listening Room

A prototype “listening room” in the Internet Archive’s building in San Francisco is available free to the public to listen to the full musical holdings.  Access to these collections will also be provided to select computer science researchers via a secure “virtual reading room” in our data center.  As tools and the collections grow, we will offer everyone access to the metadata to help them explore, and then offer links to commercial sites for listening or purchasing.

We invite interested people to participate:

Archives. The Internet Archive and the Archive of Contemporary Music in New York have started digitizing ACM’s holdings with consistent, high quality, standards-based methods to build a scalable workflow.  We welcome other archives with similar projects, or who would like to help.  “Digitizing our large physical collections is an important step for our archive to allow others to learn from this deep legacy,” said Bob George, Director of the Archive of Contemporary Music, NYC.

ACMdigitization

Digitizing CDs at the Archive of Contemporary Music

Collectors.  Digitize, donate, or lend material for digitization.  Improve metadata or provide context to help others understand the depth and cultural relevance of these collections.  “Recycled Records is happy to have directed the donation of many thousands of LPs to the Internet Archive to help with their projects and for the love of music,” Bruce Lyall, proprietor of Recycled Records.

Labels.  Preserving a complete collection of everything published by a label is best done by or with the record label.  We would like to work with labels to get their releases archived and properly cataloged.  “The upcoming Music Libraries program continues the very work that enables our label, and the musicians who record for us, to bring the music of earlier times to audiences today. We are proud to participate in a tradition of preservation that has brought joy to so many through music.”  said David Fox, Co-founder of Musica Omnia.

Cataloging services.  Commercial and non-commercial cataloging services can participate by making sure there are proper links from and to these collections.  The musicbrainz.org open, community-created catalog has already been very helpful.

ellisquote

Commercial vendors and streaming services.  Links from these collections to commercial services can help users buy and listen to full tracks.  These services might have valuable metadata as well that can help users navigate.

Musicians and bands.  Please create more great works that libraries can preserve and provide access to.  We would like to hear your ideas about making the site useful for both musicians and the general public.

Researchers, historians, and music lovers.  Annotate, organize, datamine, and surface music in the collections, and help us preserve those works not yet in the collections.  “Access to a comprehensive archive of commercial music audio is the key missing link for research relating signal processing to listener behavior,” said Daniel Ellis, professor at Columbia University.  By analyzing the rhythms, keys, instruments, and genres, researchers will help create more complete metadata and aid discovery.

Looking to the future, we hope to expand these shared music collections by uniting the work done by other archives and collectors.  By bringing all of this music and its metadata into a shared library, we hope to bring the richness of our musical heritage to people all over the world.

Visit the Listening Room

Internet Archive
300 Funston Ave
San Francisco, CA 94118
Hours: Fridays from 1-4pm, or by appointment.

If you would like to participate in any way, please email us.