Category Archives: Music

30 Days of Stuff

Jason Scott, free-range archivist, reporting in as 2017 draws to a close.

As part of our end-of-year fundraising drive, I thought it might be fun to tweet highlighted parts of the vast stacks of content that the Internet Archive makes available for free to millions. A lot of folks know about our Wayback Machine and its 20+ years of website history, but there’s petabytes of media and works available to see throughout the site. I called it “30 Days of Stuff”, and for the last 30 days I’ve been pointing out great items at the Archive, once a day.

You won’t have to swim upstream through my tweets; here on the last day, I’ve compiled the highlighted items in this entry. Enjoy these jewels in the Archive’s collection, a small sample of the wide range of items we provide.

Books and Texts

  • The Latch Key of my Bookhouse was one of the first books scanned by the Internet Archive in its book scanner tests, and it’s a 1921 directory of Children’s Literature that is filled with really nice illustrations that came out great.
  • As part of our ever-growing set of Defense Technical Information Center collection, we have The Role of the Citizens Band Radio Service and Travelers Information Stations In Civil Preparedness Emergencies Final Report, a 1978 overview of CB Radio and what role it might play in civil emergencies. Many thousands of taxpayer-funded educational and defense items are mirrored in this collection.
  • Also in the DTIC collection is The Battalion Commander’s Handbook 1980, which besides the crazy front page of stamps, approvals and sign-offs, is basically a manager’s handbook written from the point of view of the US Army.
  • There are hundreds of tractor manuals at the Archive. Hundreds! Of all types, languages (a lot of them Russian) and level of information. Tractors are one of those tools that can last generations and keeping the maintenance on them in the field can make a huge difference in livelihood.
  • A lovely 1904 catalog for plums called The Maynard Plum Catalogue was scanned in with one of our partner organizations and it’s a breathless and inspiring declaration of the future wonder of the plums this wizard of plum-growing, Luther Burbank, was bringing to the world.
  • Xerox Corporation released “A Metamorphosis of Creative Copying” in 1964, which seems to function as both promotion for Xerox and a weird gift to give to your kids to color in.
  • In 2014, a short zine called The Tao of Bitcoin was released, telling people the dream of $10,000 bitcoin would be real.
  • The 1888 chapbook Goody Two-Shoes has lovely illustrations, and a fine short story.
  • Working with a lovely couple who brought in a 1942 black-owned-businesses directory, I scanned the pages by hand and put them up into this item.
  • Inside that directory was an ad for a school of whistling that said it taught using the methods of Agnes Woodward, and a quick scan of the Archive’s stacks showed that we had an entire copy of her book Whistling as an Art!
  • The medical treatise Sleep and Its Derangements, from 1869, is William A. Hammond, MD’s overview of sleep, and what can go wrong. Scanned from the Francis A. Countway Library of Medicine, it’s one of many thousands of books we’ve scanned with partners.
  • Let Hartman Feather Your Nest could be described as “A furniture catalog” in the same way the Sistine Chapel could be described as “a place of worship”. The catalog is a thundering, fist-pounding declaration of the superiority of the Hartman enterprise and the quality and breadth of furniture and service that will arrive at your door and be backed up to the far reaches of time.

Magazines

  • Photoplay considered itself the magazine for the motion picture industry in the first part of the 20th century, and this multi-volume compilation of photos, articles and advertisements is a truly lovely overview.
  • There’s over 140 issues of the classic Maximum RockNRoll zine, truly the king of music zines for a very long time. On its newsprint pages are howls and screeches of all manner of punk, rock and the needs of musicians.
  • A magazine created by the Walt Disney Company to trumpet various parts of Disneyland and its attractions was called Vacationland, and this Fall 1965 issue covers all sorts of stuff about the park’s first decade.

Movies

  • Rescued from a warehouse years ago, a collection of Hollywood movie “B-Roll”, unused secondary scenes often filmed by different crew, has been digitized. My personal favorite is [Western Film Scenes], which is circa 1950s footage of a Western Town, all of it utterly fake but feeling weirdly real, to be used in a western. Don’t miss everyone standing around looking right at you and looking like they agree quite energetically with you!
  • No compilation could be complete without the legendary Duck and Cover, a cartoon/PSA that explained the simple ways to avoid injury in a nuclear blast. Just lie down! It’ll be fine. Please note: This Probably Won’t Work. But the song is very catchy.
  • The very weird Electric Film Format Acid Test from 1990 has a semi-interested model holding up a color bar plate in a wide, wide variety of film and video formats. Filmed just a few blocks away from the Internet Archive’s current headquarters.
  • I snuck in a 1992 interview with the Archive’s founder, Brewster Kahle, back when he was 33 and working at WAIS, a company or two before the Archive and where he is asked about his thoughts on information and gathering of data. It’s quite interesting to hear the consistency of thought.
  • The Office of War Information worked with Disney to create “Dental Health“, a film to show to troops about proper dental care. It’s a combination of straightforward animation and industrial film-making worth enjoying.

Audio

  • We have a collection of hours of the radio show The Shadow from 1938-1939, starring  Orson Welles at 23, at the height of his performance powers, playing the dual main role.
  • For Christmas Eve, we pointed to “Christmas Chopsticks”, a 1953 78rpm record of “Twas the Night Before Christmas” performed to the tune of the classic piano piece “Chopsticks”; one of tens of thousands of 78rpm records the Archive has been adding this year.
  • On Christmas, a user of the Archive uploaded two obscure albums he’d purchased on eBay – remnants of the S. S. Kresge Company, which became K-Mart, and which were played over the PA system for shoppers. He got his hands on Albums #261 and #294.
  • Earlier in the month before the user uploaded those Christmas albums, I linked to a different holiday collection of K-Mart items, a 1974 Reel-to-Reel that started with a K-Mart jingle and went full holiday from there.
  • Before he was a (retired) talk show host, and before he was a stand-up comedian, David Letterman worked and trained in radio. Happily, we have recordings of Dave Letterman, DJ, from when he was 22, at Ball State University.
  • Ron “Boogiemonster” Gerber has been hosting his weekly pop music recycling radio show, “Crap from the Past”, for over 25 years, and he’s been uploading and cataloging his show to the Archive for well over 10 of those years, including all the way back to the beginning of his show. The full Crap From The Past archive is up and is hundreds of hours of fun.
  • The truly weird “Conquer the Video Craze” is a 1982 record album with straightforward descriptions of how to beat games like Centipede, Defender, Stargate, Dig Dug, and more. This album has been sampled from by multiple DJs to bring that extra spice to a track.
  • Over 3,000 shows at the DNA Lounge are at the archive, including “Bootie: Gamer Night“, which combines mash-up tracks and video games. Bootie has been playing at DNA Lounge for years, and puts the audio from one song with the singing from another, and… it’s quite addicting, like games. This night was for the nearby Game Developers’ Conference being held the same week.

Software

  • In 2011, as part of a “retrocomputing” competition, we saw the release of “Paku-Paku”, a pac-clone program which ran in an obscure early PC-Compatible graphics mode that was very colorful and very small (160×100) and was built perfectly for it. You can play the game in your browser by clicking here.
  • Psion Chess is a game for the Macintosh that can play both you and itself with pretty high levels of skill and really sharp and crisp black and white graphics.  It makes a really great screensaver in self-playing mode.

People often overuse a phrase like “Barely scratched the surface”, but I assure you there are millions of amazing items in the archive, and it’s been a pleasure to bring some to light. While the 30 Days of Stuff was a fun way to stretch out a month of fundraising with stuff to see every day, we’re here 24/7 to bring you all these items, and welcome you finding jewels, gems and clunkers throughout our hard drives whenever you want.

Thanks for another year!

Dreaming of Semantic Audio Restoration at a Massive Scale

I believe we can do a fabulous job of bringing the music from the 78rpm era back to vibrant life if we really understand wear and if we could model the instruments and voices.

In other words, I believe we could reconstruct a performance by semantically modeling the noise and distortion we want to get rid of, as well as modeling the performer’s instruments.

To follow this reasoning—what if we knew we were examining a piano piece and knew what notes were being played on what kind of piano and exactly when and how hard for each note—we could take that information to make a reconstruction by playing it again and recording that version. This would be similar to what optical character recognition (OCR) does with images of pages with text—it knows the language and it figures out the words on the page and then makes a new page in a perfect font. In fact, with the OCR’ed text, you can change the font, make it bigger, and reflow the page to fit on a different device.

What if we OCR’ed the music? This might work well for the instrumental accompaniment, because then we would handle a voice, if any, differently. We could have a model of the singer’s voice based on not only this recording and other recordings of this song, but also all other recordings of that singer. With those models we could reconstruct the voice without any noise or distortion at all.

We would balance the reconstructed and the raw signals to maintain the subtle variations that make great performances.   This could also be done for context as sometimes digital filmmakers add in some scratched film effects.

So, there can be a wide variety of restoration tools if we make the jump into semantics and big data analysis.

The Great 78 Project will collect and digitize over 400,000 digitized 78rpm recordings to make them publicly available, creating a rich data set to do large scale analysis. These transfers are being done with four different styli shapes and sizes at the same time, and all recorded at 96KHz/24bit lossless samples, and in stereo (even though the records are in mono, this provides more information about the contours of the groove). This means each groove has 8 different high-resolution representations of every 11 microns. Furthermore, there are often multiple copies of the same recording that would have been stamped and used differently. So, modeling the wear on the record and using that to reconstruct what would have been on the master may be possible.

Many important records from the 20th century, such as jazz, blues, and ragtime, have only a few performers on each, so modeling those performers, instruments, and performances is quite possible.  Analyzing whole corpuses is now easier with modern computers, which can provide insights beyond restoration as well as understand playing techniques that are not commonly understood.

If we build full semantic models of instruments, performers, and pieces of music, we could even create virtual performances that never existed.  Imagine a jazz performer virtually playing a song that had not been written in their lifetime. We could have different musician combinations, or singers performing with different cadences. Areas for experimentation abound once we cross the threshold of full corpus analysis and semantic modeling.

We hope the technical work done on this project will have a far-reaching effect on a full media type since the Great 78 Project will digitize and hold a large percentage of all 78rpm records ever produced from 1908 to 1950.  Therefore, any techniques that are built upon these recordings can be used to restore many many records.

Please dive in and have fun with a great era of music and sound.

 

(we get a sample every 11microns when digitizing the outer rim of a 78rpm record at 96KHz.   And given we now have 8 different readings of that, with 24bit resolution, we hopefully can get a good idea of the groove.   There are optical techniques that are very cool, but those have their own issues, I am told

10″ * 3.14 = 31.4″ circumference = 80cm/revolution

@ 78rpm:  60 seconds/min / 78revolutions/minute = .77 seconds / revolution

80cm/rev   / (.77sec/rev)  = 104cm/sec

96Ksampes/sec

104cm/sec / (96ksamples/sec) = 11microns )

 

Listening to the 78rpm Disc Collection


By Jessica Thompson, Coast Mastering

The Great 78 Project
A few times a year, I join B. George in the Internet Archives’ warehouses to help sort and pack 78rpm discs to ship to George Blood L.P. for digitization. As a music fan and a professional mastering and restoration engineer, I get a thrill from handling the heavy, grooved discs, admiring the fonts and graphic designs on the labels, and chuckling at amusing song titles. Now digitized, these recordings offer a wealth of musicological, discographic and technical information, documenting and contextualizing music and recording history in the first half of the 20th century.

The sheer scale of this digitization project is unprecedented. At over 15,000 recordings and counting, the value strictly in terms of preservation is clear, especially given the Internet Archive’s focus on digitizing music less commonly available to researchers. Music fans can take a deep dive into early blues, Hawaiian, hillbilly, comedy and bluegrass. I even found several early Novachord synthesizer recordings from 1941.

As a researcher and audio restoration engineer, the real goldmine is in the aggregation of discographic and technical metadata accompanying these recordings. Historians can search for and cross reference recordings based on label, artist, song title, year of release, personnel, genre, and, importantly, collection. (The Internet Archive documents the provenance of the 78rpm discs so that donated collections remain digitally intact and maintain their contextual significance.) General users can submit reviews with notes to amend or add to metadata, and the content of those reviews is searchable, so metadata collection is active. No doubt it will continue to improve as dedicated and educated users fill in the blanks.

Access to the technical metadata offers a valuable teaching tool to those of us who practice audio preservation. For audio professionals new to 78s and curious about how much difference a few tenths of a millimeter of stylus can make, the Internet Archive offers 15,000+ examples of this. Play through the different styli options, and it quickly becomes apparent that particular labels, years and even discs do respond better to specific styli sizes and shapes. This is something audio preservationists are taught, but rarely are we presented with comprehensive audio examples. To be able to listen to and analyze the sonic and technical differences in these versions marries the hard science with the aesthetic.

Playback speeds were not standardized until the late 1920s or early 1930s, and most discs were originally cut at speeds ranging from 76-80rpm (and some well beyond). The discs in the George Blood Collection were all digitized at a playback speed of 78rpm. Preservationists and collectors debate extensively about the “correct” speed at which discs ought to be played back, and whether one ought to pitch discs individually. However, performance, recording and manufacturing practices varied so widely that even if a base speed could generally be agreed upon, there will always be exceptions. (For more on this, please check out George Blood’s forthcoming paper Stylus Size And Speed Selection In Pre-1923 Acoustic Recordings in Sustainable audiovisual collections through collaboration: Proceedings of the 2016 Joint Technical Symposium. Bloomington, IN: Indiana University Press.)

Every step of making a recording involves so many aesthetic decisions – choices of instrumentation, methods of sound amplification, microphone placement, the materials used in the disc itself, deliberate pitching of the instruments and slowing or speeding of the recording – that playback speed simply become one of many aesthetic choices in the chain. As preservationists, we are preserving the disc as an historic record, not attempting to restore or recreate a performance. (Furthermore, speed correction is possible in the digital realm, should anyone want to modify these digital files for their own personal enjoyment).

How do they sound? Each 78rpm disc has an inherent noise fingerprint based on the frequency and dynamic range the format can replicate (limited, compared to contemporary digital playback formats) and the addition of surface noise from dust, dirt and stylus wear in the grooves. As expected, the sound quality in this collection varies. Some of these discs were professionally recorded, minimally played, stored well, and play back with a tolerable, even ignorable level of surface noise relative to the musical content. Others were recorded under less professional circumstances, and/or were much loved, frequently played, stored without sleeves in basements and attics, and therefore suffer from significant surface noise that can interfere with enjoyment (and study) of the music.

Yet, a compelling recording can cut through noise. Take this 1944 recording of Josh White performing St. James Infirmary, Asch 358-2A. This side has been released commercially several times, so if you look it up on a streaming service like Spotify, you can listen to different versions sourced from the same recording (though almost certainly not from the same 78rpm disc). They play at different speeds, some barely perceptibly faster or slower but at least one nearly a half-step faster than the preservation copy digitized by George Blood L.P. They also have a range of noise reduction and remastering aesthetics, some subtle and some downright ugly and riddled with digital artifacts. The version on the Internet Archive offers a benchmark. This is what the recording sounded like on the original 78rpm disc. Listen to the bend in the opening guitar notes. That technique cuts through the surface noise and should be preserved and highlighted in any restored version (which is another way of saying that any noise reduction should absolutely not interfere with the attack and decay of those luscious guitar notes).

McGill University professor of Culture and Technology Jonathan Sterne wrote a book – The Audible Past: Cultural Origins of Sound Reproductionthat is worth reading for anyone interested in a cultural history of early recording formats, including 78s. As Sterne says, sound fidelity is “ultimately about deciding the values of competing and contending sounds.” So, in listening to digital versions of 78s on the Internet Archive, music fans, researchers, and audio professionals alike engage in a process of renegotiating concepts of acceptable thresholds of noise and what that noise communicates about the circumstances of the recording and its life on a physical disc.

Fortunately, our brains are very good at calibrating to accept different ratios of signal to noise, and, I found, the more I listened to 78rpm recordings on the Internet Archive, the less I was bothered by the inherent noise. Those of us who grew up on CDs or digitally recorded and distributed music are not used to the intrusions of surface noise. However, when listening to historic recordings, we are able to adjust our expectations and process a level of noise that would be ridiculous in contemporary music formats. (Imagine this week’s Billboard Top 100 chart topper, Bruno Mars’s “That’s What I Like,” with the high and low end rolled off, covered in a sheen of crackles and pops). The fact that these 78rpm recordings sound, to us, like they were made in the 1920s, 1930s, 1940s lets them get away with a different scale of fidelity. The very nature of their historicity gets them off the hook.

In analog form, crackles and pops can be mesmerizing, almost like the sound of a crackling fire. However, once digitized, those previously random pops become fixed in time. What may have been enjoyable in analog form becomes a permanent annoyance in digital form. The threshold of acceptable noise levels moves again.

This means that noise associated with recording carriers such as 78rpm discs is almost always preferably to noises introduced in the digital realm through the process of attempted noise reduction. Sound restorationists understand that their job is to follow a sonic Hippocratic oath: do no harm. Though noise reduction tools are widely available, they range in quality (and accordingly in cost), and are merely tools to be used with a light or heavy touch, by experienced or amateur restorationists.

The question of whether noise reduction of the Internet Archive’s 78rpm recordings could be partially automated makes my heart palpitate. Though I know from experience that, for example, auto-declickers exist that could theoretically remove a layer of noise from these recordings with minimal interference with the musical signal, I don’t believe the results would be uniformly satisfactory. It is so easy to destroy the aura of a recording with overzealous, heavy-handed, cheap, or simply unnecessary noise reduction. Even a gentle touch of an auto-declicker or de-crackler will have widely varying results on different recordings.

I tried this with a sampling of selections from the /georgeblood/ collection. I chose eleven songs from different genres and years and ran two different, high quality auto-declickers (the iZotope RX6 Advanced multiband declicker and CEDAR Audio’s declick) on the 24bit FLAC files. The results were uneven. Some of the objectively noisier songs, such as Blind Blake’s Tampa Bound, Paramount 12442-B, benefited from having the most egregious surface noises gently scrubbed.

Tampa Bound Flat Transfer vs Tampa Bound Declicked, Dehissed and Denoised
that’s a lot of noise!

However, a song with a strong musical presence and mild surface noise such as Trio Schmeed’s Yodel Cha Cha, ABC-Paramount 9660, actually suffered more from light auto-declicking because the content of the horns and percussive elements registered to the auto-delicker as aberrations from the meat of the signal and were dulled. A pop presents as an aberration across all frequencies. Mapped visually across frequency, time and intensity, it looks like a spike cutting through the waveform. A snare hit looks similar and is therefore likely to be misinterpreted by an auto-declicker unless the threshold at which the declicker deploys is set very carefully. This difference is why good restorationists earn their pay.

Yodel Cha Cha flat transfer and denoised. Notice the “clicks and pops” have been scrubbed,
but so has wanted high end content in the music.

 I am approaching this collection as a listener and music fan, as a researcher, and as an audio professional, three very different modes of listening and interacting with music. In all cases, the Internet Archive 78rpm collection offers massive amounts of music and data to be explored, discovered, enjoyed, studied and utilized. Whether you want to listen to early Bill Monroe tunes, crackles, pops and all, or explore hundreds of recordings of pre-war polkas, or analyze the effects of stylus size on 1930s Victor discs, the Internet Archive provides the raw materials in digital form and, not to be underestimated, preserves the original discs too.

The New Memory Palace

By Paul D. Miller aka DJ Spooky

     “Sometimes it is the people no one can imagine anything of who do the things no one can imagine.”

– Alan Turing’s biopic, The Imitation Game, 2014

Photo Credit: Mitchell Maher

DJ Spooky at Internet Archive’s 20th Anniversary Celebration
Photo Credit: Mitchell Maher

A lot of things have changed in the last 20 years. A lot of things haven’t. We’ve moved from the tyranny of physical media to the seemingly unlimited possibilities of total digital immersion. We’ve moved from a top down, mega corporate dominated media, to a hyper-fragmented multiverse where any kind of information is accessible within reason (and sometimes without!). The fundamental issue that “memory” and how it responds to the digital etherealization of all aspects of the information economy we inhabit conditions everything we do in this 21st-century culture of post-, post-, post-everything contemporary America. Whether it’s the legions of people who walk the streets with Bluetooth enabled earbuds that allow them to ignore the physical reality of the world around them, or the Pokémon Go hordes playing the world’s largest video game as it’s overlaid on stuff that happens “IRL” (In Real Life) that layer digital role playing over the world: diagnosis is pending. But the fundamental fact is clear: digital archives are more important than ever and how we engage and access the archival material of the past, shapes and molds the way we experience the present and future. Playing with the Archive is a kind of digital analytics of the subconscious impulse to collage. It’s also really fun.

mnemosyne1Mnemosyne was the Greek muse who personified memory. She was a Titaness who was the daughter of Uranus (who represented “Sky”), the son and husband of Gaia, Mother Earth. When you break it down, Mnemosyne had a deeply complicated life, and ended up birthing the other muses with her nephew, Zeus. Ancient Greek myth was quite an incestuous place, and every deity had complicated and deeply interwoven histories that added layers and layers of what we would now call “intertextuality.” Look at it this way: a Titaness, Mnemosyne, gave birth to Urania (Muse of Astronomy), Polyhymnia (Muse of hymns,) Melpomene (Muse of tragedy,) Erato (Muse of lyric poetry,) Clio (Muse of history,) Calliope (Muse of epic poetry,) Terpsichore (Muse of dance,) and Euterpe (Muse of music). It’s complicated. Mnemosyne also presided over her own pool in Hades as a counterpoint to the river Lethe, where the dead went to drink to forget their previous life. If you wanted to remember things, you went to Mnemosyne’s pool instead. You had to be clever enough to find it. Otherwise, you’d end up crossing the river under the control of spirits guided by the “helmsman” whose title translates from the Greek term “kybernētēs” across the mythical river into the land of the dead aka Hades. What’s amazing about the wildly “recombinant” logic of this cast of characters is that somehow it became the foundation of our modern methods for naming almost every aspect of digital media — including the term “media.” Media, like the term data is a plural form of a word “appropriated” directly from Latin. But the eerie resonance it has with our era comes into play when we think of the ways “the archive” acts as a downright uncanny reflection site of language and its collision between code and culture.

neuromancer-william-gibsonUntil the internet, the term cyber was usually used to measure words about governance and then later evolved to how we look at computers, computer networks, and now things like augmented reality and virtual reality. The term traces back to the word cybernetics, which was popularized by the renowned mathematician Norbert Wiener, founder of Information theory, at MIT. There’s a strange emergent logic that connects the dots here: permutation, wordplay, and above all, the use of borrowed motifs and ahistorical connections between utterly unassociated material. I guess William S. Burroughs was right: the world has become a mega-Cybertron, a place where everything is mixed, cut and paste style, to make new meanings from old. With people like Norbert Wiener, cybernetics usually refers to the study of mechanical and electronic systems designed at heart, to replace human systems. The term “cyberspace” was coined by William Gibson, to reflect the etherealized world of his 1982 classic, Burning Chrome. He used it again as a reference point for Neuromancer, his groundbreaking novel. A great, oft-cited passage gives you a sense how resonant it is with our current time:

Cyberspace. A consensual hallucination experienced daily by billions of legitimate operators, in every nation, by children being taught mathematical concepts… a graphic representation of data abstracted from the banks of every computer in the human system. Unthinkable complexity. Lines of light ranged in the nonspace of the mind, clusters and constellations of data. Like city lights, receding…

When the Internet Archive asked me to do a megamix of their archive of recordings from their data files, I was a bit overwhelmed. There’s no way any human being could comb through even the way they’ve documented just the web, let alone the material they have asked people to upload.Where to start? Sir Tim Berners-Lee’s speech inaugurating the internet back when he came up with the term the “Semantic Web?”  The first recordings from Edison? That could be cool. Maybe mix that with GW Bush’s State of the Union speech inaugurating the invasion of Iraq? Why not. Take Hedy Lamar’s original blueprints for spread spectrum “secret communications systems” and mix that with recordings of William S. Burrough and Malcolm X, with a beat made from open source 1920’s jazz and 1950’s New Orleans blues? Why not. Grab some clips of Cory Doctorow talking about the upcoming war on open computing and mix it with Parliament Funkadelic? Sure. Take the first “sound heard around the world,” the telemetry signals guiding the Sputnik satellite as it swirled around planet Earth to become our first orbital artificial moon? Cool. Why not? Take a speech from Margaret Sanger, the woman who started Planned Parenthood, and mix it with Public Enemy? Cool. Take D.W. Griffith’s “Birth of a Nation” and re-score it with the Quincy Jones theme from “Fat Albert?” That would actually be kind of cool, but would require a lot of editing.

The basic idea here is that once you have the recordings and documentation of all aspects of human activity from the last several centuries, that is a serious “mega-mix.”

What you will hear in the short track I made is a mini reflection of the density of the sheer volume of materials that the Internet Archive has onsite. It is a humble reminder that through the computer, the network, and the wireless transmission of information, we have an immaculate reflection of what Alan Turing may have called “morphogenesis” — the human, all too human, attempt to corral the world into anthropocentric metaphors that seek to convey the sublime, the edge of human understanding: the emergent patterns that occur when you recombine material with unexpectedly powerful new connections.

Photo Credit: Mitchell Maher

Memory Palace on flexi vinyl
Photo Credit: Mitchell Maher

I’m honored to be the first DJ to start. But I’m also honored that many, many more will follow. The Archive is a mirror of infinite recombinant potential. I hope that its gift of free culture and free exchange creates a place where we will be comfortable with what is almost impossible to guess comes next. It is not a “collaborative filter” but a place where you are invited to explore on your own and come up with new ways of seeing the infinite memory palace of the fragments of history, time, and space that make this modern 21st century world work.

Enjoy.

Paul D. Miller aka DDJ SpookyJ Spooky’s work ranges from creating the first DJ app to producing an impactful DVD anthology about the “Pioneers of African American Cinema.” According to a New York Times review, “there has never been a more significant video release than ‘Pioneers of African-American Cinema.'” The prolific innovator and artist also created 13 music albums and is about to release a fourteenth. Called “Phantom Dancehall,” it is an intense mix of hip hop, Jamaican ska and dancehall culture.

Rock Against the TPP is Coming to San Francisco…TOMORROW!

tpp
On Friday, September 9th hip hop icons Dead Prez, actress Evangeline Lilly, punk legend Jello Biafra, Grammy winners La Santa Cecilia, and others will play a free concert at the Regency Ballroom in San Francisco to protest the Trans-Pacific Partnership (TPP).

The TPP is a contentious trade agreement that is getting quite a bit of negative press in the 2016 U.S. election cycle. Among many other issues, the TPP would govern how signatory countries protect and enforce intellectual property rights. The TPP could have a large negative impact on libraries by increasing copyright term limits and neglecting the essential limitations on copyright law that libraries around the world rely on. Many different groups have vocally opposed the TPP, both for its substance and for the secrecy of the negotiations process.

tppmorrelloOrganized by Fight for the Future and Rage Against the Machine guitarist Tom Morello, the  tour is designed to pull new audiences into the fight against the TPP. See more details and a full lineup at https://www.rockagainstthetpp.org/san-francisco-ca

The concert will be followed by a teach-in on “How to Fight the TPP” on Saturday, Sept. 10th from 1pm – 3pm at 1999 Bryant Street, hosted by experts from a wide range of organizations opposing the TPP.
tppaudience

Saving the 78s

Written by B. George, the Director of ARChive of Contemporary Music in NYC, and Curator of Sound Collections at the Internet Archive in San Francisco.

While audio CDs whiz by at about 500 revolutions per minute, the earliest flat disks offering music whirled at 78rpm. They were mostly made from shellac, i.e., beetle (the bug, not The Beatles) resin and were the brittle predecessors to the LP (microgroove) era. The format is obsolete, and the surface noise is often unbearable and just picking them up can break your heart as they break apart in your hands. So why does the Internet Archive have more than 200,000 in our physical possession?Music

A little over a year ago New York’s ARChive of Contemporary Music (ARC) partnered with the Internet Archive to focus on preserving and digitizing audio-visual materials. ARC is the largest independent collection of popular music in the world. When we began in 1985 our mandate was microgroove recordings – meaning vinyl – LPs and forty-fives. CDs were pretty much rumors then, and we thought that other major institutions were doing a swell job of collecting earlier formats, mainly 78rpm discs. But donations and major research projects like making scans for The Grammy Museum and The Ertegun Jazz Hall of Fame placed about 12,000 78s in our collection.

For years we had been getting calls offering 78 collections that we were unable to accept. But when space and shipping became available through the Internet Archive, it was now possible to begin preserving 78s. Here’s a short history of how in only a few years ARC and the Internet Archive have created one of the largest collections in America.

Our first major donation came from the Batavia Public Library in Illinois, part of the Barrie H.Thorp Collection of 48,000 78s.

We’re always a tad suspicious of large collections like these. First thought is, “Must be junk.” Secondly, “It’s been cherrypicked.” But the Thorp Collection was screened by former ARC Board member Tom Cvikota, who found the donor, helped negotiate the gift and stored it. That was in 2007. Between then and our 2015 pickup Tom arranged for some of the recordings to be part of an exhibition at the Greengrassi Gallery, London, (UK, Mar-Apr, 2014) by artist Allen Ruppersberg, titled, For Collectors Only (Everyone is a Collector).

What makes the Thorp collection unique is the obsessive typewritten card catalog featured in a short film hosted on the exhibition’s webpage. Understanding why you collect and how you give your interests meaning is a part of Allen’s work – artworks that focus on the collector’s mentality. One nice quote by Allen referenced in Greil Marcus’ book, The History of Rock n’ Roll in Ten Songs is, “In some cases, if you live long enough, you begin to see the endings of things in which you saw the beginnings.”

Philosophical musings aside, there are 48,000 discs to deal with. That meant taking poorly packed boxes — many of them open for 20 years — and re-boxing them for proper storage. The picture below shows an example of how they arrived (on the right), and how they were palletized (on the left.)

PalletizedThe trick to repacking in a timely fashion is to not look at the records. It’s a trick that is never performed successfully. Handling fragile 78s requires grabbing one or just a few at a time. So we’re endlessly reading the labels, sleeving and resleeving, all the time checking for rarities, breakage and dirt.

Now we didn’t do all this work on our own. Working another part of the warehouse was two-and-a-half month old Zinnia Dupler — the youngest volunteer ever to give us a hand. Mom also helped a bit.

mom

A few minutes after the snap I found this gem in the Thorp collection. Coincidence? I don’t think so…burpinthebaby

“Burpin” is a country novelty tune from out of Texas by Austin broadcaster and humorist Richard “Cactus” Pryor (1923 – 2011). It came from a box jam-packed with country and hillbilly discs. This was a pleasant surprise, as we expected the collection to be like most we encounter – big band and bland pop. But here was box-after-box of hillbilly, country, and Western swing records. Now, I use’ta think I knew a bit about music. But with this collection, it was back to school for me. Just so many artists I’ve never heard of or held a record by. As we did a bit of sorting, in the ‘G’s alone there’s Curly Gribbs, Lonnie Glosson and the Georgians. Geeez! Did you know that Hank Snow had a recordin’ kid, Jimmy, and he cut “Rocky Mountain Boogie” on 4 Star records, or that Cass Daley, star of stage and screen, was the ‘Queen of Musical Mayhem?” Me neither.  The Davis Sisters, turns out, included a young Skeeter Davis(!) and not to be confused with the Davis Sister Gospel group, also in this collection. Then there’s them Koen Kobblers, Bill Mooney and his Cactus Twisters, and Ozie Waters and the Colorado Hillbillies. No matter they should be named the Colorado Mountaineers, they’re new to me.

For us this donation is a dream: it allows us to preserve material that was otherwise going to be thrown away; it has a larger cultural value beyond the music; and it contained a mountain of unfamiliar music, much of it quite rare. And most of it is not available online.

It was a second large donation that prompted the Internet Archive to move toward the idea that we should digitize all of our 78s. The Joe Terino Collection came to us through a cold call, the collection professionally appraised at $500,000. The 70,000 plus 78s were stored in a warehouse for more than 40 years, originally deposited by a distributor. Here’s the kicker: they said that we could have it all, but we had to move it – NOW! Internet Archive did and it came in on 72 pallets, in three semis, from Rhode Island to San Francisco, looking like this…JoeTernino

So Fred Patterson and the crackerjack staff out in our Richmond warehouses (Marc Wendt, Mark Graves, Sean Fagan, Lotu Tii, Tracey Gutierrez, Kelly Ransom, and Matthew Soper) pulled everything off the ramshackle pallets and carefully reboxed this valuable material.

boxes

How valuable? Well, we’re really not so sure yet, despite the appraisal, as just receiving and reboxing was such a chore. One hint is this sweet blues 78 that we managed to skim off the top of a pile.

muddywaters

The next step is curating this material, acquiring more collections and moving towards preservation through digitization. Already we have a pilot project in the works with master preservationist George Blood to develop workflow and best digitization practices.

We’re doing all this because there’s just no way to predict if the digital will outlast the physical, so preserving both will ensure the survival of cultural materials for future generations to study and enjoy. And, it’s fun.

 

Microphone Check: Thousands of Hip-Hop Mixtapes at the Archive

The Internet Archive has been growing an interesting sub-collection of music for the past few months: Hip-Hop Mixtapes. The resulting collection still has a way to go before it’s anywhere near what is out there (limited by bandwidth and a few other technical factors), but now that it’s past 150 solid days of music on there, it’s quite enough to browse and “get the idea”, should you be so inclined.

Note: Hip-Hop tends to be for a mature audience, both in subject matter and language.

I’m sure this is entirely old knowledge for some people, but it was new to me, so I’ll describe the situation and the thinking.

Front-Cover

There’s some excellent introductions and writeups about mixtapes in Hip-Hop culture at these external articles:

So, in quick summary, there have been mixtapes of many varieties for many years, going back to the 1970s to the dawn of what we call Hip-Hop, and throughout the time since the “tapes” have become CDs and ZIP files and are now still being released out into “the internet” to be spread around. The goal is to gain traction and attention for your musical act, or for your skills as a DJ, or any of a dozen reasons related to getting music to the masses.

There is an entire ecosystem of mixtape distribution and access. There are easily tens of thousands of known mixtapes that have existed. This is a huge, already-extant environment out there, that was established, culturally critical, and born-digital.

It only made sense for a library like the Internet Archive to provide it as well.

There’s a lot coded into the covers of these mixtapes (not to even mention the stuff coded into the lyrics themselves) – there’s stressing of riches, drug use, power, and oppression. There’s commentary on government, on social issues, and on the meaning of entertainment and celebrity. There’s parody, there’s aggrandizement, and there’s every attempt to draw in the listeners in what is a pretty large pile of material floating around. It’s not about this song or that grandiose portrait, though – it’s about the fact this whole set of material has meaning, reality and relevance to many, many people.

How do I know this has relevance? Within 24 hours of the first set of mixtapes going onto the Archive, many of the albums already had hundreds of listeners, and one of them broke a thousand views. Since then, a good amount have had tens of thousands of listens. Somebody wants this stuff, that’s for sure. And that’s fundamentally what the Archive is about – bringing access to the world.

The end goal here is simple: Providing free access to huge amounts of culture, so people can reference, contextualize, enjoy and delight over material in an easy-to-reach, linkable, usable manner. Apparently it’s already taken off, but here you go too.

Get your drank on here.

archive.org download counts of collections of items updates and fixes

Every month, we look over the total download counts for all public items at archive.org.  We sum item counts into their collections.  At year end 2014, we found various source reliability issues, as well as overcounting for “top collections” and many other issues.

archive.org public items tracked over time

archive.org public items tracked over time

To address the problems we did:

  • Rebuilt a new system to use our database (DB) for item download counts, instead of our less reliable (and more prone to “drift”) SOLR search engine (SE).
  • Changed monthly saved data from JSON and PHP serialized flatfiles to new DB table — much easier to use now!
  • Fixed overcounting issues for collections: texts, audio, etree, movies
  • Fixed various overcounting issues related to not unique-ing <collection> and <contributor> tags (more below)
  • Fixes to character encoding issues on <contributor> tags

Bonus points!

  • We now track *all collections*.  Previously, we only tracked items tagged:
    • <mediatype> texts
    • <mediatype> etree
    • <mediatype> audio
    • <mediatype> movies
  • For items we are tracking <contributor> tags (texts items), we now have a “Contributor page” that shows a table of historical data.
  • Graphs are now “responsive” (scale in width based on browser/mobile width)

 

The Overcount Issue for top collection/mediatypes

  • In the below graph, mediatypes and collections are shown horizontally, with a sample “collection hierarchy” today.
  • For each collection/mediatype, we show 1 example item, A B C and D, with a downloads/streams/views count next to it parenthetically.   So these are four items, spanning four collections, that happen to be in a collection hierarchy (a single item can belong to multiple collections at archive.org)
  • The Old Way had a critical flaw — it summed all sub-collection counts — when really it should have just summed all *direct child* sub-collection counts (or gone with our New Way instead)

overcount

So we now treat <mediatype> tags like <collection> tags, in terms of counting, and unique all <collection> tags to avoid items w/ minor nonideal data tags and another kind of overcounting.

 

… and one more update from Feb/1:

We graph the “difference” between absolute downloads counts for the current month minus the prior month, for each month we have data for.  This gives us graphs that show downloads/month over time.  However, values can easily go *negative* with various scenarios (which is *wickedly* confusing to our poor users!)

Here’s that situation:

A collection has a really *hot* item one month, racking up downloads in a given collection.  The next month, a DMCA takedown or otherwise removes the item from being available (and thus counted in the future).  The downloads for that collection can plummet the next month’s run when the counts are summed over public items for that collection again.  So that collection would have a negative (net) downloads count change for this next month!

Here’s our fix:

Use the current month’s collection “item membership” list for current month *and* prior month.  Sum counts for all those items for both months, and make the graphed difference be that difference.  In just about every situation that remains, graphed monthly download counts will be monotonic (nonnegative and increasing or zero).

 

 

Music Analysis Beginnings

As mentioned in our recent Building Music Libraries post, we are working with researchers at Columbia University and UPF in Barcelona to run their code on the music collection to help their research and to provide new analyses that could help with exploration and understanding.

We are doing some pilot runs to generate files which some close observers may see in the music item directories on archive.org.  Audio fingerprints from audfprint are .afpt and music attributes from Essentia are in _esslow.json.gz (download sample) and _esshigh.json.gz.

Spectrogram of a Grateful Dead track

Spectrogram of a Grateful Dead track

We are also creating image files showing the audio spectrum used.  We hope this is useful for those that want to see if files have been compressed in the past (even if they are posted as flac files now).  There is also a .png for each audio file of a basic waveform that is being used in the archive’s beta site as eye candy.

More as it happens, but we wanted you know there is some progress and you will see some new files.  If you have proposed other analyses that would benefit from being run over a large corpus, please let us know by contacting info at archive dot org.

Thank you to the researchers and the Archive programmers who are working together to make this happen.

 

Using Docker to Encapsulate Complicated Program is Successful

The Internet Archive has been using docker in a useful way that is a bit out of the mainstream: to package a command-line binary and its dependencies so we can deploy it on a cluster and use it in the same way we would a static binary.

Columbia University’s Daniel Ellis created an audio fingerprinting program that was used in a competition.   It was not packaged as a debian package or other distribution approach.   It took a while for our staff to find how to install it and its many dependencies consistently on Ubuntu, but it seemed pretty heavy handed to install that on our worker cluster.    So we explored using docker and it has been successful.   While old hand for some, I thought it might be interesting to explain what we did.

1) Created a docker file to make a docker container that held all of the code needed to run the system.

2) Worked with our systems group to figure out how to install docker on our cluster with a security profile we felt comfortable with.   This included running the binary in the container as user nobody.

3) Ramped up slowly to test the downloading and running of this container.   In general it would take 10-25 minutes to download the container the first time. Once cached on a worker node, it was very fast to start up.    This cache is persistent between many jobs, so this is efficient.

4) Use the container as we would a shell command, but passed files into the container by mounting a sub filesystem for it to read and write to.   Also helped with signaling errors.

5) Starting production use now.

We hope that docker can help us with other programs that require complicated or legacy environments to run.

Congratulations to Raj Kumar, Aaron Ximm, and Andy Bezella for the creative solution to problem that could have made it difficult for us to use some complicated academic code in our production environment.

Go docker!