Machine Learning + Music Lovers: the Internet Archive is seeking technical volunteers or interns or low-cost contractors with a passion for music to make an opensource software library capable of identifying which songs are on LPs (given a wave form or audio track of the sides). We have a training set of ~5k manually labeled LPs and thousands more which are in need of your help.
- detecting start and stop of songs
- get track titles from OCR’ed covers or labels
- engaging UI for QA or uncertain automated output
The Internet Archive is interested in digitizing “Lost Vinyl”: those recordings that did not make it to CD or Spotify. We have been getting donations of physical LP’s (but we can always use more, please think of us…) And at the end of the year we would like to start to digitize them. We are not sure how available we can make the resulting audio files, but let’s make sure these fabulous recordings are at least preserved.
We are looking for help in separating the tracks on an LP. Sounds easy, but we have not been able to do it automatically yet.
For instance, this is an awesome Bruce record:
We want to detect timings and track titles:
<info> <title>Dancing in the Dark</title> <artist>Bruce Springsteen</artist> <trackinfo unit="mm:ss"> <track target="01_Dancing_in_the_Dark__Blaster_mix" title="Dancing in the Dark (Blaster mix)" start="0:09" duration="6:11" end="6:20"/> <track target="02_Dancing_in_the_Dark__Radio" title="Dancing in the Dark (Radio)" start="6:42" duration="4:43" end="11:25"/> <track target="03_Dancing_in_the_Dark__Dub" title="Dancing in the Dark (Dub)" start="11:25" duration="5:33" end="16:58"/> </trackinfo> </info>
We have 5,000 of these that have been done by hand that can be used as a training set, and we want to do the next many thousand using a computer and human QA. Sometimes we know how many tracks there are on a side, which can help, but ideally we would not have to know.
We have derivative waveforms, fingerprints, already computed and full audio if needed.
What we would like is a piece of code, ideally python and open source, that would take an mp3, flac, or png, and create a set of timings for the tracks on it. If the code needed the number of tracks, we could supply that as well.
Then we would like to take label images such as:
To create the track titles for the metadata above. (we OCR the labels, but it will be a bit lossy).
In other words, we would like to take photographs and digitization of the 2 sides of the album, and then get the titles with start and stop times.
We have done this for 5,000 LP’s already, and we would like help in automating this process so we can do it for all LP’s that did not make it to CD.
Up for helping? We can give access to existing 5,000 and what we would love is robust code that we could run on newly digitized LP’s so we could at least preserve, and maybe even bring access to the Lost Vinyl of the 20th century.
This is not as easy as it looks, but please do not be discouraged, we could use the help.
Existing open source projects could get us a long way there:
https://github.com/yu-tseng-chou/CS696-AMN-Project https://github.com/bonnici/scrobble-along-scrobbler https://github.com/NavJ/slicer/blob/master/slicer.py https://github.com/tyiannak/pyAudioAnalysis
If you are interested, please write to email@example.com.
Code must exist somewhere since many commercial products do this (Roxio Easy CD & DVD Burning for example, which is one I use). I can tell you from digitizing LPs, the problem is _usually_ too many tracks, rather than too few – it’s very sensitive to volume and music with very soft passages I’m not a “machine learning” programmer (most of my programming is on IBM mainframes) but I would suggest a method: first, find the unique LP identifier (usually it’s a company followed by a code, such as “Command RS-946 SD”) then perhaps there are online databases which can give the software an idea of the timings…. use that to guide the algorithm. Just an idea.
To add to my other post. Here’s a link to a page for developers with Discogs. Maybe those fine folk would work with you?