Correct Metadata is Hard: a Lesson from the Great 78 Project

We have been digitizing about 8,000 78rpm record sides each month and now have 122,000 of them done. These have been posted on the net and over a million people have explored them. We have been digitizing, typing the information on the label, and linking to other information like discographies, databases, reviews and the like.

Volunteers, users, and internal QA checkers have pointing out typos, and we decided to go back over a couple of month’s metadata and found problems. And then we contracted with professional proofreaders and they found even more (2% of the records at this point had something to point out, some are matters of opinion or aesthetics, some lead to corrections).

We are going to pay the professional proofreaders to correct the 5 most important fields for all 122,000 records, but can use more help. We are pointing these out here in hopes to interest volunteer proofreaders and to share our experience in continually improving our collections.

Here are some of the issues with the primary performer field: before-the-after that we have now corrected from the June 2019 transfers (before | after) that we hope to upload in the next couple of weeks:

Jose Melis And His Latin American Ensemble | Jose Melis And His-Latin American Ensemble
Columbia-Orchestra | Columbia-Orchester
S. Formichi and T. Chelotti | S. Formichi e T. Chelotti
Dennis Daye and The Rhythmaires | Dennis Day and The Rhythmaires
Harry James and His Orchestra | Harry James and His Orch.
Charles Hart & Elliot Shaw | Charles Hart & Elliott Shaw
Peerless Quartet | Peerless Quartette

Some of the title corrections:

O Vino Fa ‘Papla (Wine Makes You Talk) | ‘O Vino Fa ‘Papla (Wine Makes You Talk)
Masked Ball Salaction | Masked Ball Selection
Moonlight and Roses (Brings Mem’ries Of You) | Moonlight and Roses (Bring Mem’ries Of You)
Que Bonita Eres Tu (You Are Beutiful) | Que Bonita Eres Tu (You Are Beautiful)
Buttered Roll | “Buttered Roll”
Paradise | “Paradise”
Got a Right to Cry | “Got a Right to Cry”
Blue Moods | “Blue Moods”
Auf Wiederseh’n Sweerheart | Auf Wiederseh’n Sweetheart
George M. Cohan Medley – Part 1 | George M. Cohan Medley – Part 2
Dewildered | Bewildered
Lolita (Seranata) | Lolita (Serenata)
Got a Right to Cry | “Got a Right to Cry” Joe Liggins and His Honeydrippers
Blue Moods | “Blue Moods”
Body and Soul | “Body and Soul”
Mais Qui Est-Ce | Mais Qui Est-Ce?
Wail Till the Sun Shines Nellie Blues | Wait Till the Sun Shines Nellie Blues
Que Te Pasa Joe (What Happens Joe) | Que Te Pasa Jose (What Happens Joe)
SAMSON AND DELILAH Softly Awakens My Heart | SAMSON AND DELILAH Softly Awakes My Heart
I’m Gonna COO, COO, COO | (I’m Gonna) COO, COO, COO

10 thoughts on “Correct Metadata is Hard: a Lesson from the Great 78 Project

  1. Nemo

    Thank you! Cataloguing is expensive: libraries would often spend some 5 $ per book, I can’t imagine for such old discs. It’s nice to see the results of computional methods together with (sampled?) QA.

    “S. Formichi e T. Chelotti” is an interesting example: this duo is going to be a bit easier to find for Italians now that the Italian conjunction is used. Multi-lingual metadata is even harder.

  2. Henriette Avram

    This is a good example of why we transcribe information as seen on the label, warts and all, and in addition supply users with terms from controlled vocabularies and authorities. 78 labels often contain typos, sometimes outright errors, plus abbreviations, parallel languages, etc. The transcription is meant to serve as a surrogate for the actual object, the metadata gives us an idea of what that real-world object looks like. It’s impossible to say which version is correct without having an original to compare it to. There is no way to know if “Harry James and His Orchestra” or “Harry James and His Orch.” is correct without a reference point: the physical disc or a photograph of it. Harry James Orchestra was quite prolific, and the space on the label to record information was very small, and the group’s name appeared in a variety of ways. To accommodate the reality of the messiness of the world, catalogers also must include in the metadata an unambiguous reference to this performing ensemble, so all recordings can be grouped together in a cataloger, without having to search each of the name variants, abbreviations, etc:
    There are many other issues present in your examples that illustrate the importance of using principles of information organization.

    1. Henriette Avram

      Sorry for my typo: “together in a cataloger” should of course be “together in a catalog.” Humans are extremely good at making mistakes–another reason to include unambiguous, machine-readable references. 😉

  3. Lost in the 21st century


    I wanted to thank Brewster Kahle for the reply he gave to a question I had posed on an earlier blog entry (“Most 20th Century Books Unavailable to Internet Users”):

    Yes, you can borrow the book and read it in the web browser. so Adobe software is not needed.

    Many thanks for taking the time and trouble to reply, sir; somehow, viewing the books in-browser would never have occurred to me. Thanks to what you wrote, I have already borrowed some half-dozen books from the Internet Archive so far and plan to borrow many more in the future.

  4. Stefano

    I wish we have a “Wikipedia” like approach, where a user can edit the metadata and submit it for a revision.

    1. Brewster Kahle Post author

      yes, that would be better. What we do now is users put information in the reviews,and an admin changes the metadata based on this.

      For the professional proofreaders we made a spreadsheet so they can do it more efficiently, then we import the changes. If anyone would like to play with this, or help proofread this way, we would be happy to share.


Comments are closed.