In the spirit of continuing to celebrate female authors past the confines of Women’s History Month, we’ve gathered some of these books into a special collection called Great Books by Women Authors to make it easier to find your next exceptional read. You will also find these books via Open Library as listed below. Happy reading!
Radio remains one of the most-consumed forms of traditional media today, with 89% of Americans listening to radio at least once a week as of 2018, a number that is actually increasing during the pandemic. News is the most popular radio format and 60% of Americans trust radio news to “deliver timely information about the current COVID-19 outbreak.”
Local talk radio is home to a diverse assortment of personality-driven programming that offers unique insights into the concerns and interests of citizens across the nation. Yet radio has remained stubbornly inaccessible to scholars due to the technical challenges of monitoring and transcribing broadcast speech at scale.
Debuting this past July, the Internet Archive’s Radio Archive uses automatic speech recognition technology to transcribe this vast collection of daily news and talk radio programming into searchable text dating back to 2016, and continues to archive and transcribe a selection of stations through present, making them browsable and keyword searchable.
Ngrams data set
Building on this incredible archive, the GDELT Project and I have transformed this massive archive into a research dataset of radio news ngrams spanning 26 billion English language words across portions of 550 stations, from 2016 to the present.
You can keyword search all 3 million shows, but for researchers interested in diving into the deeper linguistic patterns of radio news, the new ngrams dataset includes 1-5grams at 10 minute resolution covering all four years and updated every 30 minutes. For those less familiar with the concept of “ngrams,” they are word frequency tables in which the transcript of each broadcast is broken into words and for each 10 minute block of airtime a list is compiled of all of the words spoken in those 10 minutes for each station and how many times each word was mentioned.
Some initial research using these ngrams
How can researchers use this kind of data to understand new insights into radio news?
The graph below looks at pronoun usage on BBC Radio 4 FM, comparing the percentage of words spoken each day that were either (“we”, “us”, “our”, “ours”, “ourselves”) or (“i”, “me”, “i’m”). “Me” words are used more than twice as often as “we” words but look closely at February of 2020 as the pandemic began sweeping the world and “we” words start increasing as governments began adopting language to emphasize togetherness.
TV vs. Radio
Combined with the television news ngrams that I previously created, it is possible to compare how topics are being covered across television and radio.
The graph below compares the percentage of spoken words that mentioned Covid-19 since the start of this year across BBC News London (television) versus radio programming on BBC World Service (international focus) and BBC Radio 4 FM (domestic focus).
All three show double surges at the start of the year as the pandemic swept across the world, a peak in early April and then a decrease since. Yet BBC Radio 4 appears to have mentioned the pandemic far less than the internationally-focused BBC World Service, though the two are now roughly equal even as the pandemic has continued to spread. Over all, television news has emphasized Covid-19 more than radio.
For now, you can download the entire dataset to explore on your own computer but there will also be an interactive visualization and analysis interface available sometime in mid-Spring.
It is important to remember that these transcripts are generated through computer speech recognition, so are imperfect transcriptions that do not properly recognize all words or names, especially rare or novel terms like “Covid-19,” so experimentation may be required to yield the best results.
Researchers can ask questions that for the first time simultaneously look across audio, video, imagery and text to understand how ideas, narratives, beliefs and emotions diffuse across mediums and through the global news ecosystem. Helping to seed the future of such at-scale research, the Internet Archive and GDELT are collaborating with a growing number of media archives and researchers through the newly formed Media Data Research Consortium to better understand how critical public health messaging is meeting the challenges of our current global pandemic.
About Kalev Leetaru
For more than 25 years, GDELT’s creator, Dr. Kalev H. Leetaru, has been studying the web and building systems to interact with and understand the way it is reshaping our global society. One of Foreign Policy Magazine’s Top 100 Global Thinkers of 2013, his work has been featured in the presses of over 100 nations and fundamentally changed how we think about information at scale and how the “big data” revolution is changing our ability to understand our global collective consciousness.
On January 1st, 2021, many books, movies and other media from 1925 will enter the public domain in the United States. Some of them are quite famous — jump ahead to see lists of those well known books and movies that you can enjoy on the Internet Archive — or take the scenic route with me.
What does this all mean? Essentially, many items created in 1925 in the US that are still under copyright will become free and open for people to use in any way they see fit in the new year. But check out Duke Law’s Center for the Study of the Public Domain article for a more in-depth explanation.
As part of this yearly ritual, I explore our collections to unearth these newly freed items, and I invariably run across a few things that hit a nerve. This year, it started with this intertitle in “Isn’t Life Terrible?” Less than 20 seconds into this 1925 film, and suddenly I’m dumped back into 2020.
Rude, right? I don’t even have a front yard to enjoy during shelter in place.
Gondolas still glide under the Bridge of Sighs, and the Tower of Pisa is still leaning, but the 1925 version of the Colosseum certainly lacks today’s fake gladiator photo ops.
Looking at the past with the eyes of today
Every toe dipped into the past has the potential to surprise or shock. The story of a pantry shelf, an outline history of grocery specialties is only mildly interesting on the surface. Essentially, it’s a sales pitch to food manufacturers encouraging them to advertise in a set of women’s magazines. The book contains short case histories of successful food brands like Maxwell House Coffee, Campbell Soup, Coca Cola, etc. (all of whom advertise with them, naturally).
The book gives you a glimpse of why people were so enthusiastic about mass produced, packaged foods. Unsanitary conditions, bugs in your sugar, milk going bad over night; things modern shoppers never think about.
It puts this glowing praise of Kraft Cheese into perspective: “…a pasteurized product, blended to obtain a uniformity of quality and flavor, a thing greatly lacking in ordinary types of cheese.” (page 149)
That’s pretty entertaining if you’re a cheese lover. I think most people would agree that Kraft cheese is no longer on the cutting edge.
But keep poking around and you find a much deeper cultural divergence. While The story of a pantry shelf is extolling the virtues of the home economics training available at Cornell, you stumble across this horrifying sentence (page 12).
I was not expecting to read about orphaned babies being used as “learning aids” while flipping through stories about Jell-O. Intellectually, I know that attitudes towards children have changed over the years — the Fair Labor Standards Act, which set federal standards for child labor, wasn’t even passed until 1938. But this casual aside tossed in amongst the marketing hype still packs an emotional punch. It’s important to remember how far we have come.
Even writing that was forward-thinking for the time, like the booklet Homo-sexual life, is terribly backward according to today’s standards. It’s from the Little Blue Book series — we have many that were published in 1925, and the publisher was quite prolific for many years. The series provided working class people with inexpensive access to all kinds of topics including philosophy, sexuality, science, religion, law, and government. Post WWII, they published criticism of J. Edgar Hoover and the founder was subsequently targeted by the FBI for tax evasion. But in 1925, they were going strong and one of their prolific writers was Clarence Darrow.
Controversies of the Age
Darrow was writing about prohibition for the Little Blue Book series in 1925, but that is also the year he defended John T. Scopes for teaching evolution in his Tennessee classroom. The Scopes Trial generated a huge amount of publicity, pitting religion against science, and even giving rise to popular songs like these two 78rpm recordings from 1925.
Like the Scopes trial, prohibition had its passionate adherents and detractors. This was the “Roaring 20s” — the year The Great Gatsby was published — with speakeasies and flappers and iconic cocktails. And yet the pro-prohibition silent film Episodes in the Life of a Gin Bottle follows a bottle around as it lures people into a state of dissolution.
And the most unchanging part of this particular season, of course — children still anticipate the arrival of Santa Claus with questions, wishes and schemes.
The silent film Santa Claus features two children who want to know where Saint Nick lives and how he spends his time. We follow him to the North Pole (Alaska in disguise) to see Santa’s workshop, snow castle, reindeer, and friends and neighbors. Jack Frost, introduced around 14:20, appears to be wearing the prototype for Ralphie’s bunny suit in “A Christmas Story” (but with a magic wand). Stick around for the sleigh crash at 20:45, and right around 22:20 Santa wipes out on the ice.
And just in case you’re still doing your holiday shopping, I feel like I should pass on a recommendation from this ad in a 1925 The Billboard magazine: Armadillo Baskets make beautiful Christmas gifts. And you can still buy vintage versions online – trust me, I looked. You’re welcome.
Juneteenth celebrates when enslaved people actually became free in 1865. The date, June 19th, commemorates General Gordon Granger of the Union Army announcing the executive order in Galveston, Texas, freeing all enslaved people in Texas.
Community access TV stations around the country have shown local celebrations of Juneteenth for years, and we thought this 2013 talk by Dr. Shennette Garrett-Scott at the Allen Public Library in Texas (via Allen City TV) was particularly helpful in understanding the history of this important day.
You could listen to multiple people recite the first 50 digits of pi in various styles, including to the tune of the Battle Hymn of the Republic (my personal favorite), in the voice of Bullwinkle, as an infomercial, in Latin, while laughing, in Morse Code, and while eating actual pie.
When you visit a public library, you get to meet the librarians and others who build and care for those collections. You know there are people who empty the garbage cans, who put back the borrowed books, who maintain the computers, and who determine what ends up on the shelf.
A digital library, on the other hand, is “just” a web site. You don’t really see the people who build it – we are often anonymous. But the Internet Archive wasn’t built by computers and algorithms.
From its inception, the Internet Archive has been built by thousands of people who understand that we have an opportunity to use the Internet to give everyone access to knowledge. Every person on the planet should have the opportunity to learn and to make a contribution.
This goal – Universal Access to All Knowledge – inspires the people who have built the Internet Archive over the past 23 years.
People clean and repair the buildings that we occupy. People do payroll, choose our health plans, answer the phones, plan our events, reply to user emails, clean up spam, and pay our bills. People design and build the computers that hold the collections. People construct the network that carries data to every corner of the world. People write software that processes, backs up, and delivers files. People design and test and build interfaces. People digitize analog media and type in metadata. People curate collections, establish collaborations, and manage projects.
There’s no way I can mention all of these people by name. Even if I listed every employee from the past 23 years, I would still be missing the volunteers, the people from other organizations who worked on joint projects with us, the pro bono lawyers, the delightfully compulsive collectors, the funding organizations, the idea generators, our sounding boards for crazy ideas, the individuals who have donated money or materials, and the hundreds of thousands of people who have uploaded media into the archive.
Libraries are built by people, for people. Thank you so much to all of the people who have contributed to building the Internet Archive, whether they were employees or our huge group friends and family. We would not be here without you, and we hope you will continue to help bring universal access to all knowledge in the future.
About a year and a half ago, the Internet Archive launched a collection of older books that were determined to qualify for the “Last 20” provision in Copyright Law, also known as Section 108(h) for the lawyers. As I understand this provision, it states that published works in the last twenty years of their copyright term may be digitized and distributed by libraries, archives and museums under certain circumstances. At the time, the small number of books that went into the collection were hand-researched by a team of legal interns. As you can imagine, this is a process that would be difficult to perform one-by-one for a large and ever-growing corpus of works.
So we set out to automate it. Amazon has an API with book information, so I figured with a little data massaging it shouldn’t be too hard to build a piece of software to do that job for us. Pull the metadata from our MARC* metadata records, send it to Amazon, and presto!
I was wrong. It was hard.
Library Catalog Names are different from Book Seller’s Names
Library-generated metadata is often very detailed, which leads to problems when we try to match the metadata provided by librarians to the metadata used on consumer-oriented web sites. For example, an author listed in a MARC record might appear as
Purucker, G. de (Gottfried), 1874-1942
But when you look on Amazon, that same author appears as
G. de Purucker
If we search the full author from the MARC on Amazon (including full name and birth and death dates), we may miss potential matches. And this is just one simple example. We have to transform every author field we get from MARC using a set of rules that may continue to expand as we find new problems to solve. Here are the current rules just for transforming this one field:
General rules for transforming MARC author to Amazon author:
Maintain all accented or non-Roman characters as-is
If there are no commas, semicolons or parentheses in the string, use the whole string as-is
If there are no commas in the string, but there are semicolon and/or parentheses, use anything before semicolon or parentheses as the entire author string
If there are commas in the string:
Everything before the first comma should be used as the author’s last name
Everything after the first comma but BEFORE any of these should be used as the author’s first name:
comma [ , ],
semicolon [ ; ],
open parentheses [ ( ]
any number [0-9]
end of string
Remaining information should be discarded
Period [ . ] and apostrophe [ ‘ ] and other symbols should not be used to delimit any name and should be maintained as-is in the transformed string.
An Account of the Saga of the Never-ending Title: as told to the author by three blah blah blahs…
Some older books have really long titles. The MARC record contains the entire title, of course! Why wouldn’t it?! But consumer-oriented sites like Amazon often carry these books with shortened or modified titles.
For example, here’s the title of a real page-turner:
American authors, 1600 – 1900 a biographical dictionary of American literature ; compl. in 1 vol. with 1300 biographies and 400 portraits
But on Amazon that title is:
American Authors 1600-1900: A Biographical Dictionary of American Literature (Wilson Authors)
As you can image, it’s far more difficult to reliably match books with longer titles. A human can look at those two titles and think “yeah, that’s probably the same book,” but software doesn’t work quite that well.
Now that the librarians have had a laugh, let’s explain that for everybody else! Think back to the days of yore when you went to the library and looked things up in a physical card catalog. If you wanted to know where a serial or periodical was located within the library collections, you really just needed one card to tell you that. It’s on this shelf in this area and the collection contains these years.
Great! Except when you’re looking at digital versions of these serials, they are distinct entities – they have different dates, different topics, different authors sometimes, etc. And yet they often still have just one MARC record – the digital equivalent of that one card in the catalog.
And that means that the publication dates pulled from the MARC records are sometimes very wrong.
For example, we have several items from the annual series The Book of Knowledge – 1947, 1957, 1958, 1959, 1974… The date provided in the MARC file for all of these is 1940.
As you can imagine, when we are filtering texts by year for various purposes, serials are a consistent issue.
Even when we have a correct date, Amazon does not match very well on volume and other serial or periodical-based information. For example, when we search for a particular month of a magazine, we are likely to match an entirely different month of that same magazine.
Not All Metadata is Good Metadata
Unbelievably, librarians do make mistakes. Sometimes the data we have from MARC records has typos, or a MARC record for a different publication date was attached to the book. For example, we have an author named Fkorence A Huxley, but her name is really Florence. Not according to the MARC record, though! Fat finger errors don’t just happen on phones. Another example: we scanned a book originally published in 1924, and *republished* in 1971. We have the 1971 version. But the MARC record tells us it’s from 1924.
Essentially, our search is only as good as our metadata. If there are typos, or the wrong MARC record, or wrong data, our search and/or filtering will not be accurate.
Commercial APIs Are Not Built to Solve Library Problems
Amazon’s API is built to sell books to end users. Yes, it helps you find a particular book, but the other data the API contains about availability, formats and pricing is less accurate. Because the Section 108(h) exemption for libraries (read more here) involves knowing whether copies are being sold at reasonable prices, we need to know about these aspects of the book to determine whether they qualify. But Amazon’s API is incomplete in this area. So we found ourselves needing to use the API to find a match for the title and author, and then go to the page and scrape it to actually get accurate availability and pricing information.
This increases the complexity of the programming required to use Amazon as a source for information, and greatly lengthened the process of building tools for this purpose.
We are making a determination about whether a book meets the qualifications for Section 108(h) at a particular point in time. Even with all of the issues discussed here, the accuracy of the data we can now pull about book availability and price is high. But it’s only accurate for the moment that we pull the data, because Amazon’s marketplace is constantly changing. If we don’t find a book on Amazon today, that doesn’t mean it won’t appear on the site tomorrow.
Because of this, when we make an item available to the public via Section 108(h), we write into the item’s metadata the date on which the determination was made.
Who Wants In!?
Since I’ve made this process sound SO appealing, I would imagine that any number of other library institutions are going to line up around the block wanting to try it out for themselves. Or not. But here’s the good news! If we digitize your books, the Internet Archive may be able to do the Section 108(h) determination on your behalf. Please contact us if you would like to participate.
*A MARC record is a MAchine-Readable Cataloging record. Essentially, it is the digital equivalent of the physical card from a card catalog.
Commercial radio broadcasting began in the 1920s, bringing entertainment, news and music into people’s homes. Now, instead of needing to play a 78rpm disc on your phonograph, you could just tune in to listen to popular songs.
But why are we focusing on 1923? Because for the first time in 20 years, new works are entering the public domain in the United States (read more: 1,2, 3). And those works were all published in, you guessed it, 1923.