Author Archives: Alexis Rossi

Radio Ngrams Dataset Allows New Research into Public Health Messaging

Posted on January 7, 2021 by Alexis Rossi

Guest post by Dr. Kalev Leetaru

Radio remains one of the most-consumed forms of traditional media today, with 89% of Americans listening to radio at least once a week as of 2018, a number that is actually increasing during the pandemic. News is the most popular radio format and 60% of Americans trust radio news to “deliver timely information about the current COVID-19 outbreak.”

Local talk radio is home to a diverse assortment of personality-driven programming that offers unique insights into the concerns and interests of citizens across the nation. Yet radio has remained stubbornly inaccessible to scholars due to the technical challenges of monitoring and transcribing broadcast speech at scale.

Debuting this past July, the Internet Archive’s Radio Archive uses automatic speech recognition technology to transcribe this vast collection of daily news and talk radio programming into searchable text dating back to 2016, and continues to archive and transcribe a selection of stations through present, making them browsable and keyword searchable.

Ngrams data set

Building on this incredible archive, the GDELT Project and I have transformed this massive archive into a research dataset of radio news ngrams spanning 26 billion English language words across portions of 550 stations, from 2016 to the present.

You can keyword search all 3 million shows, but for researchers interested in diving into the deeper linguistic patterns of radio news, the new ngrams dataset includes 1-5grams at 10 minute resolution covering all four years and updated every 30 minutes. For those less familiar with the concept of “ngrams,” they are word frequency tables in which the transcript of each broadcast is broken into words and for each 10 minute block of airtime a list is compiled of all of the words spoken in those 10 minutes for each station and how many times each word was mentioned.

Some initial research using these ngrams

How can researchers use this kind of data to understand new insights into radio news?

The graph below looks at pronoun usage on BBC Radio 4 FM, comparing the percentage of words spoken each day that were either (“we”, “us”, “our”, “ours”, “ourselves”) or (“i”, “me”, “i’m”). “Me” words are used more than twice as often as “we” words but look closely at February of 2020 as the pandemic began sweeping the world and “we” words start increasing as governments began adopting language to emphasize togetherness.

*“We” (orange) vs. “Me” (blue) words on BBC Radio 4 FM, showing increase of “we” words beginning in February 2020 as Covid-19 progresses*

TV vs. Radio

Combined with the television news ngrams that I previously created, it is possible to compare how topics are being covered across television and radio.

The graph below compares the percentage of spoken words that mentioned Covid-19 since the start of this year across BBC News London (television) versus radio programming on BBC World Service (international focus) and BBC Radio 4 FM (domestic focus).

All three show double surges at the start of the year as the pandemic swept across the world, a peak in early April and then a decrease since. Yet BBC Radio 4 appears to have mentioned the pandemic far less than the internationally-focused BBC World Service, though the two are now roughly equal even as the pandemic has continued to spread. Over all, television news has emphasized Covid-19 more than radio.

Covid-19 mentions on Television vs. Radio. The chart compares BBC News London (TV) in blue, versus BBC World Service (Radio) in orange and BBC Radio 4 FM (Radio) in grey.

For now, you can download the entire dataset to explore on your own computer but there will also be an interactive visualization and analysis interface available sometime in mid-Spring.

It is important to remember that these transcripts are generated through computer speech recognition, so are imperfect transcriptions that do not properly recognize all words or names, especially rare or novel terms like “Covid-19,” so experimentation may be required to yield the best results.

The graphs above just barely scratch the surface of the kinds of questions that can now be explored through the new radio news ngrams, especially when coupled with television news and 152-language online news ngrams.

From transcribing 3 million radio broadcasts into ngrams to describing a decade of television news frame by frame, cataloging the objects and activities of half a billion online news images, to inventorying the tens of billions of entities and relationships in half a decade of online journalism, it is becoming increasingly possible to perform multimodal analysis at the scale of entire archives.

Researchers can ask questions that for the first time simultaneously look across audio, video, imagery and text to understand how ideas, narratives, beliefs and emotions diffuse across mediums and through the global news ecosystem. Helping to seed the future of such at-scale research, the Internet Archive and GDELT are collaborating with a growing number of media archives and researchers through the newly formed Media Data Research Consortium to better understand how critical public health messaging is meeting the challenges of our current global pandemic.

About Kalev Leetaru

For more than 25 years, GDELT’s creator, Dr. Kalev H. Leetaru, has been studying the web and building systems to interact with and understand the way it is reshaping our global society. One of Foreign Policy Magazine’s Top 100 Global Thinkers of 2013, his work has been featured in the presses of over 100 nations and fundamentally changed how we think about information at scale and how the “big data” revolution is changing our ability to understand our global collective consciousness.

January 1st brings public domain riches from 1925

Posted on December 15, 2020 by Alexis Rossi

On January 1st, 2021, many books, movies and other media from 1925 will enter the public domain in the United States. Some of them are quite famous — jump ahead to see lists of those well known books and movies that you can enjoy on the Internet Archive — or take the scenic route with me.

Book cover: Mrs. Dalloway by Virginia Woolf

What does this all mean? Essentially, many items created in 1925 in the US that are still under copyright will become free and open for people to use in any way they see fit in the new year. But check out Duke Law’s Center for the Study of the Public Domain article for a more in-depth explanation.

We have a party every year to celebrate the new works entering the public domain, and this year is no exception. Join us on Thursday, Dec. 17th to toast these newly available additions.

Traveling from Home

As part of this yearly ritual, I explore our collections to unearth these newly freed items, and I invariably run across a few things that hit a nerve. This year, it started with this intertitle in “Isn’t Life Terrible?” Less than 20 seconds into this 1925 film, and suddenly I’m dumped back into 2020.

Silent film intertitle that reads, "Charley Chase as The poor young man with only two places to go -- Front yard and back yard"

Rude, right? I don’t even have a front yard to enjoy during shelter in place.

But the magic of media is that it can transport us to different places and times. Photo books like Picturesque Italy, Picturesque Mexico, and Picturesque Palestine, Arabia and Syria show us both how much and how little has changed in the past 95 years.

Screen shot thumbnail images from the book Picturesque Italy. The 12+ photos feature tourist sites in Venice, Italy like the Doges Palace, the Bridge of Sighs, and Piazza San Marco.

Gondolas still glide under the Bridge of Sighs, and the Tower of Pisa is still leaning, but the 1925 version of the Colosseum certainly lacks today’s fake gladiator photo ops.

Looking at the past with the eyes of today

Every toe dipped into the past has the potential to surprise or shock. The story of a pantry shelf, an outline history of grocery specialties is only mildly interesting on the surface. Essentially, it’s a sales pitch to food manufacturers encouraging them to advertise in a set of women’s magazines. The book contains short case histories of successful food brands like Maxwell House Coffee, Campbell Soup, Coca Cola, etc. (all of whom advertise with them, naturally).

The book gives you a glimpse of why people were so enthusiastic about mass produced, packaged foods. Unsanitary conditions, bugs in your sugar, milk going bad over night; things modern shoppers never think about.

It puts this glowing praise of Kraft Cheese into perspective: “…a pasteurized product, blended to obtain a uniformity of quality and flavor, a thing greatly lacking in ordinary types of cheese.” (page 149)

That’s pretty entertaining if you’re a cheese lover. I think most people would agree that Kraft cheese is no longer on the cutting edge.

But keep poking around and you find a much deeper cultural divergence. While The story of a pantry shelf is extolling the virtues of the home economics training available at Cornell, you stumble across this horrifying sentence (page 12).

Passage from "The Story of a Pantry Shelf" which reads, "Indeed, the Practice House, where students learn housekeeping in its every phase, even includes the complete care of a baby, adopted each year by Cornell for the benefit of these 'mothers' who, under the direction of trained Home Economics women, feed, bathe, dress and tend an infant from the tender age of two weeks throughout the session."

I was not expecting to read about orphaned babies being used as “learning aids” while flipping through stories about Jell-O. Intellectually, I know that attitudes towards children have changed over the years — the Fair Labor Standards Act, which set federal standards for child labor, wasn’t even passed until 1938. But this casual aside tossed in amongst the marketing hype still packs an emotional punch. It’s important to remember how far we have come.

Even writing that was forward-thinking for the time, like the booklet Homo-sexual life, is terribly backward according to today’s standards. It’s from the Little Blue Book series — we have many that were published in 1925, and the publisher was quite prolific for many years. The series provided working class people with inexpensive access to all kinds of topics including philosophy, sexuality, science, religion, law, and government. Post WWII, they published criticism of J. Edgar Hoover and the founder was subsequently targeted by the FBI for tax evasion. But in 1925, they were going strong and one of their prolific writers was Clarence Darrow.

Controversies of the Age

Darrow was writing about prohibition for the Little Blue Book series in 1925, but that is also the year he defended John T. Scopes for teaching evolution in his Tennessee classroom. The Scopes Trial generated a huge amount of publicity, pitting religion against science, and even giving rise to popular songs like these two 78rpm recordings from 1925.

The John T. Scopes Trial (The Old Religion’s Better After All) by Vernon Dalhart and Company

Monkey Biz-ness (Down in Tennessee) by International Novelty Orchestra with Billy Murray

Like the Scopes trial, prohibition had its passionate adherents and detractors. This was the “Roaring 20s” — the year The Great Gatsby was published — with speakeasies and flappers and iconic cocktails. And yet the pro-prohibition silent film Episodes in the Life of a Gin Bottle follows a bottle around as it lures people into a state of dissolution.

We even see an entire book about throwing parties that includes no alcoholic beverages at all.

The more things change, the more they stay the same

But as much as some things have changed, other aspects of our lives remain unchanged. People still want to tell you about their pets, rely on self help books, read stories to their kids, follow celebrities, tell each other jokes, and make silly videos.

And the most unchanging part of this particular season, of course — children still anticipate the arrival of Santa Claus with questions, wishes and schemes.

The silent film Santa Claus features two children who want to know where Saint Nick lives and how he spends his time. We follow him to the North Pole (Alaska in disguise) to see Santa’s workshop, snow castle, reindeer, and friends and neighbors. Jack Frost, introduced around 14:20, appears to be wearing the prototype for Ralphie’s bunny suit in “A Christmas Story” (but with a magic wand). Stick around for the sleigh crash at 20:45, and right around 22:20 Santa wipes out on the ice.

And just in case you’re still doing your holiday shopping, I feel like I should pass on a recommendation from this ad in a 1925 The Billboard magazine: Armadillo Baskets make beautiful Christmas gifts. And you can still buy vintage versions online – trust me, I looked. You’re welcome.

Advertisement with a picture of an armadillo and a basket made from an armadillo. Text reads, "Armadillo Baskets Make Beautiful Christmas Gifts. From these nine-banded horn-shelled little animals we make beautiful baskets. We are the original dealers in Armadillo Baskets. We take their shells, polish them, and then line with silk. They make ideal work baskets, etc. Let us tell you about these unique baskets. Write for Free Booklet. Apelt Armadillo Co., Comfort, Texas."

The Famous Stuff

And now on to the blockbusters of 1925…

Books First Published in 1925

The Great Gatsby by F. Scott Fitzgerald
Mrs. Dalloway by Virginia Woolf
Der Prozess by Franz Kafka (The trial in English)
The professor’s house by Willa Cather
No more parades by Ford Madox Ford
The Thundering Herd by Zane Grey
Arrowsmith by Sinclair Lewis
The mother’s recompense by Edith Wharton
The writing of fiction by Edith Wharton
Christina Alberta’s father by H. G. Wells
A year of prophesying by H. G. Wells
Wild Geese by Martha Ostenso
Manhattan transfer John Dos Passos
Porgy by DuBose Heyward (became Porgy & Bess in 1935)
Gentlemen prefer blondes by Anita Loos

Movies Released in 1925

The Big parade
The Freshman (Harold Lloyd)
The Gold Rush (Charlie Chaplin)
The Lost World
Don Q Son of Zorro (douglas fairbanks)
Little Annie Rooney (mary pickford)
The Phantom of the Opera (Lon Chaney)
The Unholy Three (Lon Chaney)
Ben Hur (Ramon Novarro)
Curses (Fatty Arbuckle)
The Eagle (Rudolph Valentino)
Go West (Buster Keaton)
The Lady (Norma Talmadge)
The Monster (Lon Chaney)
The Plastic Age (Clara Bow)
Seven Chances (Buster Keaton)
The Wizard of Oz (Oliver Hardy)
Lady of the Night (Norma Shearer)

Juneteenth – Freedom Day

Posted on June 19, 2020 by Alexis Rossi

The Emancipation Proclamation went into effect on January 1st, 1863, legally freeing 3.5 million enslaved people in the Confederate states. But of course, this executive order from President Abraham Lincoln came in the midst of the United States Civil War, which didn’t end until April of 1865 – the order could not be enforced until the war was over.

Juneteenth celebrates when enslaved people actually became free in 1865. The date, June 19th, commemorates General Gordon Granger of the Union Army announcing the executive order in Galveston, Texas, freeing all enslaved people in Texas.

Community access TV stations around the country have shown local celebrations of Juneteenth for years, and we thought this 2013 talk by Dr. Shennette Garrett-Scott at the Allen Public Library in Texas (via Allen City TV) was particularly helpful in understanding the history of this important day.

More resources:

Pretend you’re here with Internet Archive Zoom backgrounds

Posted on May 22, 2020 by Alexis Rossi

Have you seen these gorgeous library backgrounds you can use to pretend you’re amongst the smell of of old books and hushed page turning?

When I saw them I got a little jealous and thought, “computers are just as soothing!” So without further ado, welcome to your Internet Archive virtual Zoom backgrounds.

We’ve got a pretty majestic building you could sit in front of. There’s free wifi.

Or you can come inside and sit in the Great Room with us, stained glass dome and all.

Sit quietly amongst the pews with our little Internet Archivist sculptures by Nuala Creed.

Or have them be your backup dancers / Greek chorus on all your calls.

You can sit amongst the films waiting to be digitized.

Or pretend to be digitizing them yourself.

Scan books seated in front of a Table Top Scribe.

Or sit with the constant hum of busy servers in the background.

Happy Pi(e) Day

Posted on March 14, 2020 by Alexis Rossi

In honor of the esteemed mathematical constant, we invite you to celebrate Pi Day with us!

If you’re a math geek, we have you covered:

A history of pi (1989)
Pi, a sourcebook (1997)
Pi in the sky : counting, thinking, and being (1992)
Easy as pi? : an introduction to higher mathematics (1999)
Pi : a biography of the world’s most mysterious number (2004)

If your mathematical knowledge could use a little refresher, maybe try this one instead:
Sir Cumference and the dragon of pi : a math adventure.

You could listen to multiple people recite the first 50 digits of pi in various styles, including to the tune of the Battle Hymn of the Republic (my personal favorite), in the voice of Bullwinkle, as an infomercial, in Latin, while laughing, in Morse Code, and while eating actual pie.

If you’re just obsessive, here’s

Have insomnia? Listen to the first 1,000 digits of pi for 9.5 minutes straight… problem solved!

But most importantly, if you want to celebrate by eating pie we can help you make one! Winner of the Best Title Award definitely goes to Pies and tarts with schmecks appeal by the inimitable Edna Staebler. A close second goes to Tarts with Tops On by Tamasin Day-Lewis. But take your pick from amongst a wide array of pie cookbooks to find the right one for you.

And most importantly, we wish you infinite pi(e).

A Love Letter to the People Who Build the Internet Archive

Posted on February 14, 2019 by Alexis Rossi

canvas-1-done When you visit a public library, you get to meet the librarians and others who build and care for those collections. You know there are people who empty the garbage cans, who put back the borrowed books, who maintain the computers, and who determine what ends up on the shelf.

A digital library, on the other hand, is “just” a web site. You don’t really see the people who build it – we are often anonymous. But the Internet Archive wasn’t built by computers and algorithms.

From its inception, the Internet Archive has been built by thousands of people who understand that we have an opportunity to use the Internet to give everyone access to canvas-2-done knowledge. Every person on the planet should have the opportunity to learn and to make a contribution.

This goal – Universal Access to All Knowledge – inspires the people who have built the Internet Archive over the past 23 years.

People clean and repair the buildings that we occupy. People do payroll, choose our health plans, answer the phones, plan our events, reply to user emails, clean up spam, and pay our bills. People design and build the computers that hold the collections. People construct the network that carries data to every corner of the world. People write software that processes, backs up, and delivers files. People design and test and build interfaces. People digitize analog media and type in metadata. People curate collections, establish collaborations, and manage projects.

holidaypic

There’s no way I can mention all of these people by name. Even if I listed every employee from the past 23 years, I would still be missing the volunteers, the people from other organizations who worked on joint projects with us, the pro bono lawyers, the delightfully compulsive collectors, the funding organizations, the idea generators, our sounding boards for crazy ideas, the individuals who have donated money or materials, and the hundreds of thousands of people who have uploaded media into the archive.

staff2011

Libraries are built by people, for people. Thank you so much to all of the people who have contributed to building the Internet Archive, whether they were employees or our huge group friends and family. We would not be here without you, and we hope you will continue to help bring universal access to all knowledge in the future.

Happy Valentine’s Day!

151020-archive-staff-large

Want to read like a celebrity?

Posted on February 13, 2019 by Alexis Rossi

Apparently you’re not alone. I ran across a list of celeb’s favorite books and thought you might like to check out a few. (See what I did there? Librarian pun.) Happy reading!

Anna Kendrick
All Quiet on the Western Front by Erich Maria Remarque
Slaughterhouse-Five by Kurt Vonnegut
The Things They Carried by Tim O’Brien

Bill Murray
Huckleberry Finn by Mark Twain
A Story Like the Wind by Laurens Van Der Post
A Far Off Place by Laurens Van Der Post
The Plague by Albert Camus

**Bill Murray**
(photo by Georges Biard, CC BY-SA 3.0, from Wikimedia Commons)

Emma Watson
Le Petit Prince by Antoine de Saint-Exupéry

Olivia Munn
Replay by Ken Grimwood

Michelle Obama
Song of Solomon by Toni Morrison

Kit Harington
1984 by George Orwell

Dolly Parton
The Little Engine That Could by Watty Piper
(And check out Dolly Parton’s Imagination Library, which gives free books to kids!)

**Dolly Parton**
(photo by Josef Just [CC BY-SA 3.0, from Wikimedia Commons)

Robin Williams
Foundation trilogy by Isaac Asimov (or individually at 1, 2, 3)

Daniel Radcliffe
The Master and Margarita by Mikhail Bulgakov

Rachel McAdams
When You Are Engulfed in Flames by David Sedaris

Zooey Deschanel
A Supposedly Fun Thing I’ll Never Do Again by David Foster Wallace

Donald Glover
The Curious Incident of the Dog in the Night-Time by Mark Haddon
Extremely Loud And Incredibly Close by Jonathan Safran Foer

**Donald Glover**
(photo by NASA/Bill Ingalls [Public domain], via Wikimedia Commons)

Alec Baldwin
The Phantom Tollbooth by Norton Juster

Hillary Clinton
The Brothers Karamazov by Fyodor Dostoyevsky
Runaway by Alice Munro

Jessica Biel
Tender Is the Night by F. Scott Fitzgerald

Chelsea Handler
Mawson’s Will by Lennard Bickel
One Thousand White Women by Jim Fergus
Anna Karenina by Leo Tolstoy

Keira Knightley
The Passion by Jeanette Winterson

J. K. Rowling
The Woman Who Walked Into Doors by Roddy Doyle

Halle Berry
Some Love, Some Pain, Sometime by J. California Cooper

Jamie Chung
The Orphan Master’s Son by Adam Johnson

**Jamie Chung**
(photo by David Shankbone [CC BY 3.0], from Wikimedia Commons)

Jennifer Lawrence
Catcher in the Rye by J. D. Salinger
Raise High the Roof Beam, Carpenters; and Seymour by J. D. Salinger

Lady Gaga
Letters to a Young Poet by Rainer Maria Rilke

John Hamm
Arcadia by Tom Stoppard

Cher
Music for Chameleons by Truman Capote
Stranger in a Strange Land by Robert A. Heinlein

Kesha
Still Life with Woodpecker by Tom Robbins

Anne Hathaway
The Secret Garden by Frances Hodgson Burnett

Zoe Saldana
Shawshank Redemption by Stephen King

**Zoe Saldana**
(photo by Gage Skidmore [CC BY-SA 3.0], from Wikimedia Commons)

George R. R. Martin
Lord of the Rings by J. R. R. Tolkien

Matt Damon
A People’s History of the United States by Howard Zinn

Nas
Convictions by Richard Pryor

Natalie Portman
Cloud Atlas by David Mitchell

Bill Gates
Better Angels of our Nature by Steven Pinker

Joan Didion
Victory by Joseph Conrad

Making Out-of-Print Pre-1942 books available with “Last 20” provision

Posted on February 12, 2019 by Alexis Rossi

About a year and a half ago, the Internet Archive launched a collection of older books that were determined to qualify for the “Last 20” provision in Copyright Law, also known as Section 108(h) for the lawyers. As I understand this provision, it states that published works in the last twenty years of their copyright term may be digitized and distributed by libraries, archives and museums under certain circumstances. At the time, the small number of books that went into the collection were hand-researched by a team of legal interns. As you can imagine, this is a process that would be difficult to perform one-by-one for a large and ever-growing corpus of works.

So we set out to automate it. Amazon has an API with book information, so I figured with a little data massaging it shouldn’t be too hard to build a piece of software to do that job for us. Pull the metadata from our MARC* metadata records, send it to Amazon, and presto!

I was wrong. It was hard.

Library Catalog Names are different from Book Seller’s Names

Library-generated metadata is often very detailed, which leads to problems when we try to match the metadata provided by librarians to the metadata used on consumer-oriented web sites. For example, an author listed in a MARC record might appear as

Purucker, G. de (Gottfried), 1874-1942

But when you look on Amazon, that same author appears as

G. de Purucker

If we search the full author from the MARC on Amazon (including full name and birth and death dates), we may miss potential matches. And this is just one simple example. We have to transform every author field we get from MARC using a set of rules that may continue to expand as we find new problems to solve. Here are the current rules just for transforming this one field:

General rules for transforming MARC author to Amazon author:

Maintain all accented or non-Roman characters as-is
If there are no commas, semicolons or parentheses in the string, use the whole string as-is
If there are no commas in the string, but there are semicolon and/or parentheses, use anything before semicolon or parentheses as the entire author string
If there are commas in the string:
- Everything before the first comma should be used as the author’s last name
- Everything after the first comma but BEFORE any of these should be used as the author’s first name:
  - comma [ , ],
  - semicolon [ ; ],
  - open parentheses [ ( ]
  - any number [0-9]
  - end of string
- Remaining information should be discarded
Period [ . ] and apostrophe [ ‘ ] and other symbols should not be used to delimit any name and should be maintained as-is in the transformed string.

An Account of the Saga of the Never-ending Title: as told to the author by three blah blah blahs…

Some older books have really long titles. The MARC record contains the entire title, of course! Why wouldn’t it?! But consumer-oriented sites like Amazon often carry these books with shortened or modified titles.

For example, here’s the title of a real page-turner:

American authors, 1600 – 1900 a biographical dictionary of American literature ; compl. in 1 vol. with 1300 biographies and 400 portraits

But on Amazon that title is:

American Authors 1600-1900: A Biographical Dictionary of American Literature (Wilson Authors)

As you can image, it’s far more difficult to reliably match books with longer titles. A human can look at those two titles and think “yeah, that’s probably the same book,” but software doesn’t work quite that well.

$%!@$ Serials

Now that the librarians have had a laugh, let’s explain that for everybody else! Think back to the days of yore when you went to the library and looked things up in a physical card catalog. If you wanted to know where a serial or periodical was located within the library collections, you really just needed one card to tell you that. It’s on this shelf in this area and the collection contains these years.

Great! Except when you’re looking at digital versions of these serials, they are distinct entities – they have different dates, different topics, different authors sometimes, etc. And yet they often still have just one MARC record – the digital equivalent of that one card in the catalog.

And that means that the publication dates pulled from the MARC records are sometimes very wrong.

For example, we have several items from the annual series The Book of Knowledge – 1947, 1957, 1958, 1959, 1974… The date provided in the MARC file for all of these is 1940.

As you can imagine, when we are filtering texts by year for various purposes, serials are a consistent issue.

Even when we have a correct date, Amazon does not match very well on volume and other serial or periodical-based information. For example, when we search for a particular month of a magazine, we are likely to match an entirely different month of that same magazine.

Not All Metadata is Good Metadata

Unbelievably, librarians do make mistakes. Sometimes the data we have from MARC records has typos, or a MARC record for a different publication date was attached to the book. For example, we have an author named Fkorence A Huxley, but her name is really Florence. Not according to the MARC record, though! Fat finger errors don’t just happen on phones. Another example: we scanned a book originally published in 1924, and *republished* in 1971. We have the 1971 version. But the MARC record tells us it’s from 1924.

Essentially, our search is only as good as our metadata. If there are typos, or the wrong MARC record, or wrong data, our search and/or filtering will not be accurate.

Commercial APIs Are Not Built to Solve Library Problems

Amazon’s API is built to sell books to end users. Yes, it helps you find a particular book, but the other data the API contains about availability, formats and pricing is less accurate. Because the Section 108(h) exemption for libraries (read more here) involves knowing whether copies are being sold at reasonable prices, we need to know about these aspects of the book to determine whether they qualify. But Amazon’s API is incomplete in this area. So we found ourselves needing to use the API to find a match for the title and author, and then go to the page and scrape it to actually get accurate availability and pricing information.

This increases the complexity of the programming required to use Amazon as a source for information, and greatly lengthened the process of building tools for this purpose.

Everything changes

We are making a determination about whether a book meets the qualifications for Section 108(h) at a particular point in time. Even with all of the issues discussed here, the accuracy of the data we can now pull about book availability and price is high. But it’s only accurate for the moment that we pull the data, because Amazon’s marketplace is constantly changing. If we don’t find a book on Amazon today, that doesn’t mean it won’t appear on the site tomorrow.

Because of this, when we make an item available to the public via Section 108(h), we write into the item’s metadata the date on which the determination was made.

Who Wants In!?

Since I’ve made this process sound SO appealing, I would imagine that any number of other library institutions are going to line up around the block wanting to try it out for themselves. Or not. But here’s the good news! If we digitize your books, the Internet Archive may be able to do the Section 108(h) determination on your behalf. Please contact us if you would like to participate.

*A MARC record is a MAchine-Readable Cataloging record. Essentially, it is the digital equivalent of the physical card from a card catalog.

A Public Peek into 1923

Posted on January 5, 2019 by Alexis Rossi

Commercial radio broadcasting began in the 1920s, bringing entertainment, news and music into people’s homes. Now, instead of needing to play a 78rpm disc on your phonograph, you could just tune in to listen to popular songs.

And in 1923 that means you would have been listening to one of the many versions of “Yes! We Have No Bananas” written by Frank Silver and Irving Cohn.

You could listen to the Billy Jones version (play below), the Billy Murray version, a Yiddish version, or an Italian version, among others.

Yes! We Have No Bananas by Billy Jones from the 78rpm collection

Then you could have moved on to dancing the Charleston, popularized by the song of the same name from the 1923 musical “Runnin’ Wild.” And with the explosion of recordings by African American musicians, you could also enjoy “Baby Won’t You Please Come Home” by Bessie Smith and “Dipper Mouth Blues” by Louis Armstrong.

In the news of the day you saw the first flight of an autogyro (the precursor to the helicopter).

Jack Dempsey defended his World Heavyweight Championship title against Tommy Gibbons and Luis Firpo.

And Howard Carter’s team finally entered the burial chamber of King Tutankhamen, as covered in books, sheet music and song!

But why are we focusing on 1923? Because for the first time in 20 years, new works are entering the public domain in the United States (read more: 1, 2, 3). And those works were all published in, you guessed it, 1923.

Settle in with a Reese’s Peanut Butter Cup, a Butterfinger, or a refreshing Popsicle (all invented in 1923!) while you watch Cecil B. DeMille’s The Ten Commandments, The White Sister starring Lillian Gish, or The Hunchback of Notre Dame starring Lon Chaney. Or any one of 50 other films available on archive.org from that year.

After your movie marathon, you can turn to your “new” reading materials to learn about sewing the latest women’s fashions, try an old recipe from a cook book (we recommend the Marshmallow Loaf), learn about theatrical lighting, construct yourself a bungalow (um, check the lastest building codes first), grab some sheet music, read up on Benito Mussolini, and learn “How You Can Keep Fit” from Rudolph Valentino (!).

Finally, settle in to read some Robert Frost, Virginia Woolf, Edith Wharton, or Kahlil Gibran. And while you’re here, take a look at the 20,000 other texts we have available from 1923.

We look forward to introducing you to 1924 NEXT January!

New Views Stats for the New Year

Posted on December 19, 2018 by Alexis Rossi

We began developing a new system for counting views statistics on archive.org a few years ago. We had received feedback from our partners and users asking for more fine-grained information than the old system could provide. People wanted to know where their views were coming from geographically, and how many came from people vs. robots crawling the site.

The new system will debut in January 2019. Leading up to that in the next couple of weeks you may see some inconsistencies in view counts as the new numbers roll out across tens of millions of items.

With the new system you will see changes on both items and collections.

Item page changes

An “item” refers to a media item on archive.org – this is a page that features a book, a concert, a movie, etc. Here are some examples of items: Jerky Turkey, Emma, Gunsmoke.

On item pages the lifetime views will change to a new number. This new number will be a sum of lifetime views from the legacy system through 2016, plus total views from the new system for the past two years (January 2017 through December 2018). Because we are replacing the 2017 and 2018 views numbers with data from the new system, the lifetime views number for that item may go down. I will explain why this occurs further down in this post where we discuss how the new system differs from the legacy system.

Collection page changes

Soon on collection page About tabs (example) you will see 2 separate views graphs. One will be for the old legacy system views through the end of 2018. The other will contain 2 years of views data from the new system (2017 and 2018). Moving forward, only the graph representing the new system will be updated with views numbers. The legacy graph will “freeze” as of December 2018.

Both graphs will be on the page for a limited time, allowing you to compare your collections stats between the old and new systems. We will not delete the legacy system data, but it may eventually move to another page. The data from both systems is also available through the views API.

People vs. Robots

The graph for new collection views will additionally contain information about whether the views came from known “robots” or “people.” Known robots include crawlers from major search engines, like Google or Bing. It is important for these robots to crawl your items – search engines are a major source of traffic to all of the items on archive.org. The robots number here is your assurance that search engines know your items exist and can point users to them. The robots numbers also include access from our own internal robots (which is generally a very small portion of robots traffic).

One note about robots: they like text-based files more than audio/visual files. This means that text items on the archive that have a publicly accessible text file (the djvu.txt file) get more views from robots than other types of media in the archive. Search engines don’t just want the metadata about the book – they want the book itself.

“People” are a little harder to define. Our confidence about whether a view comes from a person varies – in some cases we are very sure, and in others it’s more fuzzy, but in all cases we know the view is not from a known robot. So we have chosen to class these all together as “people,” as they are likely to represent access by end users.

What counts as a view in the new system

Each media item in the archive has a views counter.
The view counter is increased by 1 when a user engages with the media file(s) in an item.
- Media engagement includes experiencing the media through the player in the item page (pressing play on a video or audio player, flipping pages in the online bookreader, emulating software, etc.), downloading files, streaming files, or borrowing a book.
- All types of engagements are treated in the same way – they are all views.
A single user can only increase the view count of a particular item once per day.
- A user may view multiple media files in a single item, or view the same media file in a single item multiple times, but within one day that engagement will only count as 1 view.
Collection views are the sum of all the view counts of the items in the collection.
- When an item is in more than one collection, the item’s view counts are added to each collection it is in. This includes “parent” collections if the item is in a subcollection.
- When a user engages with a collection page (sorting, searching, browsing etc.), it does NOT count as a view of the collection.
- Items sometimes move in or out of collections. The views number on a collection represents the sum of the views of the items that are in the collection at that time (e.g. the September 1, 2018 views number for the collection represents the sum of the views on items that were in the collection on September 1, 2018. If an item moves out of that collection, the collection does not lose the views from September 1, 2018.).

How the new system differs from the legacy system

When we designed the new system, we implemented some changes in what counted as a “view,” added some functionality, and repaired some errors that were discovered.

The legacy system updated item views once per day and collection views once per month. The new system will update both item and collection views once per day.
The legacy system updated item views ~24 hours after a view was recorded. The new system will update the views count ~4 days after the view was recorded. This time delay in the new system will decrease to ~24 hours at some point in the future.
The legacy system had no information about geographic location of users. The new system has approximate geolocation for every view. This geographic information is based on obfuscated IP addresses. It is accurate at a general level, but does not represent an individual user’s specific location.
The legacy system had no information about how many views were caused by robots crawling the site. The new system shows us how well the site is crawled by breaking out media access by robots (vs. interactions from people).
The legacy system did not count all book reader interactions as views. The new system counts bookreader engagements as a view after 2 interactions (like page flips).
On audio and video items, the legacy system sometimes counted views when users saw *any* media in the item (like thumbnail images). The new system only counts engagements with the audio or video media files in an item in those media types, respectively.

In some cases, the differences above can lead to drastic changes in views numbers for both items and collections. While this may be disconcerting, we think the new system more accurately reflects end user behavior on archive.org.

If you have questions regarding the new stats system, you may email us at info@archive.org.

Internet Archive Blogs

A blog from the team at archive.org

Author Archives: Alexis Rossi

Radio Ngrams Dataset Allows New Research into Public Health Messaging

Guest post by Dr. Kalev Leetaru

Ngrams data set

Some initial research using these ngrams

TV vs. Radio

About Kalev Leetaru

January 1st brings public domain riches from 1925

Traveling from Home

Looking at the past with the eyes of today

Controversies of the Age

The John T. Scopes Trial (The Old Religion’s Better After All) by Vernon Dalhart and Company

Monkey Biz-ness (Down in Tennessee) by International Novelty Orchestra with Billy Murray

The more things change, the more they stay the same

The Famous Stuff

Books First Published in 1925

Movies Released in 1925

Juneteenth – Freedom Day

Pretend you’re here with Internet Archive Zoom backgrounds

Happy Pi(e) Day

A Love Letter to the People Who Build the Internet Archive

Want to read like a celebrity?

Making Out-of-Print Pre-1942 books available with “Last 20” provision

Library Catalog Names are different from Book Seller’s Names

An Account of the Saga of the Never-ending Title: as told to the author by three blah blah blahs…

$%!@$ Serials

Not All Metadata is Good Metadata

Commercial APIs Are Not Built to Solve Library Problems

Everything changes

Who Wants In!?

A Public Peek into 1923

New Views Stats for the New Year

Guest post by Dr. Kalev Leetaru

Ngrams data set

Some initial research using these ngrams

TV vs. Radio

About Kalev Leetaru

Traveling from Home

Looking at the past with the eyes of today

Controversies of the Age

The John T. Scopes Trial (The Old Religion’s Better After All) by Vernon Dalhart and Company

Monkey Biz-ness (Down in Tennessee) by International Novelty Orchestra with Billy Murray

The more things change, the more they stay the same

The Famous Stuff

Books First Published in 1925

Movies Released in 1925

Library Catalog Names are different from Book Seller’s Names

An Account of the Saga of the Never-ending Title: as told to the author by three blah blah blahs…

*$%!@$* Serials

Not All Metadata is Good Metadata

Commercial APIs Are Not Built to Solve Library Problems

Everything changes

Who Wants In!?

$%!@$ Serials