Last week, Internet Archive welcomed more than 150 attendees to the webinar, “Protect Our Future Memory: Join the Call for Library Digital Rights.” Held on January 27, the event brought together legal experts, library leaders, and advocates to talk about Our Future Memory and the global coalition working to secure the protections that memory institutions need in our increasingly digital and networked world.
Watch the session recording:
The webinar opened with a stark reality check: For generations, libraries, archives, museums, and other memory institutions have relied on social and legal norms that allow them to collect, preserve, and lend materials. But nowadays, digital content is increasingly being controlled by restrictive licenses on gated, paywalled platforms. This new distribution stream prohibits memory institutions from doing what they’ve historically been able to do in the physical world, curtailing their essential functions of preserving and providing long-term access to knowledge.
Webinar attendees heard from recentsignatoriesCharlie Barlow, Executive Director of the Boston Library Consortium, and John Chrastka, Executive Director of the EveryLibrary Institute. Their participation highlighted the crisis facing memory institutions—and the demands necessary to overcome it.
“When we have publishers or vendors coming in and saying that we can’t do something that we perceive as foundational and essential,” said Barlow, “we’re in real trouble.”
Chrastka added, “We’ve got gases, solids, liquids, plasma, and ebooks! Seriously, when you think about it, I can’t own it unless the IP owner wants to distribute that right to us. It’s a violation, in some ways, of a natural order.”
To combat this dire situation, Our Future Memory is building consensus around the Statement on Digital Rights for Protecting Memory Institutions Online. Originating from discussions at the Library Leaders Forum and first endorsed by the National Library of Aruba in 2024, the Statement proposes the simple solution of letting memory institutions do what they were always able to do before the digital age. Specifically, they need the legal rights and practical ability to:
Collect digital materials
Preserve digital collections
Provide controlled digital access
Cooperate across institutions
The Statement’s focus on foundational norms is what compelled the Boston Library Consortium to join the coalition, and Barlow emphasized its value as a tool for asserting that traditional library functions must not be treated as negotiable.
“We chose to sign this one because for us, it really established a clear, public baseline that we can point to when long-standing library rights are being treated as optional or the exception,” he explained. “It really is about making those foundational rights visible and shared and harder to dismiss.”
For Chrastka and the EveryLibrary Institute, endorsing the Statement was a necessary step toward building the political momentum required to change the status quo.
“We haven’t been necessarily talking as a sector out loud together as frequently and as vociferously as we need to about what this should all look like,” Chrastka said. “We want to lean into this conversation.”
How can organizations participate?
It is because memory institutions speak louder when they stand together that Our Future Memory is actively accepting signatures from institutions, organizations, and government entities. If you are ready to stand with a global community committed to protecting the past to power the future, here is how you can join:
Download the Statement from ourfuturememory.org (or email campaigns@internetarchive.eu for a copy).
Sign the document (either by hand or using an electronic signature tool).
Send the signed document back to campaigns@internetarchive.eu.
Once received, your organization will be added to the list of signatories.
Want to learn more? If you missed the live event, you can watch the full recording or visit the Our Future Memory website for resources to help you advocate for these rights in your own community.
The following Q&A between writer Caralee Adams and journalist Philip Bump of The Washington Post is part of our Vanishing Culture series, highlighting the power and importance of preservation in our digital age. Read more essays online or download the full report now.
Philip Bump is a columnist for The Washington Post based in New York. He writes the weekly newsletter How To Read This Chart. He’s also the author of The Aftermath: The Last Days of the Baby Boom and the Future of Power in America.
Caralee Adams: What does it mean for an individual journalist to have their work preserved? Why is it important to have easy access to news stories from the past?
Philip Bump: One of the nice things about my career has been that I’ve worked for outlets that I feel confident are doing their own preservation, like TheWashington Post. I’m not particularly worried about losing access to my writing. However, it’s less of a concern for me than it is for other outlets, unfortunately. It is unquestionably the case that I find the Internet Archive useful and use it regularly for a variety of things—both for its preservation of online content and collection of closed captioning for news programs.
Any recent examples of when you’ve found the Internet Archive particularly useful?
I use the search tool on closed captioning more than anything else. The other day I was trying to find an old copy of a webpage. I was writing about Donald Trump’s comments on Medal of Honor recipients. As it turns out, there is not an immediately accessible resource for when Medals of Honor were granted to members of the military. You can see aggregated—how many there are—but you can’t see who was given a medal and when they served. I actually used the Internet Archive to see how the metrics changed between the beginning of Trump’s presidency and by the end of it. I was able to see that there were medals awarded to about 11 people who served during the War on Terror, three who served in Vietnam, and one during World War II. Then, I was able to go back and double check against the Trump White House archive, which is done by the National Archives, and see the people to whom he had given this award. That’s a good example of being able to take those two snapshots in time and then compare them in order to see what the difference was to get this problem solved.
Why is it important for the public to have free public access to an archive of the news for television or print?
It’s the same reason that it’s important, in general, to have any sort of archive: it increases accountability and increases historical accuracy. The Internet Archive is essential at ensuring that we have an understanding of what was happening on the internet at a given point in time. That is not something that is constantly useful, but it is something that is occasionally extremely useful. I do a lot of work in politics and get to see what people are saying at certain points in time, which are important checks and accountability for elected officials. The public can know what they were saying when they were running in the primary as compared with the general [election]. The Archive allows anyone to be able to get information from websites that are no longer active. If you’re looking for something and you have the old link to Gawker or the old link to a tweet, you can often [find] it archived. The Internet Archive doesn’t capture everything—it couldn’t possibly do so. But it captures enough to generally answer the questions that need to get answered. There’s nowhere else that does that. There are other archiving sites, but none that do so as comprehensively, or none with an archive that goes back that far.
Has any of your journalism vanished from the public? Do you have any examples where you’ve been looking for something and it’s been missing?
Yes. One of the challenges is that multimedia content has often, in the past, been overlooked. There are old news reports that I’ve been unable to find because they’re on video in the era before there was a lot of accessibility and transcripts. Therefore, yes, there are certainly things like that which come up with some regularity. Also, particularly in the era of 2005 to 2015, there were a lot of independent sites that had useful news reports—particularly since we’re talking about the cast of political characters that have been around in the public eye at that point in time. It’s often the case that it’s hard to track those things down. Or if you’re trying to track down the original source or verify a rumor, you might need to dip into the Archive. There are a lot of sites from that era of “bespoke” blogs that the Internet Archive often captures.
How does limited access to historical data or previous coverage impact you as a journalist?
It is hard to say, because relatively speaking, I am advantaged by the fact that I live in this era. If I were doing this in 1990, [I’d use] basically whatever was at the New York Public Library and on microfiche. It is far better than it used to be, but the amount of content being produced is also far larger. It is both a positive and a negative that it is far easier to do that sort of research here from my desk at home than it would possibly have been 30 years ago. In fact, I was working on a project where I relied heavily on a local newspaper in a small town in Pennsylvania that wasn’t available online. I literally had to hire someone in the town to go to the library, find [coverage from] the particular date and the local paper and to get the scans done. It cost me hundreds of dollars, but that was the only way to do it. You can see how getting these things done is problematic and challenging.
When Paramount deleted the MTV News Archive in June, there was a lot of dismay, but some say it was frivolous, disposable, and kind of meant to be thrown away. How do you feel about that?
My first writing gig online was at MTV News in college, so that actually had a personal resonance for me. I was at Ohio State in the early to mid 1990s, and I got this little internship with MTV News. I wrote one piece about this band called The Hairy Patt Band. It ended up on the MTV News website. I was very excited. I haven’t seen that in 30 years. It’s one of those things where I wondered what ever happened to that story or if it exists anywhere, in any form. So, that [news] actually had resonance. It’s a bummer. Is it as important to maintain the archives of MTV News as it is The Washington Post? I’m biased, but I would say, no. But it is still a loss of culture—and it is a unique loss of culture. This was a unique and novel form of information that was emergent in the 1990s and now is lost. In the moment, its very existence captured the culture in a way that is worth preserving.
How do you feel about the future of digital preservation of news, data, and information?
I’m more pessimistic than I used to be. I came of age with the internet. When it was new, I used to describe it as the emergence from a new dark age. We had all this information and there was no more going back. All this existed. Everything was online, and we had archives. Now, we see, in part because the scale has increased so quickly that economic considerations come into play, and all of a sudden… the internet isn’t just an endless archive anymore. There are very few places that are doing what libraries do to capture these things on microfiche or store books for the public’s benefit. There is so much of it and that becomes the problem.
Why is it important to pay attention to this issue and preserve journalism for future reporters?
It is obviously the case that we are creating information, culture, and benchmarks for society faster than we can figure out how we’re going to make sure they’re preserved. I think that’s probably always been the case, except that what’s different now is that we are more cognizant of the process of preservation and the challenges of preservation. We expect there to be this thing that exists forever. We don’t yet know how to balance the interest in having as few things be ephemeral as possible, versus the value in doing that… maybe it’s not even possible to preserve everything in the way that we would want to at scale. We have created a process by which it is possible to record and observe nearly everything, and now we’re realizing that that is potentially in conflict with our desire to also store and preserve all this information indefinitely.
Anything you’d like to add?
I think it’s worth noting that preservation is one of the few areas in which I think artificial intelligence bears some potential benefit. One of the things that I’ve long found frustrating is that The New York Times, The Washington Post, and other major news outlets, have enormous storehouses of information—not all of it textual. The New York Times must have, in its archives, photos of every square inch of New York City at some point in time over the course of the past 100 years. Artificial intelligence is a great tool for indexing and documenting. We now have tools that allow us to go deeper into our archives and extract more information from them, which I think is a positive development, and is something I’ve advocated for a long time publicly. Only with the advent of artificial intelligence does large-scale preservation become something that seems feasible. One can go through the National Archive and extract an enormous amount of information that is currently stored there in an accessible form, which saves someone from having to stumble upon a particular image. I think that is beneficial. I don’t think that necessarily solves the storage at scale issue, but it does address the fact that so much information is currently locked away and inaccessible, which is another facet of the challenge.
About the author
Caralee Adams is a journalist based in Bethesda, Maryland. She is a graduate of Iowa State University and received her master’s in political science at the University of New Orleans. After working at newspapers and magazines, she has been a freelancer covering education, science, tech and health for a variety of publications for more than 30 years.
The RFC Series contains documents that define how the Internet functions. The first RFC was published in 1969, when just a few organizations were trying to figure out how to communicate digitally. Now, 53 years later, more than 9,200 RFCs have been written by thousands of volunteers and these documents and protocols are the underpinnings of the Internet systems we use every day.
Alexis joins the IETF team to help maintain the archival quality of the RFC Series, and to provide guidance on the policies and processes for publishing these important documents. She will also continue in her role with the Internet Archive, managing the organization’s millions of digital items.
The Internet Archive’s founder, Brewster Kahle, who has his own informational RFC (RFC 1625) published in 1994 for WAIS (Wide Area Information Servers), said of Alexis’s new role, “From my own days working on WAIS, I know how important these documents have been to the development of today’s Web. I’m glad to know that someone with so much experience will be helping to keep this Series preserved.”
Radio remains one of the most-consumed forms of traditional media today, with 89% of Americans listening to radio at least once a week as of 2018, a number that is actually increasing during the pandemic. News is the most popular radio format and 60% of Americans trust radio news to “deliver timely information about the current COVID-19 outbreak.”
Local talk radio is home to a diverse assortment of personality-driven programming that offers unique insights into the concerns and interests of citizens across the nation. Yet radio has remained stubbornly inaccessible to scholars due to the technical challenges of monitoring and transcribing broadcast speech at scale.
Debuting this past July, the Internet Archive’s Radio Archive uses automatic speech recognition technology to transcribe this vast collection of daily news and talk radio programming into searchable text dating back to 2016, and continues to archive and transcribe a selection of stations through present, making them browsable and keyword searchable.
Ngrams data set
Building on this incredible archive, the GDELT Project and I have transformed this massive archive into a research dataset of radio news ngrams spanning 26 billion English language words across portions of 550 stations, from 2016 to the present.
You can keyword search all 3 million shows, but for researchers interested in diving into the deeper linguistic patterns of radio news, the new ngrams dataset includes 1-5grams at 10 minute resolution covering all four years and updated every 30 minutes. For those less familiar with the concept of “ngrams,” they are word frequency tables in which the transcript of each broadcast is broken into words and for each 10 minute block of airtime a list is compiled of all of the words spoken in those 10 minutes for each station and how many times each word was mentioned.
Some initial research using these ngrams
How can researchers use this kind of data to understand new insights into radio news?
The graph below looks at pronoun usage on BBC Radio 4 FM, comparing the percentage of words spoken each day that were either (“we”, “us”, “our”, “ours”, “ourselves”) or (“i”, “me”, “i’m”). “Me” words are used more than twice as often as “we” words but look closely at February of 2020 as the pandemic began sweeping the world and “we” words start increasing as governments began adopting language to emphasize togetherness.
“We” (orange) vs. “Me” (blue) words on BBC Radio 4 FM, showing increase of “we” words beginning in February 2020 as Covid-19 progresses
TV vs. Radio
Combined with the television news ngrams that I previously created, it is possible to compare how topics are being covered across television and radio.
The graph below compares the percentage of spoken words that mentioned Covid-19 since the start of this year across BBC News London (television) versus radio programming on BBC World Service (international focus) and BBC Radio 4 FM (domestic focus).
All three show double surges at the start of the year as the pandemic swept across the world, a peak in early April and then a decrease since. Yet BBC Radio 4 appears to have mentioned the pandemic far less than the internationally-focused BBC World Service, though the two are now roughly equal even as the pandemic has continued to spread. Over all, television news has emphasized Covid-19 more than radio.
Covid-19 mentions on Television vs. Radio. The chart compares BBC News London (TV) in blue, versus BBC World Service (Radio) in orange and BBC Radio 4 FM (Radio) in grey.
For now, you can download the entire dataset to explore on your own computer but there will also be an interactive visualization and analysis interface available sometime in mid-Spring.
It is important to remember that these transcripts are generated through computer speech recognition, so are imperfect transcriptions that do not properly recognize all words or names, especially rare or novel terms like “Covid-19,” so experimentation may be required to yield the best results.
Researchers can ask questions that for the first time simultaneously look across audio, video, imagery and text to understand how ideas, narratives, beliefs and emotions diffuse across mediums and through the global news ecosystem. Helping to seed the future of such at-scale research, the Internet Archive and GDELT are collaborating with a growing number of media archives and researchers through the newly formed Media Data Research Consortium to better understand how critical public health messaging is meeting the challenges of our current global pandemic.
About Kalev Leetaru
For more than 25 years, GDELT’s creator, Dr. Kalev H. Leetaru, has been studying the web and building systems to interact with and understand the way it is reshaping our global society. One of Foreign Policy Magazine’s Top 100 Global Thinkers of 2013, his work has been featured in the presses of over 100 nations and fundamentally changed how we think about information at scale and how the “big data” revolution is changing our ability to understand our global collective consciousness.
The goal of the News Measures Research Project is to examine the health of local community news by analyzing the amount and type of local news coverage in a sample of community. In order to generate a random and unbiased sample of communities, the team used US Census data. Prior research suggested that average income in a community is correlated with the amount of local news coverage; thus the team decided to focus on three different income brackets (high, medium and low) using the Census data to break up the communities into categories. Rural areas and major cities were eliminated from the sample in order to reduce the number of outliers; this left a list of 1,559 communities ranging in population from 20,000 to 300,000 and in average household income from $21,000 to $215,000. Next, a random sample of 100 communities was selected, and a rigorous search process was applied to build a list of 663 news outlets that cover local news in those communities (based on Web searches and established directories such as Cision).
The News Measures Research Project web captures provide a unique snapshot of local news in the United States. The work is focused on analyzing the nature of local news coverage at a local level, while also examining the broader nature of local community news. At the local level, the 100 community sample provides a way to look at the nature of local news coverage. Next, a team of coders analyzed content on the archived web pages to assess what is being covered by a given news outlet. Often, the websites that serve a local community are simply aggregating content from other outlets, rather than providing unique content. The research team was most interested in understanding the degree to which local news outlets are actually reporting on topics that are pertinent to a given community (e.g. local politics). At the global level, the team looked at interaction between community news websites (e.g. sharing of content) as well as automated measures of the amount of coverage.
The primary data for the researchers was the archived local community news data, but in addition, the team worked with census data to aggregate other measures such as circulation data for newspapers. These data allowed the team to examine the amount and type of local news changes depending on the characteristics of the community. Because the team was using multiple datasets, the Web data is just one part of the puzzle. TheWAT data format proved particularly useful for the team in this regard. Using the WAT file format allowed the team to avoid digging deeply into the data – rather, the WAT data allowed the team to examine high level structure without needing to examine the content of each and every WARC record. Down the road, the WARC data allows for a deeper dive, but the lighter metadata format of the WAT files has enabled early analysis.
Stay tuned for more updates as research utilizing this data continues! The websites selected will continue to be archived and much of the data are publicly available.
A few hours after after Sean Spicer, the White House press secretary, compared Syrian President Bashar Assad to Adolf Hitler, saying, “We didn’t use chemical weapons in World War II…You had … someone as despicable as Hitler who didn’t even sink to using chemical weapons,” the media speculation began. Where did Spicer get the idea to compare Assad to Hitler?
On Twitter, a liberal blogger named Yashar Alipointed to a Fox News segment that had aired on April 10, featuring a Skype interview with Kassim Eid, a Syrian activist who has written about surviving an earlier gas attack, seen below on the TV News Archive. Eid said, “He displaced half of the country. He destroyed the country. He gassed women and children. Who can be worse than him? He’s worse than Hitler.”
Ali’s tweet was picked up later that afternoon by NJ.com in a report about the social media criticism following Spicer’s statement. At 4:50 p.m., Charlie Warzel, a reporter for BuzzFeed, posted a piece hypothesizing that the Fox Business News interview might have been the inspiration for Spicer’s statement.
Of course only Spicer himself knows if the Fox News report inspired his statement, which he eventually apologized for after several hours of harsh criticism. After all, he is certainly not the first public official to run into trouble when making statements about Hitler.
In an era where news no longer solely arrives on newsprint on front doorsteps, tracing the provenance of a statement, idea, story, or report across media platforms–social media, television, news websites–has become a common pursuit. This has been, perhaps, fueled by the president, who has made such references himself.
As a library, the Internet Archive can help. Our Wayback Machine preserves websites online, with more than 286 million websites saved overtime. And our TV News Archive provides an online, public library with 1.3 million shows and counting. Here we have the original source for many types of statements by public officials: news conferences, appearances before congressional committees, appearances on TV news shows, and more. The 60-second segment format allows for editing your own clips up to three minutes long and makes them shareable on social media and embeddable on websites.
For example, in February, Trump made a reference at a Florida rally about Sweden: “Look at what’s happening last night in Sweden. Sweden, who would believe this? Sweden. They took in large numbers. They’re having problems like they never thought possible.” Fact- checkers reported that nothing had happened in Sweden the night before.
Trump later tweeted, however, that his statement about Swedish problems was inspired by Fox News report.
My statement as to what’s happening in Sweden was in reference to a story that was broadcast on @FoxNews concerning immigrants & Sweden.
In that report, Fox showed an interview by a Swedish film maker, Ami Horowitz, who asserts that refugees are responsible for “an absolute surge in both gun violence and rape in Sweden once they began this open door policy.”
Robert Farley, a reporter for FactCheck.org, wrote that this claim is contested by “Swedish authorities and criminologists.”
Several weeks later, Trump credited a “talented legal mind” on Fox news as the source for his March 2017 tweet accusing former President Barack Obama ordering wiretapping of Trump tower during the presidential election.
Following Trump’s statement, Shepard Smith, chief news anchor for Fox News, said that “Fox News cannot confirm Judge Napalitano’s commentary. Fox News knows of no evidence of any kind that the president of the united states was surveilled at any time in any way, full stop.”
The question of how political rhetoric travels across media platforms goes far beyond the Trump administration. Media researchers are developing methodologies to track messages and stories as they travel across the news ecosphere. Understanding these phenomenon is essential in figuring out effective ways to improve overall media literacy and fight the spread of misinformation.
As an early experiment in making such research easier, we’ve been developing hand-curated collections of statements by public officials, starting with the Trump Archive and now branching out to creating archives (still in development) for the congressional leadership on both sides of the party aisle: Sen. Majority Leader Mitch McConnell, R., Ky.; Senate Minority Leader Charles Schumer, D., N.Y.; House Speaker Paul Ryan, R., Wis., and House Minority Leader Nancy Pelosi, D., Calif.
We’re working now to develop partnerships to use machine learning approaches, such as speaker identification and natural language processing, to make our resources more useful for researchers. Ultimately, we’ll improve search to make it simpler to search across our different collections and types of media.
Hot off the internet presses, here is media analyst’s Kalev Leetaru’s visualization tool, fueled by Internet Archive data, which enables users to trace particular phrases used in broadcast news coverage in the first 24 hours after would-be presidential nominees appeared in the first Democratic debate of the 2016 election.
Scroll down and what sticks out immediately are the two subjects that captured most of the news broadcasters’ attention: “Bernie Sanders’ “damn emails” quote and guns.
When the subject came up of the controversy over Clinton’s decision to do public work from a private email server, rather than attack Clinton, Sanders defended her:
“Let me say — let me say something that may not be great politics. But I think the secretary is right, and that is that the American people are sick and tired of hearing about your damn e-mails.”
According to Internet Archive data, that sound bite aired 496 times across stations.
The other issue that grabbed attention was gun violence: Sanders, who hails from gun-friendly rural Vermont, was called to task for his vote to make it tougher to hold gun manufacturers liable when the guns they make are used in a crime. Answering a question by CNN moderator Anderson Cooper, on whether Sanders is tough enough on guns, Clinton said:
“No, not at all. I think that we have to look at the fact that we lose 90 people a day from gun violence. This has gone on too long and it’s time the entire country stood up against the NRA. The majority of our country…(APPLAUSE)… supports background checks, and even the majority of gun owners do.”
This clip aired 260 times across stations.
However, these are just the top take-aways from this massive data crunching tool. It provides a search mechanism for the user to do deeper dives into the data and discover trends across and within certain types of news broadcasts.
Leetaru’s own analysis is here, on the Washington Post’s Monkey Cage. Among his observations:
There was also variation in how much attention each network paid to each candidate (you can see for yourself using the interactive visualization). Telemundo favored Sanders with 41 percent, followed by O’Malley with 24 percent and Clinton at just 21 percent, though admittedly, they broadcast a relatively small number of excerpts. FOX Business also favored Sanders 50 percent to Clinton’s 38 percent, as did CSPAN with Sanders at 52 percent to Clinton’s 44 percent. All other networks favored Clinton, though sometimes by a relatively close margin — like CNBC (50 percent Clinton to 43 percent Sanders) or PBS affiliates (41 percent Clinton to 38 percent Sanders).
This tool is also part of the Internet Archive’s testing of technology that we’ll use in our new Knight Foundation funded project to track political TV ads in key primary states, which will launch in early December.