Tag Archives: library as laboratory

Follow the Changes: 9 Ways Web Archives are Used in Digital Investigations

Guest post from Thais Lobo, Liliana Bounegru & Jonathan W. Y. Gray, King’s College London.

This work was supported by the Centre for Digital Culture and Department of Digital Humanities at King’s College London and developed further through collaborations with researchers and students at the University of Amsterdam.


Digital journalists increasingly turn to web archives like the Wayback Machine to follow how things on the Internet break, change or disappear – from deleted posts to quietly edited pages.

The web has become not only a source of information but also the subject of media investigations, prompting journalists, researchers and activists to use digital archives to reconstruct timelines, verify claims, uncover hidden connections and hold powerful actors to account.

As online materials grow more fragile and prone to disappearance, the Internet Archive’s Wayback Machine has been critical in making “lost” web pages available – recently celebrating archiving over a trillion pages.

As we’ve previously written about on this blog, the Wayback Machine is an important resource for our work as media researchers, helping us to trace histories of digital media objects (for example, changes in ad tracker signatures of viral “fake news” sites over time).

We are also interested in how others use web archives across fields, and what we can learn from each other.

In this piece we draw on the Internet Archive’s News Stories collection to surface practices and use cultures of the Wayback Machine amongst journalists and media organisations. We analysed a dataset of about 8,600 news articles, assembled by the IA via daily Google News keyword searches since 2018.

Drawing on a combination of digital methods, machine learning and lots of reading – we surfaced nine ways that journalists use the Wayback Machine in their reporting.

***

1. following what is deleted

Shifting political alliances are a common driver of online footprint erasure. Deleted tweets have revealed past critics in current allies (here and here), and current career aspirations were juxtaposed with earlier conflicting stances in personal blogs and websites (here, here, here and here). 

Unannounced takedowns of collections or site sections on government websites often prompt investigations using archival snapshots. Examples include removed editions of presidential newsletters and deleted staff contact lists for services supporting vulnerable groups, signaling access-to-information breaches. 

The removal of official publications also enticed further contextualisation, revealing cases in which information was deleted due to being incomplete, inaccurate or inconveniently timed

Beyond politics, erasing on corporate websites highlights commercial and reputational pressures, such as deleted statements on forced labour, product safety and climate deception.

2. following what has been altered

Subtle alterations on webpages can also reveal a plain-to-see effort to reshape narratives.

Reporting based on archived pages shows how wording edits can move in opposite directions: from hardening language on migration ahead of a policy announcement to softening controversial statements in view of a political nomination, or erasing customer protection promises prior to a bankruptcy filing. 

In other cases, small additions to online content have proved just as revealing. A before and after snapshot of a blog post showed how a supposed early warning about a virus threat was added only after the pandemic began. Similarly, changes to a social media platform’s API rules appeared shortly after third-party apps were banned, subtly reframing the policy to align with new restrictions.  

3. following what is banned

Sometimes removals are deliberate, often at the request of companies seeking to enforce copyright, control branding, or limit liability.

Reports from media investigations highlight how such bans can affect games (here, here, here and here), apps and technical reviews.

In some cases, the bans intersect with political pressures, such as Hong Kong news outlets being shuttered under pro‑Beijing pressure, and disinformation networks being taken down due to links to state actors.

4. following what is broken

Archived snapshots are also often the only way to reconstruct what preceded a link break, when it happened, and what information was effectively cut off.

For example, an investigation into a set of broken URLs on a government website revealed that the pages themselves had not been removed, but the links pointed to outdated servers, creating a false impression of secrecy that sparked a conspiracy theory.

In another case, a major technical glitch took multiple Nigerian government websites offline, cutting off access to official information and showing how even unintentional failures can undermine transparency.

5. following what is hacked

Compromised versions of hacked websites and social media accounts present another form of using archived snapshots as traceable historical record.

For example, past screenshots of Twitter’s bio page revealed inconsistencies in claims about an alleged takeover of the US president’s social media account. In other cases, such snapshots helped surface a forensic trail and distinguish unauthorised activity carried out by activists (here and here) from the ones linked to cybercriminal groups (here).

6. following what is connected

Archived web data often uncovers unexpected linkages between domains’ ownership that appear unrelated on the surface.

For example, journalists used analytics codes of copies of sites maintained by the Wayback Machine to uncover disinformation networks. In another investigation, archived records verified that a website redirect to Joe Biden’s presidential campaign was unrelated to him, debunking conspiracy theories about the domain’s ownership.

Snapshots of a fake Black Lives Matter Facebook page and its associated websites allowed reporters to trace the individuals behind the operation. Similarly, archived versions of Amazon storefronts exposed networks of accounts generating affiliate revenue from coordinated product listings.

7. following what is reported

Archived web pages have proven vital for tracing how stories are presented across media outlets and platforms.

Investigations have examined archived versions of individual pages, such as headline coverage relying heavily on unverified claims, a news agency editorial premature assessment, or the unflagging of a branded content

In another case, snapshots of the Google homepage captured during the 2018 State of the Union speech disproved a viral claim that Google ignored Donald Trump’s address in favour of Barack Obama.

8. following what is unchanged

In other investigations, the most revealing detail is what did not change.

For example, during a bushfire crisis in Australia, archived pages showed that a key policy statement by the Greens party was left untouched, despite a disinformation campaign claiming to the contrary.

Similarly, a social media account circulated as having been reactivated under a new wave of laissez-faire moderation was, in fact, never suspended.

9. following what is saved 

When forums, platforms and websites vanish, it’s the work of crowdsourced archivists that capture their traces before they vanish for good.

In several reported cases, users raced to preserve spaces such as a long-running forum for sex workers, a 16-year-old Q&A site, a meme-sharing platform, and a free music library

Archiving web pages can become part of the story.

***

These are some of the ways we’ve noticed journalists using web archives – and there are many more! If you know of other interesting examples, we’d love to hear from you.

We hope that these nine ways may help to inspire critical and creative uses of web archives to “follow the changes” – exploring what they can tell us about digital culture and society, and the times we live in.

This work was supported by the Centre for Digital Culture and Department of Digital Humanities at King’s College London and developed further through collaborations with researchers and students at the University of Amsterdam.


About the authors

Thais Lobo is research associate at the Department of Digital Humanities, King’s College London, with a previous career in journalism.

Jonathan W. Y. Gray is Co-director of the Centre for Digital Culture and Reader in Critical Infrastructure Studies at the Department of Digital Humanities, King’s College London. He is also co-founder of the Public Data Lab; research associate at the Digital Methods Initiative (University of Amsterdam) and the médialab (Sciences Po, Paris). More about his work at jonathangray.org

Liliana Bounegru is Senior Lecturer (Associate Professor) in Digital Media, Culture and Society at the Department of Digital Humanities, King’s College London. She is also co-founder of the Public Data Lab, member of the Digital Methods Initiative at the University of Amsterdam and associate of the Sciences Po Paris médialab. More about her work can be found at lilianabounegru.org.

Library as Laboratory Recap: Opening Television News for Deep Analysis and New Forms of Interactive Search

Watching a single episode of the evening news can be informative. Tracking trends in broadcasts over time can be fascinating. 

The Internet Archive has preserved nearly 3 million hours of U.S. local and national TV news shows and made the material open to researchers for exploration and non-consumptive computational analysis. At a webinar April 13, TV News Archive experts shared how they’ve curated the massive collection and leveraged technology so scholars, journalists and the general public can make use of the vast repository.

Roger Macdonald, founder of the TV News Archive, and Kalev Leetaru, collaborating data scientist and GDELT Project founder, spoke at the session. Chris Freeland, director of Open Libraries, served as moderator and Internet Archive founder Brewster Kahle offered opening remarks.

Watch video

“Growing up in the television age, [television] is such an influential, important medium—persuasive, yet not something you can really quote,” Kahle said. “We wanted to make it so that you could quote, compare and contrast.” 

The Internet Archive built on the work of the Vanderbilt Television Archive, and the UCLA Library Broadcast NewsScape to give the public a broader “macro view,” said Kahle. The trends seen in at-scale computational analyses of news broadcasts can be used to understand the bigger picture of what is happening in the world and the lenses through which we see the world around us.

In 2012, with donations from individuals and philanthropies such as the Knight Foundation, the Archive started repurposing the closed captioning data stream required of all U.S. broadcasters into a search index. “This simple approach transformed the antiquated experience of searching for specific topics within video,” said Macdonald, who helped lead the effort. “The TV caption search enabled discovery at internet speed with the ability to simultaneously search millions of programs and have your results plotted over time, down to individual broadcasters and programs.”

“[Television] is such an influential, important medium—persuasive, yet not something you can really quote. We wanted to make it so that you could quote, compare and contrast.”

Brewster Kahle, Internet Archive

Scholars and journalists were quick to embrace this opportunity, but the team kept experimenting with deeper indexing. Techniques like audio fingerprinting, Optical Character Recognition (OCR) and Computer Vision made it possible to capture visual elements of the news and improve access, Macdonald said. 

Sub-collections of political leaders’ speeches and interviews have been created, including an extensive Donald Trump Archive. Some of the Archive’s most productive advances have come from collaborating with outsiders who have requested more access to the collection than is available through the public interface, Macdonald said. With appropriate restrictions to maintain respect for broadcasters and distribution platforms, the Archive has worked with select scientists and journalists as partners to use data in the collection for more complex analyses.

Treating television as data

Treating television news as data creates vast opportunities for computational analysis, said Leetaru. Researchers can track word frequency use in the news and how that has changed over time.  For instance, it’s possible to look at mentions of COVID-related words across selected news programs and see when it surged and leveled off with each wave before plummeting downward, as shown in the graph below.

The newly computed metadata can help provide context and assist with fact checking efforts to combat misinformation. It can allow researchers to map the geography of television news—how certain parts of the world are covered more than others, Leetaru said. Through the collections, researchers have explored  which presidential tweets challenging election integrity got the most exposure on the news.  OCR of every frame has been used to create models of how to identify names of every “Dr.” depicted on cable TV after the outbreak of COVID-19 and calculate air time devoted to the medical doctors commenting on one of the virus variants.  Reverse image lookup of images in TV news has been used to determine the source of photos and videos.  Visual entity search tools can even reveal the increasing prevalence of bookshelves as backdrops during home interviews in the pandemic, as well as appearances of books by specific authors or titles. Open datasets of computed TV news metadata are available that include all visual entity and OCR detections, 10-minute interval captioning ngrams and second by second inventories of each broadcast cataloging whether it was “News” programming, “Advertising” programming or “Uncaptioned” (in the case of television news this is almost exclusively advertising).

From television news to digitized books and periodicals, dozens of projects rely on the collections available at archive.org for computational and bibliographic research across a large digital corpus. Data scientists or anyone with questions about the TV News Archives, can contact info@archive.org.

Up Next

This webinar was the fourth a series of six sessions highlighting how researchers in the humanities use the Internet Archive. The next will be about Analyzing Biodiversity Literature at Scale on April 27. Register here.

Library as Laboratory Recap: Supporting Computational Use of Web Collections

For scholars, especially those in the humanities, the library is their laboratory. Published works and manuscripts are their materials of science. Today, to do meaningful research, that also means having access to modern datasets that facilitate data mining and machine learning.

On March 2, the Internet Archive launched a new series of webinars highlighting its efforts to support data-intensive scholarship and digital humanities projects. The first session focused on the methods and techniques available for analyzing web archives at scale.

Watch the session recording now:

“If we can have collections of cultural materials that are useful in ways that are easy to use — still respectful of rights holders — then we can start to get a bigger idea of what’s going on in the media ecosystem,” said Internet Archive Founder Brewster Kahle.

Just what can be done with billions of archived web pages? The possibilities are endless. 

Jefferson Bailey, Internet Archive’s Director of Web Archiving & Data Services, and Helge Holzmann, Web Data Engineer, shared some of the technical issues libraries should consider and tools available to make large amounts of digital content available to the public.

The Internet Archive gathers information from the web through different methods including global and domain crawling, data partnerships and curation services. It preserves different types of content (text, code, audio-visual) in a variety of formats.

Learn more about the Library as Laboratory series & register for upcoming sessions.

Social scientists, data analysts, historians and literary scholars make requests for data from the web archive for computational use in their research. Institutions use its service to build small and large collections for a range of purposes. Sometimes the projects can be complex and it can be a challenge to wrangle the volume of data, said Bailey.

The Internet Archive has worked on a project reviewing changes to the content of 800,000 corporate home pages since 1996. It has also done data mining for a language analysis that did custom extractions for Icelandic, Norwegian and Irish translation.

Transforming data into useful information requires data engineering. As librarians consider how to respond to inquiries for data, they should look at their tech resources, workflow and capacity. While more complicated to produce, the potential has expanded given the size, scale and longitudinal analysis that can be done.  

“We are getting more and more computational use data requests each year,” Bailey said. “If librarians, archivists, cultural heritage custodians haven’t gotten these requests yet, they will be getting them soon.”

Up next in the Library as Laboratory series:

The next webinar in the series will be held March 16, and will highlight five innovative web archiving research projects from the Archives Unleashed Cohort Program. Register now.