Library as Laboratory: A New Series Exploring the Computational Use of Internet Archive Collections

From web archives to television news to digitized books & periodicals, dozens of projects rely on the collections available at archive.org for computational & bibliographic research across a large digital corpus. This series will feature six sessions highlighting the innovative scholars that are using Internet Archive collections, services and APIs to support data-driven projects in the humanities and beyond.

Many thanks to the program advisory group:

  • Dan Cohen, Vice Provost for Information Collaboration and Dean, University Library and Professor of History, Northeastern University
  • Makiba Foster, Library Regional Manager for the African American Research Library and Cultural Center, Broward County Library
  • Mike Furlough, Executive Director, HathiTrust
  • Harriett Green, Associate University Librarian for Digital Scholarship and Technology Services, Washington University Libraries

Session Details:

March 2 @ 11am PT / 2pm ET

Supporting Computational Use of Web Collections
Jefferson Bailey, Internet Archive
Helge Holzmann, Internet Archive

What can you do with billions of archived web pages? In our kickoff session, Jefferson Bailey, Internet Archive’s Director of Web Archiving & Data Services, and Helge Holzmann, Web Data Engineer, will take attendees on a tour of the methods and techniques available for analyzing web archives at scale. 

Read the session recap & watch the video:


March 16  @ 11am PT / 2pm ET

Applications of Web Archive Research with the Archives Unleashed Cohort Program

Launched in 2020, the Cohort program is engaging with researchers in a year-long collaboration and mentorship with the Archives Unleashed Project and the Internet Archive, to support web archival research. 

 Web archives provide a rich resource for exploration and discovery! As such, this session will feature the program’s inaugural research teams, who will discuss the innovative ways they are exploring web archival collections to tackle interdisciplinary topics and methodologies. Projects from the Cohort program include:

  • AWAC2 — Analysing Web Archives of the COVID Crisis through the IIPC Novel Coronavirus dataset—Valérie Schafer (University of Luxembourg)
  • Everything Old is New Again: A Comparative Analysis of Feminist Media Tactics between the 2nd- to 4th Waves—Shana MacDonald (University of Waterloo)
  • Mapping and tracking the development of online commenting systems on news websites between 1996–2021—Robert Jansma (University of Siegen)
  • Crisis Communication in the Niagara Region during the COVID-19 Pandemic—Tim Ribaric (Brock University)
  • Viral health misinformation from Geocities to COVID-19—Shawn Walker (Arizona State University)

UPDATE: Quinn Dombrowski from Saving Ukrainian Cultural Heritage Online (SUCHO) will give an introductory presentation about the team of volunteers racing to archive Ukrainian digital cultural heritage.

Read the session recap & watch the video:


March 30  @ 11am PT / 2pm ET

Hundreds of Books, Thousands of Stories: A Guide to the Internet Archive’s African Folktales
Laura Gibbs, Educator, writer & bibliographer
Helen Nde, Historian & writer

Join educator & bibliographer Laura Gibbs and researcher, writer & artist Helen Nde as they give attendees a guided tour of the African folktales in the Internet Archive’s collection. Laura will share her favorite search tips for exploring the treasure trove of books at the Internet Archive, and how to share the treasures you find with colleagues, students, and fellow readers in the form of a digital bibliography guide. Helen will share how she uses the Internet Archive’s collections to tell the stories of individuals and cultures that aren’t often represented online through her work at Mythological Africans (@MythicAfricans). Helen will explore how she uses technology to continue the African storytelling tradition in spoken form, and she will discuss the impacts on the online communities that she is able to reach.

Read the session recap & watch the video:


April 13  @ 11am PT / 2pm ET

Television as Data: Opening TV News for Deep Analysis and New Forms of Interactive Search
Roger MacDonald, Founder, TV News Archive
Kalev Leetaru, Data Scientist, GDELT

How can treating television news as data create fundamentally new kinds of opportunities for both computational analysis of influential societal narratives and the creation of new kinds of interactive search tools? How could derived (non-consumptive) metadata be open-access and respectful of content creator concerns? How might specific segments be contextualized by linking them to related analysis, like professional journalist fact checking? How can tools like OCR, AI language analysis and knowledge graphs generate terabytes of annotations making it possible to search television news in powerful new ways?

For nearly a decade, the Internet Archive’s TV News Archive has enabled closed captioning keyword search of a growing archive that today spans nearly three million hours of U.S. local and national TV news (2,239,000+ individual shows) from mid-2009 to the present. This public interest library is dedicated to facilitating journalists, scholars, and the public to compare, contrast, cite, and borrow specific portions of the collection.  Using a range of algorithmic approaches, users are moving beyond simple captioning search towards rich analysis of the visual side of television news. 
In this session, Roger Macdonald, founder of the TV News Archive, and Kalev Leetaru, collaborating data scientist and  GDELT Project founder, will report on experiments applying full-screen OCR, machine vision, speech-to-text and natural language processing to assist exploration, analyses and data-visualization of this vast television repository. They will ​​survey the resulting open metadata datasets and demonstrate the public search tools and APIs they’ve created that enable powerful new forms of interactive search of television news and what it looks like to ask questions of more than a decade of television news.

Read the session recap & watch the video:


April 27  @ 11am PT / 2pm ET

Analyzing Biodiversity Literature at Scale
Martin R. Kalfatovic, Smithsonian Library & Archives
JJ Dearborn, Biodiversity Heritage Library Data Manager

Imagine the great library of life, the library that Charles Darwin said was necessary for the “cultivation of natural science” (1847). And imagine that this library is not just hundreds of thousands of books printed from 1500 to the present, but also the data contained in those books that represents all that we know about life on our planet. That library is the Biodiversity Heritage Library (BHL) The Internet Archive has provided an invaluable platform for the BHL to liberate taxonomic names, species descriptions, habitat description and much more. Connecting and harnessing  the disparate data from over five-centuries is now BHL’s grand challenge. The unstructured textual data generated at the point of digitization holds immense untapped potential. Tim Berners-Lee provided the world with a semantic roadmap to address this global deluge of dark data and Wikidata is now executing on his vision. As we speak, BHL’s data is undergoing rapid transformation from legacy formats into linked open data, fulfilling the promise to evaporate data silos and foster bioliteracy for all humankind.

Martin R. Kalfatovic (BHL Program Director and Associate Director, Smithsonian Library and Archives) and JJ Dearborn (BHL Data Manager) will explore how books in BHL become data for the larger biodiversity community.

Watch the video:


May 11  @ 11am PT / 2pm ET

Lightning Talks
In this final session of the Internet Archive’s digital humanities expo, Library as Laboratory, you’ll hear from scholars in a series of short presentations about their research and how they’re using collections and infrastructure from the Internet Archive for their work.

Watch the session recording:

Talks include:

  • Forgotten Histories of the Mid-Century Coding Bootcamp, [watch] Kate Miltner (University of Edinburgh)
  • Japan As They Saw It, [watch] Tom Gally (University of Tokyo)
  • The Bibliography of Life, [watch] Rod Page (University of Glasgow)
  • Q&A #1 [watch]
  • More Than Words: Fed Chairs’ Communication During Congressional Testimonies, [watch] Michelle Alexopoulos (University of Toronto)
  • WARC Collection Summarization, [watch] Sawood Alam (Internet Archive)
  • Automatic scanning with an Internet Archive TT scanner, [watch] Art Rhyno (University of Windsor)
  • Q&A #2 [watch]
  • Automated Hashtag Hierarchy Generation Using Community Detection and the Shannon Diversity Index, [watch] Spencer Torene (Thomson Reuters Special Services, LLC)
  • My Internet Archive Enabled Journey As A Digital Humanities Citizen Scientist, [watch] Jim Salmons
  • Web and cities: (early internet) geographies through the lenses of the Internet Archive, [watch] Emmanouil Tranos (University of Bristol)
  • Forgotten Novels of the 19th Century, [watch] Tom Gally (University of Tokyo)
  • Q&A #3 [watch]