We are excited to announce the public availability of ARCH (Archives Research Compute Hub), a new research and education service that helps users easily build, access, and analyze digital collections computationally at scale. ARCH represents a combination of the Internet Archive’s experience supporting computational research for more than a decade by providing large-scale data to researchers and dataset-oriented service integrations like ARS (Archive-it Research Services) and a collaboration with the Archives Unleashed project of the University of Waterloo and York University. Development of ARCH was generously supported by the Mellon Foundation.
What does ARCH do?
ARCH helps users easily conduct and support computational research with digital collections at scale – e.g., text and data mining, data science, digital scholarship, machine learning, and more. Users can build custom research collections relevant to a wide range of subjects, generate and access research-ready datasets from collections, and analyze those datasets. In line with best practices in reproducibility, ARCH supports open publication and preservation of user-generated datasets. ARCH is currently optimized for working with tens of thousands of web archive collections, covering a broad range of subjects, events, and timeframes, and the platform is actively expanding to include digitized text and image collections. ARCH also works with various portions of the overall Wayback Machine global web archive totaling 50+ PB going back to 1996, representing an extensive archive of contemporary history and communication.
ARCH, In-Browser Visualization
Who is ARCH for?
ARCH is for any user that seeks an accessible approach to working with digital collections computationally at scale. Possible users include but are not limited to researchers exploring disciplinary questions, educators seeking to foster computational methods in the classroom, journalists tracking changes in web-based communication over time, to librarians and archivists seeking to support the development of computational literacies across disciplines. Recent research efforts making use of ARCH include but are not limited to analysis of COVID-19 crisis communications, health misinformation, Latin American women’s rights movements, and post-conflict societies during reconciliation.
ARCH, Generate Datasets
What are core ARCH features?
Build: Leverage ARCH capabilities to build custom research collections that are well scoped for specific research and education purposes.
Access: Generate more than a dozen different research-ready datasets (e.g., full text, images, pdfs, graph data, and more) from digital collections with the click of a button. Download generated datasets directly in-browser or via API.
Analyze: Easily work with research-ready datasets in interactive computational environments and applications like Jupyter Notebooks, Google CoLab, Gephi, and Voyant and produce in-browser visualizations.
Publish and Preserve: Openly publish datasets in line with best practices in reproducible research. All published datasets will be preserved in perpetuity.
Support: Make use of synchronous and asynchronous technical support, online trainings, and extensive help center documentation.
How can I learn more about ARCH?
To learn more about ARCH please reach out via the following form.
From web archives to television news to digitized books & periodicals, dozens of projects rely on the collections available at archive.org for computational & bibliographic research across a large digital corpus. This series will feature six sessions highlighting the innovative scholars that are using Internet Archive collections, services and APIs to support data-driven projects in the humanities and beyond.
Many thanks to the program advisory group:
Dan Cohen, Vice Provost for Information Collaboration and Dean, University Library and Professor of History, Northeastern University
Makiba Foster, Library Regional Manager for the African American Research Library and Cultural Center, Broward County Library
Mike Furlough, Executive Director, HathiTrust
Harriett Green, Associate University Librarian for Digital Scholarship and Technology Services, Washington University Libraries
March 2 @ 11am PT / 2pm ET
Supporting Computational Use of Web Collections Jefferson Bailey, Internet Archive Helge Holzmann, Internet Archive
What can you do with billions of archived web pages? In our kickoff session, Jefferson Bailey, Internet Archive’s Director of Web Archiving & Data Services, and Helge Holzmann, Web Data Engineer, will take attendees on a tour of the methods and techniques available for analyzing web archives at scale.
Applications of Web Archive Research with the Archives Unleashed Cohort Program
Launched in 2020, the Cohort program is engaging with researchers in a year-long collaboration and mentorship with the Archives Unleashed Project and the Internet Archive, to support web archival research.
Web archives provide a rich resource for exploration and discovery! As such, this session will feature the program’s inaugural research teams, who will discuss the innovative ways they are exploring web archival collections to tackle interdisciplinary topics and methodologies. Projects from the Cohort program include:
AWAC2 — Analysing Web Archives of the COVID Crisis through the IIPC Novel Coronavirus dataset—Valérie Schafer (University of Luxembourg)
Everything Old is New Again: A Comparative Analysis of Feminist Media Tactics between the 2nd- to 4th Waves—Shana MacDonald (University of Waterloo)
Mapping and tracking the development of online commenting systems on news websites between 1996–2021—Robert Jansma (University of Siegen)
Crisis Communication in the Niagara Region during the COVID-19 Pandemic—Tim Ribaric (Brock University)
Viral health misinformation from Geocities to COVID-19—Shawn Walker (Arizona State University)
UPDATE: Quinn Dombrowski from Saving Ukrainian Cultural Heritage Online (SUCHO) will give an introductory presentation about the team of volunteers racing to archive Ukrainian digital cultural heritage.
Hundreds of Books, Thousands of Stories: A Guide to the Internet Archive’s African Folktales Laura Gibbs, Educator, writer & bibliographer Helen Nde, Historian & writer
Join educator & bibliographer Laura Gibbs and researcher, writer & artist Helen Nde as they give attendees a guided tour of the African folktales in the Internet Archive’s collection. Laura will share her favorite search tips for exploring the treasure trove of books at the Internet Archive, and how to share the treasures you find with colleagues, students, and fellow readers in the form of a digital bibliography guide. Helen will share how she uses the Internet Archive’s collections to tell the stories of individuals and cultures that aren’t often represented online through her work at Mythological Africans (@MythicAfricans). Helen will explore how she uses technology to continue the African storytelling tradition in spoken form, and she will discuss the impacts on the online communities that she is able to reach.
Television as Data: Opening TV News for Deep Analysis and New Forms of Interactive Search Roger MacDonald, Founder, TV News Archive Kalev Leetaru, Data Scientist, GDELT
How can treating television news as data create fundamentally new kinds of opportunities for both computational analysis of influential societal narratives and the creation of new kinds of interactive search tools? How could derived (non-consumptive) metadata be open-access and respectful of content creator concerns? How might specific segments be contextualized by linking them to related analysis, like professional journalist fact checking? How can tools like OCR, AI language analysis and knowledge graphs generate terabytes of annotations making it possible to search television news in powerful new ways?
For nearly a decade, the Internet Archive’s TV News Archive has enabled closed captioning keyword search of a growing archive that today spans nearly three million hours of U.S. local and national TV news (2,239,000+ individual shows) from mid-2009 to the present. This public interest library is dedicated to facilitating journalists, scholars, and the public to compare, contrast, cite, and borrow specific portions of the collection. Using a range of algorithmic approaches, users are moving beyond simple captioning search towards rich analysis of the visual side of television news. In this session, Roger Macdonald, founder of the TV News Archive, and Kalev Leetaru, collaborating data scientist and GDELT Project founder, will report on experiments applying full-screen OCR, machine vision, speech-to-text and natural language processing to assist exploration, analyses and data-visualization of this vast television repository. They will survey the resulting open metadata datasets and demonstrate the public search tools and APIs they’ve created that enable powerful new forms of interactive search of television news and what it looks like to ask questions of more than a decade of television news.
Analyzing Biodiversity Literature at Scale Martin R. Kalfatovic, Smithsonian Library & Archives JJ Dearborn, Biodiversity Heritage Library Data Manager
Imagine the great library of life, the library that Charles Darwin said was necessary for the “cultivation of natural science” (1847). And imagine that this library is not just hundreds of thousands of books printed from 1500 to the present, but also the data contained in those books that represents all that we know about life on our planet. That library is the Biodiversity Heritage Library (BHL) The Internet Archive has provided an invaluable platform for the BHL to liberate taxonomic names, species descriptions, habitat description and much more. Connecting and harnessing the disparate data from over five-centuries is now BHL’s grand challenge. The unstructured textual data generated at the point of digitization holds immense untapped potential. Tim Berners-Lee provided the world with a semantic roadmap to address this global deluge of dark data and Wikidata is now executing on his vision. As we speak, BHL’s data is undergoing rapid transformation from legacy formats into linked open data, fulfilling the promise to evaporate data silos and foster bioliteracy for all humankind.
Martin R. Kalfatovic (BHL Program Director and Associate Director, Smithsonian Library and Archives) and JJ Dearborn (BHL Data Manager) will explore how books in BHL become data for the larger biodiversity community.
Watch the video:
May 11 @ 11am PT / 2pm ET
Lightning Talks In this final session of the Internet Archive’s digital humanities expo, Library as Laboratory, you’ll hear from scholars in a series of short presentations about their research and how they’re using collections and infrastructure from the Internet Archive for their work.
Watch the session recording:
Forgotten Histories of the Mid-Century Coding Bootcamp, [watch] Kate Miltner (University of Edinburgh)
Japan As They Saw It, [watch] Tom Gally (University of Tokyo)
The Bibliography of Life, [watch] Rod Page (University of Glasgow)