Tag Archives: web archiving

Rural Libraries Receive Support from the Internet Archive to Preserve Community Stories

Public librarians are shaping the future of the historic record. As experts in community knowledge dedicated to serving local information needs, these librarians are uniquely positioned to preserve and provide access to their community’s stories. Since 2017, Internet Archive’s Community Webs program has provided training, support, and services to empower public libraries to preserve local digital heritage. 

For rural public libraries, this crucial work may be particularly challenging. While a range of cultural heritage institutions may play a role in local preservation initiatives focused on larger communities, the public library may be the only organization engaging in this work in a rural area. Resource constraints, however, make it difficult for rural libraries to take on new initiatives and they may lack access to tools, training, and technology to support these efforts. Yet documenting how history is happening in these communities is essential for ensuring a more complete historic record. Without participation from rural libraries, these local stories may go untold, unheard, and undocumented.

Librarians from rural and small librarians across the country gathered in Albuquerque for a workshop hosted by Internet Archive’s Community Webs program.

In response to these challenges and opportunities, Internet Archive has recently focused on recruiting rural libraries into the Community Webs program, providing them with access to web archiving and digital preservation services as well as training and support at no cost. On September 20th, a group of these program members from across the country came together to learn about practical methods and accessible resources that can be used to document, preserve, and share local history in rural communities. Hosted in conjunction with the Association for Rural and Small Libraries Conference in Albuquerque, New Mexico, the event was an opportunity for participants to work with Internet Archive staff and their peers from similar institutions to develop plans for implementing community-focused preservation initiatives. 

A screenshot from a website captured by workshop participant Belen Public Library. Belen, New Mexico is south of Albuquerque with a population under 8,000.

Over the course of the workshop, participants learned strategies for developing community partnerships, providing access to digital collections, and ensuring long term preservation of digital assets. Participatory preservation initiatives such as community scanning days and oral history programs were also covered. Particular attention was paid to the preservation of web-based local content. From the websites of community organizations to local news sites to neighborhood blogs, web archiving is critical for libraries working to preserve their community’s story as it unfolds. Attendees learned how to use Archive-It to both preserve and provide access to web archive collections. They then brainstormed about what local online information possessed enduring value for their current and future community members. Many attendees cited local newspapers that had moved to online-only distribution, town or county government webpages, and online information about community resources and services as content they would include in their web archives. 

Internet Archive will continue to offer support through the Community Webs program for these libraries as they take what they learned in this workshop and begin to apply it locally. Thank you to the Mellon Foundation whose support allows our team to host events like this and continue to expand the Community Webs program. We also wish to thank all of the libraries that participated in our recent workshop:

Asotin Public Library (Washington), Belen Public Library (New Mexico), Cairo Public Library (New York), Charlotte Public Library (Vermont), Dodge Center Public Library (Minnesota), Hillsboro Community Library (New Mexico), Holbrook Public Library (Massachusetts), Jemez Springs Public Library (New Mexico), Kendall Young Library (Iowa), Middlebury Public Library (Indiana), Milltown Public Library (New Jersey), Mount Pleasant Public Library (Texas), Randolph County Public Libraries (North Carolina), Salem-South Lyon District Library (Michigan), Scott County Library System (Iowa), Smithville Public Library (Texas), Sweet Home Public Library (Oregon), Van Horn Public Library (Minnesota), Westford Public Library (Vermont), and Yavapai County Free Library District (Arizona)

Interested in learning more about Community Webs? Explore Community Webs collectionsread the latest program news, or apply to join!

Archiving Resilience: How a Public Library Preserved Their Community’s Response to a Local Disaster

The following guest post from Joanna Kolosov, Librarian and Archivist at the Sonoma County Library in California, is part of a series written by members of Internet Archive’s Community Webs program. Community Webs advances the capacity of community-focused memory organizations to build web and digital archives documenting local histories.

Sonoma County Library joined Community Webs back in 2017, the same year the North San Francisco Bay was hit by devastating wildfires. Realizing that much of the stories, video and information about the emergency response, aftermath and recovery efforts was being shared online and constantly changing, we sensed the urgency to capture stories as the crisis unfolded and the community navigated new territory. We received our new Archive-It account and started learning by doing, creating the “North Bay Fires, 2017” collection.

One of the first websites we archived was cartoonist Brian Fies’ blog, The Fies Files, where he posted a webcomic that he penned in the days following the fire that consumed his house and much of the neighborhood of Coffey Park. He later published it as a graphic memoir called A Fire Story. Preserving the first draft from his blog, we have also saved the numerous comments elicited by his powerful and intimate account.

Screenshots from an archived blogpost by Brian Fies—“A Fire Story, COMPLETE,” The Fies Files, 15 October 2017.

Also included in the North Bay Fires collection is a video by Sutter Health recounting how staff at the Santa Rosa Regional Hospital came together to evacuate the hospital in the early hours of October 9th. Combining firsthand accounts and security camera footage, Firestorm: The First Hours shows healthcare workers rising to the challenge of an unprecedented emergency.

Sutter Health’s Heroes Among Us interview project.


The collection also features websites of volunteer-run groups that sprung up to meet the needs of their communities, providing essential information about cleanup and rebuilding, disaster preparedness, and disaster relief. Some examples include Coffey Strong, a site that provided resources to the community on comparing builders, debris removal, and landscaping. Fire Safe Occidental included evacuation and cell coverage maps as well as a wildfire action plan. UndocuFund.org  was the online presence of a mutual aid project set up to help the county’s most vulnerable residents.

Some of the archived content in the collection reflects on past wildfire disasters, such as “The Forgotten Fires of Fountaingrove and Coffey Park,” a blog post by the late Jeff Elliott, author of SantaRosaHistory.com, who places the fire phenomenon in its broader historical context. Reporters Eric Sagara and Patrick Michaels traced the development of unchecked growth in the wildfire path in the March 14, 2018 episode of Reveal’s podcast “Built to Burn.” Stepping back even further allows us to consider the history of the landscape. In a video posted by staff of the Bouverie Preserve, fire ecologist Sasha Berleman compares past policies of fire suppression with a deeper understanding, grounded in Indigenous knowledge and stewardship and the impact of fire on ecosystems. The archive also documented the aftermath in the years following the fires, showing evidence of how the community continued to regroup, remember, and recover.

An archived tweet from the Santa Rosa Fire Department at the one-year anniversary

Following the Los Angeles fires of January 2025, this collection has taken on new meaning as an archive of resilience and hope, offering testimonies of recovery and regrowth for LA fire survivors. 

The experience of documenting the 2017 wildfires prepared us for preserving Sonoma County’s response to the COVID-19 pandemic beginning in 2020. The Sonoma Responds project was an online archive that invited our community to collectively build the historical record of living through COVID-19, the Black Lives Matter movement and the impacts of these events on daily life locally. Members of the public could upload a photo, audio/video file, or PDF that embodied their experiences and impressions of life in lockdown. We also encouraged people to nominate a website, webpage, blog post, news article, or online video for inclusion in the web archives. While we expected to receive links to news articles and the like, most submissions were from content creators, nominating their music videos, journals and blogs. These included singer/songwriter Chris Herrod’s album, I Don’t Play Xmas Songs, I Play Coronavirus Songs (watch all 10 tracks by clicking the “play” button in the Wayback banner at the top of the page). Michael Mann created a series of live journal entries on his blog “riding the viral apocalypse” that documented the mundane to the surreal happenings of pandemic life. “Book of Days: A Covid Kitchen Chronicle” was created by Liat Goldman Douglas, who described herself as “a mom and elementary school teacher presently working with a neighborhood Pandemic Pod of Tk-2nd graders; baking my way through and sharing my story as I go.” 

An image from the archived page “Book of Days: A Covid Kitchen Chronicle” by List Goldman Douglas

Another notable submission encapsulating that time was a crowdsourced list of Black-owned restaurants and businesses in Sonoma County, an effort that has since been expanded to include Native, POC-immigrant, and people of color-owned businesses.

Screenshot of collectively created directory, archived 23 October 2020

Now more than ever, we recognize and appreciate the value of preserving the web to ensure that reliable sources of information, vital pieces of the historical record, endure. To that end, the library is embarking on a new collection—Community Roots/Raíces Comunitarias—a shift from event collecting to preserving the websites of local organizations who work to support the needs and aspirations of marginalized groups.

This change in focus warrants a new approach to collecting, as we seek permission from organizations to archive their web content. This requires us to be intentional and transparent about our collecting. This accountability acknowledges the asymmetrical relationship between archival institutions and communities of color that has led to mistrust, silencing, and harm; it is vital in maintaining equitable partnerships. It is also an opportunity to let local organizations know who we are and the preservation work we have been doing. 

We hope this opens a dialogue and leads to future collaboration. At the very least, it is a chance for the library to say, “What you are doing in our community matters, and the library is here to support, celebrate and further your work.” So far, we’ve received an enthusiastic response from organizations such as Positive Images, an LGBTQIA+ Community Center, and the North Bay Organizing Project, a social justice coalition.

Explore the web archives of Sonoma County Library.

CARTA: A Collective Approach to the Preservation of Online Art Resources

Art historians, critics, curators, and humanities scholars rely on the records of artists, galleries, museums, and arts organizations to understand and contextualize contemporary artistic practice. Yet, much of the art-related materials that were once published in print form are now available primarily or solely on the web and are ephemeral by nature. In response to this challenge, more than 40 art libraries, museums, and organizations from across the United States and Canada have partnered with Internet Archive to establish a collective approach to the preservation of web-based art content at scale: The Collaborative ART Archive (CARTA).

Since 2018, members of CARTA have worked together to identify, preserve, and provide access to at-risk online content related to the arts. The program relies on the expertise of those working in art libraries and museums by asking them to nominate sites for inclusion in the archive. Internet Archive web archivists then work to capture the sites and make the preserved content available in the CARTA collections portal.

Sumitra Duncan, Head of the Web Archiving Program at the New York Art Resources Consortium/The Frick Collection, was one of the founding program members and currently serves as the CARTA Advisory Board Chair. In speaking about the program, she reflected “It’s been tremendous to see what we’ve all been able to achieve together with CARTA in just a few years of working collaboratively, with over 40 member organizations having contributed. This work isn’t as easily accomplished alone (especially for those who are part of a small museum staff, face shrinking budgets for subscriptions, or are the solo archivist/librarian at their organization), so CARTA has allowed many art library colleagues to join the effort and share their expertise for collection development to ensure that these ephemeral materials are being preserved before disappearing from the live web.” 

Image from Galeria Superficie art gallery website, contributed by NYARC.
See more of the archived website here.

While CARTA is a member-supported program, mission-aligned organizations experiencing financial constraints may apply to join through the Sponsored Membership Program. One of CARTA’s sponsored members is the American Craft Council, a nonprofit organization that celebrates the history, practice, and unique storytelling of American craftwork. “I was very happy to be invited back to join CARTA as a sponsored member,” said Beth Goodrich, Archivist at the American Craft Council. “It was very important to me to see that the field of craft is recognized and reflected in the archival record of art in America and around the world.”

Image from Patti Warashina’s artist website, contributed by the American Craft Council.
See more of the archived website here.

Each CARTA member brings their own unique expertise to the program, often contributing nominations connected to the regions, styles, and media represented in their institution’s collections. Marie Chant, Digital Archivist at the Museum of Glass, explained “Museum of Glass has been digitally documenting vibrant and innovative glass artists in our state-of-the-art Hot Shop for over twenty years. Joining CARTA was a natural next step for our work and will help further support our collection of born-digital glass art documentation. We are excited to work with the Internet Archive and other CARTA institutions committed to preserving significant web-based contemporary art resources for generations to come.” 

In addition to nonprofit organizations and museums, CARTA’s membership also includes university art libraries. One of these contributors is Kristy Waller, Archivist at Emily Carr University of Art + Design. “Emily Carr University (ECU) was pleased to be selected to participate in CARTA as a sponsored member and we are excited to contribute Canadian art and design content. ECU supports both emerging and established artists by documenting arts education and practice through its websites and resources. We tried to crawl these sites manually using open-source tools, but arts content is often complicated and media heavy, making this work unsustainable on our budget. Through our involvement in CARTA, we are able to preserve content for the ECU community and beyond; as well as collaborate with local arts organizations to nominate artist-run centres and artists’ web sites – always with the goal of increasing meaningful access to arts content for future researchers.”

Image from Intuition Commons artist website, contributed by Emily Carr University of Art + Design.
See more of the archived website here.

As the CARTA collections and membership continue to grow, collaborators are pursuing more opportunities to preserve and provide access to art resources from communities and organizations across the world. “I’m very grateful to CARTA members and the Internet Archive staff for their dedication and shared vision for the success and continued growth of this program via coordinated collaboration,” said Duncan. “I’m excited to see how we can further get the word out about the wonderful resources we have within the CARTA collections and to recruit additional members to the CARTA cohort who can bring unique perspectives to subject areas not yet represented by the sites we’ve archived thus far.” 

“CARTA is transformative in the realm of preserving web-based art history,” said Heather Slania, who began her involvement with the program while working at the Maryland Institute College of Art and now serves as the Chief Librarian of the National Gallery of Art. “Its collaborative nature is vital for managing the vast and interconnected art world. I strongly encourage large and small institutions to join this essential endeavor. By contributing to CARTA, you are preserving art information and ensuring that future generations have a rich and diverse understanding of today’s art landscape.”

Learn more about the CARTA program, explore the CARTA collections portal, or reach out to the CARTA program team for more information.

Internet Archive and Partners Receive Press Forward Funding to Support Preserving Local News

We are excited to announce that Internet Archive, working with partners Investigative Reporters & Editors (IRE) and The Poynter Institute, has received a $1 million grant from Press Forward, a national initiative to reimagine local news. The funding is part of Press Forward’s Open Call on Infrastructure, which is providing $22.7 million to 22 projects that address the urgent challenges local newsrooms face today. 

The grant will support development of the “Today’s News for Tomorrow” national program by Internet Archive, IRE, and The Poynter Institute to provide infrastructure, preservation services, training, and community building that enable local newsrooms and journalists to ensure the archiving and perpetual access of their their publications, digital assets, and other materials. As the first draft of history, local news published today is a critical resource documenting the lives and stories of the American people as well as an essential record for use by students, historians, and researchers. The “Today’s News for Tomorrow” program will address the financial and operational challenges that many local news organizations face in managing and preserving their digital materials both for their immediate internal needs and the future information needs of their communities. 

The Press Forward funding will allow the program partners to provide infrastructure and training to over 300 newsrooms and journalists across the country, with a focus on vital local online news that is particularly at risk. Internet Archive’s web archive has long been an essential resource for journalists in their reporting. Pairing Internet Archive’s preservation infrastructure and services with IRE’s and The Poynter Institute’s experience in training and community support for journalists will further Press Forward’s goal to strengthen communities by revitalizing local news. The “Today’s News for Tomorrow” program also builds on Internet Archive’s successful “Community Webs” national program which has received nearly $3M in funding to provide preservation services and cohort-based training to over 275 libraries, museums, and municipalities from 46 states and 7 Canadian provinces in support of their work documenting the history of their communities. 

We thank Press Forward and The Miami Foundation for their support of “Today’s News for Tomorrow.” We are excited to work closely with IRE and The Poynter Institute supporting newsrooms and journalists and are honored to be part of the group of organizations receiving funding as part of Press Forward’s Open Call on Infrastructure. The full list of recipients is available online at pressforward.news/infrastructure25.

New Ways to Search Archived Music News

First crawl of CMT News on January 10, 2002.

When MTVNews.com went offline in late June, Internet users were quick to discover that some (but sadly, not all) of the site had been archived in the Internet Archive’s Wayback Machine. While you can no longer browse MTV News directly on the web, the archived pages are available via the Wayback Machine, starting with the first crawl of the site on July 5, 1997.

The same is true for CMT (Country Music Television) News, which was first crawled by the Internet Archive on January 10, 2002.

In response to patron requests, our engineers have created new search indexes for each site:

Why provide search indexes to music news? Because, as Michael Alex, founding editor of MTV News Digital, wrote in an op-ed for Variety, “the archives of MTV News and countless other news and entertainment organizations have a similar value: They’re a living record of entertainment history as it happened.”

It’s important to remember that these collections were captured as a routine part of the daily work conducted by more than one thousand libraries and archives collaborating with the Internet Archive to archive the web. For centuries, libraries have been the trusted repositories of culture and knowledge. As our news and information sources move increasingly digital, the role of libraries like the Internet Archive and our partners has changed to meet these new demands. This is why libraries like ours exist, and why web archiving is critical for preserving our shared digital culture.

Using DLARC, Amateur Radio Operators are Resurrecting Technical Ideas from the Past, Using 21st Century Tech

A Thank You to Internet Archive’s Digital Library of Amateur Radio & Communications
by Steve Stroh N8GNJ

In 2021, I was a member of the committee that recommended approval of a significant grant from Amateur Radio Digital Communications (ARDC) to Internet Archive to create the Digital Library of Amateur Radio & Communications (DLARC). I could foresee the potential of DLARC then… but I couldn’t then imagine the scale of what DLARC would become, nor how useful DLARC would prove to be for the entirety of the Amateur (Ham) Radio community worldwide.

In my newsletter Zero Retries, I write about interesting developments in Ham Radio to folks like me whose primary interest in Ham Radio is experimenting with the more advanced technological possibilities of Ham Radio. Such developments include communicating with data modes locally and worldwide (Packet Radio), using Ham Radio satellites and communicating with Ham Radio astronauts on the International Space Station, and developing M17, a new two way radio technology based entirely on open source (to mention just a few).

One of my favorite ways to use the DLARC (nearly 120,000 items now, and still growing) is to re-explore ideas that were proposed or attempted in Ham Radio, but for various reasons, didn’t quite become mainstream. Typically, the technology of earlier eras simply wasn’t up to some proposed ideas. But, with the technology of the 2020s such as cheap, powerful computers and software defined radio technology, many old ideas can be reexamined with perhaps succeed in becoming mainstream now. The problem has been that much of the source material for such “reimagining” has been languishing in file cabinets or bookcases of Ham Radio Operators like me, with nowhere to go. With the grant, IA could hire a dedicated archivist and began receiving, scanning, hosting, and aggregating electronic versions of old Ham Radio material.

One of my favorite examples of maybe we should try this again? is a one page flyer for a radio unit designed for data – the  NW Digital Radio UDRX-440. That radio was a leading-edge idea in 2013, but didn’t become a product. One reason for that fate was that it required a small but powerful computer that NW Digital Radio was forced to develop itself, which was expensive. More than a decade later, the computer that NW Digital Radio required, with a quad-core, 1.8 GHZ processor and 1 GB of RAM is available off-the-shelf – for $35. Perhaps it’s time for an innovative Ham Radio manufacturer to try creating something like the UDRX-440 again. Being able to provide a link to illustrate such a concept, and prove that one manufacturer got as far as the design stage, can be inspirational.

Another example maybe we should try this again? is the PACSAT system, a data-communications protocol and hardware specification for Ham Radio satellites that combined multiple receivers with a single high speed transmitter for more efficient throughput of data. In the 1990s, PACSAT was proposed and several satellites were actually built and put into orbit. But then, PACSAT required dedicated, expensive, specialized hardware suitable only for a satellite. In the 2020s, a PACSAT system could replace a Ham Radio repeater with a software defined receiver (can now listen to multiple frequencies) and a few other off-the-shelf parts. The difference that DLARC makes is that all the original reference material for PACSAT can easily be found in DLARC. If some graduate student were to email me looking for a project, I can suggest that they create a “PACSAT 2025” – and point them to all of the PACSAT material in DLARC.

Many new Ham Radio Operators live in “restricted” living arrangements such as apartments, condominiums, or communities that don’t allow external antennas. Thus, to operate on the Ham Radio “High Frequency – HF” bands (shortwave) bands, some “creativity” is required – a stealthy antenna. One of my favorite collections within DLARC is 73 Magazine which was published monthly for 43 years, with many, many antenna construction articles such as the “compressed W3EDP” HF antenna that would fit into an attic. Unlike current Ham Radio magazines, all 516 issues of 73 Magazine can be browsed, and downloaded, and because Internet Archive does optical character recognition (OCR), every word of every issue is keyword searchable.That, is powerful and ample “food for the imagination” of Ham Radio Operators looking to the past for some interesting projects to tackle.

Those are just a few examples of the utility of DLARC from my perspective. Ham Radio has existed for more than a century, but prior to DLARC, there was no comprehensive online archive of Ham Radio material. There were some personal archives, some Ham Radio clubs and organizations had their newsletters online, but there was no comprehensive online archive of Ham Radio material. DLARC is now the archive that Ham Radio has been missing. Most significantly, unlike some Ham Radio organizations, material in DLARC is free for public access (though some material is subject to Controlled Digital Lending). DLARC includes club newsletters (from all over the world), Ham Radio books and magazines (some from very early in the 20th century), audio recordings, video recordings, conference proceedings… literally a treasure trove of knowledge and ideas and inspiration.

Thank you Internet Archive and Archivist Kay Savetz K6KJN for all the hard work in creating and growing Digital Library of Amateur Radio & Communications – we really appreciate it (and I use it nearly every day).

Steve Stroh
Amateur Radio Operator N8GNJ
Bellingham, Washington, USA

End of Term Web Archive – Preserving the Transition of a Nation

It’s that time again. The 2024 End of Term crawl has officially begun! The End of Term Web Archive #EOTArchive hosts an initiative named the End of Term crawl to archive U.S. government websites in the .gov and .mil web domains — as well as those harder-to-find government websites hosted on .org, .edu, and other top level domains (TLDs) — as one administrative term ends and a new term begins. 

End of Term crawls have been completed for term transitions in 2004, 2008, 2012, 2016, and 2020. The results of these efforts is preserved in the End of Term Web Archive. In total, over 500 terabytes of government websites and data have been archived through the End of Term Web Archive efforts. These archives can be searched full-text via the Internet Archive’s collections search and also downloaded as bulk data for machine-assisted analysis.

The purpose of the End of Term Web Archive is to preserve a record of government websites for historical and research purposes. It is important to capture these websites because they can provide a snapshot of government messaging before and after the transition of terms. The End of Term Web Archive preserves information that may no longer be available on the live web for open access.

The End of Term Archive is a collaborative effort by the Internet Archive along with the University of North Texas (UNT), Stanford University, Library of Congress (LC), U.S. Government Publishing Office (GPO), and National Archives and Records Administration (NARA). Past partners include the University of CA’s California Digital Library (CDL), George Washington University, and the Environmental Data and Governance Initiative (EDGI).

Four images of Whitehouse.gov captured between 2008 and 2020
Whitehouse.gov captures from: 2008 Sept. 15; 2013 Mar. 21; 2017 Feb. 3; and 2021 Feb. 25

We are committed to preserving a record of U.S. government websites. But we need your help to complete the 2024 End of Term crawl. 

How can you help?! 

We have a list of top level domains from the General Services Administration (GSA) and from previous End of term crawls. But we need volunteers to help us out. We are currently accepting nominations for websites to be included in the 2024 End of Term Web Archive.

Submit a url nomination by going to digital2.library.unt.edu/nomination/eth2024/.
We encourage you to nominate any and all U.S. federal government websites that you want to make sure get captured. Nominating urls deep within .gov/.mil websites helps to make our web crawls as thorough and complete as possible. 

Individuals and institutions nominating seed urls are recognized on the individual contributors leaderboard and the institutions leaderboard!

Explore the End of Term Web Archive with full text search and download the data!

Community Webs Receives $750,000 Grant to Expand Community Archiving by Public Libraries

Started in 2017, our Community Webs program has over 175 public libraries and local cultural organizations working to build digital archives documenting the experiences of their communities, especially those patrons often underrepresented in traditional archives. Participating public libraries have created over 1,400 collections documenting local civic life totaling nearly 100 terabytes and tens of millions of individual documents, images, audio/video files, blogs, websites, social media, and more. You can browse many of these collections at the Community Webs website. Participants have also collaborated on digitization efforts to bring minority newspapers online, held public programming and outreach events, and formed local partnerships to help preservation efforts at other mission-aligned organizations. The program has conducted numerous workshops and national symposia to help public librarians gain expertise in digital preservation and cohort members have done dozens of presentations at professional conferences showcasing their work. In the past, Community Webs has received support from the Institute of Museum and Library Services, the Mellon Foundation, the Kahle Austin Foundation, and the National Historical Publications and Records Commission.

We are excited to announce that Community Webs has received $750,000 in funding from The Mellon Foundation to continue expanding the program. The award will allow additional public libraries to join the program and will enable new and existing members to continue their web archiving collection building using our Archive-It service. In addition, the funding will also provide members access to Internet Archive’s new Vault digital preservation service, enabling them to build and preserve collections of any type of digital materials. Lastly, leveraging members’ prior success in local partnerships, Community Webs will now include an “Affiliates” program so member public libraries can nominate local nonprofit partners that can also receive access to archiving services and resources. Funding will also support the continuation of the program’s professional development training in digital preservation and community archiving and its overall cohort and community building activities of workshops, events, and symposia.

We thank The Andrew W. Mellon Foundation for their generous support of Community Webs. We are excited to continue to expand the program and empower hundreds of public librarians to build archives that document the voices, lives, and events of their communities and to ensure this material is permanently available to patrons, students, scholars, and citizens.

Moving Getty.edu “404-ward” With Help From The Internet Archive API

This is a guest post from Teresa Soleau (Digital Preservation Manager), Anders Pollack (Software Engineer), and Neal Johnson (Senior IT Project Manager) from the J. Paul Getty Trust.

Project Background

Getty pursues its mission in Los Angeles and around the world through the work of its constituent programs—Getty Conservation Institute, Getty Foundation, J. Paul Getty Museum, and Getty Research Institute—serving the general interested public and a wide range of professional communities to promote a vital civil society through an understanding of the visual arts. 

In 2019, Getty began a website redesign project, changing the technology stack and updating the way we interact with our communities online. The legacy website contained more than 19,000 web pages and we knew many were no longer useful or relevant and should be retired, possibly after being archived. This led us to leverage the content we’d captured using the Internet Archive’s Archive-It service.

We’d been crawling our site since 2017, but had treated the results more as a record of institutional change over time than as an archival resource to be consulted after deletion of a page. We needed to direct traffic to our Wayback Machine captures thus ensuring deleted pages remain accessible when a user requests a deprecated URL. We decided to dynamically display a link to the archived page from our site’s 404 error “Page not found” page.

Getty.edy 404 page
Getty.edu 404 error “Page not found” message including the dynamically generated instructions and Internet Archive page link.

The project to audit all existing pages required us to educate content owners across the institution about web archiving practices and purpose. We developed processes for completing human reviews of large amounts of captured content. This work is described in more detail in a 2021 Digital Preservation Coalition blog post that mentions the Web Archives Collecting Policy we developed.

In this blog post we’ll discuss the work required to use the Internet Archive’s data API to add the necessary link on our 404 pages pointing to the most recent Wayback Machine capture of a deleted page.

Technical Underpinnings

getty workflow

Implementation of our Wayback Machine integration was very straightforward from a technical point of view. The first example provided in the Wayback Machine APIs documentation page provided the technical guidance needed for our use case to display a link to the most recent capture of any page deleted from our website. With no requirements for authentication or management of keys or platform-specific software development kit (SDK) dependencies, our development process was simplified. We chose to incorporate the Wayback API using Nuxt.js, the web framework used to build the new Getty.edu site.

Since the Wayback Machine API is highly performant for simple queries, with a typical response delay in milliseconds, we are able to query the API before rendering the page using a Nuxt route middleware module. API error handling and a request timeout were added to ensure that edge cases such as API failures or network timeouts do not block rendering of the 404 response page.

The only Internet Archive API feature missing for our initial list of requirements was access to snapshot page thumbnails in the JSON data payload received from the API. Access to these images would allow us to enhance our 404 page with a visual cue of archived page content.

Results and Next Steps

Our ability to include a link to an archived version of a deleted web page on our 404 response page helped ease the tough decisions content stakeholders were obliged to make about what content to archive and then delete from the website. We could guarantee availability of content in perpetuity without incurring the long term cost of maintaining the information ourselves.

The API brings back the most recent Wayback Machine capture by default which is sometimes not created by us and hasn’t necessarily passed through our archive quality assurance process. We intend to develop our application further so that we privilege the display of Getty’s own page captures. This will ensure we’re delivering the highest quality capture to users.

Google Analytics has been configured to report on traffic to our 404 pages and will track clicks on links pointing to Internet Archive pages, providing useful feedback on what portion of archived page traffic is referred from our 404 error page.

To work around the challenge of providing navigational affordances to legacy content and ensure web page titles of old content remains accessible to search engines, we intend to provide an up-to-date index of all archived getty.edu pages.

As we continue to retire obsolete website pages and complete this monumental content archiving and retirement effort, we’re grateful for the Internet Archive API which supports our goal of making archived content accessible in perpetuity.

Leveraging Technology to Scale Library Research Support: ARCH, AI, and the Humanities

Kevin Hegg is Head of Digital Projects at James Madison University Libraries (JMU). Kevin has held many technology positions within JMU Libraries. His experience spans a wide variety of technology work, from managing computer labs and server hardware to developing a large open-source software initiative. We are thankful to Kevin for taking time to talk with us about his experience with ARCH (Archives Research Compute Hub), AI, and supporting research at JMU

Thomas Padilla is Deputy Director, Archiving and Data Services. 

Thomas: Thank you for agreeing to talk more about your experience with ARCH, AI, and supporting research. I find that folks are often curious about what set of interests and experiences prepares someone to work in these areas. Can you tell us a bit about yourself and how you began doing this kind of work?

Kevin: Over the span of 27 years, I have held several technology roles within James Madison University (JMU) Libraries. My experience ranges from managing computer labs and server hardware to developing a large open-source software initiative adopted by numerous academic universities across the world. Today I manage a small team that supports faculty and students as they design, implement, and evaluate digital projects that enhance, transform, and promote scholarship, teaching, and learning. I also co-manage Histories Along the Blue Ridge which hosts over 50,000 digitized legal documents from courthouses along Virginia’s Blue Ridge mountains.

Thomas: I gather that your initial interest in using ARCH was to see what potential it afforded for working with James Madison University’s Mapping Black Digital and Public Humanities project. Can you introduce the project to our readers? 

Kevin: The Mapping the Black Digital and Public Humanities project began at JMU in Fall 2022. The project draws inspiration from established resources such as the Colored Convention Project and the Reviews in Digital Humanities journal. It employs Airtable for data collection and Tableau for data visualization. The website features a map that not only geographically locates over 440 Black digital and public humanities projects across the United States but also offers detailed information about each initiative. The project is a collaborative endeavor involving JMU graduate students and faculty, in close alliance with JMU Libraries. Over the past year, this interdisciplinary team has dedicated hundreds of hours to data collection, data visualization, and website development.

Mapping the Black Digital and Public Humanities, project and organization type distribution

The project has achieved significant milestones. In Fall 2022, Mollie Godfrey and Seán McCarthy, the project leaders, authored, “Race, Space, and Celebrating Simms: Mapping Strategies for Black Feminist Biographical Recovery“, highlighting the value of such mapping projects. At the same time, graduate student Iliana Cosme-Brooks undertook a monumental data collection effort. During the winter months, Mollie and Seán spearheaded an effort to refine the categories and terms used in the project through comprehensive research and user testing. By Spring 2023, the project was integrated into the academic curriculum, where a class of graduate students actively contributed to its inaugural phase. Funding was obtained to maintain and update the database and map during the summer.

Looking ahead, the project team plans to present their work at academic conferences and aims to diversify the team’s expertise further. The overarching objective is to enhance the visibility and interconnectedness of Black digital and public humanities projects, while also welcoming external contributions for the initiative’s continual refinement and expansion.

Thomas: It sounds like the project adopts a holistic approach to experimenting with and integrating the functionality of a wide range of tools and methods (e.g., mapping, data visualization). How do you see tools like ARCH fitting into the project and research services more broadly? What tools and methods have you used in combination with ARCH?

Kevin: ARCH offers faculty and students an invaluable resource for digital scholarship by providing expansive, high-quality datasets. These datasets enable more sophisticated data analytics than typically encountered in undergraduate pedagogy, revealing patterns and trends that would otherwise remain obscured. Despite the increasing importance of digital humanities, a significant portion of faculty and students lack advanced coding skills. The advent of AI-assisted coding platforms like ChatGPT and GitHub CoPilot has democratized access to programming languages such as Python and JavaScript, facilitating their integration into academic research.

For my work, I employed ChatGPT and CoPilot to further process ARCH datasets derived from a curated sample of 20 websites focused on Black digital and public humanities. Utilizing PyCharm—an IDE freely available for educational purposes—and the CoPilot extension, my coding efficiency improved tenfold.

Next, I leveraged ChatGPT’s Advanced Data Analysis plugin to deconstruct visualizations from Stanford’s Palladio platform, a tool commonly used for exploratory data visualizations but lacking a means for sharing the visualizations. With the aid of ChatGPT, I developed JavaScript-based web applications that faithfully replicate Palladio’s graph and gallery visualizations. Specifically, I instructed ChatGPT to employ the D3 JavaScript library for ingesting my modified ARCH datasets into client-side web applications. The final products, including HTML, JavaScript, and CSV files, were made publicly accessible via GitHub Pages (see my graph and gallery on GitHub Pages)

Black Digital and Public Humanities websites, graph visualization

In summary, the integration of Python and AI-assisted coding tools has not only enhanced my use of ARCH datasets but also enabled the creation of client-side web applications for data visualization.

Thomas: Beyond pairing ChatGPT with ARCH, what additional uses are you anticipating for AI-driven tools in your work?

Kevin: AI-driven tools have already radically transformed my daily work. I am using AI to reduce or even eliminate repetitive, mindless tasks that take tens or hundreds of hours. For example, as part of the Mapping project, ChatGPT+ helped me transform an AirTable with almost 500 rows and two dozen columns into a series of 500 blog posts on a WordPress site. ChatGPT+ understands the structure of a WordPress export file. After a couple of hours of iterating through my design requirements with ChatGPT, I was able to import 500 blog posts into a WordPress website. Without this intervention, this task would have required over a hundred hours of tedious copying and pasting. Additionally, we have been using AI-enabled platforms like Otter and Descript to transcribe oral interviews.

I foresee AI-driven tools playing an increasingly pivotal role in many facets of my work. For instance, natural language processing could automate the categorization and summarization of large text-based datasets, making archival research more efficient and our analyses richer. AI can also be used to identify entities in large archival datasets. Archives hold a treasure trove of artifacts waiting to be described and discovered. AI offers tools that will supercharge our construction of finding aids and item-level metadata.  

Lastly, AI could facilitate more dynamic and interactive data visualizations, like the ones I published on GitHub Pages. These will offer users a more engaging experience when interacting with our research findings. Overall, the potential of AI is vast, and I’m excited to integrate more AI-driven tools into JMU’s classrooms and research ecosystem.

Thomas: Thanks for taking the time Kevin. To close out, whose work would you like people to know more about? 

Kevin: Engaging in Digital Humanities (DH) within the academic library setting is a distinct privilege, one that requires a collaborative ethos. I am fortunate to be a member of an exceptional team at JMU Libraries, a collective too expansive to fully acknowledge here. AI has introduced transformative tools that border on magic. However, loosely paraphrasing Immanuel Kant, it’s crucial to remember that technology devoid of content is empty. I will use this opportunity to spotlight the contributions of three JMU faculty whose work celebrates our local community and furthers social justice.

Mollie Godfrey (Department of English) and Seán McCarthy (Writing, Rhetoric, and Technical Communication) are the visionaries behind two inspiring initiatives: the Mapping Project and the Celebrating Simms Project. The latter serves as a digital, post-custodial archive honoring Lucy F. Simms, an educator born into enslavement in 1856 who impacted three generations of young students in our local community. Both Godfrey and McCarthy have cultivated deep, lasting connections within Harrisonburg’s Black community. Their work strikes a balance between celebration and reparation. Collaborating with them has been as rewarding as it is challenging.

Gianluca De Fazio (Justice Studies) spearheads the Racial Terror: Lynching in Virginia project, illuminating a grim chapter of Virginia’s past. His relentless dedication led to the installation of a historical marker commemorating the tragic lynching of Charlotte Harris. De Fazio, along with colleagues, has also developed nine lesson plans based on this research, which are now integrated into high school curricula. My collaboration with him was a catalyst for pursuing a master’s degree in American History.

Racial Terror: Lynching in Virginia

Both the Celebrating Simms and Racial Terror projects are highlighted in the Mapping the Black Digital and Public Humanities initiative. The privilege of contributing to such impactful projects alongside such dedicated individuals has rendered my extensive tenure at JMU both meaningful and, I hope, enduring.