Tag Archives: CARTA

CARTA: A Collective Approach to the Preservation of Online Art Resources

Art historians, critics, curators, and humanities scholars rely on the records of artists, galleries, museums, and arts organizations to understand and contextualize contemporary artistic practice. Yet, much of the art-related materials that were once published in print form are now available primarily or solely on the web and are ephemeral by nature. In response to this challenge, more than 40 art libraries, museums, and organizations from across the United States and Canada have partnered with Internet Archive to establish a collective approach to the preservation of web-based art content at scale: The Collaborative ART Archive (CARTA).

Since 2018, members of CARTA have worked together to identify, preserve, and provide access to at-risk online content related to the arts. The program relies on the expertise of those working in art libraries and museums by asking them to nominate sites for inclusion in the archive. Internet Archive web archivists then work to capture the sites and make the preserved content available in the CARTA collections portal.

Sumitra Duncan, Head of the Web Archiving Program at the New York Art Resources Consortium/The Frick Collection, was one of the founding program members and currently serves as the CARTA Advisory Board Chair. In speaking about the program, she reflected “It’s been tremendous to see what we’ve all been able to achieve together with CARTA in just a few years of working collaboratively, with over 40 member organizations having contributed. This work isn’t as easily accomplished alone (especially for those who are part of a small museum staff, face shrinking budgets for subscriptions, or are the solo archivist/librarian at their organization), so CARTA has allowed many art library colleagues to join the effort and share their expertise for collection development to ensure that these ephemeral materials are being preserved before disappearing from the live web.” 

Image from Galeria Superficie art gallery website, contributed by NYARC.
See more of the archived website here.

While CARTA is a member-supported program, mission-aligned organizations experiencing financial constraints may apply to join through the Sponsored Membership Program. One of CARTA’s sponsored members is the American Craft Council, a nonprofit organization that celebrates the history, practice, and unique storytelling of American craftwork. “I was very happy to be invited back to join CARTA as a sponsored member,” said Beth Goodrich, Archivist at the American Craft Council. “It was very important to me to see that the field of craft is recognized and reflected in the archival record of art in America and around the world.”

Image from Patti Warashina’s artist website, contributed by the American Craft Council.
See more of the archived website here.

Each CARTA member brings their own unique expertise to the program, often contributing nominations connected to the regions, styles, and media represented in their institution’s collections. Marie Chant, Digital Archivist at the Museum of Glass, explained “Museum of Glass has been digitally documenting vibrant and innovative glass artists in our state-of-the-art Hot Shop for over twenty years. Joining CARTA was a natural next step for our work and will help further support our collection of born-digital glass art documentation. We are excited to work with the Internet Archive and other CARTA institutions committed to preserving significant web-based contemporary art resources for generations to come.” 

In addition to nonprofit organizations and museums, CARTA’s membership also includes university art libraries. One of these contributors is Kristy Waller, Archivist at Emily Carr University of Art + Design. “Emily Carr University (ECU) was pleased to be selected to participate in CARTA as a sponsored member and we are excited to contribute Canadian art and design content. ECU supports both emerging and established artists by documenting arts education and practice through its websites and resources. We tried to crawl these sites manually using open-source tools, but arts content is often complicated and media heavy, making this work unsustainable on our budget. Through our involvement in CARTA, we are able to preserve content for the ECU community and beyond; as well as collaborate with local arts organizations to nominate artist-run centres and artists’ web sites – always with the goal of increasing meaningful access to arts content for future researchers.”

Image from Intuition Commons artist website, contributed by Emily Carr University of Art + Design.
See more of the archived website here.

As the CARTA collections and membership continue to grow, collaborators are pursuing more opportunities to preserve and provide access to art resources from communities and organizations across the world. “I’m very grateful to CARTA members and the Internet Archive staff for their dedication and shared vision for the success and continued growth of this program via coordinated collaboration,” said Duncan. “I’m excited to see how we can further get the word out about the wonderful resources we have within the CARTA collections and to recruit additional members to the CARTA cohort who can bring unique perspectives to subject areas not yet represented by the sites we’ve archived thus far.” 

“CARTA is transformative in the realm of preserving web-based art history,” said Heather Slania, who began her involvement with the program while working at the Maryland Institute College of Art and now serves as the Chief Librarian of the National Gallery of Art. “Its collaborative nature is vital for managing the vast and interconnected art world. I strongly encourage large and small institutions to join this essential endeavor. By contributing to CARTA, you are preserving art information and ensuring that future generations have a rich and diverse understanding of today’s art landscape.”

Learn more about the CARTA program, explore the CARTA collections portal, or reach out to the CARTA program team for more information.

Collective Web-Based Art Preservation and Access at Scale 

Art historians, critics, curators, humanities scholars and many others rely on the records of artists, galleries, museums, and arts organizations to conduct historical research and to understand and contextualize contemporary artistic practice. Yet, much of the art-related materials that were once published in print form are now available primarily or solely on the web and are thus ephemeral by nature. In response to this challenge, more than 40 art libraries spent the last 3 years developing a collective approach to preservation of web-based art materials at scale. 

Supported by the Institute of Museum and Library Services and the National Endowment for the Humanities, The Collaborative ART Archive (CARTA) community has successfully aligned effort across libraries large and small, from Manoa, Hawaii to Toronto, Ontario and back resulting in preservation of and access to 800 web-based art resources, organized into 8 collections (art criticism, art fairs and events, art galleries, art history and scholarship, artists websites, arts education, arts organizations, auction houses), totalling nearly 9 TBs of data with continued growth. All collections are preserved in perpetuity by the Internet Archive. 

Today, CARTA is excited to launch the CARTA portal – providing unified access to CARTA collections.

CARTA portal

🎨 CARTA portal 🎨

The CARTA portal includes web archive collections developed jointly by CARTA members, as well as preexisting art-related collections from CARTA institutions, and non-CARTA member collections. CARTA portal development builds on the Internet Archive’s experience creating the COVID-19 Web Archive and Community Webs portal. 

CARTA collections are searchable by contributing organization, collection, site, and page text. Advanced search supports more granular exploration by host, results per host, file types, and beginning and end dates.

CARTA search

🔭 CARTA search 🔭

In addition to the CARTA portal, CARTA has worked to promote research use of collections through a series of day long computational research workshops – Working to Advance Library Support for Web Archive Researchbacked by ARCH (Archives Research Compute Hub). A call for applications for the next workshop, held concurrent to the annual Society of American Archivists meeting, is now open. 

Moving forward CARTA aims to grow and diversify its membership in order to increase collective ability to preserve web-based art materials. If your art library would like to join CARTA please express interest here..

Getting Started with Machine Learning and GLAM (Galleries, Libraries, Archives, Museums) Collections

Guest Post by Daniel Van Strien, Machine Learning Librarian, Hugging Face

Machine learning has many potential applications for working with GLAM (galleries, libraries, archives, museums) collections, though it is not always clear how to get started. This post outlines some of the possible ways in which open source machine learning tools from the Hugging Face ecosystem can be used to explore web archive collections made available via the Internet Archive’s ARCH (Archives Research Compute Hub). ARCH aims to make computational work with web archives more accessible by streamlining web archive data access, visualization, analysis, and sharing. Hugging Face is focused on the democratization of good machine learning. A key component of this is not only making models available but also doing extensive work around the ethical use of machine learning. 

Below, I work with the Collaborative Art Archive (CARTA) collection focused on artist websites. This post is accompanied by an ARCH Image Dataset Explorer Demo. The goal of this post is to show how using a specific set of open source machine learning models can help you explore a large dataset through image search, image classification, and model training. 

Later this year, Internet Archive and Hugging Face will organize a hands-on hackathon focused on using open source machine learning tools with web archives. Please let us know if you are interested in participating by filling out this form.

Choosing machine learning models

The Hugging Face Hub is a central repository which provides access to open source machine learning models, datasets and demos. Currently, the Hugging Face Hub has over 150,000 openly available machine learning models covering a broad range of machine learning tasks.

Rather than relying on a single model that may not be comprehensive enough, we’ll select a series of models that suit our particular needs.

A screenshot of the Hugging Face hub task navigator presenting a way of filtering machine learning models hosted on the hub by the tasks they intend to solve. Example tasks are Image Classification, Token Classification and Image-to-Text.

A screenshot of the Hugging Face Hub task navigator presenting a way of filtering machine learning models hosted on the hub by the tasks they intend to solve. Example tasks are Image Classification, Token Classification and Image-to-Text.

Working with image data 

ARCH currently provides access to 16 different “research ready” datasets generated from web archive collections. These include but are not limited to datasets containing all extracted text from the web pages in a collection, link graphs (showing how websites link to other websites), and named entities (for example, mentions of people and places). One of the datasets is made available as a CSV file, containing information about the images from webpages in the collection, including when the image was collected, when the live image was last modified, a URL for the image, and a filename.

Screenshot of the ARCH interface showing a preview for a dataset. This preview includes a download link and an “Open in Colab” button.

Screenshot of the ARCH interface showing a preview for a dataset. This preview includes a download link and an “Open in Colab” button.

One of the challenges we face with a collection like this is being able to work at a larger scale to understand what is contained within it – looking through 1000s of images is going to be challenging. We address that challenge by making use of tools that help us better understand a collection at scale.

Building a user interface

Gradio is an open source library supported by Hugging Face that helps create user interfaces that allow other people to interact with various aspects of a machine learning system, including the datasets and models. I used Gradio in combination with Spaces to make an application publicly available within minutes, without having to set up and manage a server or hosting. See the docs for more information on using Spaces. Below, I show examples of using Gradio as an interface for applying machine learning tools to ARCH generated data.

Exploring images

I use the Gradio tab for random images to begin assessing images in the dataset. Looking at a randomized grid of images gives a better idea of what kind of images are in the dataset. This begins to give us a sense of what is represented in the collection (e.g., art, objects, people, etc.).

Screenshot of the random image gallery showing a grid of images from the dataset.

Screenshot of the random image gallery showing a grid of images from the dataset.

Introducing image search models 

Looking at snapshots of the collection gives us a starting point for exploring what kinds of images are included in the collection. We can augment our approach by implementing image search. 

There are various approaches we could take which would allow us to search our images. If we have the text surrounding an image, we could use this as a proxy for what the image might contain. For example, we might assume that if the text next to an image contains the words “a picture of my dog snowy”, then the image contains a picture of a dog. This approach has limitations – text might be missing, unrelated or only capture a small part of what is in an image. The text “a picture of my dog snowy” doesn’t tell us what kind of dog the image contains or if other things are included in that photo.

Making use of an embedding model offers another path forward. Embeddings essentially take an input i.e. text or image, and return a bunch of numbers. For example, the text prompt: ‘an image of a dog’, would be passed through an embedding model, which ‘translates’ text into a matrix of numbers (essentially a grid of numbers). What is special about these numbers is that they should capture some semantic information about the input; the embedding for a picture of a dog should somehow capture the fact that there is a dog in the image. Since these embeddings consist of numbers, we can also compare one embedding to another to see how close they are to each other. We expect the embeddings for similar images to be closer to each other and the embeddings for images which are less similar to each other to be farther away. Without getting too much into the weeds of how this works, it’s worth mentioning that these embeddings don’t just represent one aspect of an image, i.e. the main object it contains but also other components, such as its aesthetic style. You can find a longer explanation of how this works in this post.

Finding a suitable image search model on the Hugging Face Hub

To create an image search system for the dataset, we need a model to create embeddings. Fortunately, the Hugging Face Hub makes it easy to find models for this.  

The Hub has various models that support building an image search system. 

A screenshot of the Hugging Face Hub showing a list of hosted models.

Hugging Face Hub showing a list of hosted models.

All models will have various benefits and tradeoffs. For example, some models will be much larger. This can make a model more accurate but also make it harder to run on standard computer hardware.

Hugging Face Hub provides an ‘inference widget’, which allows interactive exploration of a model to see what sort of output it provides. This can be very useful for quickly understanding whether a model will be helpful or not. 

A screenshot of a model widget showing a picture of a dog and a cat playing the guitar. The widget assigns the label `"playing music`" the highest confidence.

A screenshot of a model widget showing a picture of a dog and a cat playing the guitar. The widget assigns the label `”playing music`” the highest confidence.

For our use case, we need a model which allows us to embed both our input text, for example, “an image of a dog,” and compare that to embeddings for all the images in our dataset to see which are the closest matches. We use a variant of the CLIP model hosted on Hugging Face Hub: clip-ViT-B-16. This allows us to turn both our text and images into embeddings and return the images which most closely match our text prompt.

A screenshot of the search tab showing a search for “landscape photograph” in a text box and a grid of images resulting from the search. This includes two images containing trees and images containing the sky and clouds.

Aa screenshot of the search tab showing a search for “landscape photograph” in a text box and a grid of images resulting from the search. This includes two images containing trees and images containing the sky and clouds.

While the search implementation isn’t perfect, it does give us an additional entry point into an extensive collection of data which is difficult to explore manually. It is possible to extend this interface to accommodate an image similarity feature. This could be useful for identifying a particular artist’s work in a broader collection.

Image classification 

While image search helps us find images, it doesn’t help us as much if we want to describe all the images in our collection. For this, we’ll need a slightly different type of machine learning task – image classification. An image classification model will put our images into categories drawn from a list of possible labels. 

We can find image classification models on the Hugging Face Hub. The “Image Classification Model Tester” tab in the demo Gradio application allows us to test most of the 3,000+ image classification models hosted on the Hub against our dataset.

This can give us a sense of a few different things:

  • How well do the labels for a model match our data?A model for classifying dog breeds probably won’t help us much!
  • It gives us a quick way of inspecting possible errors a model might make with our data. 
  • It prompts us to think about what categories might make sense for our images.
A screenshot of the image classification tab in the Gradio app which shows a bar chart with the most frequently predicted labels for images assigned by a computer vision model.

A screenshot of the image classification tab in the Gradio app which shows a bar chart with the most frequently predicted labels for images assigned by a computer vision model.

We may find a model that already does a good job working with our dataset – if we don’t, we may have to look at training a model.

Training your own computer vision model

The final tab of our Gradio demo allows you to export the image dataset in a format that can be loaded by Label Studio, an open-source tool for annotating data in preparation for machine learning tasks. In Label Studio, we can define labels we would like to apply to our dataset. For example, we might decide we’re interested in pulling out particular types of images from this collection. We can use Label Studio to create an annotated version of our dataset with these labels. This requires us to assign labels to images in our dataset with the correct labels. Although this process can take some time, it can be a useful way of further exploring a dataset and making sure your labels make sense. 

With a labeled dataset, we need some way of training a model. For this, we can use AutoTrain. This tool allows you to train machine learning models without writing any code. Using this approach supports creation of a model trained on our dataset which uses the labels we are interested in. It’s beyond the scope of this post to cover all AutoTrain features, but this post provides a useful overview of how it works.

Next Steps

As mentioned in the introduction, you can explore the ARCH Image Dataset Explorer Demo yourself. If you know a bit of Python, you could also duplicate the Space and adapt or change the current functionality it supports for exploring the dataset.

Internet Archive and Hugging Face plan to organize a hands-on hackathon later this year focused on using open source machine learning tools from the Hugging Face ecosystem to work with web archives. The event will include building interfaces for web archive datasets, collaborative annotation, and training machine learning models. Please let us know if you are interested in participating by filling out this form.

Working to Advance Library Support for Web Archive Research 

This Spring, the Internet Archive hosted two in-person workshops aimed at helping to advance library support for web archive research: Digital Scholarship & the Web and Art Resources on the Web. These one-day events were held at the Association of College & Research Libraries (ACRL) conference in Pittsburgh and the Art Libraries Society of North America (ARLIS) conference in Mexico City. The workshops brought together librarians, archivists, program officers, graduate students, and disciplinary researchers for full days of learning, discussion, and hands-on experience with web archive creation and computational analysis. The workshops were developed in collaboration with the New York Art Resources Consortium (NYARC) – and are part of an ongoing series of workshops hosted by the Internet Archive through Summer 2023.

Internet Archive Deputy Director of Archiving & Data Services Thomas Padilla discussing the potential of web archives as primary sources for computational research at Art Resources on the Web in Mexico City.

Designed in direct response to library community interest in supporting additional uses of web archive collections, the workshops had the following objectives: introduce participants to web archives as primary sources in context of computational research questions, develop familiarity with research use cases that make use of web archives; and provide an opportunity to acquire hands-on experience creating web archive collections and computationally analyzing them using ARCH (Archives Research Compute Hub) – a new service set to publicly launch June 2023.

Internet Archive Community Programs Manager Lori Donovan walking workshop participants through a demonstration of Palladio using a dataset generated with ARCH at Digital Scholarship & the Web In Pittsburgh, PA.

In support of those objectives, Internet Archive staff walked participants through web archiving workflows, introduced a diverse set of web archiving tools and technologies, and offered hands-on experience building web archives. Participants were then introduced to Archives Research Compute Hub (ARCH). ARCH supports computational research with web archive collections at scale – e.g., text and data mining, data science, digital scholarship, machine learning, and more. ARCH does this by streamlining generation and access to more than a dozen research ready web archive datasets, in-browser visualization, dataset analysis, and open dataset publication. Participants further explored data generated with ARCH in PalladioVoyant, and RAWGraphs.

Network visualization of the Occupy Web Archive collection, created using Palladio based on a Domain Graph Dataset generated by ARCH.

Gallery visualization of the CARTA Art Galleries collection, created using Palladio based on an Image Graph Dataset generated by ARCH.

At the close of the workshops, participants were eager to discuss web archive research ethics, research use cases, and a diverse set of approaches to scaling library support for researchers interested in working with web archive collections – truly vibrant discussions – and perhaps the beginnings of a community of interest!  We plan to host future workshops focused on computational research with web archives – please keep an eye on our Event Calendar.