We are excited to announce the public availability of ARCH (Archives Research Compute Hub), a new research and education service that helps users easily build, access, and analyze digital collections computationally at scale. ARCH represents a combination of the Internet Archive’s experience supporting computational research for more than a decade by providing large-scale data to researchers and dataset-oriented service integrations like ARS (Archive-it Research Services) and a collaboration with the Archives Unleashed project of the University of Waterloo and York University. Development of ARCH was generously supported by the Mellon Foundation.
What does ARCH do?
ARCH helps users easily conduct and support computational research with digital collections at scale – e.g., text and data mining, data science, digital scholarship, machine learning, and more. Users can build custom research collections relevant to a wide range of subjects, generate and access research-ready datasets from collections, and analyze those datasets. In line with best practices in reproducibility, ARCH supports open publication and preservation of user-generated datasets. ARCH is currently optimized for working with tens of thousands of web archive collections, covering a broad range of subjects, events, and timeframes, and the platform is actively expanding to include digitized text and image collections. ARCH also works with various portions of the overall Wayback Machine global web archive totaling 50+ PB going back to 1996, representing an extensive archive of contemporary history and communication.
ARCH, In-Browser Visualization
Who is ARCH for?
ARCH is for any user that seeks an accessible approach to working with digital collections computationally at scale. Possible users include but are not limited to researchers exploring disciplinary questions, educators seeking to foster computational methods in the classroom, journalists tracking changes in web-based communication over time, to librarians and archivists seeking to support the development of computational literacies across disciplines. Recent research efforts making use of ARCH include but are not limited to analysis of COVID-19 crisis communications, health misinformation, Latin American women’s rights movements, and post-conflict societies during reconciliation.
ARCH, Generate Datasets
What are core ARCH features?
Build: Leverage ARCH capabilities to build custom research collections that are well scoped for specific research and education purposes.
Access: Generate more than a dozen different research-ready datasets (e.g., full text, images, pdfs, graph data, and more) from digital collections with the click of a button. Download generated datasets directly in-browser or via API.
Analyze: Easily work with research-ready datasets in interactive computational environments and applications like Jupyter Notebooks, Google CoLab, Gephi, and Voyant and produce in-browser visualizations.
Publish and Preserve: Openly publish datasets in line with best practices in reproducible research. All published datasets will be preserved in perpetuity.
Support: Make use of synchronous and asynchronous technical support, online trainings, and extensive help center documentation.
How can I learn more about ARCH?
To learn more about ARCH please reach out via the following form.
Machine learning has many potential applications for working with GLAM (galleries, libraries, archives, museums) collections, though it is not always clear how to get started. This post outlines some of the possible ways in which open source machine learning tools from the Hugging Face ecosystem can be used to explore web archive collections made available via the Internet Archive’s ARCH (Archives Research Compute Hub). ARCH aims to make computational work with web archives more accessible by streamlining web archive data access, visualization, analysis, and sharing. Hugging Face is focused on the democratization of good machine learning. A key component of this is not only making models available but also doing extensive work around the ethical use of machine learning.
Below, I work with the Collaborative Art Archive (CARTA) collection focused on artist websites. This post is accompanied by an ARCH Image Dataset Explorer Demo. The goal of this post is to show how using a specific set of open source machine learning models can help you explore a large dataset through image search, image classification, and model training.
Later this year, Internet Archive and Hugging Face will organize a hands-on hackathon focused on using open source machine learning tools with web archives. Please let us know if you are interested in participating by filling out this form.
Choosing machine learning models
The Hugging Face Hub is a central repository which provides access to open source machine learning models, datasets and demos. Currently, the Hugging Face Hub has over 150,000 openly available machine learning models covering a broad range of machine learning tasks.
Rather than relying on a single model that may not be comprehensive enough, we’ll select a series of models that suit our particular needs.
A screenshot of the Hugging Face Hub task navigator presenting a way of filtering machine learning models hosted on the hub by the tasks they intend to solve. Example tasks are Image Classification, Token Classification and Image-to-Text.
Working with image data
ARCH currently provides access to 16 different “research ready” datasets generated from web archive collections. These include but are not limited to datasets containing all extracted text from the web pages in a collection, link graphs (showing how websites link to other websites), and named entities (for example, mentions of people and places). One of the datasets is made available as a CSV file, containing information about the images from webpages in the collection, including when the image was collected, when the live image was last modified, a URL for the image, and a filename.
Screenshot of the ARCH interface showing a preview for a dataset. This preview includes a download link and an “Open in Colab” button.
One of the challenges we face with a collection like this is being able to work at a larger scale to understand what is contained within it – looking through 1000s of images is going to be challenging. We address that challenge by making use of tools that help us better understand a collection at scale.
Building a user interface
Gradio is an open source library supported by Hugging Face that helps create user interfaces that allow other people to interact with various aspects of a machine learning system, including the datasets and models. I used Gradio in combination with Spacesto make an application publicly available within minutes, without having to set up and manage a server or hosting. See the docs for more information on using Spaces. Below, I show examples of using Gradio as an interface for applying machine learning tools to ARCH generated data.
I use the Gradio tab for random images to begin assessing images in the dataset. Looking at a randomized grid of images gives a better idea of what kind of images are in the dataset. This begins to give us a sense of what is represented in the collection (e.g., art, objects, people, etc.).
Screenshot of the random image gallery showing a grid of images from the dataset.
Introducing image search models
Looking at snapshots of the collection gives us a starting point for exploring what kinds of images are included in the collection. We can augment our approach by implementing image search.
There are various approaches we could take which would allow us to search our images. If we have the text surrounding an image, we could use this as a proxy for what the image might contain. For example, we might assume that if the text next to an image contains the words “a picture of my dog snowy”, then the image contains a picture of a dog. This approach has limitations – text might be missing, unrelated or only capture a small part of what is in an image. The text “a picture of my dog snowy” doesn’t tell us what kind of dog the image contains or if other things are included in that photo.
Making use of an embedding model offers another path forward. Embeddings essentially take an input i.e. text or image, and return a bunch of numbers. For example, the text prompt: ‘an image of a dog’, would be passed through an embedding model, which ‘translates’ text into a matrix of numbers (essentially a grid of numbers). What is special about these numbers is that they should capture some semantic information about the input; the embedding for a picture of a dog should somehow capture the fact that there is a dog in the image. Since these embeddings consist of numbers, we can also compare one embedding to another to see how close they are to each other. We expect the embeddings for similar images to be closer to each other and the embeddings for images which are less similar to each other to be farther away. Without getting too much into the weeds of how this works, it’s worth mentioning that these embeddings don’t just represent one aspect of an image, i.e. the main object it contains but also other components, such as its aesthetic style. You can find a longer explanation of how this works in this post.
Finding a suitable image search model on the Hugging Face Hub
To create an image search system for the dataset, we need a model to create embeddings. Fortunately, the Hugging Face Hub makes it easy to find models for this.
The Hub has various models that support building an image search system.
Hugging Face Hub showing a list of hosted models.
All models will have various benefits and tradeoffs. For example, some models will be much larger. This can make a model more accurate but also make it harder to run on standard computer hardware.
Hugging Face Hub provides an ‘inference widget’, which allows interactive exploration of a model to see what sort of output it provides. This can be very useful for quickly understanding whether a model will be helpful or not.
A screenshot of a model widget showing a picture of a dog and a cat playing the guitar. The widget assigns the label `”playing music`” the highest confidence.
For our use case, we need a model which allows us to embed both our input text, for example, “an image of a dog,” and compare that to embeddings for all the images in our dataset to see which are the closest matches. We use a variant of the CLIP model hosted on Hugging Face Hub: clip-ViT-B-16. This allows us to turn both our text and images into embeddings and return the images which most closely match our text prompt.
Aa screenshot of the search tab showing a search for “landscape photograph” in a text box and a grid of images resulting from the search. This includes two images containing trees and images containing the sky and clouds.
While the search implementation isn’t perfect, it does give us an additional entry point into an extensive collection of data which is difficult to explore manually. It is possible to extend this interface to accommodate an image similarity feature. This could be useful for identifying a particular artist’s work in a broader collection.
While image search helps us find images, it doesn’t help us as much if we want to describe all the images in our collection. For this, we’ll need a slightly different type of machine learning task – image classification. An image classification model will put our images into categories drawn from a list of possible labels.
We can find image classification models on the Hugging Face Hub. The “Image Classification Model Tester” tab in the demo Gradio application allows us to test most of the 3,000+ image classification models hosted on the Hub against our dataset.
This can give us a sense of a few different things:
How well do the labels for a model match our data?A model for classifying dog breeds probably won’t help us much!
It gives us a quick way of inspecting possible errors a model might make with our data.
It prompts us to think about what categories might make sense for our images.
A screenshot of the image classification tab in the Gradio app which shows a bar chart with the most frequently predicted labels for images assigned by a computer vision model.
We may find a model that already does a good job working with our dataset – if we don’t, we may have to look at training a model.
Training your own computer vision model
The final tab of our Gradio demo allows you to export the image dataset in a format that can be loaded by Label Studio, an open-source tool for annotating data in preparation for machine learning tasks. In Label Studio, we can define labels we would like to apply to our dataset. For example, we might decide we’re interested in pulling out particular types of images from this collection. We can use Label Studio to create an annotated version of our dataset with these labels. This requires us to assign labels to images in our dataset with the correct labels. Although this process can take some time, it can be a useful way of further exploring a dataset and making sure your labels make sense.
With a labeled dataset, we need some way of training a model. For this, we can use AutoTrain. This tool allows you to train machine learning models without writing any code. Using this approach supports creation of a model trained on our dataset which uses the labels we are interested in. It’s beyond the scope of this post to cover all AutoTrain features, but this post provides a useful overview of how it works.
As mentioned in the introduction, you can explore the ARCH Image Dataset Explorer Demo yourself. If you know a bit of Python, you could also duplicate the Space and adapt or change the current functionality it supports for exploring the dataset.
Internet Archive and Hugging Face plan to organize a hands-on hackathon later this year focused on using open source machine learning tools from the Hugging Face ecosystem to work with web archives. The event will include building interfaces for web archive datasets, collaborative annotation, and training machine learning models. Please let us know if you are interested in participating by filling out this form.
The experimental Trump Archive, which we launched in January, is a collection of President Donald Trump’s appearances on TV news shows, including interviews, speeches, and press conferences dating back to 2009. Now largely hand-curated, the Trump Archive is a prototype of the type of collection on a public figure or topic possible to make with material from our library of TV news. We are starting to reach out to machine learning collaborators to develop tools to make it more efficient to create such collections, and we have plans to publish similar collections on the Congressional leadership on both sides of the party aisle.
The growing Trump Archive contains a lot of content–928 clips and counting–so we’ve put together some pointers and ideas for how to use the collection.
Anna Wiener at The New Yorker used the Trump Archive for “immersion therapy: a means of overcoming shock through prolonged exposure,” while the TheWall Street Journal’s Geoffrey A. Fowler proposed the Trump Archive could be used to hold politicians accountable by people doing own fact-checking: “At a time when facts are considered up for debate, there’s more value than ever in being able to check the tape yourself.”
From this page you can explore TV programs that include at least one fact-checked Trump statement. After choosing a program, look for the fact-checked icon on the program timeline. When you click on that icon, you’ll be able to watch the video of the statement and then click through to a fact-checking article by one of our partners.
And if you’re eager to look for a specific topic, such as “terrorism,” or “immigration,” this table is a great place to start. You can search for a topic using the trusty find function on your computer, or download the table and view the list as a spreadsheet. Find a list of topics at PolitiFact and FactCheck.org.
Search the Trump Archive
The search function, on the left side of the screen on the front page of the Trump Archive, allows you to find words or phrases within the closed captioning for a particular clip. Since those transcribers are working in real-time and at lightening speed, the captions don’t produce a perfect transcript, but they will get you really close to where you need to be.
For example, I searched for “believe me” in the Trump Archive and came up with hundreds of results. While that particular example may only be useful for artists and linguists, the functionality can be applied in many ways. For example, there are almost 200 results for a search of “Iran Deal,” 70+ results for “radical Islamic terrorists,” and when you search “jobs,” the results almost match the number in our total collection, revealing how often Donald Trump talks about jobs.
When we heard the President would be taking action to remove an expansion of rights for the transgender community, we looked for what he may have said about it before by searching “transgender” in the caption search. It yielded six programs in which he spoke publicly about it.
Because of the imperfect nature of closed captioning transcripts, your search is often more successful if you don’t try for an exact quote. For example, you may know Trump said something like “we can make the kind of change together that you dream of.” The closed captioning quote may actually be “an make the find a change together that you beam of.” But in those circumstances where you need to search for an exact quote, try using this, ~0. For example, “the lion, the witch, and the wardrobe”~0 . The ~0 tells the search box to look for all these words without any other words in between them, thus next to each other.
Browse the Trump Archive by TV show
If you know the program name of the Trump statement you’re looking for, you can use the “Topics & Subjects” filter on the left side navigation. So for instance, you may recall that Trump said something you want to find on an episode of 60 Minutes. Find Topics & Subjects on the left side of the page and click on “More.”
Then check the boxes of the relevant program(s), in this case, 60 Minutes. Hit “apply your filters,” and then browse all the 60 Minutes programs in the Trump Archive.
Make your own shareable TV clips
Once you find that video clip where Trump says something you want to share, you can make your own video clip of up to three minutes that can be easily embedded into a post. If you post the link on twitter, the clip appears within the body of the tweet and can be played without clicking through to the TV News Archive.
To start, click on the icon to “Share, embed or refine this clip!”
A window will then open up to present (highlighted in orange) the closed captioning of the 60-second pre-defined segment—and the captioning of the 60 seconds before and after (not highlighted) for context. Important: the captions come from real-time closed captioning, which means they are often incomplete, garbled and not precisely aligned. This is all still an experiment, remember. Be sure to watch your clip before you post to make sure you captured what you meant to.
“Grab” the quote marks at the beginning and end of the highlighted segment and through a bit of trial and error, find the right in and out points for the clip. Note that each time the quote marks move, the player starts to play and the URL changes to update the “start” and “end” points of the clip — named to reflect the number of seconds into the entire program. Remember: Watch your clip before you post.
Pro tip: If you clip a quote that’s fewer than 10 seconds it might not play, so give it a bit of time to run. Copy the URL and paste it elsewhere. Click one of the variety of share method icons on the bottom of the edit window (Twitter, Facebook, etc.) The embed icon </> will offer two flavors of embed codes for the portion you have selected—one for an iFrame, the other for many WordPress sites.
Fun, right? Now go share another. Let us know if you have any questions by emailing us at firstname.lastname@example.org; and please, do share what uses you find for the Trump Archive.