As an academic librarian helping connect students and faculty with the research materials they need, Sanjeet Mann has turned to the Internet Archive many times.
“I really value having the Wayback Machine as an additional tool in my librarian’s toolbox,” Mann said. “Information preservation is an essential, but often overlooked, part of the infrastructure for teaching and learning.”
Mann, currently working as the Systems & Discovery Librarian at California State University, San Bernardino (CSUSB), said he first learned about the value of the Internet Archive in 2006 during his library science master’s program.
Over his career, Mann has worked at various libraries, tapping into the Archive on the job.
Assisting budding writers, composers and artists as Arts Librarian at University of Redlands, Mann found that the vast amount of free information online, including biographies, can shape students’ projects.
“We can draw on the Archive whenever we need inspiration for creative work, or when we need to understand how current scholarship and the issues that we’re facing now aren’t completely new—they’re based on this history of work by scholars, by politicians, by citizens active in the public interest,” he said. “These issues tend to recur over time. As a society, we need to know where we have been in order to meet the challenges of the future.”
At CSUSB, Mann also helps computer science and business students use the Archive’s collections to better understand the cultural roots of new technologies—the historical context for their innovations.
“It is the only entity I’m aware of that preserves the Internet’s scholarly and historical record at this scale,” Mann said.
“I really value having the Wayback Machine as an additional tool in my librarian’s toolbox.”
Sanjeet Mann, librarian
On a practical note, Mann leveraged information through the Wayback Machine when he was researching how to set up a campus laptop loaner program for University of Redlands. This can be an essential service that libraries provide students who have trouble with their computers.
Mann wanted to understand policies at other universities, such as how they handled the return of damaged laptops. Looking at archived versions of university library websites through the Wayback Machine, Mann was able to learn about other approaches and find contacts to follow up for additional details.
The Internet Archive is a source to verify information that is no longer listed on websites, he said.
“Companies themselves don’t have any incentive to archive the history of their website. New products get launched. The platform gets migrated from one platform to another,” Mann said. “An organization like the Internet Archive, being a library, is uniquely positioned to meet the need in society of ensuring some kind of continuity of memory and having a public record. Especially with the government being very partisan these days, I think there’s value in the Internet Archive being an independent, not-for-profit that operates in the public interest.”
Mann added: “Without the Archive, we would lose decades of information about our society at a crucial turning point in its development, eroding trust in online systems and requiring educators, students and researchers to reconsider the way we do our work and share it with others.”
This is a guest post from Teresa Soleau (Digital Preservation Manager), Anders Pollack (Software Engineer), and Neal Johnson (Senior IT Project Manager) from the J. Paul Getty Trust.
Project Background
Getty pursues its mission in Los Angeles and around the world through the work of its constituent programs—Getty Conservation Institute, Getty Foundation, J. Paul Getty Museum, and Getty Research Institute—serving the general interested public and a wide range of professional communities to promote a vital civil society through an understanding of the visual arts.
In 2019, Getty began a website redesign project, changing the technology stack and updating the way we interact with our communities online. The legacy website contained more than 19,000 web pages and we knew many were no longer useful or relevant and should be retired, possibly after being archived. This led us to leverage the content we’d captured using the Internet Archive’s Archive-It service.
We’d been crawling our site since 2017, but had treated the results more as a record of institutional change over time than as an archival resource to be consulted after deletion of a page. We needed to direct traffic to our Wayback Machine captures thus ensuring deleted pages remain accessible when a user requests a deprecated URL. We decided to dynamically display a link to the archived page from our site’s 404 error “Page not found” page.
Getty.edu 404 error “Page not found” message including the dynamically generated instructions and Internet Archive page link.
The project to audit all existing pages required us to educate content owners across the institution about web archiving practices and purpose. We developed processes for completing human reviews of large amounts of captured content. This work is described in more detail in a 2021 Digital Preservation Coalition blog post that mentions the Web Archives Collecting Policy we developed.
In this blog post we’ll discuss the work required to use the Internet Archive’s data API to add the necessary link on our 404 pages pointing to the most recent Wayback Machine capture of a deleted page.
Technical Underpinnings
Implementation of our Wayback Machine integration was very straightforward from a technical point of view. The first example provided in the Wayback Machine APIs documentation page provided the technical guidance needed for our use case to display a link to the most recent capture of any page deleted from our website. With no requirements for authentication or management of keys or platform-specific software development kit (SDK) dependencies, our development process was simplified. We chose to incorporate the Wayback API using Nuxt.js, the web framework used to build the new Getty.edu site.
Since the Wayback Machine API is highly performant for simple queries, with a typical response delay in milliseconds, we are able to query the API before rendering the page using a Nuxt route middleware module. API error handling and a request timeout were added to ensure that edge cases such as API failures or network timeouts do not block rendering of the 404 response page.
The only Internet Archive API feature missing for our initial list of requirements was access to snapshot page thumbnails in the JSON data payload received from the API. Access to these images would allow us to enhance our 404 page with a visual cue of archived page content.
Results and Next Steps
Our ability to include a link to an archived version of a deleted web page on our 404 response page helped ease the tough decisions content stakeholders were obliged to make about what content to archive and then delete from the website. We could guarantee availability of content in perpetuity without incurring the long term cost of maintaining the information ourselves.
The API brings back the most recent Wayback Machine capture by default which is sometimes not created by us and hasn’t necessarily passed through our archive quality assurance process. We intend to develop our application further so that we privilege the display of Getty’s own page captures. This will ensure we’re delivering the highest quality capture to users.
Google Analytics has been configured to report on traffic to our 404 pages and will track clicks on links pointing to Internet Archive pages, providing useful feedback on what portion of archived page traffic is referred from our 404 error page.
To work around the challenge of providing navigational affordances to legacy content and ensure web page titles of old content remains accessible to search engines, we intend to provide an up-to-date index of all archived getty.edu pages.
As we continue to retire obsolete website pages and complete this monumental content archiving and retirement effort, we’re grateful for the Internet Archive API which supports our goal of making archived content accessible in perpetuity.
As Elena Rowan researches the ways that activist archivers gather and make sense of data, she often relies on the Internet Archive. She is a graduate student in sociology at Concordia University in Montreal, Canada, with an interest in the debate around copyright and e-books in public libraries.
Elena Rowan
“I look at why archives and libraries are important to society and culture as a whole,” said Rowan, who uses materials preserved in the Wayback Machine and the lnternet Archive. “Without the Internet Archive, so much of the knowledge and information on the Internet would be lost, and most of my research would be impossible.”
Rowan is in her second year of her master’s program and works as a research assistant at the Data Justice Hub. It is a collaborative research project that pursues data-related skills development for social activists, critical researchers and the general public, and aims to understand how data activists gather and make sense of data.
The Internet Archive has been valuable, she said, in providing information for the project and its podcast, Data Decoded.
For a recent class on sociology theory, Rowan said she’s found it useful to search for work by early researchers such as W.E.B. Du Bois in the Internet Archive’s collection. Her university library has a wealth of materials, but she says there are times when she can only find an older book through the Archive and, being digital, it’s easier to locate.
With an event sponsored by the Milieux Institute, which offers programs at the intersection of fine arts, digital culture, and information technology, Rowan leveraged the Internet Archive in another way. She created a one-hour Curating Nostalgia workshop where participants could explore resources in the digital collection to create their own personal nostalgia archive.
Logging into the Internet Archive, Rowan taught people how to search for historical documents and pop culture items. For example, she found a beloved video game that came in a cereal box from her childhood, as well as an audio walking tour of her neighborhood from a decade earlier before gentrification changed the landscape. Other workshop participants found books they read as kids, Club Penguin memorabilia and a Nancy Drew game.
“For scholarly work and nostalgia researchers, it’s a treasure trove of goodies,” Rowan says of the Internet Archive.
In her personal life, Rowan said she’s enjoyed perusing old magazines and obscure cookbooks. She’s found recipes for ambitious cakes, sewing patterns and vintage designs that give her ideas for how to pull together her eclectic mix of old furniture.
“The colors, writing and patterns of the past offer infinite inspiration for creative hobbies and help cultivate domestic bliss,” she said. “I am grateful to everyone at the Internet Archive for creating, maintaining and continuing to expand and fight for this truly amazing public resource!”
At this year’s annual celebration in San Francisco, the Internet Archive team showcased its innovative projects and rallied supporters around its mission of “Universal Access to All Knowledge.”
Brewster Kahle, Internet Archive’s founder and digital librarian, welcomes hundreds of guests to the annual celebration on October 12, 2023.
“People need libraries more than ever,” said Brewster Kahle, founder of the Internet Archive, at the October 12 event. “We have a set of forces that are making libraries harder and harder to happen—so we have to do something more about it.”
Efforts to ban books and defund libraries are worrisome trends, Kahle said, but there are hopeful signs and emerging champions.
Watch the full live stream of the celebration
Among the headliners of the program was Connie Chan, Supervisor of San Francisco’s District 1, who was honored with the 2023 Internet Archive Hero Award. In April, she authored and unanimously passed a resolution at the San Francisco Board of Supervisors, backing the Internet Archive and the digital rights of all libraries.
Chan spoke at the event about her experience as a first-generation, low-income immigrant who relied on books in Chinese and English at the public library in Chinatown.
Watch Supervisor Chan’s acceptance speech
“Having free access to information was a critical part of my education—and I know I was not alone,” said Chan, who is a supporter of the Internet Archive’s role as a digital, online library. “The Internet Archive is a hidden gem…It is very critical to humanity, to freedom of information, diversity of information and access to truth…We aren’t just fighting for libraries, we are fighting for our humanity.”
Several users shared testimonials about how resources from the Internet Archive have enabled them to advance their research, fact-check politicians’ claims, and inspire their creative works. Content in the collection is helping improve machine translation of languages. It is preserving international television news coverage and Ukrainian memes on social media during the war with Russia.
Quinn Dombrowski, of the Saving Ukrainian Cultural Heritage Online project, shows off Ukrainian memes preserved by the project.
Technology is changing things—some for the worse, but a lot for the better, said David McRaney, speaking via video to the audience in the auditorium at 300 Funston Ave. “And when [technology] changes things for the better, it’s going to expand the limited capabilities of human beings. It’s going to extend the reach of those capabilities, both in speed and scope,” he said. “It’s about a newfound freedom of mind, and time, and democratizing that freedom so everyone has access to it.”
Open Library developer Drini Cami explained how the Internet Archive is using artificial intelligence to improve access to its collections.
When a book is digitized, it used to be that photographs of pages had to be manually cropped by scanning operators. The Internet Archive recently trained a custom machine learning model to automatically suggest page boundaries—allowing staff to double the rate of process. Also, an open-source machine learning tool converts images into text, making it possible for books to be searchable, and for the collection to be available for bulk research, cross-referencing, text analysis, as well as read aloud to people with print disabilities.
Open Library developer Drini Cami.
“Since 2021, we’ve made 14 million books, documents, microfiche, records—you name it—discoverable and accessible in over 100 languages,” Cami said.
As AI technology advanced this year, Internet Archive engineers piloted a metadata extractor, a tool that automatically pulls key data elements from digitized books. This extra information helps librarians match the digitized book to other cataloged records, beginning to resolve the backlog of books with limited metadata in the Archive’s collection. AI is also being leveraged to assist in writing descriptions of magazines and newspapers—reducing the time from 40 to 10 minutes per item.
“Because of AI, we’ve been able to create new tools to streamline the workflows of our librarians and the data staff, and make our materials easier to discover, and work with patrons and researchers, Cami said. “With new AI capabilities being announced and made available at a breakneck rate, new ideas of projects are constantly being added.”
Jamie Joyce & AI hackathon participants.
A recent Internet Archive hackathon explored the risks and opportunities of AI by using the technology itself to generate content, said Jamie Joyce, project lead with the organization’s Democracy’s Library project. One of the hackathon volunteers created an autonomous research agent to crawl the web and identify claims related to AI. With a prompt-based model, the machine was able to generate nearly 23,000 claims from 500 references. The information could be the basis for creating economic, environmental and other arguments about the use of AI technology. Joyce invited others to get involved in future hackathons as the Internet Archive continues to expand its AI potential.
Peter Wang, CEO and co-founder at Anaconda, said interesting kinds of people and communities have emerged around cultures of sharing. For example, those who participate in the DWeb community are often both humanists and technologists, he said, with an understanding about the importance of reducing barriers to information for the future of humanity. Wang said rather than a scarcity mindset, he embraces an abundant approach to knowledge sharing and applying community values to technology solutions.
Peter Wang, CEO and co-founder at Anaconda.
“With information, knowledge and open-source software, if I make a project, I share it with someone else, they’re more likely to find a bug,” he said. “They might improve the documentation a little bit. They might adapt it for a novel use case that I can then benefit from. Sharing increases value.”
The Internet Archive’s Joy Chesbrough, director of philanthropy, closed the program by expressing appreciation for those who have supported the digital library, especially in these precarious times.
“We are one community tied together by the internet, this connected web of knowledge sharing. We have a commitment to an inclusive and open internet, where there are many winners, and where ethical approaches to genuine AI research are supported,” she said. “The real solution lies in our deep human connection. It inspires the most amazing acts of generosity and humanity.”
***
If you value the Internet Archive and our mission to provide “Universal Access to All Knowledge,” please consider making a donation today.
As scholars of digital media studies, Liliana Bounegru and Jonathan Gray say the Internet Archive preserves artifacts that are integral to their work.
Jonathan Gray and Liliana Bounegru
The two academics work at King’s College London in the Department of Digital Humanities—Bounegru is a lecturer in digital media and Gray is a senior lecturer in critical infrastructure studies. They are both interested in studying how media has changed with digital technology. The Internet Archive collection has been useful as they examine the history of the web, trends and evolution of websites and changes in technology, society and culture.
In one study of online myths and disinformation, the researchers used the Wayback Machine to understand how tracker signatures (snippets of code that embed ads and analytics on a website) of viral “fake news” sites changed over time. As websites were blacklisted from major ad networks, they looked up the archived versions of the websites to follow how their money-making practices via ads changed over time. This project was completed in collaboration with BuzzFeed news, which published an article about the findings and analytical techniques.
This investigation builds on work that Bounegru and Gray did with First Draft, a nonprofit that works with journalists to support investigations around misinformation. They analyzed the tracker signatures of mainstream news sites alongside those of junk news sites to understand their different monetization and audience economics practices.
As a result of their investigations, the researchers created A Field Guide to Fake News that explores how digital methods can be used to study false viral news, political memes, and trolling practices. “It became widely used by a network of hundreds of media organizations and fact checking groups as well as for training people doing investigative work on disinformation,” Gray said. Together with other collaborators at the Public Data Lab which they co-founded, Bounegru and Gray wrote a paper in New Media & Society about the threat of misleading junk news on social, economic and political life and the questions that it raises about social media and online content sharing platforms.
Gray has long been interested in the politics of open and public data and is writing a book on the subject. This involves tracing how open data policies and practices have developed around the world, and he said it’s been valuable to be able to search and analyze open data websites through the Wayback Machine. As part of research for the book he published an article in Data & Policy, from Cambridge University Press, about the rise of data portals as online devices for making data public.
“In the case of data portals such as data.gov.uk we see a shift from more sociable and experimental design approaches aiming to surface questions, engage communities and support cultures of socially oriented invention to more muted, minimal expert facing infrastructures,” said Gray. “It could be considered a certain kind of success for open data advocates that portals have become so established and institutionalized, but also suggests that maybe there’s less interest in being inclusive,accessible, responsive or thoughtful in reaching communities that may be less technically oriented or those who don’t already know what they are looking for or what kinds of data is likely to be found.”
In working with their students, both Bounegru and Gray share ways that the Internet Archive can be useful for research. Through hands-on research activities with the Wayback Machine they explore how it can show how web content, user interfaces and web categories change. It can even provide evidence of broader societal change, such as how political views have shifted over time. The Archive can reveal large-scale changes and allow researchers, journalists, students and community groups to gain a richer appreciation of digital media history.
Added Bounegru: “We use the Internet Archive a lot. It is an essential tool for our research.”
Slide on how the WayBack Machine is being used from Bounegru and Gray’s “web histories” class as part of digital methods course at King’s College London.
Rachel Simmons first used the Wayback Machine for research projects at her Sacramento, California, high school. Now a senior at UCLA, she’s discovered even more ways to find material not available elsewhere.
Rachel Simmons
Simmons, whose mother and grandmother were both librarians, is an applied math major with a minor in film, television and digital media. As she looks up information about media figures or needs to find a rare film, she says the Internet Archive’s digital collection has been an invaluable resource.
“It’s really great to have access to information for anyone to use from their home computer,” Simmons says. “I don’t physically have to go into a library. If I’m working on something late at night, it’s convenient.”
When taking a class on American film history last year, she was assigned to research a famous actor; she chose Peter Lorre.
“I’m a big fan of classic horror films and he’s an icon whose legacy has continued long past his career,” she said. “I just wanted to learn more about him and what people thought of him at the time.”
To find those contemporary views of Lorre’s work, Simmons turned to the fan magazine collection in the Archive’s Media History Digital Library. There she found interviews with the actor and reviews of his movies from the 1930s. Despite appearing as a mysterious figure on film, Simmons says she learned the interviews present him as a conventional, regular guy. She gained even more insight through the published fan letters in the magazines. “I found it really interesting that I was reading these letters from almost one hundred years ago,” Simmons said.
For another UCLA course, Simmons tapped into the Internet Archive to view silent German films that were discussed in class. While she was studying, Simmons found herself stumbling onto trailers for other films, which led her to checking out similar movies for fun after her projects were complete. Many of the more obscure titles that interest her are not available on streaming services, she notes.
Simmons says she tells others about the resources available through the Internet Archive—including her family of librarians.
When Graeme Currie was working at a university, he went to the campus library for research and often lingered in the stacks just to enjoy the collection.
Now, as a freelance translator and editor operating remotely from a small town near Hamburg, Germany, Currie doesn’t have that same access. Without an institutional affiliation, he relies on materials in the Internet Archive for his work.
“It’s been vital for me because, at times, it’s the only way I can find what I need,” says Currie, 51, who is originally from Scotland. “For freelancers who are working from home without a library nearby and using obscure sources and out-of-print books, there’s nothing to replace the Internet Archive.”
Currie first heard about the Wayback Machine in the early 2000s as a means to check changes in websites. Then, he discovered other services that the Internet Archive provides including its audio and book library.
“For freelancers who are working from home without a library nearby and using obscure sources and out-of-print books, there’s nothing to replace the Internet Archive.”
Graeme Currie, freelance translator & editor
As he edits and translates academic books from German to English, Currie says he often has to check book citations—looking up page numbers and verifying passages. The virtual collection has been helpful as he researches a range of topics in the arts, social sciences and the humanities. Currie says he’s borrowed titles related to philosophy, criminality and global urban history, including the early history of tourism in Sicily.
Not only are many of the books hard to find, but Currie says logistically, they are difficult to obtain. Without the Internet Archive, Currie says he would have to wait weeks for interlibrary loans or try to contact the book authors, who are often unavailable.
“I simply could not do my job without access to a virtual library,” says Currie, who has been freelancing for about five years. “The Internet Archive is like having a university library on your desktop.”
Sarah Barry wanted to become a fighter for something—but she didn’t know exactly what.
Citizen journalist Sarah Barry
“I was frustrated with all that was going on in the world. I knew I couldn’t wave a magic wand and fix everything, but I wanted to help in some small way,” said the 28-year-old who lives in Columbus, Ohio, and works in IT.
She decided to leverage her research skills to help correct misinformation about vaccines and public health.
For Barry, the Wayback Machine has been critical in tracking the science and sharing what she’s discovered. Without the Internet Archive, she said, valuable internet history that she needs to do effective research would have been completely lost.
“I use the Internet Archive to look up old links and resources that have since gone defunct,” said Barry. “I also use the Archive to actively input web pages that need to be saved or saved again to ensure that any resources I’m currently using are saved for mine or other’s future reference.”
“It’s a common language among people like me who do research. We all know the Internet Archive is legit.”
Sarah Barry, citizen journalist
She has turned into a citizen journalist and independent activist, volunteering for nonprofit organizations to better inform the public. Barry has given public presentations on her findings and provided materials to reporters that have appeared in a variety of news outlets.
As a millennial, Barry said she grew up being active online and has long used the Internet Archive as a tool. “It’s a common language among people like me who do research,” she said. “We all know the Internet Archive is legit.”
As a doctoral student in anthropology at Yale University, Spencer Kaplan often relies on the Internet Archive for his research. He is an anthropologist of technology who studies virtual communities. Kaplan said he uses the Wayback Machine to create a living archive of data that he can analyze.
Doctoral student Spencer Kaplan
Last summer, Kaplan studied the blockchain community, which is active on Twitter and constantly changing. As people were sharing their views of the market and helping one another, he needed a way to save the data before their accounts disappeared. A failed project might have prompted the users to take down the information, but Kaplan used the Wayback Machine to preserve the social media exchanges.
In his research, Kaplan said he discovered an environment of mistrust online in the blockchain community and an abundance of scams. He followed how people were navigating the scams, warning one another online to be careful, and actually building trust in some cases. While blockchain is trying to build technologies that avoid trust in social interaction, Kaplan said it was interesting to observe blockchain enthusiasts engaging in trusting connections. He takes the texts of tweets to build a corpus that he can then code and analyze the data to track or show trends.
The Wayback Machine can be helpful, Kaplan said, in finding preserved discussions on Twitter, early versions of company websites or pages that have been taken down altogether—a start-up company that went out of business, for example. “It’s important to be able to hold on to that [information] because our research takes place at a very specific moment in time and we want to be able to capture that specific moment,” Kaplan said.
The Internet Archive’s Open Library has also been essential in Kaplan’s work. When he was recently researching the invention of the “corporate culture” concept, he had trouble finding the first editions of many business books written in the late 80s and early 90s. His campus library often bought updated volumes, but Kaplan needed the originals. “I needed the first edition because I needed to know exactly what they said first and I was able to find that on the Internet Archive,” Kaplan said.
A precious tool. That’s how Laura Ranca describes the Wayback Machine in her work.
As a researcher at the Berlin-based organization Tactical Tech and its Exposing the Invisible Project, she helps people use technology to inform, educate and advance causes. Ranca trains journalists, human rights activists, scholars and everyday citizens to use the internet to investigate and gather evidence.
The Wayback Machine has been particularly useful in finding and retrieving lost websites, said Ranca. She also makes sure materials she produces are preserved online so future researchers can build on her work. As people try to document how the public is interacting with technology, the material stored by the Internet Archive has been essential to investigators, Ranca said.
“We face the challenge of websites and webpages being modified, altered or intentionally taken down. Sometimes it’s to hide something that was previously published, but is no longer relevant, or it now has maybe a different connotation than was intended,” Ranca said. “For us, this is very valuable to access historical records and to save different web pages and resources online using the Wayback Machine.”
When researching environmental issues, Ranca has discovered material that reflects missed early warning signs. Finding 20-year-old mining reports, video footage or other documentation affecting the climate can be important evidence in making the case for climate action. These items need to be protected, Ranca said, and the Wayback Machine provides that security. Ranca and the team at Exposing the Invisible conduct workshops on how to navigate the Wayback Machine, as well as train-the-trainer sessions on investigative skills more broadly. She also created guides on how to use Internet Archive content, available as open source through Creative Commons.