Category Archives: Wayback Machine – Web Archive

Name: Authors Alliance 10th Anniversary: Authorship in an Age of Monopoly and Moral Panics
Start: 2024-05-17T16:00:00-08:00
End: 2024-05-17T19:00:00-08:00
Location: Internet Archive

Reliving the Past & Redesigning the Present with Animated GIFs

Posted on March 25, 2024 by Chris Freeland

As an editorial strategist and tech journalist, JD Shadel spends a lot of time thinking about how the content on the internet continues to rapidly evolve. One telling example they’ve followed closely is the evolution of GIFs. Two decades ago, the web was filled with millions of jittery, pixelated, handmade GIFs wherever you looked. And for many of us, there’s a nostalgia for the early days of the web when things felt a bit wilder and untamed.

That nostalgia for the version of the internet they grew up with is what first sparked Shadel’s interest in collecting old-school GIFs. During the first months of pandemic lockdowns in 2020, Shadel started spending a lot of their extra spare time diving deep into the Internet Archive’s GifCities collection. Shadel’s personal fascination began with under construction GIFs, a rich niche in the GifCities collection full of animated construction workers and tools. Then came seeking out GIFs of Furbies, Tamagotchi, and other cultural touchstones that the 33-year-old came of age with online. Over the next few years, downloading and organizing GIFs became a hobby for Shadel.

Recently, it came time to update Shadel’s professional website. “It’s one of those evergreen chores it’s easy to obsess over as a freelancer, when your website is your calling card for new work,” said Shadel, who found themself digging back through the hundreds of GIFs they’ve curated thanks to the Internet Archive.

Early cyberspace-themed GIFs became the theme for their new and somewhat unconventional portfolio, which features more than two dozen images sourced entirely from GifCities. Users can, for example, click on a spinning globe for an introduction or a British Furby to learn about Shadel’s background as an American now based in London—including editorial work for outlets such as Vice, The Washington Post, and Conde Nast Traveler and consulting for clients including Airbnb and Adidas.

“I’m so happy GifCities exists to capture that specific snapshot of the internet,” Shadel said. “It really relates, metaphorically, to a lot of my work where the real world and the internet blur, where the digital and the physical intersect.”

In addition to GifCities, the Wayback Machine has also been useful to Shadel. Professionally, it is a resource when reporting and fact checking stories. Personally, they recently found material from a band they played in years ago.

“The Internet Archive just touches my digital life in so many different ways,” Shadel said. “As a journalist, it’s a fact-checking tool. Having the ephemeral internet preserved for future researchers, writers, reporters and editors is a huge service to democracy. And it’s also just fun.”

JD Shadel

On the website with its Space Jam-like navigation, Shadel wanted to reference the history of the internet — and maybe even inspire visitors to think more actively about their own role in charting the future. “I think we can reclaim our digital lives and rekindle the notion of ourselves as ‘netizens’—citizens of the internet and not just passive participants,” Shadel said.

“That’s why the work of the Internet Archive is so important,” they continued. “Despite the fact that we have access to more information than ever before, it’s really easy to forget digital histories and the lessons that we can learn from that.”

Shadel’s writing touches on a range of intersecting topics—such as tech, travel and queerness—but the one thing they hope everyone takes away from their work is the idea that we’re all netizens with a role to play in shaping what we want these shared public spaces to be.

“If we all have some shared sense of ownership of the internet, which is so involved in our lives, I believe we have a greater chance to make it better.” Sometimes, that can start in simple ways—in this case, building a DIY website with a bunch of old GIFs reminded one tech journalist in London that there are lessons we can take from the early internet. “We all have a part to play in making the internet a better place.” And at the least, they hope you enjoy the GIFs they’ve selected.

Digital archives: a time machine for the web

Posted on March 5, 2024 by Chris Freeland

This post was originally published in a newsletter by Project Liberty, February 20, 2024. Image by Project Liberty.

In the summer of 2023, the New York Times ran an article titled “Ways You Can Still Cancel Your Federal Student Loan Debt.”

The article outlined six ways to cancel student debt, with the final being:

“Death
This is not something that most people would choose as a solution to their debt burden.”

At least that was the sixth reason until the New York Times revised it with a stealth edit. When you read the article today, choosing death as a solution to a debt burden has been replaced, but there’s no mention that this article was revised. The timestamp is still the day it was originally published.

If not for Internet Archive’s Wayback Machine, this discrepancy wouldn’t have been caught. The Wayback Machine is a digital archive of the internet, and as such, it captured multiple previous versions.

The internet is constantly being revised in ways that allow history to be rewritten and a shared sense of truth to be questioned. With AI-generated disinformation, the potential to exert control over the future by rewriting the past has never been greater.

This week we’re exploring how digital archives are crucial in developing a record of truth in an ever-changing web.

The need for digital archives

Mark Graham, Director of the Wayback Machine, spoke with the Project Liberty Foundation and shared the key reasons why there’s an even greater need for digital archives:

The importance of the internet. So much of what humanity publishes and makes available lives only on the internet. Given how much time we spend online, the internet has become a central medium of human expression, history, and culture.

The fragile and ephemeral nature of the internet. Graham shared two stats that underscore how fragile today’s internet is:

A study found that of the two million hyperlinks in New York Times articles from 1996 to 2019, 25% of all links were broken (described as link rot).
The Wayback Machine has fixed 20 million broken links in Wikipedia articles with the correct ones.

“The web itself is a living thing. Webpages change. They go away on quite a frequent basis. There’s no backup system or version control system for the web,” Graham explained. That is, except for archives like the Wayback Machine.

The Wayback Machine

The Wayback Machine is a “time machine for the web,” in Graham’s words. It allows users to trace the evolution (or disappearance) of a webpage over time, enabling them to establish a record of what happened on the internet.

For example, the Apple.com URL has been archived 539,000 times since its first archived page in October 1996.
The Wayback Machine has archived over 866 billion webpages in its 28-year history. Today, it archives hundreds of millions of webpages every day and has become one of the most important archives of online content in the world.

How it works

The Wayback Machine “crawls” the web and downloads publicly accessible information. Webpages, documents, and data are stored with a time-stamped URL.
For information that’s not publicly accessible, Internet Archive offers web archiving services through Archive-It for 1,200 organizations in 24 countries around the world (from libraries to research institutions).
The Wayback Machine supports everyday people to help it archive the internet. Anyone can go to Save Page Now to archive a webpage or article.
The Wayback Machine partners with 1,200 fact-checking organizations globally to help it reference material on the web that was the source of disinformation. It has built a library of more than 200,000 examples where a claim has been made, and the Wayback Machine has provided additional context on if that claim is true (known as a review of the claim).

Archive of facts

Fixing links, archiving webpages, and fact-checking digital articles are part of a deeper, more important project to chronicle digital history and establish a record of facts.

Last month, the archive of press releases from a sitting member of Congress, New York’s Elise Stefanik, vanished after she came under scrutiny. The Wayback Machine documented this erasure and provided a time-stamped record of past versions of her website and press releases.
In 2018, a US Appeals court ruled that the Wayback Machine’s archive of webpages can be used as legitimate legal evidence.
The Internet Archive has countless examples of when the press have referenced the Wayback Machine to correct disinformation and dispel rumors. In one example from last year, the Associated Press relied on the Wayback Machine to set the record that the CDC did not say the polio vaccine gave millions of Americans a “cancer virus.”

With the rise of AI-generated disinformation, there’s reason to believe such attempts at rewriting history (even if that history is just yesterday) will become more prevalent and the social contract that has governed web crawlers is coming to an end.

A citizen-powered web

Building digital archives is a bulwark against those attempting to rewrite history and spread misinformation. An archived, time-stamped webpage is not just unimpeachable evidence, it’s a foundational building block of a shared sense of reality.

In 2014, when Malaysia Airlines Flight 17 went down over Ukraine, the Wayback Machine captured evidence that a pro-Russian group was behind the missile attack. But it wasn’t the Wayback Machine’s algorithms that captured the evidence by crawling the internet; it was an individual who found an obscure blog post from a Ukrainian separatist leader touting the shooting down of a plane. That individual identified the blogpost as important enough to be archived, and it became a critical piece of evidence, even after that post disappeared from the internet.

As Graham said, “You don’t know what you got until it’s gone. If you see something, save something.”

What pages can you help archive? Archive them with the Wayback Machine on Save Page Now.

Fair Use in Action at the Internet Archive

Posted on March 1, 2024 by Lila Bailey

As we celebrate Fair Use/Fair Dealing Week, we are reminded of all the ways these flexible copyright exceptions enable libraries to preserve materials and meet the needs of the communities they serve. Indeed, fair use is essential to the functioning of libraries, and underlies many of the ordinary library practices that we all take for granted. In this blog post, we wanted to describe a few of the ways the fair use doctrine has helped us build our library.

Fair use in action: Web Archives and the Wayback Machine

The Internet Archive has been archiving the web since the mid-1990’s. Our web collection now includes more than 850 billion web pages, with hundreds of millions added each day. The Wayback Machine is a free service that lets people visit these archived websites. Users can type in a URL, select a date range, and then begin surfing on an archived version of the web.

Web archives are used for a variety of important purposes, many of which are themselves fair uses. News reporting and investigative journalism is one such use of the Wayback Machine. Indeed, thousands of news articles have relied upon historical versions of the web from the Wayback Machine. Just last week, 13 links to the Wayback Machine were used in a CNN story about an Ohio GOP Senate candidate’s previous statements that were critical of former President Trump. Our web archive also becomes an urgent backup for media sites that are shut down suddenly, whether by authoritarian governments or for other reasons, often becoming the only accessible source both for the authors of these stories and for the public. Another important purpose web archives can serve is as evidence in legal disputes. Attorneys use the Wayback Machine in their daily practice for evidentiary and research purposes. In 2023 alone, the Internet Archive attested to 450 affidavits in cases where Wayback Machine captures were used as evidence in court.

The Wayback Machine also makes other parts of the web, such as Wikipedia, more useful and reliable. To date, the Internet Archive has been able to repair over 19 million broken links, URLs, that had returned a 404 (Page Not Found) error message, from 320 different Wikipedia language editions. There are many reasons, including bit rot and content drift, why links stop working. Restoring links ensures that Wikipedia remains an accurate and verifiable source of information for the public good. And we hope to build new tools and partnerships to help create a more dependable knowledge ecosystem as more and more content on the web is created by generative AI.

The Fair Use doctrine is broadly considered to be what makes web archiving possible. Without it, much of our knowledge and cultural heritage–huge amounts of which are now artifacts in digital form–would be at risk. In today’s chaotic information ecosystem, safeguarding this material in an open, accessible, and transparent way is vital for history and vital for democracy.

Fair use in action: Manuals collection

Whether you are an individual who has rendered an appliance useless because you lost the instructions, or a professional mechanic looking to fix an old vehicle, owners’ manuals are invaluable. As the right to repair movement has amply demonstrated, copyright should not stand as an obstacle to using machines you’ve bought and paid for. This is a place where fair use can shine.

Over the years, the Internet Archive has received manuals, instruction sheets and informational pamphlets of all kinds. The Manuals collection has well over a million items—or users to access 24/7 at no cost. This resource gives people the right to repair and extend the life of their products. Whether you are a rocket scientist needing to operate your space shuttle, a mechanic who needs to repair a vintage VW Bug, or a curious kid trying to fix up your mom’s old computer, having free online access to the technical documentation you need is essential. And in many cases, there would appear to be no other way to get access to this crucial information.

Some preserved manuals are a single printed page with poorly constructed diagrams. Others are multi-volume tomes that give exacting details on operation of a complex piece of machinery. These materials are more than instructions or a list of components. They reflect the priorities and approaches that companies and individuals take with products, as well as the artistic and visual efforts to make an item clear to the reader.

This collection is a cool example of how fair use provides a framework for the Internet Archive to share critical knowledge with consumers. At the same time, it provides a historical timeline of sorts for innovation and the development of technology.

From preserving our digital history to providing access to manuals of obsolete devices, fair use helps libraries like ours serve our community. And while there are no doubt a variety of commercial projects that properly rely on fair use, fair use is at heart about the public good. As we celebrate Fair Use week, we should remember the crucial role it plays, and ensure that we preserve and protect fair use for the good of future generations. For more on events and news on Fair Use/Fair Dealing Week, visit FairUseWeek.org.

Genealogist uncovers family histories with help of Internet Archive

Posted on December 11, 2023 by Caralee Adams

In tracing her family history, Taneya Koonce discovered stories about her African American ancestors in records going back to the late 1700s. Many were enslaved. She followed the path of some descendants from North Carolina to New York in the Great Migration.

Taneya Koonce

The Internet Archive is among the many sources that Koonce has relied on in her research. From her home in Tampa, Florida, she regularly accesses the collection’s online yearbooks, newspapers, location histories, and government records to piece together her family’s story—and has also contributed material in hopes of helping others.

“As a genealogist and family historian, the breadth of digitized materials in the Internet Archive is essential to my research and an invaluable source of information in my family history quest,” said Koonce, who works as an information scientist at an academic medical center.

Koonce began to record stories in her family by interviewing her grandmothers nearly 30 years ago. She learned about several siblings of her maternal grandmother who died in infancy and the hardships they faced in life. Rediscovering her notes from those conversations after they died, Koonce began to dive into genealogy in earnest in 2005.

Her interest turned from a hobby to a passion in recent years. Koonce maintains a family genealogy website, created a web database for research of Koonce surnames from all over the country, publishes on her genealogy blog, and runs a collaborative genealogy-focused online community, the Academy of Legacy Leaders.

Having found so many historical items on the Internet Archive, Koonce teaches others how to use the collection in their own research. She’s active in genealogy societies, frequently presenting to others about the wealth of materials online.

Koonce applauded the Archive for preserving New York voter lists that helped her find one of her ancestors. After researching slaveholders by the name of Koonce, she connected with a man in Wisconsin who had published a “Koonce to Koonce” newsletter on the family’s history. With his cooperation, Taneya digitized and uploaded the newsletter to the Archive to preserve it for others. She always documents her findings, should they be of interest to others pursuing their family history.

“I specialize in helping family historians be very cognizant about planning for the future and leaving a legacy,” said Koonce, who has presented about the importance of saving family history research for the next generation. “One strategy is sharing material on the Internet Archive. I want to help educate people that it is a library. It’s dedicated to preserving content for the future. If we can contribute information to the collection, we can spread the word about what we’re doing and make sure it’s long lasting.”

Using the Wayback Machine to Understand the Cultural Roots of New Technologies

Posted on November 6, 2023 by Caralee Adams

As an academic librarian helping connect students and faculty with the research materials they need, Sanjeet Mann has turned to the Internet Archive many times.

“I really value having the Wayback Machine as an additional tool in my librarian’s toolbox,” Mann said. “Information preservation is an essential, but often overlooked, part of the infrastructure for teaching and learning.”

Mann, currently working as the Systems & Discovery Librarian at California State University, San Bernardino (CSUSB), said he first learned about the value of the Internet Archive in 2006 during his library science master’s program.

Over his career, Mann has worked at various libraries, tapping into the Archive on the job.

Assisting budding writers, composers and artists as Arts Librarian at University of Redlands, Mann found that the vast amount of free information online, including biographies, can shape students’ projects.

“We can draw on the Archive whenever we need inspiration for creative work, or when we need to understand how current scholarship and the issues that we’re facing now aren’t completely new—they’re based on this history of work by scholars, by politicians, by citizens active in the public interest,” he said. “These issues tend to recur over time. As a society, we need to know where we have been in order to meet the challenges of the future.”

At CSUSB, Mann also helps computer science and business students use the Archive’s collections to better understand the cultural roots of new technologies—the historical context for their innovations.

“It is the only entity I’m aware of that preserves the Internet’s scholarly and historical record at this scale,” Mann said.

“I really value having the Wayback Machine as an additional tool in my librarian’s toolbox.”
Sanjeet Mann, librarian

On a practical note, Mann leveraged information through the Wayback Machine when he was researching how to set up a campus laptop loaner program for University of Redlands. This can be an essential service that libraries provide students who have trouble with their computers.

Mann wanted to understand policies at other universities, such as how they handled the return of damaged laptops. Looking at archived versions of university library websites through the Wayback Machine, Mann was able to learn about other approaches and find contacts to follow up for additional details.

The Internet Archive is a source to verify information that is no longer listed on websites, he said.

“Companies themselves don’t have any incentive to archive the history of their website. New products get launched. The platform gets migrated from one platform to another,” Mann said. “An organization like the Internet Archive, being a library, is uniquely positioned to meet the need in society of ensuring some kind of continuity of memory and having a public record. Especially with the government being very partisan these days, I think there’s value in the Internet Archive being an independent, not-for-profit that operates in the public interest.”

Mann added: “Without the Archive, we would lose decades of information about our society at a crucial turning point in its development, eroding trust in online systems and requiring educators, students and researchers to reconsider the way we do our work and share it with others.”

Moving Getty.edu “404-ward” With Help From The Internet Archive API

Posted on November 2, 2023 by jefferson

This is a guest post from Teresa Soleau (Digital Preservation Manager), Anders Pollack (Software Engineer), and Neal Johnson (Senior IT Project Manager) from the J. Paul Getty Trust.

Project Background

Getty pursues its mission in Los Angeles and around the world through the work of its constituent programs—Getty Conservation Institute, Getty Foundation, J. Paul Getty Museum, and Getty Research Institute—serving the general interested public and a wide range of professional communities to promote a vital civil society through an understanding of the visual arts.

In 2019, Getty began a website redesign project, changing the technology stack and updating the way we interact with our communities online. The legacy website contained more than 19,000 web pages and we knew many were no longer useful or relevant and should be retired, possibly after being archived. This led us to leverage the content we’d captured using the Internet Archive’s Archive-It service.

We’d been crawling our site since 2017, but had treated the results more as a record of institutional change over time than as an archival resource to be consulted after deletion of a page. We needed to direct traffic to our Wayback Machine captures thus ensuring deleted pages remain accessible when a user requests a deprecated URL. We decided to dynamically display a link to the archived page from our site’s 404 error “Page not found” page.

Getty.edy 404 page — *Getty.edu 404 error “Page not found” message including the dynamically generated instructions and Internet Archive page link.*

The project to audit all existing pages required us to educate content owners across the institution about web archiving practices and purpose. We developed processes for completing human reviews of large amounts of captured content. This work is described in more detail in a 2021 Digital Preservation Coalition blog post that mentions the Web Archives Collecting Policy we developed.

In this blog post we’ll discuss the work required to use the Internet Archive’s data API to add the necessary link on our 404 pages pointing to the most recent Wayback Machine capture of a deleted page.

Technical Underpinnings

Implementation of our Wayback Machine integration was very straightforward from a technical point of view. The first example provided in the Wayback Machine APIs documentation page provided the technical guidance needed for our use case to display a link to the most recent capture of any page deleted from our website. With no requirements for authentication or management of keys or platform-specific software development kit (SDK) dependencies, our development process was simplified. We chose to incorporate the Wayback API using Nuxt.js, the web framework used to build the new Getty.edu site.

Since the Wayback Machine API is highly performant for simple queries, with a typical response delay in milliseconds, we are able to query the API before rendering the page using a Nuxt route middleware module. API error handling and a request timeout were added to ensure that edge cases such as API failures or network timeouts do not block rendering of the 404 response page.

The only Internet Archive API feature missing for our initial list of requirements was access to snapshot page thumbnails in the JSON data payload received from the API. Access to these images would allow us to enhance our 404 page with a visual cue of archived page content.

Results and Next Steps

Our ability to include a link to an archived version of a deleted web page on our 404 response page helped ease the tough decisions content stakeholders were obliged to make about what content to archive and then delete from the website. We could guarantee availability of content in perpetuity without incurring the long term cost of maintaining the information ourselves.

The API brings back the most recent Wayback Machine capture by default which is sometimes not created by us and hasn’t necessarily passed through our archive quality assurance process. We intend to develop our application further so that we privilege the display of Getty’s own page captures. This will ensure we’re delivering the highest quality capture to users.

Google Analytics has been configured to report on traffic to our 404 pages and will track clicks on links pointing to Internet Archive pages, providing useful feedback on what portion of archived page traffic is referred from our 404 error page.

To work around the challenge of providing navigational affordances to legacy content and ensure web page titles of old content remains accessible to search engines, we intend to provide an up-to-date index of all archived getty.edu pages.

As we continue to retire obsolete website pages and complete this monumental content archiving and retirement effort, we’re grateful for the Internet Archive API which supports our goal of making archived content accessible in perpetuity.

Grad Student Finds Nostalgic ‘Treasure Trove of Goodies’ Through the Internet Archive

Posted on October 23, 2023 by Caralee Adams

As Elena Rowan researches the ways that activist archivers gather and make sense of data, she often relies on the Internet Archive. She is a graduate student in sociology at Concordia University in Montreal, Canada, with an interest in the debate around copyright and e-books in public libraries.

“I look at why archives and libraries are important to society and culture as a whole,” said Rowan, who uses materials preserved in the Wayback Machine and the lnternet Archive. “Without the Internet Archive, so much of the knowledge and information on the Internet would be lost, and most of my research would be impossible.”

Rowan is in her second year of her master’s program and works as a research assistant at the Data Justice Hub. It is a collaborative research project that pursues data-related skills development for social activists, critical researchers and the general public, and aims to understand how data activists gather and make sense of data.

The Internet Archive has been valuable, she said, in providing information for the project and its podcast, Data Decoded.

For a recent class on sociology theory, Rowan said she’s found it useful to search for work by early researchers such as W.E.B. Du Bois in the Internet Archive’s collection. Her university library has a wealth of materials, but she says there are times when she can only find an older book through the Archive and, being digital, it’s easier to locate.

With an event sponsored by the Milieux Institute, which offers programs at the intersection of fine arts, digital culture, and information technology, Rowan leveraged the Internet Archive in another way. She created a one-hour Curating Nostalgia workshop where participants could explore resources in the digital collection to create their own personal nostalgia archive.

Logging into the Internet Archive, Rowan taught people how to search for historical documents and pop culture items. For example, she found a beloved video game that came in a cereal box from her childhood, as well as an audio walking tour of her neighborhood from a decade earlier before gentrification changed the landscape. Other workshop participants found books they read as kids, Club Penguin memorabilia and a Nancy Drew game.

“For scholarly work and nostalgia researchers, it’s a treasure trove of goodies,” Rowan says of the Internet Archive.

In her personal life, Rowan said she’s enjoyed perusing old magazines and obscure cookbooks. She’s found recipes for ambitious cakes, sewing patterns and vintage designs that give her ideas for how to pull together her eclectic mix of old furniture.

“The colors, writing and patterns of the past offer infinite inspiration for creative hobbies and help cultivate domestic bliss,” she said. “I am grateful to everyone at the Internet Archive for creating, maintaining and continuing to expand and fight for this truly amazing public resource!”

Internet Archive Celebrates Research and Research Libraries at Annual Gathering

Posted on October 19, 2023 by Caralee Adams

At this year’s annual celebration in San Francisco, the Internet Archive team showcased its innovative projects and rallied supporters around its mission of “Universal Access to All Knowledge.”

Brewster Kahle, Internet Archive’s founder and digital librarian, welcomes hundreds of guests to the annual celebration on October 12, 2023.

“People need libraries more than ever,” said Brewster Kahle, founder of the Internet Archive, at the October 12 event. “We have a set of forces that are making libraries harder and harder to happen—so we have to do something more about it.”

Efforts to ban books and defund libraries are worrisome trends, Kahle said, but there are hopeful signs and emerging champions.

Watch the full live stream of the celebration

Among the headliners of the program was Connie Chan, Supervisor of San Francisco’s District 1, who was honored with the 2023 Internet Archive Hero Award. In April, she authored and unanimously passed a resolution at the San Francisco Board of Supervisors, backing the Internet Archive and the digital rights of all libraries.

Chan spoke at the event about her experience as a first-generation, low-income immigrant who relied on books in Chinese and English at the public library in Chinatown.

Watch Supervisor Chan’s acceptance speech

“Having free access to information was a critical part of my education—and I know I was not alone,” said Chan, who is a supporter of the Internet Archive’s role as a digital, online library. “The Internet Archive is a hidden gem…It is very critical to humanity, to freedom of information, diversity of information and access to truth…We aren’t just fighting for libraries, we are fighting for our humanity.”

Several users shared testimonials about how resources from the Internet Archive have enabled them to advance their research, fact-check politicians’ claims, and inspire their creative works. Content in the collection is helping improve machine translation of languages. It is preserving international television news coverage and Ukrainian memes on social media during the war with Russia.

Quinn Dombrowski, of the Saving Ukrainian Cultural Heritage Online project, shows off Ukrainian memes preserved by the project.

Technology is changing things—some for the worse, but a lot for the better, said David McRaney, speaking via video to the audience in the auditorium at 300 Funston Ave. “And when [technology] changes things for the better, it’s going to expand the limited capabilities of human beings. It’s going to extend the reach of those capabilities, both in speed and scope,” he said. “It’s about a newfound freedom of mind, and time, and democratizing that freedom so everyone has access to it.”

Open Library developer Drini Cami explained how the Internet Archive is using artificial intelligence to improve access to its collections.

When a book is digitized, it used to be that photographs of pages had to be manually cropped by scanning operators. The Internet Archive recently trained a custom machine learning model to automatically suggest page boundaries—allowing staff to double the rate of process. Also, an open-source machine learning tool converts images into text, making it possible for books to be searchable, and for the collection to be available for bulk research, cross-referencing, text analysis, as well as read aloud to people with print disabilities.

“Since 2021, we’ve made 14 million books, documents, microfiche, records—you name it—discoverable and accessible in over 100 languages,” Cami said.

As AI technology advanced this year, Internet Archive engineers piloted a metadata extractor, a tool that automatically pulls key data elements from digitized books. This extra information helps librarians match the digitized book to other cataloged records, beginning to resolve the backlog of books with limited metadata in the Archive’s collection. AI is also being leveraged to assist in writing descriptions of magazines and newspapers—reducing the time from 40 to 10 minutes per item.

“Because of AI, we’ve been able to create new tools to streamline the workflows of our librarians and the data staff, and make our materials easier to discover, and work with patrons and researchers, Cami said. “With new AI capabilities being announced and made available at a breakneck rate, new ideas of projects are constantly being added.”

Jamie Joyce & AI hackathon participants.

A recent Internet Archive hackathon explored the risks and opportunities of AI by using the technology itself to generate content, said Jamie Joyce, project lead with the organization’s Democracy’s Library project. One of the hackathon volunteers created an autonomous research agent to crawl the web and identify claims related to AI. With a prompt-based model, the machine was able to generate nearly 23,000 claims from 500 references. The information could be the basis for creating economic, environmental and other arguments about the use of AI technology. Joyce invited others to get involved in future hackathons as the Internet Archive continues to expand its AI potential.

Peter Wang, CEO and co-founder at Anaconda, said interesting kinds of people and communities have emerged around cultures of sharing. For example, those who participate in the DWeb community are often both humanists and technologists, he said, with an understanding about the importance of reducing barriers to information for the future of humanity. Wang said rather than a scarcity mindset, he embraces an abundant approach to knowledge sharing and applying community values to technology solutions.

“With information, knowledge and open-source software, if I make a project, I share it with someone else, they’re more likely to find a bug,” he said. “They might improve the documentation a little bit. They might adapt it for a novel use case that I can then benefit from. Sharing increases value.”

The Internet Archive’s Joy Chesbrough, director of philanthropy, closed the program by expressing appreciation for those who have supported the digital library, especially in these precarious times.

“We are one community tied together by the internet, this connected web of knowledge sharing. We have a commitment to an inclusive and open internet, where there are many winners, and where ethical approaches to genuine AI research are supported,” she said. “The real solution lies in our deep human connection. It inspires the most amazing acts of generosity and humanity.”

***

If you value the Internet Archive and our mission to provide “Universal Access to All Knowledge,” please consider making a donation today.

From Fake News to Open Data: Studying the Histories of Digital Media Using the Wayback Machine

Posted on October 9, 2023 by Caralee Adams

As scholars of digital media studies, Liliana Bounegru and Jonathan Gray say the Internet Archive preserves artifacts that are integral to their work.

The two academics work at King’s College London in the Department of Digital Humanities—Bounegru is a lecturer in digital media and Gray is a senior lecturer in critical infrastructure studies. They are both interested in studying how media has changed with digital technology. The Internet Archive collection has been useful as they examine the history of the web, trends and evolution of websites and changes in technology, society and culture.

In one study of online myths and disinformation, the researchers used the Wayback Machine to understand how tracker signatures (snippets of code that embed ads and analytics on a website) of viral “fake news” sites changed over time. As websites were blacklisted from major ad networks, they looked up the archived versions of the websites to follow how their money-making practices via ads changed over time. This project was completed in collaboration with BuzzFeed news, which published an article about the findings and analytical techniques.

This investigation builds on work that Bounegru and Gray did with First Draft, a nonprofit that works with journalists to support investigations around misinformation. They analyzed the tracker signatures of mainstream news sites alongside those of junk news sites to understand their different monetization and audience economics practices.

As a result of their investigations, the researchers created A Field Guide to Fake News that explores how digital methods can be used to study false viral news, political memes, and trolling practices. “It became widely used by a network of hundreds of media organizations and fact checking groups as well as for training people doing investigative work on disinformation,” Gray said. Together with other collaborators at the Public Data Lab which they co-founded, Bounegru and Gray wrote a paper in New Media & Society about the threat of misleading junk news on social, economic and political life and the questions that it raises about social media and online content sharing platforms.

Gray has long been interested in the politics of open and public data and is writing a book on the subject. This involves tracing how open data policies and practices have developed around the world, and he said it’s been valuable to be able to search and analyze open data websites through the Wayback Machine. As part of research for the book he published an article in Data & Policy, from Cambridge University Press, about the rise of data portals as online devices for making data public.

“In the case of data portals such as data.gov.uk we see a shift from more sociable and experimental design approaches aiming to surface questions, engage communities and support cultures of socially oriented invention to more muted, minimal expert facing infrastructures,” said Gray. “It could be considered a certain kind of success for open data advocates that portals have become so established and institutionalized, but also suggests that maybe there’s less interest in being inclusive,accessible, responsive or thoughtful in reaching communities that may be less technically oriented or those who don’t already know what they are looking for or what kinds of data is likely to be found.”

In working with their students, both Bounegru and Gray share ways that the Internet Archive can be useful for research. Through hands-on research activities with the Wayback Machine they explore how it can show how web content, user interfaces and web categories change. It can even provide evidence of broader societal change, such as how political views have shifted over time. The Archive can reveal large-scale changes and allow researchers, journalists, students and community groups to gain a richer appreciation of digital media history.

Added Bounegru: “We use the Internet Archive a lot. It is an essential tool for our research.”

Slide on how the WayBack Machine is being used from Bounegru and Gray’s “web histories” class as part of digital methods course at King’s College London.

Student’s Use of Internet Archive Expands from High School to College

Posted on September 18, 2023 by Caralee Adams

Rachel Simmons first used the Wayback Machine for research projects at her Sacramento, California, high school. Now a senior at UCLA, she’s discovered even more ways to find material not available elsewhere.

Simmons, whose mother and grandmother were both librarians, is an applied math major with a minor in film, television and digital media. As she looks up information about media figures or needs to find a rare film, she says the Internet Archive’s digital collection has been an invaluable resource.

“It’s really great to have access to information for anyone to use from their home computer,” Simmons says. “I don’t physically have to go into a library. If I’m working on something late at night, it’s convenient.”

When taking a class on American film history last year, she was assigned to research a famous actor; she chose Peter Lorre.

“I’m a big fan of classic horror films and he’s an icon whose legacy has continued long past his career,” she said. “I just wanted to learn more about him and what people thought of him at the time.”

To find those contemporary views of Lorre’s work, Simmons turned to the fan magazine collection in the Archive’s Media History Digital Library. There she found interviews with the actor and reviews of his movies from the 1930s. Despite appearing as a mysterious figure on film, Simmons says she learned the interviews present him as a conventional, regular guy. She gained even more insight through the published fan letters in the magazines. “I found it really interesting that I was reading these letters from almost one hundred years ago,” Simmons said.

For another UCLA course, Simmons tapped into the Internet Archive to view silent German films that were discussed in class. While she was studying, Simmons found herself stumbling onto trailers for other films, which led her to checking out similar movies for fun after her projects were complete. Many of the more obscure titles that interest her are not available on streaming services, she notes.

Simmons says she tells others about the resources available through the Internet Archive—including her family of librarians.

Internet Archive Blogs

A blog from the team at archive.org