Category Archives: Web & Data Services

Working to Advance Library Support for Web Archive Research 

This Spring, the Internet Archive hosted two in-person workshops aimed at helping to advance library support for web archive research: Digital Scholarship & the Web and Art Resources on the Web. These one-day events were held at the Association of College & Research Libraries (ACRL) conference in Pittsburgh and the Art Libraries Society of North America (ARLIS) conference in Mexico City. The workshops brought together librarians, archivists, program officers, graduate students, and disciplinary researchers for full days of learning, discussion, and hands-on experience with web archive creation and computational analysis. The workshops were developed in collaboration with the New York Art Resources Consortium (NYARC) – and are part of an ongoing series of workshops hosted by the Internet Archive through Summer 2023.

Internet Archive Deputy Director of Archiving & Data Services Thomas Padilla discussing the potential of web archives as primary sources for computational research at Art Resources on the Web in Mexico City.

Designed in direct response to library community interest in supporting additional uses of web archive collections, the workshops had the following objectives: introduce participants to web archives as primary sources in context of computational research questions, develop familiarity with research use cases that make use of web archives; and provide an opportunity to acquire hands-on experience creating web archive collections and computationally analyzing them using ARCH (Archives Research Compute Hub) – a new service set to publicly launch June 2023.

Internet Archive Community Programs Manager Lori Donovan walking workshop participants through a demonstration of Palladio using a dataset generated with ARCH at Digital Scholarship & the Web In Pittsburgh, PA.

In support of those objectives, Internet Archive staff walked participants through web archiving workflows, introduced a diverse set of web archiving tools and technologies, and offered hands-on experience building web archives. Participants were then introduced to Archives Research Compute Hub (ARCH). ARCH supports computational research with web archive collections at scale – e.g., text and data mining, data science, digital scholarship, machine learning, and more. ARCH does this by streamlining generation and access to more than a dozen research ready web archive datasets, in-browser visualization, dataset analysis, and open dataset publication. Participants further explored data generated with ARCH in PalladioVoyant, and RAWGraphs.

Network visualization of the Occupy Web Archive collection, created using Palladio based on a Domain Graph Dataset generated by ARCH.

Gallery visualization of the CARTA Art Galleries collection, created using Palladio based on an Image Graph Dataset generated by ARCH.

At the close of the workshops, participants were eager to discuss web archive research ethics, research use cases, and a diverse set of approaches to scaling library support for researchers interested in working with web archive collections – truly vibrant discussions – and perhaps the beginnings of a community of interest!  We plan to host future workshops focused on computational research with web archives – please keep an eye on our Event Calendar.

How do you use the Internet Archive in your research?

Tell us about your research & how you use the Internet Archive to further it! We are gathering testimonials about how our library & collections are used in different research projects & settings.

From using our books to check citations to doing large-scale data analysis using our web archives, we want to hear from you!

Share your story now!

Internet Archive Welcomes Digital Humanists and Cultural Heritage Professionals to “Humanities and the Web: Introduction to Web Archive Data Analysis”

By The Community Programs Team

On November 14, 2022, the Internet Archive hosted Humanities and the Web: Introduction to Web Archive Data Analysis, a one-day introductory workshop for humanities scholars and cultural heritage professionals. The group included disciplinary scholars and information professionals with research interests ranging from Chinese feminist movements, to Indigenous language revitalization, to the effects of digital platforms on discourses of sexuality and more. The workshop was held at the Central Branch of the Los Angeles Public Library and coincided with the National Humanities Conference.

Attendees and Facilitators at Humanities and the Web: Introduction to Web Archive Data Analysis, November 14, 2022, Los Angeles Public Library

The goals of the workshop were to introduce web archives as primary sources and to provide a sampling of tools and methodologies that could support computational analysis of web archive collections. Internet Archive staff shared web archive research use cases and provided participants with hands-on experience building web archives and analyzing web archive collections as data.

Senior Program Manager, Lori Donovan, guiding attendees in using Voyant to analyze text datasets extracted from an Archive-It collection using ARCH.

The workshop’s central feature was an introduction to ARCH (Archives Research Compute Hub). ARCH transforms web archives into datasets tuned for computational research, allowing researchers to, for example, extract all text, spreadsheets, PDFs, images, audio, named entities and more from collections. During the workshop, participants worked directly with text, network, and image file datasets generated from web archive collections. With access to datasets derived from these collections, the group explored a range of analyses using Palladio, RAWGraphs, and Voyant

Visualization of the image files contained in the Chicago Architecture Biennial collection, created using Palladio based on an Image File dataset extracted from the collection using ARCH.

The high level of interest and participation in this event is indicative of the appetite within the Humanities for workshops on computational research. Participants described how the workshop gave them concrete language to express the challenges of working with large-scale data, while also expressing how the event offered strategies they could apply to their own research or could use to support their research communities. For those who were not able to make it to Humanities and the Web, we will be hosting a series of virtual and in-person workshops in 2023. Keep your eye on this space for upcoming announcements.

Digital Library of Amateur Radio & Communications Surpasses 25,000 Items

In the six weeks since announcing that Internet Archive has begun gathering content for the Digital Library of Amateur Radio and Communications (DLARC), the project has quickly grown to more than 25,000 items, including ham radio newsletters, podcasts, videos, books, and catalogs. The project seeks additional contributions of material for the free online library.

You are welcome to explore the content currently in the library and watch the primary collection as it grows at https://archive.org/details/dlarc.

The new material includes historical and modern newsletters from diverse amateur radio groups including the National Radio Club (of Aurora, CO); the Telford & District Amateur Radio Society, based in the United Kingdom; the Malta Amateur Radio League; and the South African Radio League. The Tri-State Amateur Radio Society contributed more than 200 items of historical correspondence, newspaper clippings, ham festival flyers, and newsletters. Other publications include Selvamar Noticias, a multilingual digital ham radio magazine; and Florida Skip, an amateur radio newspaper published from 1957 through 1994.The library also includes the complete run of 73 Magazine — more than 500 issues — which are freely and openly available.  

More than 300 radio related books are available in DLARC via controlled digital lending. These materials may be checked out by anyone with a free Internet Archive account for a period of one hour to two weeks. Radio and communications books donated to Internet Archive are scanned and added to the DLARC lending library.

Amateur radio podcasts and video channels are also among the first batch of material in the DLARC collection. These include Ham Nation, Foundations of Amateur Radio, the ICQ Amateur/Ham Radio Podcast, with many more to come. Providing a mirror and archive for “born digital” content such as video and podcasts is one of the core goals of DLARC.

Additions to DLARC also include presentations recorded at radio communications conferences, including GRCon, the GNU Radio Conference; and the QSO Today Virtual Ham Expo. A growing reference library of past radio product catalogs includes catalogs from Ham Radio Outlet and C. Crane.

DLARC is growing to be a massive online library of materials and collections related to amateur radio and early digital communications. It is funded by a significant grant from Amateur Radio Digital Communications (ARDC) to create a digital library that documents, preserves, and provides open access to the history of this community. 

Anyone with material to contribute to the DLARC library, questions about the project, or interest in similar digital library building projects for other professional communities, please contact:

Kay Savetz, K6KJN
Program Manager, Special Collections
kay@archive.org
Mastodon: dlarc@mastodon.radio

Internet Archive Seeks Donations of Materials to Build a Digital Library of Amateur Radio and Communications

Internet Archive has begun gathering content for the Digital Library of Amateur Radio and Communications (DLARC), which will be a massive online library of materials and collections related to amateur radio and early digital communications. The DLARC is funded by a significant grant from the Amateur Radio Digital Communications (ARDC), a private foundation, to create a digital library that documents, preserves, and provides open access to the history of this community.

The library will be a free online resource that combines archived digitized print materials, born-digital content, websites, oral histories, personal collections, and other related records and publications. The goals of the DLARC are to document the history of amateur radio and to provide freely available educational resources for researchers, students, and the general public. This innovative project includes:

  • A program to digitize print materials, such as newsletters, journals, books, pamphlets, physical ephemera, and other records from both institutions, groups, and individuals.
  • A digital archiving program to archive, curate, and provide access to “born-digital” materials, such as digital photos, websites, videos, and podcasts.
  • A personal archiving campaign to ensure the preservation and future access of both print and digital archives of notable individuals and stakeholders in the amateur radio community.
  • Conducting oral history interviews with key members of the community. 
  • Preservation of all physical and print collections donated to the Internet Archive.

The DLARC project is looking for partners and contributors with troves of ham radio, amateur radio, and early digital communications related books, magazines, documents, catalogs, manuals, videos, software, personal archives, and other historical records collections, no matter how big or small. In addition to physical material to digitize, we are looking for podcasts, newsletters, video channels, and other digital content that can enrich the DLARC collections. Internet Archive will work directly with groups, publishers, clubs, individuals, and others to ensure the archiving and perpetual access of contributed collections, their physical preservation, their digitization, and their online availability and promotion for use in research, education, and historical documentation. All collections in this digital library will be universally accessible to any user and there will be a customized access and discovery portal with special features for research and educational uses.

We are extremely grateful to ARDC for funding this project and are very excited to work with this community to explore a multi-format digital library that documents and ensures access to the history of a specific, noteworthy community. Anyone with material to contribute to the DLARC library, questions about the project, or interest in similar digital library building projects for other professional communities, please contact:

Kay Savetz, K6KJN
Program Manager, Special Collections
kay@archive.org
Twitter: @KaySavetz 

Mapping Principles for a Better World: Ethics of the Decentralized Web & Uses in Humanitarian Work

“Technology embodies a set of values,” opened Internet Archive’s Director of Partnerships, Wendy Hanamura, in Ethics of the Decentralized Web & Uses in Humanitarian Work, the final session with METRO Library Council and Library Futures. “What values drive our current web?”

Watch Video:

While most of the audience responded by discussing web monetization or opined about lack of privacy, many still believe in the power of the internet for better sharing. As we build a new web, most would like for it to be driven by a different set of values, particularly community, collaboration, freedom, sovereignty, democracy, and trust.

Beginning with Mai Ishikawa Sutton’s work on the five principles of the DWeb and ending with a demo of the Mapeo project, this session brought in designers, coders, policy professors, and ethicists building a new “web for the people” that would embody the above values, and much more.

In 2016 at a campout in California, Sutton and a community of technology enthusiasts came together to rethink the values embedded in the technology of the web we use and the web we could build. While technology was a major factor in the resulting work, ethical considerations and standing for better technology were just as crucial. They created a document that reflected the interests and values of their community with five principles:

  1. Technology for Human Agency
  2. Distributed Benefits
  3. Mutual Respect
  4. Humanity
  5. Ecological Awareness

The group hopes to revisit some of these principles this summer at DWeb Camp 2022 to better define the “web that we want.” In the Q&A with Hanamura, Sutton clarified the ways in which the DWeb addresses crucial aspects of power, control, and capital. Rather than staying static or solely basing itself in technological innovation, the DWeb community is a way to ensure that benefits “flow back into the community.”

Author and Professor Nathan Schneider followed Sutton to discuss how human rights can be encoded into the blockchain. Schneider’s presentation, “Policy Proposals for Less Dystopian Crypto Protocols” began with a recognition of the issues within blockchain, stating that he wishes to explore how crypto can be “up there with libraries” in terms of building “true civic institutions.” Faced with the dystopia of the current web and recognizing that it could perpetuate the same harms, crypto could present a new form of economic democracy and pluriverses for all. For Schneider, “if code is law,” there are a number of policy proposals that can support a better crypto future. These include building sufficiently decentralized systems, transparent governance, labor over capital, taxation for public goods, reparations, provable zero-carbon, and human rights fail-safes. For his community, this is not about “catching up” to institutions as we know them, but instead doing the work to build a more humane world.

Following Schneider, Luandro Vieira of Digital Democracy demoed his project, Mapeo, a decentralized app built with and for communities. Mapeo is a mobile application that provides free and accessible geospatial technology that is translatable, designed for community, private, and available offline. Originally built for earth defenders, or marginalized people at the front lines of defending their land around the world, Mapeo is highly customizable and used mainly by indigenous people in 16 countries. It is used to map and monitor threats from invasions, mining, logging, and oil activities. Mapeo’s power was demonstrated through the #WaoraniResistance, which protected 1/2 million acres in the Amazon and jeopardized a 7 million acre oil auction.

Through the lens of these three activists and experts, the promise of the DWeb was clear. A new, highly democratic web is possible, but it will take all of us to build it.

Thank you to METRO Library Council, Internet Archive, and Library Futures for their work in hosting these sessions. All resources from the six DWeb sessions can be found at the METRO Library Council website.

Launching Legal Literacies for Text Data Mining – Cross Border (LLTDM-X)

We are excited to announce that the National Endowment for the Humanities (NEH) has awarded nearly $50,000 through its Digital Humanities Advancement Grant program to UC Berkeley Library and Internet Archive to study legal and ethical issues in cross-border text data mining research. NEH funding for the project, entitled Legal Literacies for Text Data Mining – Cross Border (LLTDM-X), will support research and analysis that addresses law and policy issues faced by U.S. digital humanities practitioners whose text data mining research and practice intersects with foreign-held or licensed content, or involves international research collaborations. LLTDM-X builds upon Building Legal Literacies for Text Data Mining Institute (Building LLTDM), previously funded by NEH. UC Berkeley Library directed Building LLTDM, bringing together expert faculty from across the country to train 32 digital humanities researchers on how to navigate law, policy, ethics, and risk within text data mining projects (results and impacts are summarized in the white paper here.) 

Why is LLTDM-X needed?

Text data mining, or TDM, is an increasingly essential and widespread research approach. TDM relies on automated techniques and algorithms to extract revelatory information from large sets of unstructured or thinly-structured digital content. These methodologies allow scholars to identify and analyze critical social, scientific, and literary patterns, trends, and relationships across volumes of data that would otherwise be impossible to sift through. While TDM methodologies offer great potential, they also present scholars with nettlesome law and policy challenges that can prevent them from understanding how to move forward with their research. Building LLTDM trained TDM researchers and professionals on essential principles of licensing, privacy law, as well as ethics and other legal literacies —thereby helping them move forward with impactful digital humanities research. Further, digital humanities research in particular is marked by collaboration across institutions and geographical boundaries. Yet, U.S. practitioners encounter increasingly complex cross-border problems and must accordingly consider how they work with internationally-held materials and international collaborators.

How will LLTDM-X help? 

Our long-term goal is to design instructional materials and institutes to support digital humanities TDM scholars facing cross-border issues. Through a series of virtual roundtable discussions, and accompanying legal research and analyses, LLTDM-X will surface these cross-border issues and begin to distill preliminary guidance to help scholars in navigating them. After the roundtables, we will work with the law and ethics experts to create instructive case studies that reflect the types of cross-border TDM issues practitioners encountered. Case studies, guidance, and recommendations will be widely-disseminated via an open access report to be published at the completion of the project. And most importantly, these resources will be used to inform our future educational offerings.

The LLTDM-X team is eager to get started. The project is co-directed by Thomas Padilla, Deputy Director, Archiving and Data Services at Internet Archive and Rachael Samberg, who leads UC Berkeley Library’s Office of Scholarly Communication Services. Stacy Reardon, Literatures and Digital Humanities Librarian, and Timothy Vollmer, Scholarly Communication and Copyright Librarian, both at UC Berkeley Library, round out the team.

We would like to thank NEH’s Office of Digital Humanities again for funding this important work. The full press release is available at UC Berkeley Library’s website. We invite you to contact us with any questions.

Decentralized Apps, the Metaverse, and the “Next Big Thing”

In the fifth session of “Imagining a Better Online World: Exploring the Decentralized Web” – a joint series of events with Internet Archive, METRO Library Council, and Library Futures – “Decentralized Apps, the Metaverse, and the ‘Next Big Thing,’” Internet Archive Director of Partnerships Wendy Hanamura took a deep dive into the metaverse and NFTs through an exploration of virtual worlds with pioneering metaverse developer Jin.

Watch session:

In this engaging session, Hanamura and Jin explored the technologies that would transform the future and the world as we know it within Web 3.0: the immersive spaces and built communities of the metaverse. As indicated by participants, to some, NFT and metaverse means “cyberspace on steroids,” or “Second Life,” while for others it holds a more negative connotation. From the “read-only” Web 1.0 to the forthcoming “read-write-trust verifiable” future of Web 3.0, the evolution of the web is leading to an enhancement of reality to create new and augmented realities.

An NFT, or an entry on a blockchain, can be anything from a document to even a virtual representation of a physical space like the Internet Archive. Jin, for example, is able to create a complete virtual desktop where their entire life and memory lives in 3D, and where they conducted the virtual reality interview with Hanamura. From hacker spaces to raves to the virtual representation of the Internet Archive they built as a central space to conduct their work, Jin’s life is mediated and defined through their virtual world building.

What makes Jin’s world unique is their commitment to building with other people in the open source community in an “interesting, collaborative, co-creation.”

Within these worlds, one of the key provisions is interoperability: the ability to carry these worlds between each other. For Jin, this is still a work in progress, with new modes of interoperability still being built. In addition, privacy is a major concern – Web 3.0 provides a new form of privacy through avatars and other obscuring technology, but Jin cautions that due diligence is still warranted, just like in the real world.

The conversation ended with a discussion of the democratizing aspects of NFT creation and independent artists. As an artist, Jin’s first NFT earned him more money than he ever had previously in his career. One of the most exciting aspects of this kind of creation is the way it removes the middle person from the art market: rather than creating for museums or other art markets, Jin is able to reach their audience directly.

Jin ended the session on a positive note: “In virtual reality, you have a lot more bandwidth for empathy. There’s a lot of nuance that is lost in text-based communication platforms. It’s more asynchronous. The sense of presence, of being there with other people, you experience a lot of genuine and good connections… there’s a lot of genuine appreciation of art. That gives me hope.”

Goodbye Facebook. Hello Decentralized Social Media?

The pending sale of Twitter to Elon Musk has generated a buzz about the future of social media and just who should control our data.

Wendy Hanamura, director of partnerships at the Internet Archive, moderated an online discussion April 28 “Goodbye Facebook, Hello Decentralized Social Media?” about the opportunities and dangers ahead. The webinar is part of a series of six workshops, “Imagining a Better Online World: Exploring the Decentralized Web.” 

Watch the session recording:

The session featured founders of some of the top decentralized social media networks including Jay Graber, chief executive officer of R&D project Bluesky, Matthew Hodgson, technical co-founder of Matrix, and Andre Staltz, creator of Manyverse. Unlike Twitter, Facebook or Slack, Matrix and Manyverse have no central controlling entity. Instead the peer-to-peer networks shift power to the users and protect privacy. 

If Twitter is indeed bought and people are disappointed with the changes, the speakers expressed hope that the public will consider other social networks. “A crisis of this type means that people start installing Manyverse and other alternatives,” Staltz said. “The opportunity side is clear.” Still in the transition period if other platforms are not ready, there is some risk that users will feel stuck and not switch, he added.

Hodgson said there are reasons to be both optimistic and pessimistic about Musk purchasing Twitter. The hope is that he will use his powers for good, making it available to everybody and empowering people to block the content they don’t want to see. The risk is with no moderation, Hodgson said, people will be obnoxious to one another without sufficient controls to filter, and the system will melt down. “It’s certainly got potential to be an experiment. I’m cautiously optimistic on it,” he said.

People who work in decentralized tech recognize the risk that comes when one person can control a network and act for good or bad, Graber said. “This turn of events demonstrates that social networks that are centralized can change very quickly,” she said. “Those changes can potentially disrupt or drastically alter people’s identity, relationships, and the content that they put on there over the years. This highlights the necessity for transition to a protocol-based ecosystem.” 

When a platform is user-controlled, it is resilient to disruptive change, Graber said. Decentralization enables immutability so change is hard and is a slow process that requires a lot of people to agree, added Staltz.

The three leaders spoke about how decentralized networks provide a sustainable alternative and are gaining traction. Unlike major players that own user data and monetize personal information, decentralized networks are controlled by users and information lives in many different places.

“Society as a whole is facing a lot of crises,” Graber said. “We have the ability to, as a collective intelligence, to investigate a lot of directions at once. But we don’t actually have the free ability to fully do this in our current social architecture…if you decentralize, you get the ability to innovate and explore many more directions at once. And all the parts get more freedom and autonomy.”

Decentralized social media is structured to change the balance of power, added Hanamura: “In this moment, we want you to know that you have the power. You can take back the power, but you have to understand it and understand your responsibility.”

The webinar was co-sponsored by DWeb and Library Futures, and presented by the Metropolitan New York Library Council (METRO).

The next event in the series, Decentralized Apps, the Metaverse and the “Next Big Thing,” will be held Thursday, May 26 at 4-5 p.m.EST, Register here

Library as Laboratory Recap: Analyzing Biodiversity Literature at Scale

At a recent webinar hosted by the Internet Archive, leaders from the Biodiversity Heritage Library (BHL) shared how its massive open access digital collection documenting life on the planet is an invaluable resource of use to scientists and ordinary citizens.

“The BHL is a global consortium of the  leading natural history museums, botanical gardens, and research institutions — big and small— from all over the world. Working together and in partnership with the Internet Archive, these libraries have digitized more than 60 million pages of scientific literature available to the public”, said Chris Freeland, director of Open Libraries and moderator of the event.

Watch session recording:

Established in 2006 with a commitment to inspiring discovery through free access to biodiversity knowledge, BHL has 19 members and 22 affiliates, plus 100 worldwide partners contributing data. The BHL has content dating back nearly 600 years alongside current literature that, when liberated from the print page, holds immense promise for advancing science and solving today’s pressing problems of climate change and the loss of biodiversity.

Martin Kalfatovic, BHL program director and associate director of the Smithsonian Libraries and Archives, noted in his presentation that Charles Darwin and colleagues famously said “the cultivation of natural science cannot be efficiently carried on without reference to an extensive library.”

“Today, the Biodiversity Heritage Library is creating this global, accessible open library of literature that will  help scientists, taxonomists, environmentalists—a host of people working with our planet—to actually have ready access to these collections,” Kalfatovic said. BHL’s mission is to improve research methodology by working with its partner libraries and the broader biodiversity and bioinformatics community. Each month, BHL draws about 142,000 visitors and 12 million users overall.

“The outlook for the planet is challenging. By unlocking this historic data [in the Biodiversity Heritage Library], we can find out where we’ve been over time to find out more about where we need to be in the future.”

Martin Kalfatovic, program director, Biodiversity Heritage Library

Most of the BHL’s materials are from collections in the global north, primarily in large, well-funded institutions. Digitizing these collections helps level the playing field, providing researchers in all parts of the world equal access to vital content.

The vast collection includes species descriptions, distribution records, climate records, history of scientific discovery, information on extinct species, and records of scientific distributions of where species live. To date, BHL has made over 176,000 titles and 281,000 volumes available. Through a partnership with the Global Names Architecture project, more than 243 million instances of taxonomic (Latin) names have been found in BHL content.

Kalfatovic underscored the value of BHL content in understanding the environment in the wake of recent troubling news from the Sixth Assessment Report (AR6) published by the  Intergovernmental Panel on Climate Change about the impact of the earth’s warming. 

Biodiversity Heritage Library by the numbers.

“The outlook for the planet is challenging,” he said. “By unlocking this historic data, we can find out where we’ve been over time to find out more about where we need to be in the future.”

JJ Dearborn, BHL data manager, discussed how digitization transforms physical books into digital objects that can be shared with “anyone, at any time, anywhere.” She describes the Wikimedia ecosystem as “fertile ground for open access experimentation,” crediting the organization with giving BHL the ability to reach new audiences and transform its data into 5-star linked open data. “Dark data” that is locked up in legacy formats, JP2s, and OCR text are sources of valuable checklist, species occurrence, and event sampling data that the larger biodiversity community can use to improve humanity’s collective ability to monitor biodiversity loss and the destructive impacts of climate change, at scale.  

The majority of the world’s data today is siloed, unstructured, and unused, Dearborn explained. This “dark data” “represents an untapped resource that could really transform human understanding if it could be truly utilized,” she said. “It might represent a gestalt leap for humanity.” 

The event was the fifth in a series of six sessions highlighting how researchers in the humanities use the Internet Archive. The final session of the Library as Laboratory series will be a series of lightning talks on May 11 at 11am PT / 2pm ET—register now!