Category Archives: Event

August Book Talk: Dataraising and Digital Civil Society

Featuring the book How We Give Now by Lucy Bernholz. Published by MIT Press.

What is dataraising and why should nonprofits care? For millennia humans have given time and money to each other and to causes they care about. A few hundred years ago we invented nonprofit organizations and they’ve become a key mechanism in the donation of private resources for public benefit. Now, we can also donate digital data. Organizations such as iNaturalist use donated digital photographs to build communities of nature lovers and inform climate scientists. Other organizations are using donated data to build cultural archives, advocate for fair labor laws, protect consumers, and for medical research.

Watch session recording:

Join Lucy Bernholz, author of How We Give Now, Scott Loarie of iNaturalist, and Dr. Jasmine McNealy from the University of Florida for a discussion of the promises and perils of donating digital data and the implications for individuals, communities, and civil society.

Purchase your copy of How We Give Now from MIT Press.

August Book Talk: Dataraising and Digital Civil Society
Featuring Lucy Bernholz, author of How We Give Now, Scott Loarie of iNaturalist, and Dr. Jasmine McNealy from the University of Florida
August 10, 2022 @ 11am PT
Watch the session recording.

Internet Archive Hosts Community Webs Symposium in Washington, DC

On June 21st, the Community Webs program team hosted its 2022 US Symposium at the National Museum of the American Indian in Washington, DC. For this day-long meeting, we welcomed over 30 librarians and archivists from across the country for presentations, discussion, networking, and some much-needed catch up following two years of entirely virtual events. 

National Museum of the American Indian, Washington, DC

Community Webs is a community history web and digital archiving program operated by the Internet Archive. The program seeks to advance the capacity for community-focused memory organizations to build web and digital archives documenting local histories, with a particular focus on communities that have been underrepresented in the historic record. Community Webs provides its members with web and digital archiving tools, as well as training, technical support and access to a network of organizations doing similar work. The Community Webs program, including this event, is generously funded with support from the Institute of Museum and Library Services (IMLS) and the Mellon Foundation. 

Jefferson Bailey, Director of Archiving & Data Services at the Internet Archive, describes the concepts that have underpinned the development of Community Webs since its inception

The day began with opening remarks and program updates from Internet Archive staff, including an overview of Community Webs and the significant growth the program has experienced since its launch in 2017. Staff provided a glimpse at what lies ahead both for Community Webs and the Internet Archive’s Archiving and Data Services team. This included plans to incorporate digitization, digital preservation and other forms of digital collecting into Community Webs, as well as projects and services either newly released or in development at IA.

Dr. Doretha Williams, Director of the Robert F. Smith Center for the Digitization and Curation of African American History at the National Museum of African American History and Culture

The first keynote speaker of the day was Dr. Doretha Williams, Director of the Robert F. Smith Center for the Digitization and Curation of African American History at the National Museum of African American History and Culture. Dr. Williams detailed her organization’s commitment to serving its communities via the Center’s Community Curation Program, Internships and Fellowships Program, Family History Center, and Great Migration Home Movie Project. Throughout her presentation, Dr. Williams stressed the importance of community input and partnerships to achieving the Center’s mission, echoing one of the central tenets of the Community Webs program.

National Gallery of Art Executive Librarian Roger Lawson discusses his organization’s involvement with the Collaborative ART Archive (CARTA)

Following this presentation, three speakers shared their experiences working on collaborative web archiving initiatives. Lori Donovan, Senior Program Manager for Community Programs at the Internet Archive, began with an overview of various collaborative web archiving initiatives the Internet Archive and its partners have participated in, including the Collaborative ART Archive (CARTA), a web archiving initiative aimed at capturing web-based art materials utilizing a collective approach. Roger Lawson, Executive Librarian at the National Gallery of Art, shared his institution’s perspective as a member of CARTA. Finally, Christie Moffatt, Digital Manuscripts Program Manager at the National Library of Medicine, described working with colleagues both across her organization and externally to capture health-related web content at a national scale. Each of these presentations emphasized the advantages in scale, resources, staffing and knowledge-sharing that can be achieved by pursuing web archiving via collaborative entities.

Our afternoon session kicked off with a second keynote presentation from Leslie Johnston, Director of Digital Preservation at the National Archives and Records Administration (NARA). Johnston detailed the challenges NARA faces while contending with digital preservation across the enterprise. These challenges include the heterogeneity of digital outputs and technologies, the complexity of digital objects and environments, the scale of the archivable digital universe, and the difficulties in ensuring equitable access. As an antidote to these challenges, Johnston recommends archivists provide guidance to content creators, take a risk-based approach, prioritize basic levels of control, maintain scalable and flexible infrastructure, and engage in collaborations and partnerships. She also advocated for a people- rather than technology-centric approach to digital preservation, again mirroring the ethos of the Community Webs program.

Leslie Johnston, Director of Digital Preservation at NARA, outlines the challenges her institution is facing while contending with digital preservation

For our final speaker session of the afternoon, we welcomed Community Webs members up to the lectern to share their web archiving and digital goals and achievements. Librarian, archivist, Phd student, and creative polymath kYmberly Keeton discussed her work as founder of Art | Library Deco, an online archive of African American art. Keeton described working closely with the artists featured in the archive, reiterating the theme of collaboration espoused by other speakers at the event. Tricia Dean, Tech Services Manager at Wilmington Public Library (Illinois), argued for the importance of capturing the histories of small and rural communities through initiatives like Community Webs. Liz Paulus, Adult Services Librarian at Cedar Mill & Bethany Community Libraries described her efforts to capture the online Cedar Mill News via web archiving, stressing how one successful project can play a significant role when advocating for future resources. Longtime Community Webs member Dylan Gaffney, Information Services Associate for Local History & Special Collections at Forbes Library, described his library’s participation in States of Incarceration, a traveling exhibition on mass incarceration, the Historic Northampton Enslaved People Project, and other initiatives. Gaffney credited Community Webs with paving the way for an equity-focused approach to digital projects such as these. Finally, Dana Hamlin, Archivist at Waltham Public Library showcased her organization’s web archiving efforts, highlighting the library’s COVID-19 collections and their attempts to capture the online local newspaper, the Waltham News Tribune.  

Throughout the day, attendees had opportunities to discuss digital initiatives at their organizations, to catch up informally after a long hiatus, and to browse the exhibitions on display at the National Museum of the American Indian. We’re so grateful to all of our Community Webs members who were able to attend the event and especially to those who shared their knowledge. Our next Community Webs Symposium will be held in Chattanooga this September 13 to coincide with the Association for Rural and Small Libraries Conference. We are looking forward to seeing more program members there!

Building a Better Internet: Internet Archive Convenes DC Workshop

Photo of workshop participants, by Caralee Adams

Thought leaders from libraries, academia, and civil society gathered at Georgetown Law Center in Washington, D.C., on June 23 to discuss how to best advance policies that improve the ease, affordability and equity in how people access knowledge in the digital age.

Convened by the Internet Archive, this workshop was designed as a continuation of a conversation that the public interest community, including Internet Archive, Creative Commons, Public Knowledge, Library Futures, and the Wikimedia Foundation, started last summer around building a better internet centered on public interest values.

While U.S. lawmakers’ focus on internet policy has largely been directed at reigning in the “Big Tech” commercial platforms, this workshop took a different approach. Rather than centering the challenges with today’s for-profit, commercial platforms, the workshop centered the barriers libraries face and potential opportunities for them to help solve challenges with our digital information ecosystem.

Our hope as organizers was that we could map the terrain, find common ground, and identify areas for further discussion. And even in a short amount of time, we were able to do that in spades. Here are a few of our key learnings, and what’s next.

Key Opportunities

Participants recognized that libraries could fill an important gap in the current online environment, as they have done for hundreds of years offline–indeed, providing free access to high quality, trusted information is libraries’ primary mission. As our information ecosystem becomes increasingly digital, the world often looks to libraries to do even more. For instance, scholar Joan Donovan has suggested that platform companies hire 10,000 librarians to help curate their services and support access to quality information. Others have suggested libraries could be doing fact checking, building and hosting social media networks, and more. One important way to combat misinformation is with better information provided by libraries; however, this is not without its challenges.

Key Barriers

Participants identified two significant barriers for full library participation in the digital information ecosystem as media consolidation and copyright overreach by powerful publishers. The group discussed a wide range of possible solutions to these challenges including antitrust scrutiny, contract preemption, supporting a robust public domain, controlled digital lending, and digital ownership.


The group was motivated by a desire to serve the public over commercial interests, and expressed their commitment to making sure equity was woven through all proposals in a thoughtful and authentic manner. Libraries support access to information and creative empowerment for all. We understand that a better internet must work for everyone, including underserved and vulnerable populations.

As the organizers of this event, we are very grateful to all the participants for contributing their time and expertise to this effort. Up next, we will hold virtual workshops to include additional members of the community in these discussions followed by the publication of a longer report with our findings and policy recommendations. Stay tuned for future updates as this effort moves forward.


July Book Talk: The Library: A Fragile History

“A comprehensive and fascinating deep dive into the evolution of libraries… Bibliophiles should consider this a must-read.”—Publishers Weekly

Perfect for book lovers, this is a fascinating exploration of the history of libraries and the people who built them, from the ancient world to the digital age.

Join historian Abby Smith Rumsey for a book talk & conversation with Andrew Pettegree and Arthur der Weduwen, authors of The Library: A Fragile History.

Watch session recording:

Many have decried the perilous state of the library in the 21st century, a situation that was made only worse when public libraries across the world were forced to shut their doors in the face of a global pandemic. But across centuries of existence, libraries have faced ruin from war, fire, neglect, and dispersal—only to be reborn again.

In The Library, historians Andrew Pettegree and Arthur der Weduwen trace the extraordinary history of the institution, from the famed collections of the ancient world to the modern public resource of today. Along the way, they encounter the librarians, historians, readers, supporters and antagonists that have shaped the library and its offerings over centuries. Do libraries last? Register for our book talk to find out from the authors.

Purchase a copy from our local bookstore, The Booksmith.

July Book Talk: The Library: A Fragile History
Historian Abby Smith Rumsey in conversation with authors Andrew Pettegree and Arthur der Weduwen.
July 20 @ 9am PT
Watch the event recording

ABOUT THE SPEAKERS:

Abby Smith Rumsey is a writer and historian focusing on the creation, preservation, and use of the cultural record in all media. She writes and lectures widely on analog and digital preservation, online scholarship, the nature of evidence, the changing roles of libraries and archives, and the impact of new information technologies on perceptions of history, time, and identity. She is the author of When We Are No More: How Digital Memory is Shaping our Future (2016).

Andrew Pettegree is Professor of Modern History at St Andrews University, where he directs the Universal Short Title Catalogue, a database of information about all books published before 1650. A leading expert on the history of book and media transformations, Pettegree is the award-winning author of several books on the subject. He lives in Scotland. 

Arthur der Weduwen is a historian and postdoctoral fellow at St. Andrews, where he serves as an associate editor of the Universal Short Title Catalogue. This is his fifth book. He lives in Scotland.

June Book Talk: The Catalogue of Shipwrecked Books

“Wilson-Lee’s pioneering study makes Hernando’s life every bit as compelling as his father’s. But that is not all: as we accompany Hernando on his various European journeys of compulsive acquisition, we are not only led through a richly evoked early modern world, but also prompted to reflect on our own data-saturated age.” —The Times Literary Supplement

The Internet Archive invites you to watch a book talk with Edward Wilson-Lee, author of The Catalogue of Shipwrecked Books: Christopher Columbus, His Son, and the Quest to Build the World’s Greatest Library, followed by a conversation with Brewster Kahle, founder of the Internet Archive.

Purchase your copy from The Booksmith, our local bookstore.

In The Catalogue of Shipwrecked Books, Edward Wilson-Lee tells the compelling story of Hernando Colón, who sailed with his father Christopher Columbus on his final voyage to the New World, a journey that ended in disaster, bloody mutiny, and shipwreck. After Columbus’s death in 1506, eighteen-year-old Hernando sought to continue—and surpass—his father’s campaign to explore the boundaries of the known world by building a library that would collect everything ever printed: a vast holding organized by summaries and catalogues, the first database for the exploding diversity of written matter as the printing press proliferated across Europe.

Hernando held the groundbreaking conviction that a library of universal knowledge should include “all books, in all languages and on all subjects,” even material often dismissed: ballads, erotica, news pamphlets, almanacs, popular images, romances, fables. The loss of part of his collection to another maritime disaster in 1522, set off the final scramble to complete this sublime project, a race against time to realize a vision of near-impossible perfection.

Book Talk: The Catalogue of Shipwrecked Books
Author Edward Wilson-Lee in conversation with Internet Archive’s Brewster Kahle.
June 28 @ 10am PT
Watch the recording from the virtual event

Edward Wilson-Lee is a Fellow in English at Sidney Sussex College, University of Cambridge, and a specialist in the literature and the history of the book in the early modern period. He is the author of The Catalogue of Shipwrecked Books, Shakespeare in Swahililand and Translation and the Book Trade in Early Modern Europe.

Special Event: Universal Access to All Knowledge @ The New York Society Library

Saturday, June 4, 2:00 PM ET
The New York Society Library, 53 East 79th Street, Manhattan
or by livestream


Register now for the in-person session or the livestream.

Watch event recording:

Join Brewster Kahle of the Internet Archive for a special two-part presentation and
discussion on using this massive resource and on the societal and policy
issues affecting access to knowledge.

This is a great time to be an archivist and librarian—digital memory is ever more important and more difficult to manage.

Advances in computing and communications mean that we can cost-effectively store every book, sound recording, movie, software package, and public webpage ever created and provide access to these collections via the Internet to students and adults all over the world. By using mostly existing institutions and funding sources, we can build this, as well as compensate authors, within the current worldwide library budget. Technological advances, for the first time since the loss of the Library of Alexandria, may allow us to collect all published knowledge in a similar way. But now we can take the original goal another step further to make all the published works of humankind accessible to everyone, no matter where they are in the world.
 
Will we allow ourselves to re-invent our concept of libraries and archives to expand and to use the new technologies?  This is fundamentally a societal and policy issue. These issues are reflected in our governments’ spending priorities, and in law.

This event takes place in two interrelated hours:
2:00-3:00 PM – The Internet Archive: What It Is and How to Use It
3:00-4:00 PM – Universal Access to All Knowledge: Technologies, Societies, Legalities

Register now for the in-person session or the livestream.

Goodbye Facebook. Hello Decentralized Social Media?

The pending sale of Twitter to Elon Musk has generated a buzz about the future of social media and just who should control our data.

Wendy Hanamura, director of partnerships at the Internet Archive, moderated an online discussion April 28 “Goodbye Facebook, Hello Decentralized Social Media?” about the opportunities and dangers ahead. The webinar is part of a series of six workshops, “Imagining a Better Online World: Exploring the Decentralized Web.” 

Watch the session recording:

The session featured founders of some of the top decentralized social media networks including Jay Graber, chief executive officer of R&D project Bluesky, Matthew Hodgson, technical co-founder of Matrix, and Andre Staltz, creator of Manyverse. Unlike Twitter, Facebook or Slack, Matrix and Manyverse have no central controlling entity. Instead the peer-to-peer networks shift power to the users and protect privacy. 

If Twitter is indeed bought and people are disappointed with the changes, the speakers expressed hope that the public will consider other social networks. “A crisis of this type means that people start installing Manyverse and other alternatives,” Staltz said. “The opportunity side is clear.” Still in the transition period if other platforms are not ready, there is some risk that users will feel stuck and not switch, he added.

Hodgson said there are reasons to be both optimistic and pessimistic about Musk purchasing Twitter. The hope is that he will use his powers for good, making it available to everybody and empowering people to block the content they don’t want to see. The risk is with no moderation, Hodgson said, people will be obnoxious to one another without sufficient controls to filter, and the system will melt down. “It’s certainly got potential to be an experiment. I’m cautiously optimistic on it,” he said.

People who work in decentralized tech recognize the risk that comes when one person can control a network and act for good or bad, Graber said. “This turn of events demonstrates that social networks that are centralized can change very quickly,” she said. “Those changes can potentially disrupt or drastically alter people’s identity, relationships, and the content that they put on there over the years. This highlights the necessity for transition to a protocol-based ecosystem.” 

When a platform is user-controlled, it is resilient to disruptive change, Graber said. Decentralization enables immutability so change is hard and is a slow process that requires a lot of people to agree, added Staltz.

The three leaders spoke about how decentralized networks provide a sustainable alternative and are gaining traction. Unlike major players that own user data and monetize personal information, decentralized networks are controlled by users and information lives in many different places.

“Society as a whole is facing a lot of crises,” Graber said. “We have the ability to, as a collective intelligence, to investigate a lot of directions at once. But we don’t actually have the free ability to fully do this in our current social architecture…if you decentralize, you get the ability to innovate and explore many more directions at once. And all the parts get more freedom and autonomy.”

Decentralized social media is structured to change the balance of power, added Hanamura: “In this moment, we want you to know that you have the power. You can take back the power, but you have to understand it and understand your responsibility.”

The webinar was co-sponsored by DWeb and Library Futures, and presented by the Metropolitan New York Library Council (METRO).

The next event in the series, Decentralized Apps, the Metaverse and the “Next Big Thing,” will be held Thursday, May 26 at 4-5 p.m.EST, Register here

Congressman Ro Khanna in conversation with Larry Lessig

Could Ro Khanna be the first Asian American President of the United States?

California Congressman Ro Khanna is a political rising star, one that some Democrats see as the future of the Party. Known both for his progressive leadership and his ability to work across the aisle, Khanna – who represents Silicon Valley – is one of the most important figures setting tech policy in our nation today.

The Internet Archive invites you to come hear Khanna speak about his vision for the future. In Dignity in the Digital Age: Making Tech Work for All of Us, Khanna offers a vision for democratizing digital innovation to build economically vibrant and inclusive communities. Instead of being subject to tech’s reshaping of our economy, Khanna offers that we must channel those powerful forces toward creating a more healthy, equal, and democratic society.

On Tuesday, May 31st, 6pm PT/9pm ET, Representative Khanna will be interviewed by professor Larry Lessig, a digital access visionary and co-founder of Creative Commons and the Free Culture movement. Lessig himself ran for President in the Democratic primaries in 2016. The Internet Archive is honored to have these two great thinkers sharing our stage, for one night only! Please join us for this exciting political conversation either virtually or in-person at the Internet Archive, 300 Funston Ave, San Francisco. 

REGISTER NOW!

A note about safety for our in-person audience: The Internet Archive is taking COVID precautions very seriously. We will be requiring proof of vaccination and masks indoors. There will be no food or beverages served (though there will be a water station). We are limiting seating in our huge, thousand seat Great Room to only 200 people. And of course we will have our large windows and doors open to ensure good airflow. We are working hard to make sure that this event is as safe as can be! Please reserve your seats ASAP.


Library as Laboratory: Lightning Talks

In this final session of the Internet Archive’s digital humanities expo, Library as Laboratory, attendees heard from scholars in a series of short presentations about their research and how they’re using collections and infrastructure from the Internet Archive for their work.

Speakers:

  • Forgotten Histories of the Mid-Century Coding Bootcamp, [watch] Kate Miltner (University of Edinburgh)
  • Japan As They Saw It, [watch] Tom Gally (University of Tokyo)
  • The Bibliography of Life, [watch] Rod Page (University of Glasgow)
  • Q&A #1 [watch]
  • More Than Words: Fed Chairs’ Communication During Congressional Testimonies, [watch] Michelle Alexopoulos (University of Toronto)
  • WARC Collection Summarization, [watch] Sawood Alam (Internet Archive)
  • Automatic scanning with an Internet Archive TT scanner, [watch] Art Rhyno (University of Windsor)
  • Q&A #2 [watch]
  • Automated Hashtag Hierarchy Generation Using Community Detection and the Shannon Diversity Index, [watch] Spencer Torene (Thomson Reuters Special Services, LLC)
  • My Internet Archive Enabled Journey As A Digital Humanities Citizen Scientist, [watch] Jim Salmons
  • Web and cities: (early internet) geographies through the lenses of the Internet Archive, [watch] Emmanouil Tranos (University of Bristol)
  • Forgotten Novels of the 19th Century, [watch] Tom Gally (University of Tokyo)
  • Q&A #3 [watch]

Links shared during the session are available in the series Resource Guide.


WARC Collection Summarization

Sawood Alam (Internet Archive)

Items in the Internet Archive’s Petabox collections of various media types like image, video, audio, book, etc. have rich metadata, representative thumbnails, and interactive hero elements. However, web collections, primarily containing WARC files and their corresponding CDX files, often look opaque. We created an open-source CLI tool called “CDX Summary” [1] to process sorted CDX files and generate reports. These summary reports give insights on various dimensions of CDX records/captures, such as, total number of mementos, number of unique original resources, distribution of various media types and their HTTP status codes, path and query segment counts, temporal spread, and capture frequencies of top TLDs, hosts, and URIs. We also implemented a uniform sampling algorithm to select a given number of random memento URIs (i.e., URI-Ms) with 200 OK HTML responses that can be utilized for quality assurance purposes or as a representative sample for the collection of WARC files. Our tool can generate both comprehensive and brief reports in JSON format as well as human readable textual representation. We ran our tool on a selected set of public web collections in Petabox, stored resulting JSON files in their corresponding collections, and made them accessible publicly (with the hope that they might be useful for researchers). Furthermore, we implemented a custom Web Component that can load CDX Summary report JSON files and render them in interactive HTML representations. Finally, we integrated this Web Component into the collection/item views of the main site of the Internet Archive, so that patrons can access rich and interactive information when they visit a web collection/item in Petabox. We also found our tool useful for crawl operators as it helped us identify numerous issues in some of our crawls that would have otherwise gone unnoticed.
[1] https://github.com/internetarchive/cdx-summary/ 


More Than Words: Fed Chairs’ Communication During Congressional Testimonies

Michelle Alexopoulos (University of Toronto)

 Economic policies enacted by the government and its agencies have large impacts on the welfare of businesses and individuals—especially those related to fiscal and monetary policy. Communicating the details of the policies to the public is an important and complex undertaking. Policymakers tasked with the communication not only need to present complicated information in simple and relatable terms, but they also need to be credible and convincing—all the while being at the center of the media’s spotlight. In this briefing, I will discuss recent research on the applications of AI to monetary policy communications, and lessons learned to date. In particular, I will report on my recent ongoing project with researchers at the Bank of Canada that analyzes the effects of emotional cues by the Chairs of the U.S. Federal Reserve on financial markets during congressional testimonies.  

While most previous work has mainly focused on the effects of a central bank’s highly scripted messages about its rate decisions delivered by its leader, we use resources from the Internet Archive, CSPAN and copies of testimony transcripts and apply a variety of tools and techniques to study the both the messages and the messengers’ delivery of them. I will review how we apply recent advances in machine learning and big data to construct measures of Federal Reserve Chair’s emotions, expressed via his or her words, voice, and face, as well as discuss challenges encountered and our findings to date. In all, our initial results highlight the salience of the Fed Chair’s emotional cues for shaping market responses to Fed communications. Understanding the effects of non-verbal communication and responses to verbal cues may help policy makers improve upon their communication strategies going forward.  


Digging into the (Internet) Archive: Examining the NSFW Model Responsible for the 2018 Tumblr Purge

Renata Barreto (University of California Berkeley)

In December 2018, Tumblr took down massive amounts of LGBTQ content from its platform. Motivated in part by increasing pressures from financial institutions and a newly passed law — SESTA / FOSTA, which made companies liable for sex trafficking online — Tumblr implemented a strict “not safe for work” or NSFW model, whose false positives included images of fully clothed women, handmade and digital art, and other innocuous objects, such as vases. The Archive Team, in conjunction with the Internet Archive, jumped into high gear and began to scrape self-tagged NSFW blogs in the 2 weeks between Tumblr’s announcement of its new policy and its algorithmic operationalization. At the time, Tumblr was considered a safe haven for the LGBTQ community and in 2013 Yahoo! bought Tumblr for 1.1 billion. In the aftermath of the so-called “Tumblr purge,” Tumblr lost its main user base and, as of 2019, was valued at 3 million. This paper digs into a slice of the 90 TB of data saved by the Archive Team. This is a unique opportunity to peek under the hood of Yahoo’s open_nsfw model, which experts believe was used in the Tumblr purge, and examine the distribution of false positives on the Archive Team dataset. Specifically, we run the open_nsfw model on our dataset and use the t-SNE algorithm to project the similarities across images on 3D space.


Japan As They Saw It (video)

Tom Gally (University of Tokyo)

“Japan As They Saw It” is a collection of descriptions of Japan by American and British visitors in the 1850s and later. Japan had been closed to outsiders for more than two centuries, and there was much curiosity in the West about this newly accessible country. The excerpts are grouped by category—Land, People, Culture, etc.—and each excerpt is linked to the book where it first appeared at the Internet Archive. “Japan As They Saw It” can be read online, or it can be downloaded as a free ebook.


Forgotten Novels of the 19th Century (video)

Tom Gally (University of Tokyo)

Novels were the binge-watched television, the hit podcasts of the 19th century—immersive, addictive, commercial—and they were produced and consumed in huge numbers. But many novels of that era have slipped through the cracks of literary memory. “Forgotten Novels of the 19th Century” is a list of fifty of those neglected novels, all waiting to be discovered and read for free at the Internet Archive.


Forgotten Histories of the Mid-Century Coding Bootcamp

Kate Miltner (University of Edinburgh)

Over the past 10 years, Americans have been exhorted to “learn to code” in order to solve a series of entrenched social issues: the tech “skills gap”, the looming threat of AI and automation, social mobility, and the underrepresentation of women and people of color in the tech industry. In response to this widespread discourse, an entire industry of short-term intensive training courses– otherwise known as coding bootcamps– have sprung up across the US, bringing in hundreds of millions of dollars in revenue a year and training tens of thousands of people. Coding bootcamps have been framed as a novel kind of institution that is equipped to solve contemporary problems. However, materials from the Internet Archive show us that, in fact, a similar discourse about computer programming and similar organizations called EDP schools existed over 70 years ago. This talk will showcase materials from the Ted Nelson Archive and the Computerworld archive to showcase how lessons from the past can inform the present.


The Bibliography of Life

Roderic Page (University of Glasgow)

The “bibliography of life” is the aspiration of making all the taxonomic literature available so that for every species on the planet we can find its original description, as well as track how our knowledge of those species has changed over time. By combining content from the Internet Archive and the Wayback Machine with information in Wikidata we can make 100’s of thousands of taxonomic publications discoverable, and many of these can also be freely read via the Internet Archive. This presentation will outline this project, how it relates to efforts such as the Biodiversity Heritage Library, and highlight some tools such as Wikicite Search and ALEC to help export this content.


Automatic scanning with an Internet Archive TT scanner (video)

Art Rhyno (University of Windsor)

The University of Windsor has set up a mechanism for automatic scanning with an Internet Archive TT scanner, used for the library’s Major Papers collection.


Automated Hashtag Hierarchy Generation Using Community Detection and the Shannon Diversity Index

Spencer Torene (Thomson Reuters Special Services, LLC)

Developing  semantic  hierarchies  from  user-created  hashtags  in  social  media  can  provide  useful  organizational  structure  to  large  volumes  of  data.  However,  construction of  these  hierarchies  is  difficult  using  established  ontologies  (e.g.  WordNet)  due  to the differences in the semantic and pragmatic use of words vs. hashtags in social media. While alternative construction methods based on hashtag frequency are relatively straightforward, these methods can be susceptible to the dynamic nature of social media,  such  as  hashtags  associated  with  surges  in  popularity.  We  drew  inspiration  from the ecologically-based Shannon Diversity Index (SDI) to create a more representative and  resilient  method  of  semantic  hierarchy  construction  that  relies  upon  graph-based community detection and a novel, entropy-based ensemble diversity index (EDI) score. The EDI quantifies the contextual diversity of each hashtag, resulting in thousands of semantically-related groups of hashtags organized along a general-to-specific spectrum. Through an application of EDI to social media data (Twitter) and a comparison of our results to prior approaches, we demonstrate our method’s ability to create semantically consistent hierarchies that can be flexibly applied and adapted to a range of use cases.


Web and cities: (early internet) geographies through the lenses of the Internet Archive

Emmanouil Tranos (University of Bristol)

While geographers first turned their focus on the internet 25 years ago, the wealth of data that the Internet Archive preserves and offers remains at large unexplored, especially for large projects in terms of scope and geographical scale. However, there is hardly any other data source that depicts the evolution of our interaction with the digital and, importantly, the spatial footprint of this interaction better than the Internet Archive. Therefore, the last few years we have been using extensively data from the Internet Archive in order to understand the geography and the evolution of the creation of online content and their interrelation with cities and spatial structure. Specifically, we have worked with The British Library and utilised the JISC UK Web Domain Dataset (1996-2013)1 for a number of projects in order to (i) explore whether the availability of online content of local interest can attract individuals online, (ii) assess how the early engagement with web tools can affect future productivity, (iii) map the evolution of economic clusters, and (iv) predict interregional trade flows. The Internet Archive helps us not only to map the evolution and the geography of the engagement with the internet especially at its early stages and, therefore, draw important lessons regarding new future technologies, but also to understand economic activities that take place within and between cities.
1http://data.webarchive.org.uk/opendata/ukwa.ds.2/

Library as Laboratory Recap: Analyzing Biodiversity Literature at Scale

At a recent webinar hosted by the Internet Archive, leaders from the Biodiversity Heritage Library (BHL) shared how its massive open access digital collection documenting life on the planet is an invaluable resource of use to scientists and ordinary citizens.

“The BHL is a global consortium of the  leading natural history museums, botanical gardens, and research institutions — big and small— from all over the world. Working together and in partnership with the Internet Archive, these libraries have digitized more than 60 million pages of scientific literature available to the public”, said Chris Freeland, director of Open Libraries and moderator of the event.

Watch session recording:

Established in 2006 with a commitment to inspiring discovery through free access to biodiversity knowledge, BHL has 19 members and 22 affiliates, plus 100 worldwide partners contributing data. The BHL has content dating back nearly 600 years alongside current literature that, when liberated from the print page, holds immense promise for advancing science and solving today’s pressing problems of climate change and the loss of biodiversity.

Martin Kalfatovic, BHL program director and associate director of the Smithsonian Libraries and Archives, noted in his presentation that Charles Darwin and colleagues famously said “the cultivation of natural science cannot be efficiently carried on without reference to an extensive library.”

“Today, the Biodiversity Heritage Library is creating this global, accessible open library of literature that will  help scientists, taxonomists, environmentalists—a host of people working with our planet—to actually have ready access to these collections,” Kalfatovic said. BHL’s mission is to improve research methodology by working with its partner libraries and the broader biodiversity and bioinformatics community. Each month, BHL draws about 142,000 visitors and 12 million users overall.

“The outlook for the planet is challenging. By unlocking this historic data [in the Biodiversity Heritage Library], we can find out where we’ve been over time to find out more about where we need to be in the future.”

Martin Kalfatovic, program director, Biodiversity Heritage Library

Most of the BHL’s materials are from collections in the global north, primarily in large, well-funded institutions. Digitizing these collections helps level the playing field, providing researchers in all parts of the world equal access to vital content.

The vast collection includes species descriptions, distribution records, climate records, history of scientific discovery, information on extinct species, and records of scientific distributions of where species live. To date, BHL has made over 176,000 titles and 281,000 volumes available. Through a partnership with the Global Names Architecture project, more than 243 million instances of taxonomic (Latin) names have been found in BHL content.

Kalfatovic underscored the value of BHL content in understanding the environment in the wake of recent troubling news from the Sixth Assessment Report (AR6) published by the  Intergovernmental Panel on Climate Change about the impact of the earth’s warming. 

Biodiversity Heritage Library by the numbers.

“The outlook for the planet is challenging,” he said. “By unlocking this historic data, we can find out where we’ve been over time to find out more about where we need to be in the future.”

JJ Dearborn, BHL data manager, discussed how digitization transforms physical books into digital objects that can be shared with “anyone, at any time, anywhere.” She describes the Wikimedia ecosystem as “fertile ground for open access experimentation,” crediting the organization with giving BHL the ability to reach new audiences and transform its data into 5-star linked open data. “Dark data” that is locked up in legacy formats, JP2s, and OCR text are sources of valuable checklist, species occurrence, and event sampling data that the larger biodiversity community can use to improve humanity’s collective ability to monitor biodiversity loss and the destructive impacts of climate change, at scale.  

The majority of the world’s data today is siloed, unstructured, and unused, Dearborn explained. This “dark data” “represents an untapped resource that could really transform human understanding if it could be truly utilized,” she said. “It might represent a gestalt leap for humanity.” 

The event was the fifth in a series of six sessions highlighting how researchers in the humanities use the Internet Archive. The final session of the Library as Laboratory series will be a series of lightning talks on May 11 at 11am PT / 2pm ET—register now!