Category Archives: Web & Data Services

Internet Archive Launches Collaborative, Web-Based Art Resources Preservation and Access Initiative

Much of the art gallery, artist, and arts organization materials that were once published in print form are now available primarily or solely on the web. These groups, like many in the cultural sector, have also been hit especially hard by the global pandemic, making their web presences particularly at-risk of being lost if they are not proactively collected and preserved.The creation of reference and research resources that promote streamlined access and enable new types of scholarly use will ensure that the art historical record of the 21st century, and especially of our current global pandemic, is readily accessible far into the future.

For this reason, the Internet Archive, along with the New York Art Resources Consortium (NYARC), are pleased to announce our project Consortial Action to Preserve Born-Digital, Web-Based Art History & Culture. The project recently received a two-year, $305,343 Humanities Collections and Reference Resources grant from the Division of Preservation and Access at the National Endowment for the Humanities. This award will support the formation of a cooperative group of 30+ art and museum libraries from across the United States to collaborate on the preservation of, and access to vital arts content from the web. 

The Internet Archive has a long history of building and supporting collaborative communities and providing non-profit web, preservation, and access services to cultural heritage organizations. The multi-institutional initiative between Internet Archive, NYARC, and other arts and museum organizations will build on similar community-based archiving and professional cultivation projects in the Community Programs group, especially our Community Webs program, currently expanding nationally and internationally. Community Webs has received funding from The Andrew W. Mellon Foundation and IMLS to provide public libraries and cultural heritage organizations with services, training, and professional development opportunities to document their diverse local history. 

NYARC are pioneers in collaborative web archiving and shared services, among art and museum libraries. NYARC’s robust web archive collections encompass art resources, artists’ websites, auction catalogs, catalogues raisonnes, and hundreds of New York City gallery websites. The Internet Archive and NYARC have partnered on work to build born-digital collecting capacity among arts organizations in the past, most recently in the IMLS-funded Advancing Art Libraries and Curated Web Archives National forum and related events.  Through discussions, workshops and roadmapping sessions with leaders in art and museum libraries, a strategy and plan  towards an inclusive, sustainable, cooperative approach to collecting and stewarding born-digital, locally-focused art history collection was developed, forming the basis of this broader cooperative effort.

Members in the project’s preliminary group of art and museum libraries will select topics and specific web content that is relevant to their expertise, will provide metadata to facilitate access to archived content, and will participate in planning and evaluation meetings, all while curating a valuable reference resource that will enhance their traditional collecting areas. The Internet Archive will coordinate communications, facilitate governance and collective curatorial activities, provide technical digital library and archive services, and help enable members to build and maintain discovery and access platforms, as well as facilitate researcher use of the collections resulting from the group’s work.

If your art or museum library is interested in joining this collaborative effort, please fill out this participation form by July 31 to join us! 

Introducing 50+ New Public Library Members of the Internet Archive’s Community Webs Program

The Internet Archive’s Community Webs Program provides training and education, infrastructure and services, and professional community cultivation for public librarians across the country to document their local history and the lives of their patrons. Following our recent announcement of the program’s national expansion, with support from the Andrew W. Mellon Foundation, we are excited to welcome the first class of 50+ new public libraries to the program. This brings the current number of new and returning Community Webs participants to 90+ libraries from 33 states and 3 US territories. This diverse group of organizations includes multiple state libraries representing their regions, as well as a mix of large metropolitan library systems, small libraries in rural areas, and libraries like the Feleti Barstow Public Library in American Samoa. All will be working to document their communities, with a particular focus on archiving materials from traditionally underrepresented groups.

The new cohort class kicked off with virtual introductory events in mid-March, where participants met one another and shared stories about their communities and their goals for preserving and providing access to local history materials. Member libraries are currently receiving training in topics such as collection development and starting to build digital collections that reflect local diversity, events, and culture.

Program participant Kathleen Pickering, Director of the Belen Public Library and Harvey House Museum in Belen, New Mexico notes that their library “is committed to free and open-source electronic resources for our patrons, especially given the low-income status of many of our residents” and Community Webs will help further that goal. Similarly, new cohort member Aaron Ramirez of Pueblo City-County Library District (PCCLD) found Community Webs to be a great fit for existing institutional goals and initiatives. “PCCLD’s five-year strategic plan directs us to embrace local cultures, to include individuals of all skill levels and physical abilities, and to enrich established partnerships and collaborations. The groups that have not seen themselves in our archives will find through this project PCCLD’s intention and means to listen and go forward as allies and as a resource of support, rather than an institution serving only the affluent.”

Makiba J. Foster

Makiba J. Foster, Manager of The African American Research Library and Cultural Center of Broward County, Florida pointed out that “as content becomes increasingly digital, we need this opportunity to document the digital life and content of our community which includes a diverse representation of the Black Diaspora.”  Makiba was a member of the original Community Webs cohort in a previous position at the Schomburg Center for Research in Black Culture at New York Public Library, and recently presented on her work archiving the black diaspora to a group of more than 200 attendees.

The Community Webs Program is continuing to grow towards the milestone of over 150 participating libraries across the United States and will soon announce another call for applicants for a U.S. cohort starting in late summer. The program also is beginning to expand internationally, starting in Canada, exploring the addition of other types of libraries and cultural heritage organizations, and expanding its suite of training and services available to participants. Expect more news on these initiatives soon. 

Welcome to our new cohort of Community Webs libraries! The full list of new members: 

  • Alamogordo Public Library (New Mexico)
  • Amelia Island Museum of History (Florida)
  • ART | library deco (Texas)
  • Asbury Park Public Library (New Jersey)
  • Atlanta History Center (Georgia)
  • Bartholomew County Public Library (Indiana)
  • Bedford Public Library System (Virginia)
  • Belen Public Library and Harvey House Museum (New Mexico)
  • Bensenville Community Public Library (Illinois)
  • Biblioteca Municipal Aurea M. Pérez (Puerto Rico)
  • Carbondale Public Library (Illinois)
  • Cedar Mill & Bethany Community Libraries (Oregon)
  • Charlotte Mecklenburg Library (North Carolina)
  • Chicago Public Library (Illinois)
  • City Archives & Special Collections, New Orleans Public Library (Louisiana)
  • Dayton Metro Library (Ohio)
  • Elba Public Library (Alabama)
  • Essex Library Association (Connecticut)
  • Everett Public Library (Washington)
  • Feleti Barstow Public Library (American Samoa)
  • Forsyth County Public Library (North Carolina)
  • Hartford History Center, Hartford Public Library (Connecticut)
  • Heritage Public Library (Virginia)
  • Huntsville-Madison County Public Library (Alabama)
  • James Blackstone Memorial Library (Connecticut)
  • Jefferson Parish Library (Louisiana)
  • Jefferson-Madison Regional Library (Virginia)
  • Laramie County Library System (Wyoming)
  • Lawrence Public Library (Massachusetts)
  • Los Angeles Public Library (California)
  • Mill Valley Public Library, Lucretia Little History Room (California)
  • Missoula Public Library (Montana)
  • Niagara Falls Public Library (New York)
  • Pueblo City-County Library District (Colorado)
  • Rochester Public Library (New York)
  • Santa Cruz Public Libraries (California)
  • South Pasadena Public Library (California)
  • State Library of Pennsylvania (Pennsylvania)
  • Tangipahoa Parish Library (Louisiana)
  • The African American Research Library and Cultural Center (Florida)
  • The Ferguson Library (Connecticut)
  • Three Rivers Public Library District (Illinois)
  • Virginia Beach Public Library (Virginia)
  • Waltham Public Library (Massachusetts)
  • Watsonville Public Library (California)
  • West Virginia Library Commission (West Virginia)
  • William B Harlan Memorial Library (Kentucky)
  • Worcester Public Library (Massachusetts)
  • Your Heritage Matters (North Carolina)

Early Web Datasets & Researcher Opportunities

In July, we announced our partnership with the Archives Unleashed project as part of our ongoing effort to make new services available for scholars and students to study the archived web. Joining the curatorial power of our Archive-It service, our work supporting text and data mining, and Archives Unleashed’s in-browser analysis tools will open up new opportunities for understanding the petabyte-scale volume of historical records in web archives.

As part of our partnership, we are releasing a series of publicly available datasets created from archived web collections. Alongside these efforts, the project is also launching a Cohort Program providing funding and technical support for research teams interested in studying web archive collections. These twin efforts aim to help build the infrastructure and services to allow more researchers to leverage web archives in their scholarly work. More details on the new public datasets and the cohorts program are below. 

Early Web Datasets

Our first in a series of public datasets from the web collections are oriented around the theme of the early web. These are, of course, datasets intended for data mining and researchers using computational tools to study large amounts of data, so are absent the informational or nostalgia value of looking at archived webpages in the Wayback Machine. If the latter is more your interest, here is an archived Geocities page with unicorn GIFs.

GeoCities Collection (1994–2009)

As one of the first platforms for creating web pages without expertise, Geocities lowered the barrier of entry for a new generation of website creators. There were at least 38 million pages displayed by GeoCities before it was terminated by Yahoo! in 2009. This dataset collection contains a number of individual datasets that include data such as domain counts, image graph and web graph data, and binary file information for a variety of file formats like audio, video, and text and image files. A graphml file is also available for the domain graph.

GeoCities Dataset Collection: https://archive.org/details/geocitiesdatasets

Friendster (2003–2015)

Friendster was an early and widely used social media networking site where users were able to establish and maintain layers of shared connections with other users. This dataset collection contains  graph files that allow data-driven research to explore how certain pages within Friendster linked to each other. It also contains a dataset that provides some basic metadata about the individual files within the archival collection. 

Friendster Dataset Collection: https://archive.org/details/friendsterdatasets

Early Web Language Datasets (1996–1999)

These two related datasets were generated from the Internet Archive’s global web archive collection. The first dataset, “Parallel Language Records of the Early Web (1996–1999)” provides a dataset of multilingual records, or URLs of websites that have the same text represented in multiple languages. Such multi-language text from websites are a rich source for parallel language corpora and can be valuable in machine translation. The second dataset, “Language Annotations of the Early Web (1996–1999)” is another metadata set that annotates the language of over four million websites using Compact Language Detector (CLD3).

Early Web Language collection: https://archive.org/details/earlywebdatasets

Archives Unleashed Cohort Program

Applications are now being accepted from research teams interested in performing computational analysis of web archive data. Five cohorts teams of up to five members each will be selected to participate in the program from July 2021 to June 2022. Teams will:

  • Participate in cohort events, training, and support, with a closing event held at Internet Archive, in San Francisco, California, USA tentatively in May 2022. Prior events will be virtual or in-person, depending on COVID-19 restrictions
  • Receive bi-monthly mentorship via support meetings with the Archives Unleashed team
  • Work in the Archive-It Research Cloud to generate custom datasets
  • Receive funding of $11,500 CAD to support project work. Additional support will be provided for travel to the Internet Archive event

Applications are due March 31, 2021. Please visit the Archives Unleashed Research Cohorts webpage for more details on the program and instructions on how to apply.

Search Scholarly Materials Preserved in the Internet Archive

Looking for a research paper but can’t find a copy in your library’s catalog or popular search engines? Give Internet Archive Scholar a try! We might have a PDF from a “vanished” Open Access publisher in our web archive, an author’s pre-publication manuscript from their archived faculty webpage, or a digitized microfilm version of an older publication.

We hope Internet Archive Scholar will aid researchers and librarians looking for specific open access papers that may not be otherwise available to them. Judith van Stegeren (@jd7g on Twitter), a PhD candidate in the Netherlands, encountered just such a situation recently when sharing a workshop paper on procedural generation in computer games: “Towards Qualitative Procedural Generation” by Mark R. Johnson, originally presented at the Computational Creativity & Games Workshop in 2016. The papers for this particular year of the workshop are not indexed in the usual bibliographic catalogs, and the original workshop website hosting the Open Access papers is no longer accessible. Fortunately, copies of all the 2016 workshop papers were captured in the Wayback Machine, and can be found today by searching IA Scholar by title or conference name.

As another example, dozens of papers from the Open Journal of Hematology are no longer resolvable via DOI. As mentioned in a previous blog post, the publisher’s website vanished and has been replaced with unrelated advertisements. But before that happened, the papers were captured in the Wayback Machine, indexed in our catalog, and can now be searched in full:

IA Scholar Search Results

IA Scholar is a simple, access-oriented interface to content identified across several Internet Archive collections, including web archives, archive.org files, and digitized print materials. The full text of articles is searchable for users that are hunting for particular phrases or keywords. This complements our existing full-text search index of millions of digitized books and other documents on archive.org.

The service builds on Fatcat, an open catalog we have developed to identify at-risk and web-published open scholarly outputs that can benefit from long-term preservation, additional metadata, and perpetual access. Fatcat includes resources that may be useful to librarians and archivists, such as bulk metadata dumps, a read/write API, command-line tool, and file-level archival metadata. If you are interested in collaborating with us, or are a researcher interested in text analysis applications, we have a public chat channel or can be contacted by email at info@archive.org.

IA Scholar marks a milestone in our work initiated in 2018 to leverage the automation and scale of web and API harvesting in providing open infrastructure for the preservation of and perpetual access to scholarly materials from the public web. We particularly want to thank the Mellon Foundation for their original and ongoing support of this work, our many current partners, and the other collaborators, contributors, and volunteers.

All of this is possible because of the incredible open research ecosystem built and collectively maintained by Open Access advocates. Thank you to the DOAJ and other groups for helping catalog open access journals which has aided preservation. Thank you to the Biodiversity Heritage Library and its supporters for digitizing print journal literature. And thank you to the many other organizations we have worked with, integrated, or whose services we have utilized, including open web indices (Unpaywall, CORE, CiteseerX, Microsoft Academic, Semantic Scholar), directories of open journals (DOAJ, ROAD SHERPA/ROMEO, JURN, Wikidata), and open bibliographic catalogs (Crossref, Datacite, J-STAGE, Pubmed, dblp). 

IA Scholar is built from open source software components, and is itself released as Free Software. The website has been translated into eight languages (so far!) by generous volunteers.

Seeking Public Library Participants for Community History Web Archiving Program

Local history collections are necessary to understanding the life and culture of a community. As methods for sharing  information have shifted towards the web, there are many more avenues for community members to document diverse experiences.  Public libraries play a critical role in building community-oriented archives and these collections  are particularly important in recording the impact of unprecedented events on the lives of local citizens. 

Last week, we announced a major national expansion of our Community Webs program providing infrastructure, services, and training to public librarians to archive local history as documented on the web… We now invite public libraries in the United States and cultural heritage organizations in U.S. territories to apply to join the Community Webs program. Participants in the program receive free web archiving and technical services, education, professional development, and funding to build  community history web archives, especially collections documenting the lives of patrons and communities traditionally under-represented in the historical record.

If you are a public librarian interested in joining the Community Webs program please review the full call for applications and the program FAQs. Online applications are being accepted through Sunday, January 31, 2021

“Whether documenting the indie music scene of the 1990s, researching the history of local abolitionists and formerly enslaved peoples, or helping patrons research the early LGBT movement, I am frequently reminded of what was not saved or is not physically present in our collections. These gaps or silences often reflect subcultures in our community.” – Dylan Gaffney, Forbes Library, in Northampton, MA

The program is seeking public libraries to join a diverse network of 150+ organizations  that are:

  • Documenting local history by saving web-published sites, stories and community engagement on the web.
  • Growing their professional skills and increasing institutional technical capacity by engaging in a supportive network of peer organizations pursuing this work.
  • Building a public understanding of web archiving as a practice and its importance to preserving 21st century community history and underrepresented voices.

Current Community Webs cohort members have created nearly 300 publicly available local history web archive collections on topics ranging from COVID-19, to local arts and culture, to 2020 local and U.S. elections. Collecting the web-published materials of local organizations, movements and individuals is often the primary way to document their presence for future historians.

“During the summer of 2016, Baton Rouge witnessed the shooting of Alton Sterling, the mass shooting of Baton Rouge law enforcement, and the Great Flood of 2016. While watching these events unfold from our smartphones and computers, we at the East Baton Rouge Parish Library realized this information might be in jeopardy of never being acquired and preserved due to a shift in the way information is being created and disseminated.” – Emily Ward, East Baton Rouge Parish Library

Benefits of participation in Community Webs include:

  • A three-year subscription to the Archive-It web archiving service.
  • Funding to support travel to a full-day Community Webs National Symposium (projected for 2021 and in 2022) and other professional development opportunities. 
  • Extensive training and educational resources provided by professional staff.
  • Membership in an active and diverse community of public librarians across the country. 
  • Options to increase access (and discoverability) to program collections via hubs, such as DPLA.
  • Funding to support local outreach, public programming, and community collaborations. 

Please feel free to email us with any questions and be sure to apply by Sunday, January 31, 2021.

Community Webs Program Receives $1,130,000 Andrew W. Mellon Foundation Award for a National Network of Public Libraries Building Local History Web Archives

More than ever, the lives of communities are documented online. The web remains a vital resource for traditionally under-represented groups to write and share about their lives and experiences. Preserving this web-published material, in turn, allows libraries to build more expansive, inclusive, and community-oriented archival collections.

In 2017, the Internet Archive’s Archive-It service launched the program, “Community Webs: Empowering Public Libraries to Create Community History Web Archives.” The program provides training, professional development, cohort building, and technical services for public librarians to curate community archives of websites, social media, and online material documenting the experiences of their patrons, especially those often underrepresented in traditional physical archives. Since its launch, the program has grown to include 40 public libraries in 21 states that have built almost 300 collections documenting local civic life, especially of marginalized groups, creating an archive totaling over 50 terabytes and tens of millions of individual digital documents, images, audio-video, and more. The program received additional funding in 2019 to continue its work and focus on strategic planning, partnering with the Educopia Institute to ensure the growth and sustainability of the program and the cohort.

We are excited to announce that Community Webs has received $1,130,000 in funding from The Andrew W. Mellon Foundation for “Community Webs: A National Network of Public Library Web Archives Documenting Local History & Underrepresented Groups,” an nationwide expansion of the program to include a minimum of 2 public libraries in each of the 50 United States, plus additional local history organizations in U.S territories, for a total of 150-200 participating public libraries and heritage organizations. Participants will receive web archiving and access services, training and education, and funds to promote and pursue their community archiving. The Community Webs National Network will also make the resulting public library local history community web archives available to scholars through specialized access tools and datasets, partner with affiliated national discovery and digital collections platforms such as DPLA, and build partnerships and collaborations with state and regional groups advancing local history digital preservation efforts. We thank The Andrew W. Mellon Foundation for their generous support to grow this program nationwide and empower hundreds of public librarians to build archives that elevate the voices, lives, and events of their underrepresented communities and ensure this material is permanently available to patrons, students, scholars, and citizens.

Over the course of the Community Webs program, participating public libraries have created diverse collections on a wide range of topics, often in collaboration with members of their local communities. Examples include:

  • Community Webs members have created collections related to the COVID-19 pandemic, including Schomburg Center for Research in Black Culture’s “Novel Coronavirus COVID-19” collection which focuses on “the African diasporan experiences of COVID-19 including racial disparities in health outcomes and access, the impact on Black-owned businesses, and cultural production.” Athens Regional Library System created a collection of “Athens, Georgia Area COVID-19 Response” which focuses on the social, economic and health impacts of COVID-19 on the local community, with specific attention on community efforts to support frontline workers. A recent American Libraries article featured the COVID archiving work of public libraries.
  • Columbus Metropolitan Library’s archive of “Immigrant Experience”, a collection of websites on the activities, needs, and culture of immigrant communities in Central Ohio.
  • Sonoma County Public Library’s “North Bay Fires, 2017” collection documenting when “devastating firestorms swept through Sonoma, Napa, and Mendocino Counties” and part of their “Sonoma Responds: Community Memory Archive.”
  • Birmingham Public Library’s “LGBTQ in Alabama” collection “documenting the history and experiences of the LGBTQ community in Alabama.”
Community Webs public librarians at IA HQ

We look forward to expanding the Community Webs program nationwide in order to enable hundreds of public libraries to continue to build web collections documenting their communities, especially in these historic times.

We expect to put out a Call for Applications in early December for public libraries to join Community Webs. Please pass along this opportunity to your local public library. For more information on the program, check out our website or email us with questions.

Internet Archive Participates in DOAJ-Led Collaboration to Improve the Preservation of OA Journals

Since 2017, Internet Archive has pursued dedicated technical and partnership work to help preserve and provide perpetual access to open access scholarly literature and other outputs. See our original announcement related to this work and a recent update on progress. The below official press release announces an exciting new multi-institutional collaboration in this area.

The Directory of Open Access Journals (DOAJ), the CLOCKSS Archive, Internet Archive, Keepers Registry/ISSN International Centre and Public Knowledge Project (PKP) have agreed to partner to provide an alternative pathway for the preservation of small-scale, APC-free, Open Access journals.

The recent study authored by M.Laakso, L.Matthias, and N.Jahn has revived academia’s concern over the disappearance of the scholarly record disseminated in Open Access (OA) journals.

Their research focuses on OA journals as at risk of vanishing, and “especially small-scale and APC-free journals […] with limited financial resources” that often “opt for lightweight technical solutions” and “cannot afford to enroll in preservation schemes.” The authors have used data available in the Directory of Open Access Journals to come up with the conclusion that just under half of the journals indexed in DOAJ participate in preservation schemes. Their findings “suggest that current approaches to digital preservation are successful in archiving content from larger journals and established publishing houses but leave behind those that are more at risk.” They call for new preservation initiatives “to develop alternative pathways […] better suited for smaller journals that operate without the support of large, professional publishers.”

Answering that call, the joint initiative proposed by the five organisations aims at offering an affordable archiving option to OA journals with no author fees (“diamond” OA) registered with DOAJ, as well as raising awareness among the editors and publishers of these journals about the importance of enrolling with a preservation solution. DOAJ will act as a single interface with CLOCKSS, PKP and Internet Archive and facilitate a connection to these services for interested journals. Lars Bjørnhauge, DOAJ Managing Editor, said: “That this group of organisations are coming together to find a solution to the problem of “vanishing” journals is exciting. It comes as no surprise that journals with little to no funding are prone to disappearing. I am confident that we can make a real difference here.”

Reports regarding the effective preservation of the journals’ content will be aggregated by the ISSN International Centre (ISSN IC) and published in the Keepers Registry. Gaëlle Béquet, ISSN IC Director, commented: “As the operator of the Keepers Registry service, the ISSN International Centre receives inquiries from journal publishers looking for archiving solutions. This project is a new step in the development of our service to meet this need in a transparent and diverse way involving all our partners.”

About 50% of the journals identified by DOAJ as having no archiving solution in place use the Open Journal System (OJS). Therefore, the initiative will also identify and encourage journals on PKP’s OJS platform to preserve their content in the PKP Preservation Network (PKP PN), or to use another supported solution if the OJS instance isn’t new enough to be compatible with the PN integration (OJS 3.1.2+). 

The partners will then follow up by assessing the success and viability of the initiative with an aim to open it up to new archiving agencies and other groups of journals indexed in DOAJ to consolidate preservation actions and ensure service diversity.

DOAJ will act as the central hub where publishers will indicate that they want to participate. Archiving services, provided by CLOCKSS, Internet Archive and PKP will expand their existing capacities. These agencies will report their metadata to the Keepers Registry to provide an overview of the archiving efforts. 

Project partners are currently exploring business and financial sustainability models and outlining areas for technical collaboration.


DOAJ is a community-curated list of peer-reviewed, open access journals and aims to be the starting point for all information searches for quality, peer reviewed open access material. DOAJ’s mission is to increase the visibility, accessibility, reputation, usage and impact of quality, peer-reviewed, open access scholarly research journals globally, regardless of discipline, geography or language. DOAJ will work with editors, publishers and journal owners to help them understand the value of best practice publishing and standards and apply those to their own operations. DOAJ is committed to being 100% independent and maintaining all of its services and metadata as free to use or reuse for everyone.

CLOCKSS is a not-for-profit joint venture among the world’s leading academic publishers and research libraries whose mission is to build a sustainable, international, and geographically distributed dark archive with which to ensure the long-term survival of Web-based scholarly publications for the benefit of the greater global research community. https://www.clockss.org.

Internet Archive is a non-profit digital library, top 200 website at https://archive.org/, and archive of over 60PB of millions of free books, movies, software, music, websites, and more. The Internet Archive partners with over 800 libraries, universities, governments, non-profits, scholarly communications, and open knowledge organizations around the world to advance the shared goal of “Universal Access to All Knowledge.” Since 2017, Internet Archive has pursued partnerships and technical work with a focus on preserving all publicly accessible research outputs, especially at-risk, open access journal literature and data, and providing mission-aligned, non-commercial open infrastructure for the preservation of scholarly knowledge.

Keepers Registry hosted by the ISSN International Centre, an intergovernmental organisation under the auspices of UNESCO, is a global service that monitors the archiving arrangements for continuing resources including e-serials. A dozen archiving agencies all around the world currently report to Keepers Registry. The Registry has three main purposes: 1/ to enable librarians, publishers and policy makers to find out who is looking after what e-content, how, and with what terms of access; 2/ to highlight e-journals which are still “at risk of loss” and need to be archived; 3/ to showcase the archiving organizations around the world, i.e. the Keepers, which provide the digital shelves for access to content over the long term.

PKP is a multi-university and long-standing research project that develops (free) open source software to improve the quality and reach of scholarly publishing. For more than twenty years, PKP has played an important role in championing open access. Open Journal Systems (OJS) was released in 2002 to help reduce cost as a barrier to creating and consuming scholarship online. Today, it is the world’s most widely used open source platform for journal publishing: approximately 42% of the journals in the DOAJ identify OJS as their platform/host/aggregator. In 2014, PKP launched its own Private LOCKSS Network (now the PKP PN) to offer OJS journals unable to invest in digital preservation a free, open, and trustworthy service. 

For more information, contact: 

DOAJ: Dom Mitchell, dom@doaj.org

CLOCKSS: Craig Van Dyck, cvandyck@clockss.org

Internet Archive: Jefferson Bailey, jefferson@archive.org

Keepers Registry: Gaëlle Béquet, gaelle.bequet@issn.org

PKP: James MacGregor, jbm9@sfu.ca

How the Internet Archive is Ensuring Permanent Access to Open Access Journal Articles

Internet Archive has archived and identified 9 million open access journal articles– the next 5 million is getting harder

Open Access journals, such as New Theology Review (ISSN: 0896-4297) and Open Journal of Hematology (ISSN: 2075-907X), made their research articles available for free online for years. With a quick click or a simple query, students anywhere in the world could access their articles, and diligent Wikipedia editors could verify facts against original articles on vitamin deficiency and blood donation.  

But some journals, such as these titles, are no longer available from the publisher’s websites, and are only available through the Internet Archive’s Wayback Machine. Since 2017, the Internet Archive joined others in concentrating on archiving all scholarly literature and making it permanently accessible.

The World Wide Web has made it easier than ever for scholars to collaborate, debate, and share their research. Unfortunately, the structure of today’s web means that content can disappear just as easily: as of today the official publisher websites and DOI redirects for both of the above journals go nowhere or have been replaced with unrelated content.


Wayback Machine captures of Open Access journals now “vanished” from publisher websites

Vigilant librarians saw this problem coming decades ago, when the print-to-digital migration was getting started. They insisted that commercial publishers work with contract digital preservation organizations (such as Portico, LOCKSS, and CLOCKSS) to ensure long-term access to expensive journal subscription content. Efforts have been made to preserve open articles as well, such as Public Knowledge Project’s Private LOCKSS Network for OJS journals and national hosting platforms like the SciELO network. But a portion of all scholarly articles continues to fall through the cracks.

Researchers found that 176 open access journals have already vanished from their publishers’ website over the past two decades, according to a recent preprint article by Mikael Laakso, Lisa Matthias, and Najko Jahn. These periodicals were from all regions of the world and represented all major disciplines — sciences, humanities and social sciences. There are over 14,000 open access journals indexed by the Directory of Open Access Journals and the paper suggests another 900 of those are inactive and at risk of disappearing. The pre-print has struck a nerve, receiving news coverage in Nature and Science.

In 2017, with funding support from the Andrew Mellon Foundation and the Kahle/Austin Foundation, the Internet Archive launched a project focused on preserving all publicly accessible research documents, with a particular focus on open access materials. Our first job was to quantify the scale of the problem.

Monitoring independent preservation of Open Access journal articles published from 1996 through 2019. Categories are defined in the article text.

Of the 14.8 million known open access articles published since 1996, the Internet Archive has archived, identified, and made available through the Wayback Machine 9.1 million of them (“bright” green in the chart above). In the jargon of Open Access, we are counting only “gold” and “hybrid” articles which we expect to be available directly from the publisher, as opposed to preprints, such as in arxiv.org or institutional repositories. Another 3.2 million are believed to be preserved by one or more contracted preservation organizations, based on records kept by Keepers Registry (“dark” olive in the chart). These copies are not intended to be accessible to anybody unless the publisher becomes inaccessible, in which case they are “triggered” and become accessible.

This leaves at least 2.4 million Open Access articles at risk of vanishing from the web (“None”, red in the chart). While many of these are still on publisher’s websites, these have proven difficult to archive.

One of our goals is to archive as many of the articles on the open web as we can, and to keep up with the growing stream of new articles published every day. Another is to look back over the vast petabytes of web content in the Wayback Machine, back to 1996, and find any content we might already have but is not easily findable or discoverable. Both of these projects are amenable to software automation, but made more difficult by the evolving nature of HTML and PDFs and their diverse character sets and encodings. To that end, we have approached this project not just as a technical one, but also as a collaborative one that aims to add another piece to the distributed infrastructure supporting open scholarship.

To expand our reach, we built an editable catalog (https://fatcat.wiki) with an open API to allow anybody to contribute. As the software is free and open source, as is the data, we invite others to reuse and link to the content we have archived. We have also indexed and made searchable much of the literature to help manage our work and help others find if we have archived particular articles. We want to make scholarly material permanently available, and available in new ways– including via large datasets for analysis and “meta research.” 

We also want to acknowledge the many partnerships and collaborations that have supported this work, many of which are key parts of the open scholarly infrastructure, including ISSN, DOAJ, LOCKSS, Unpaywall, Semantic Scholar, CiteSeerX, Crossref, Datacite, and many others. We also want to acknowledge the many Internet Archive staff and volunteers that have contributed to this work, including Bryan Newbold, Martin Czygan, Paul Baclace, Jefferson Bailey, Kenji Nagahashi, David Rosenthal, Victoria Reich, Ellen Spertus, and others.

If you would like to participate in this project, please contact the Internet Archive at webservices@archive.org.

Archive-It and Archives Unleashed Join Forces to Scale Research Use of Web Archives

Archived web data and collections are increasingly important to scholarly practice, especially to those scholars interested in data mining and computational approaches to analyzing large sets of data, text, and records from the web. For over a decade Internet Archive has worked to support computational use of its web collections through a variety of services, from making raw crawl data available to researchers, performing customized extraction and analytic services supporting network or language analysis, to hosting web data hackathons and having dataset download features in our popular suite of web archiving services in Archive-It. Since 2016, we have also collaborated with the Archives Unleashed project to support their efforts to build tools, platforms, and learning materials for social science and humanities scholars to study web collections, including those curated by the 700+ institutions using Archive-It

We are excited to announce a significant expansion of our partnership. With a generous award of $800,000 (USD) to the University of Waterloo from The Andrew W. Mellon Foundation, Archives Unleashed and Archive-It will broaden our collaboration and further integrate our services to provide easy-to-use, scalable tools to scholars, researchers, librarians, and archivists studying and stewarding web archives.  Further integration of Archives Unleashed and Archive-It’s Research Services (and IA’s Web & Data Services more broadly) will simplify the ability of scholars to analyze archived web data and give digital archivists and librarians expanded tools for making their collections available as data, as pre-packaged datasets, and as archives that can be analyzed computationally. It will also offer researchers a best-of-class, end-to-end service for collecting, preserving, and analyzing web-published materials.

The Archives Unleashed team brings together a team of co-investigators.  Professor Ian Milligan, from the University of Waterloo’s Department of History, Jimmy Lin, Professor and Cheriton Chair at Waterloo’s Cheriton School of Computer Science, and Nick Ruest, Digital Assets Librarian in the Digital Scholarship Infrastructure department of York University Libraries, along with Jefferson Bailey, Director of Web Archiving & Data Services at the Internet Archive, will all serve as co-Principal Investigators on the “Integrating Archives Unleashed Cloud with Archive-It” project. This project represents a follow-on to the Archives Unleashed project that began in 2017, also funded by The Andrew W. Mellon Foundation.

“Our first stage of the Archives Unleashed Project,” explains Professor Milligan, “built a stand-alone service that turns web archive data into a format that scholars could easily use. We developed several tools, methods and cloud-based platforms that allow researchers to download a large web archive from which they can analyze all sorts of information, from text and network data to statistical information. The next logical step is to integrate our service with the Internet Archive, which will allow a scholar to run the full cycle of collecting and analyzing web archival content through one portal.”

“Researchers, from both the sciences and the humanities, are finally starting to realize the massive trove of archived web materials that can support a wide variety of computational research,” said Bailey. “We are excited to scale up our collaboration with Archives Unleashed to make the petabytes of web and data archives collected by Archive-It partners and other web archiving institutions around the world more useful for scholarly analysis.” 

The project begins in July 2020 and will begin releasing public datasets as part of the integration later in the year. Upcoming and future work includes technical integration of Archives Unleashed and Archive-It, creation and release of new open-source tools, datasets, and code notebooks, and a series of in-person “datathons” supporting a cohort of scholars using archived web data and collections in their data-driven research and analysis. We are grateful to The Andrew W. Mellon Foundation for their support of this integration and collaboration in support of critical infrastructure supporting computational scholarship and its use of the archived web.

Primary contacts:
IA – Jefferson Bailey, Director of Web Archiving & Data Services, jefferson [at] archive.org
AU – Ian Milligan, Professor of History, University of Waterloo, i2milligan [at] uwaterloo.ca

Archiving Online Local News with the News Measures Research Project

Over the past two years Archive-It, Internet Archive’s web archiving service, has partnered with researchers at the Hubbard School of Journalism and Mass Communication at University of Minnesota and the Dewitt Wallace Center for Media and Democracy at Duke University in a project designed to evaluate the health of local media ecosystems as part of the News Measures Research Project, funded by the Democracy Fund. The project is led by Phil Napoli at Duke University and Matthew Weber at University of Minnesota. Project staff worked with Archive-It to crawl and archive the homepages of 663 local news websites representing 100 communities across the United States. Seven crawls were run on single days from July through September and captured over 2.2TB of unique data and 16 million URLs. Initial findings from the research detail how local communities cover core topics such as emergencies, politics and transportation. Additional findings look at the volume of local news produced by different media outlets, and show the importance of local newspapers in providing communities with relevant content. 

The goal of the News Measures Research Project is to examine the health of local community news by analyzing the amount and type of local news coverage in a sample of community. In order to generate a random and unbiased sample of communities, the team used US Census data. Prior research suggested that average income in a community is correlated with the amount of local news coverage; thus the team decided to focus on three different income brackets (high, medium and low) using the Census data to break up the communities into categories. Rural areas and major cities were eliminated from the sample in order to reduce the number of outliers; this left a list of 1,559 communities ranging in population from 20,000 to 300,000 and in average household income from $21,000 to $215,000. Next, a random sample of 100 communities was selected, and a rigorous search process was applied to build a list of 663 news outlets that cover local news in those communities (based on Web searches and established directories such as Cision).

The News Measures Research Project web captures provide a unique snapshot of local news in the United States. The work is focused on analyzing the nature of local news coverage at a local level, while also examining the broader nature of local community news. At the local level, the 100 community sample provides a way to look at the nature of local news coverage. Next, a team of coders analyzed content on the archived web pages to assess what is being covered by a given news outlet. Often, the websites that serve a local community are simply aggregating content from other outlets, rather than providing unique content. The research team was most interested in understanding the degree to which local news outlets are actually reporting on topics that are pertinent to a given community (e.g. local politics). At the global level, the team looked at interaction between community news websites (e.g. sharing of content) as well as automated measures of the amount of coverage.

The primary data for the researchers was the archived local community news data, but in addition, the team worked with census data to aggregate other measures such as circulation data for newspapers. These data allowed the team to examine the amount and type of local news changes depending on the characteristics of the community. Because the team was using multiple datasets, the Web data is just one part of the puzzle. The WAT data format proved particularly useful for the team in this regard. Using the WAT file format allowed the team to avoid digging deeply into the data – rather, the WAT data allowed the team to examine high level structure without needing to examine the content of each and every WARC record. Down the road, the WARC data allows for a deeper dive,  but the lighter metadata format of the WAT files has enabled early analysis.

Stay tuned for more updates as research utilizing this data continues! The websites selected will continue to be archived and much of the data are publicly available.