Tag Archives: web archiving

Internet Archive Participates in DOAJ-Led Collaboration to Improve the Preservation of OA Journals

Since 2017, Internet Archive has pursued dedicated technical and partnership work to help preserve and provide perpetual access to open access scholarly literature and other outputs. See our original announcement related to this work and a recent update on progress. The below official press release announces an exciting new multi-institutional collaboration in this area.

The Directory of Open Access Journals (DOAJ), the CLOCKSS Archive, Internet Archive, Keepers Registry/ISSN International Centre and Public Knowledge Project (PKP) have agreed to partner to provide an alternative pathway for the preservation of small-scale, APC-free, Open Access journals.

The recent study authored by M.Laakso, L.Matthias, and N.Jahn has revived academia’s concern over the disappearance of the scholarly record disseminated in Open Access (OA) journals.

Their research focuses on OA journals as at risk of vanishing, and “especially small-scale and APC-free journals […] with limited financial resources” that often “opt for lightweight technical solutions” and “cannot afford to enroll in preservation schemes.” The authors have used data available in the Directory of Open Access Journals to come up with the conclusion that just under half of the journals indexed in DOAJ participate in preservation schemes. Their findings “suggest that current approaches to digital preservation are successful in archiving content from larger journals and established publishing houses but leave behind those that are more at risk.” They call for new preservation initiatives “to develop alternative pathways […] better suited for smaller journals that operate without the support of large, professional publishers.”

Answering that call, the joint initiative proposed by the five organisations aims at offering an affordable archiving option to OA journals with no author fees (“diamond” OA) registered with DOAJ, as well as raising awareness among the editors and publishers of these journals about the importance of enrolling with a preservation solution. DOAJ will act as a single interface with CLOCKSS, PKP and Internet Archive and facilitate a connection to these services for interested journals. Lars Bjørnhauge, DOAJ Managing Editor, said: “That this group of organisations are coming together to find a solution to the problem of “vanishing” journals is exciting. It comes as no surprise that journals with little to no funding are prone to disappearing. I am confident that we can make a real difference here.”

Reports regarding the effective preservation of the journals’ content will be aggregated by the ISSN International Centre (ISSN IC) and published in the Keepers Registry. Gaëlle Béquet, ISSN IC Director, commented: “As the operator of the Keepers Registry service, the ISSN International Centre receives inquiries from journal publishers looking for archiving solutions. This project is a new step in the development of our service to meet this need in a transparent and diverse way involving all our partners.”

About 50% of the journals identified by DOAJ as having no archiving solution in place use the Open Journal System (OJS). Therefore, the initiative will also identify and encourage journals on PKP’s OJS platform to preserve their content in the PKP Preservation Network (PKP PN), or to use another supported solution if the OJS instance isn’t new enough to be compatible with the PN integration (OJS 3.1.2+). 

The partners will then follow up by assessing the success and viability of the initiative with an aim to open it up to new archiving agencies and other groups of journals indexed in DOAJ to consolidate preservation actions and ensure service diversity.

DOAJ will act as the central hub where publishers will indicate that they want to participate. Archiving services, provided by CLOCKSS, Internet Archive and PKP will expand their existing capacities. These agencies will report their metadata to the Keepers Registry to provide an overview of the archiving efforts. 

Project partners are currently exploring business and financial sustainability models and outlining areas for technical collaboration.


DOAJ is a community-curated list of peer-reviewed, open access journals and aims to be the starting point for all information searches for quality, peer reviewed open access material. DOAJ’s mission is to increase the visibility, accessibility, reputation, usage and impact of quality, peer-reviewed, open access scholarly research journals globally, regardless of discipline, geography or language. DOAJ will work with editors, publishers and journal owners to help them understand the value of best practice publishing and standards and apply those to their own operations. DOAJ is committed to being 100% independent and maintaining all of its services and metadata as free to use or reuse for everyone.

CLOCKSS is a not-for-profit joint venture among the world’s leading academic publishers and research libraries whose mission is to build a sustainable, international, and geographically distributed dark archive with which to ensure the long-term survival of Web-based scholarly publications for the benefit of the greater global research community. https://www.clockss.org.

Internet Archive is a non-profit digital library, top 200 website at https://archive.org/, and archive of over 60PB of millions of free books, movies, software, music, websites, and more. The Internet Archive partners with over 800 libraries, universities, governments, non-profits, scholarly communications, and open knowledge organizations around the world to advance the shared goal of “Universal Access to All Knowledge.” Since 2017, Internet Archive has pursued partnerships and technical work with a focus on preserving all publicly accessible research outputs, especially at-risk, open access journal literature and data, and providing mission-aligned, non-commercial open infrastructure for the preservation of scholarly knowledge.

Keepers Registry hosted by the ISSN International Centre, an intergovernmental organisation under the auspices of UNESCO, is a global service that monitors the archiving arrangements for continuing resources including e-serials. A dozen archiving agencies all around the world currently report to Keepers Registry. The Registry has three main purposes: 1/ to enable librarians, publishers and policy makers to find out who is looking after what e-content, how, and with what terms of access; 2/ to highlight e-journals which are still “at risk of loss” and need to be archived; 3/ to showcase the archiving organizations around the world, i.e. the Keepers, which provide the digital shelves for access to content over the long term.

PKP is a multi-university and long-standing research project that develops (free) open source software to improve the quality and reach of scholarly publishing. For more than twenty years, PKP has played an important role in championing open access. Open Journal Systems (OJS) was released in 2002 to help reduce cost as a barrier to creating and consuming scholarship online. Today, it is the world’s most widely used open source platform for journal publishing: approximately 42% of the journals in the DOAJ identify OJS as their platform/host/aggregator. In 2014, PKP launched its own Private LOCKSS Network (now the PKP PN) to offer OJS journals unable to invest in digital preservation a free, open, and trustworthy service. 

For more information, contact: 

DOAJ: Dom Mitchell, dom@doaj.org

CLOCKSS: Craig Van Dyck, cvandyck@clockss.org

Internet Archive: Jefferson Bailey, jefferson@archive.org

Keepers Registry: Gaëlle Béquet, gaelle.bequet@issn.org

PKP: James MacGregor, jbm9@sfu.ca

Archive-It and Archives Unleashed Join Forces to Scale Research Use of Web Archives

Archived web data and collections are increasingly important to scholarly practice, especially to those scholars interested in data mining and computational approaches to analyzing large sets of data, text, and records from the web. For over a decade Internet Archive has worked to support computational use of its web collections through a variety of services, from making raw crawl data available to researchers, performing customized extraction and analytic services supporting network or language analysis, to hosting web data hackathons and having dataset download features in our popular suite of web archiving services in Archive-It. Since 2016, we have also collaborated with the Archives Unleashed project to support their efforts to build tools, platforms, and learning materials for social science and humanities scholars to study web collections, including those curated by the 700+ institutions using Archive-It

We are excited to announce a significant expansion of our partnership. With a generous award of $800,000 (USD) to the University of Waterloo from The Andrew W. Mellon Foundation, Archives Unleashed and Archive-It will broaden our collaboration and further integrate our services to provide easy-to-use, scalable tools to scholars, researchers, librarians, and archivists studying and stewarding web archives.  Further integration of Archives Unleashed and Archive-It’s Research Services (and IA’s Web & Data Services more broadly) will simplify the ability of scholars to analyze archived web data and give digital archivists and librarians expanded tools for making their collections available as data, as pre-packaged datasets, and as archives that can be analyzed computationally. It will also offer researchers a best-of-class, end-to-end service for collecting, preserving, and analyzing web-published materials.

The Archives Unleashed team brings together a team of co-investigators.  Professor Ian Milligan, from the University of Waterloo’s Department of History, Jimmy Lin, Professor and Cheriton Chair at Waterloo’s Cheriton School of Computer Science, and Nick Ruest, Digital Assets Librarian in the Digital Scholarship Infrastructure department of York University Libraries, along with Jefferson Bailey, Director of Web Archiving & Data Services at the Internet Archive, will all serve as co-Principal Investigators on the “Integrating Archives Unleashed Cloud with Archive-It” project. This project represents a follow-on to the Archives Unleashed project that began in 2017, also funded by The Andrew W. Mellon Foundation.

“Our first stage of the Archives Unleashed Project,” explains Professor Milligan, “built a stand-alone service that turns web archive data into a format that scholars could easily use. We developed several tools, methods and cloud-based platforms that allow researchers to download a large web archive from which they can analyze all sorts of information, from text and network data to statistical information. The next logical step is to integrate our service with the Internet Archive, which will allow a scholar to run the full cycle of collecting and analyzing web archival content through one portal.”

“Researchers, from both the sciences and the humanities, are finally starting to realize the massive trove of archived web materials that can support a wide variety of computational research,” said Bailey. “We are excited to scale up our collaboration with Archives Unleashed to make the petabytes of web and data archives collected by Archive-It partners and other web archiving institutions around the world more useful for scholarly analysis.” 

The project begins in July 2020 and will begin releasing public datasets as part of the integration later in the year. Upcoming and future work includes technical integration of Archives Unleashed and Archive-It, creation and release of new open-source tools, datasets, and code notebooks, and a series of in-person “datathons” supporting a cohort of scholars using archived web data and collections in their data-driven research and analysis. We are grateful to The Andrew W. Mellon Foundation for their support of this integration and collaboration in support of critical infrastructure supporting computational scholarship and its use of the archived web.

Primary contacts:
IA – Jefferson Bailey, Director of Web Archiving & Data Services, jefferson [at] archive.org
AU – Ian Milligan, Professor of History, University of Waterloo, i2milligan [at] uwaterloo.ca

“Community Webs” Receives Additional Funding to Further Public Library Local History Web Collecting

In 2017, our Archive-It service was awarded funding from the Institute of Museum and Library Services (IMLS) for the 2-year project “Community Webs: Empowering Public Librarians to Create Community History Web Archives.” The program has been providing training and technical infrastructure for a diverse group of librarians nationwide to develop expertise in creating collections of historically valuable web-published materials documenting their local communities and under-represented communities. In response to an unexpectedly large group of applicants, and with additional internal funding, we were able to expand the cohort to a total of 28 libraries from 16 states. The launch announcement and the dedicated website have further information about the program and its progress.

We are excited to announce that IMLS has recently provided additional supplementary funding to Community Webs! The additional funding will allow us to focus on program evaluation, expansion, and strategic planning. We are very pleased to be working with the Educopia Institute in support of this work and will benefit from their vast expertise in community cultivation and program facilitation.

Over the course of the original 2-year Community Webs program, the 28 participating libraries created hundreds of archived collections totaling more than 40 terabytes of data, gave dozens of professional presentations at local and national conferences, held many public programs and patron-facing events, and attended numerous meet-ups and cohort events. As well, the program created a suite of open educational resources, online courses, and other training materials supporting digital curation skills development, local history web collecting, and community formation. Some sample collections created as part of the program include:

#HashtagSyllabusMovement by Schomburg Center for Research in Black Culture
LGBTQ in Alabama by Birmingham Public Library
D.C. Punk (Web) Archive by DC Public Library, Special Collections
North Bay Fires, 2017 by Sonoma County Public Library
Food Culture by Athens (GA) Regional Library System
Movimiento Cosecha Grand Rapids by Grand Rapids Public Library

The program’s website has links to each participating institution’s collections page.

We are grateful to IMLS for the additional funding to continue this popular program, excited to work with Educopia on further community development, and encourage any public libraries interested in participating to contact us.

Archiving Information on the Novel Coronavirus (Covid-19)

The Internet Archive’s Archive-It service is collaborating with the International Internet Preservation Consortium’s (IIPC) Content Development Group (CDG) to archive web-published resources related to the ongoing Novel Coronavirus (Covid-19) outbreak. The IIPC Content Development Group consists of curators and professionals from dozens of libraries and archives from around the world that are preserving and providing access to the archived web. The Internet Archive is a co-founder and longtime member of the IIPC. The project will include both subject-expert curation by IIPC members as well as the inclusion of websites nominated by the public (see the nomination form link below).

Due to the urgency of the outbreak, archiving of nominated web content will commence immediately and continue as needed depending on the course of the outbreak and its containment. Web content from all countries and in any language is in scope. Possible topics to guide nominations and collections: 

  • Coronavirus origins 
  • Information about the spread of infection 
  • Regional or local containment efforts
  • Medical/Scientific aspects
  • Social aspects
  • Economic aspects
  • Political aspects

Members of the general public are welcomed to nominate websites and web-published materials using the following web form: https://forms.gle/iAdvSyh6hyvv1wvx9. Archived information will also be available soon via the IIPC’s public collections in Archive-It. [March 23, 2020 edit: the public collection can now be found here, https://archive-it.org/collections/13529.]

Members of the general public can also take advantage of the ability to upload non-web digital resources directly to specific Internet Archive collections such as Community Video or Community Texts. For instance, see this collection of “Files pertaining to the 2019–20 Wuhan, China Coronavirus outbreak.” We recommend using a common subject tag, like coronavirus to facilitate search and discovery. Fore more information on uploading materials to archive.org, see the Internet Archive Help Center.

A special thanks to Alex Thurman of Columbia University and Nicola Bingham of the British Library, the co-chairs of the IIPC CDG, and to other IIPC members participating in the project. Thanks as well to any and all public nominators assisting with identifying and archiving records about this significant global event.

Archiving Online Local News with the News Measures Research Project

Over the past two years Archive-It, Internet Archive’s web archiving service, has partnered with researchers at the Hubbard School of Journalism and Mass Communication at University of Minnesota and the Dewitt Wallace Center for Media and Democracy at Duke University in a project designed to evaluate the health of local media ecosystems as part of the News Measures Research Project, funded by the Democracy Fund. The project is led by Phil Napoli at Duke University and Matthew Weber at University of Minnesota. Project staff worked with Archive-It to crawl and archive the homepages of 663 local news websites representing 100 communities across the United States. Seven crawls were run on single days from July through September and captured over 2.2TB of unique data and 16 million URLs. Initial findings from the research detail how local communities cover core topics such as emergencies, politics and transportation. Additional findings look at the volume of local news produced by different media outlets, and show the importance of local newspapers in providing communities with relevant content. 

The goal of the News Measures Research Project is to examine the health of local community news by analyzing the amount and type of local news coverage in a sample of community. In order to generate a random and unbiased sample of communities, the team used US Census data. Prior research suggested that average income in a community is correlated with the amount of local news coverage; thus the team decided to focus on three different income brackets (high, medium and low) using the Census data to break up the communities into categories. Rural areas and major cities were eliminated from the sample in order to reduce the number of outliers; this left a list of 1,559 communities ranging in population from 20,000 to 300,000 and in average household income from $21,000 to $215,000. Next, a random sample of 100 communities was selected, and a rigorous search process was applied to build a list of 663 news outlets that cover local news in those communities (based on Web searches and established directories such as Cision).

The News Measures Research Project web captures provide a unique snapshot of local news in the United States. The work is focused on analyzing the nature of local news coverage at a local level, while also examining the broader nature of local community news. At the local level, the 100 community sample provides a way to look at the nature of local news coverage. Next, a team of coders analyzed content on the archived web pages to assess what is being covered by a given news outlet. Often, the websites that serve a local community are simply aggregating content from other outlets, rather than providing unique content. The research team was most interested in understanding the degree to which local news outlets are actually reporting on topics that are pertinent to a given community (e.g. local politics). At the global level, the team looked at interaction between community news websites (e.g. sharing of content) as well as automated measures of the amount of coverage.

The primary data for the researchers was the archived local community news data, but in addition, the team worked with census data to aggregate other measures such as circulation data for newspapers. These data allowed the team to examine the amount and type of local news changes depending on the characteristics of the community. Because the team was using multiple datasets, the Web data is just one part of the puzzle. The WAT data format proved particularly useful for the team in this regard. Using the WAT file format allowed the team to avoid digging deeply into the data – rather, the WAT data allowed the team to examine high level structure without needing to examine the content of each and every WARC record. Down the road, the WARC data allows for a deeper dive,  but the lighter metadata format of the WAT files has enabled early analysis.

Stay tuned for more updates as research utilizing this data continues! The websites selected will continue to be archived and much of the data are publicly available.

The Whole Earth Web Archive

As part of the many releases and announcements for our October Annual Event, we created The Whole Earth Web Archive. The Whole Earth Web Archive (WEWA) is a proof-of-concept to explore ways to improve access to the archived websites of underrepresented nations around the world. Starting with a sample set of 50 small nations we extracted their archived web content from the Internet Archive’s web archive, built special search and access features on top of this subcollection, and created a dedicated discovery portal for searching and browsing. Further work will focus on improving IA’s harvesting of the national webs of these and other underrepresented countries as well as expanding collaborations with libraries and heritage organizations within these countries, and via international organizations, to contribute technical capacity to local experts who can identify websites of value that document the lives and activities of their citizens.

whole earth web archive screenshot

Archived materials from the web play an increasingly necessary role in representation, evidence, historical documentation, and accountability. However, the web’s scale is vast, it changes and disappears quickly, and it requires significant infrastructure and expertise to collect and make permanently accessible. Thus, the community of National Libraries and Governments preserving the web remains overwhelmingly represented by well-resourced institutions from Europe and North America. We hope the WEWA project helps provide enhanced access to archived material otherwise hard to find and browse in the massive 20+ petabytes of the Wayback Machine. More importantly, we hope the project provokes a broader reflection upon the lack of national diversity in institutions collecting the web and also spurs collective action towards diminishing the overrepresentation of “first world” nations and peoples in the overall global web archive.

As with prior special projects by the Web Archiving & Data Services team, such as GifCities (search engine for animated Gifs from the Geocities web collection) or Military Industrial Powerpoint Complex (ebooks of Powerpoints from the archive of the .mil (military) web domain), the project builds on our exploratory work to provide improved access to valuable subsets of the web archive. While our Archive-It service gives curators the tools to build special collections of the web, we also work to build unique collections from the pre-existing global web archive.

The preliminary set of countries in WEWA were determined by selecting the 50 “smallest” countries as measured by number of websites registered on their national web domain (aka ccTLD) — a somewhat arbitrary measurement, we acknowledge. The underlying search index is based on internally-developed tools for search of both text and media. Indices are built from features like page titles or descriptive hyperlinks from other pages, with relevance ranking boosted by criteria such as number of inbound links and popularity and include a temporal dimension to account for the historicity of web archives. Additional technical information on search engineering can be found in “Exploring Web Archives Through Temporal Anchor Texts.”

We intend both to do more targeted, high-quality archiving of these and other smaller national webs and also have undertaking active outreach to national and heritage institutions in these nations, and to related international organizations, to ensure this work is guided by broader community input. If you are interested in contributing to this effort or have any questions, feel free to email us at webservices [at] archive [dot] org. Thanks for browsing the WEWA!

“Make It Weird”: Building a collaborative public library web archive in an arts and counterculture community

This post is reposted from the Archive-It blog and written by guest author Dylan Gaffney of the Forbes Library, one of the public libraries participating in the Community Webs program.

Whether documenting the indie music scene of the 1990s, researching the history of local abolitionists and formerly enslaved peoples in the 1840s, or helping patrons research the early LGBT movement in the area, I am frequently reminded of what was not saved or is not physically present in our collections. These gaps or silences often reflect subcultures in our community, stories that were not told on the pages of the local newspaper, or which might not be reflected in the websites of city government or local institutions. In my first sit down with a fellow staff member to talk about the prospects for a web archive, we brainstormed how we could more completely capture the digital record of today’s community. We discussed including lesser known elements like video of music shows in house basements, the blog of a small queer farm commune in the hills, the Instagram account of the kid who photographs local graffiti, etc. My colleague Heather whispered to me excitedly: “We could make it weird!” I knew immediately I had found my biggest ally in building our collections.

The Forbes Library was one of a few public libraries chosen nationwide for the Community Webs cohort, a group of public libraries organized by the Internet Archive and funded by the Institute of Museum and Library Services to expand web archiving in local history collections. As a librarian in a small city of 28,000 people, who works in a public library with no full-time archivists, the challenge of trying to build a web archive from scratch that truly reflected our rich, varied and “weird” cultural community, the arts and music scenes, and the rich tradition of activism in Western Massachusetts was a daunting but exciting project to embark on.

We knew we would have to leverage our working relationships with media organizations, nonprofits, city departments, the arts and music community, and our staff if we truly hoped to build something which reflected our community as it is. Our advantage was that we had such relationships, and could pitch the idea not only through traditional means like press releases and social media, but by chatting after meetings typically spent coordinating film screenings, gallery walks, and lawn concerts. We knew if we became comfortable enough with the basic concepts of archiving the web, that we could pick the brains of activists planning events in our meeting rooms, friends at shows, the staff of our local media company who lend equipment to aspiring filmmakers, and the folks who sell crops from small family farms in the community at the Farmer’s Markets.

We started by training just a few Information Services staff in one-on-one sessions and shared Archive-It training videos. This helped to broaden the number of librarians familiar with the Archive-It software in general, but also got the wheels turning amongst our reference and circulation staffs–our front lines of communication with the public–in particular. We talked a great deal about what we wish we had in our current archive, about filling in gaps and having the archive more accurately reflect and represent our community.

In order to solicit ideas from the community for preservation, we put together a Google form to be posted online, which was almost entirely cribbed from my Community Webs cohort colleagues at East Baton Rouge Parish Library, Queens Public Library and others. We also set up in-person, one-on-one meetings with community partners and academic institutions that were already engaged in web archiving. We put out press releases and generally just talked to and at anyone who would listen. As a result, nearly all of our first web archival acquisitions come directly from recommendations by the public and our community partners.

For instance, one of the first websites that I knew I wanted to preserve was From Wicked to Wedded, a great site which preserves the history of the LGBTQ community in our area. It was gratifying when two of the first responses to our online outreach also mentioned the site and we had a great conversation with its creator, who researches at the library, and who, like all the content creators we’ve approached thus far, was excited to be included.

Creating an accurate and exciting overview of the lively arts scene in Northampton and the surrounding area seemed like a daunting task at first, but by crawling the websites of notable galleries, arts organizations, and Northampton’s monthly gallery walk, we found that we were quickly able to capture a really interesting cross-section of local artists’ work. We have subsequently begun working with the local arts organizations directly  to identify artists who may have their own websites worthy of inclusion.

Similarly, Northampton has a rich music scene for a city of its small size. With the number of people already documenting live music these days, we weren’t sure how to contribute with our own selection and curation, and so asked several folks embedded in the scene to curate some of their own favorite content, then reached out to the bands themselves to get their thoughts. We are still early in this process, but the response has been encouraging and the benefits to the library in building relationships with folks who are documenting the music scene have already led to physical donations to the archive as well.

It was important to us from the beginning to also consult with Northampton Community Television. NCTV partners with the library on film programming to preserve a record of all they do for the community–teaching filmmaking, lending equipment, training and empowering citizen journalists.. They, in turn, have pointed us to local filmmakers, and through our ongoing collaborations around film programming and the Northampton film festival, we have a platform for outreach in that community as well.

Staff members and local activists pointed us in the direction of other new local radio shows and citizen journalism websites, both of which give personal takes on local politics. One was a wonderful radio show called Out There by one of our bicycle trash pickup workers Ruthie. In a single episode, Ruthie will talk to everybody from the mayor, environmental activists and farmers, to the random junior high kids that she runs into hanging out on the bike path under a bridge.  The other recommendation was for a new citizen journalism site called Shoestring which asks common sense questions of people in power in local government and places them in a national context. The folks from Shoestring stopped by the library’s Arts and Music desk to ask about our bi-weekly Zine Club meeting, which gave us an opportunity to talk about including their site in our web archive and led to physical donation to the archive as well!

At numerous people’s suggestion, we are preserving the Instagram account of our gruff looking former video store clerk turned City Council president Bill Dwight. Bill has a great camera, a great eye and has the ability to capture a wonderful cross-section of the community in his feed. Dann Vazquez has an instagram feed dedicated to capturing oddball moments, new building developments and local graffiti, (one of the more ephemeral of our community’s arts) which gives a unique day to day perspective of change on the streets of our city.

We are a community rich in activism, with a long tradition that, like our LGBTQ history, has not been properly reflected in our archives. For years, the personal and organizational archives of local activists have found homes at the larger colleges and Universities in the Five College Area. Now, by including the websites of long-running and new nonprofits and activist organizations, we are able to create a richer archive for future generations to learn from their pioneering work.

We have tried to remain conscious of what communities are being left out of the collections we are developing, such as the non-English speaking communities with whom we need to improve our outreach and individuals and organizations that might not have a digital presence currently. As we  have the ability to offer basic training at the library and through our community partners,we have recently been exploring the idea of creating a website or Instagram account designed to give individuals and organizations the opportunity to try out these technologies without the weight of a long-term commitment, but with the assurance that their content would be preserved among our web archives.

It still feels that we are in the earliest phases of this endeavour, but we have tried to build a collaborative system of curation which could be sustained going forward. By spreading the role of curation across the community, we can prevent staff burnout on the project and ensure that the perspectives represented in the archive are broader, more varied, and thus more reflective of our small city as it is.

Additional credits: IA staff Karl-Rainer Blumenthal who edits the Archive-It blog and Maria Praetzellis, who manages the Community Webs program.

Internet Archive and New York Art Resources Consortium Receive Grant for a National Forum to Advance Web Archiving in Art and Museum Libraries

We are pleased to announce that the Institute of Museum and Library Services (IMLS) has recently awarded a collaborative grant to the New York Art Resources Consortium and our Archive-It group to host a national forum event, along with associated workshops and stakeholder meetings, to catalyze collaboration among art libraries in the stewardship of historically valuable art-related materials published on the web. The New York Art Resources Consortium (NYARC) consists of the research libraries and archives of three leading art museums in New York City: The Brooklyn Museum, The Frick Collection, and The Museum of Modern Art. Archive-It is the web archiving service of the Internet Archive that works with hundreds of heritage organizations, including an international set of museums and art libraries, to preserve and provide access to web-published resources. Archive-It and NYARC will jointly run the project, Advancing Art Libraries and Curated Web Archives: A National Forum.

This National Leadership Grant in the Curating Collections program category to conduct a National Forum and affiliated meetings builds on NYARC’s and Archive-It’s work together expanding web archiving amongst art and museum libraries and archives, including through the ARLIS/NA Web Archiving Special Interest Group, as well as their individual efforts to advance born-digital collection building. In Reframing Collections for the Digital Age, NYARC focused on web archiving program development, including technical work to integrate Archive-It and its discovery services that can inform work in similar institutions. Archive-It, with its Community Webs program, is working with dozens of public libraries on cohort building, educational resources, and network development supporting community history web archiving — a model that can be adopted by the national art library community to scale out its coordinated efforts. In addition, Archive-It has led, and NYARC operationalized, collaborative efforts towards joint API-based systems integrations research and development to further joint services and interoperability. 

By mobilizing a broad effort through an invitational forum, the project aims to achieve national scale through network building and shared infrastructure planning that the project team will foster through a program of discussion, training, and strategic roadmapping. The project will include the contribution of a diverse group of members of the art library community, lead to published outputs on strategic directions and community-specific training materials, and launch a multi-institutional effort to scale the extent of web-published, born-digital materials preserved and accessible for art scholarship and research. Thank you to IMLS for their continued support of work advancing web archiving and the overall national digital platform initiative.

Andrew W. Mellon Foundation Awards Grant to the Internet Archive for Long Tail Journal Preservation

The Andrew W. Mellon Foundation has awarded a research and development grant to the Internet Archive to address the critical need to preserve the “long tail” of open access scholarly communications. The project, Ensuring the Persistent Access of Long Tail Open Access Journal Literature, builds on prototype work identifying at-risk content held in web archives by using data provided by identifier services and registries. Furthermore, the project expands on work acquiring missing open access articles via customized web harvesting, improving discovery and access to this materials from within extant web archives, and developing machine learning approaches, training sets, and cost models for advancing and scaling this project’s work.

The project will explore how adding automation to the already highly automated systems for archiving the web at scale can help address the need to preserve at-risk open access scholarly outputs. Instead of specialized curation and ingest systems, the project will work to identify the scholarly content already collected in general web collections, both those of the Internet Archive and collaborating partners, and implement automated systems to ensure at-risk scholarly outputs on the web are well-collected and are associated with the appropriate metadata. The proposal envisages two opposite but complementary approaches:

  • A top-down approach involves taking journal metadata and open data sets from identifier and registry sources such as ISSN, DOAJ, Unpaywall, CrossRef, and others and examining the content of large-scale web archives to ask “is this journal being collected and preserved and, if not, how can collection be improved?”
  • A bottom-up approach involves examining the content of general domain-scale and global-scale web archives to ask “is this content a journal and, if so, can it be associated with external identifier and metadata sources for enhanced discovery and access?”

The grant will fund work to use the output of these approaches to generate training sets and test them against smaller web collections in order to estimate how effective this approach would be at identifying the long-tail content, how expensive a full-scale effort would be, and what level of computing infrastructure is needed to perform such work. The project will also build a model for better understanding the costs for other web archiving institutions to do similar analysis upon their collection using the project’s algorithms and tools. Lastly, the project team, in the Web Archiving and Data Services group with Director Jefferson Bailey as Principal Investigator,  will undertake a planning process to determine resource requirements and work necessary to build a sustainable workflow to keep the results up-to-date incrementally as publication continues.

In combination, these approaches will both improve the current state of preservation for long-tail journal materials as well as develop models for how this work can be automated and applied to existing corpora at scale. Thanks to the Mellon Foundation for their support of this work and we look forward to sharing the project’s open-source tools and outcomes with a broad community of partners.

27 Public Libraries and the Internet Archive Launch “Community Webs” for Local History Web Archiving

The lives and activities of communities are increasingly documented online; local news, events, disasters, celebrations — the experiences of citizens are now largely shared via social media and web platforms. As these primary sources about community life move to the web, the need to archive these materials becomes an increasingly important activity of the stewards of community memory. And in many communities across the nation, public libraries, as one of their many responsibilities to their patrons, serve the vital role of stewards of local history. Yet public libraries have historically been a small fraction of the growing national and international web archiving community.

With generous support from the Institute of Museum and Library Services, as well as the Kahle/Austin Foundation and the Archive-It service, the Internet Archive and 27 public library partners representing 17 different states have launched a new program: Community Webs: Empowering Public Libraries to Create Community History Web Archives. The program will provide education, applied training, cohort network development, and web archiving services for a group of public librarians to develop expertise in web archiving for the purpose of local memory collecting. Additional partners in the program include OCLC’s WebJunction training and education service and the public libraries of Queens, Cleveland and San Francisco will serve as “lead libraries” in the cohort. The program will result in dozens of terabytes of public library administered local history web archives, a range of open educational resources in the form of online courses, videos, and guides, and a nationwide network of public librarians with expertise in local history web archiving and the advocacy tools to build and expand the network. A full listing of the participating public libraries is below and on the program website.

In November 2017, the cohort gathered together at the Internet Archive for a kickoff meeting of brainstorming, socializing, and, of course, talking all things web archiving.  Partners shared details on their existing local history programs and ideas for collection development around web materials. Attendees talked about building collections documenting their demographic diversity or focusing on local issues, such as housing availability or changes in community profile. As an example, Abbie Zeltzer from the Patagonia Public Library, spoke about the changes in her community of 913 residents as the town redevelops a long dormant mining industry. Zeltzer intends on developing a web archive documenting this transition and the related community reaction and changes.

Since the kickoff meeting, the Community Webs cohort has been actively building collections, from hyper-local media sites in Kansas City, to neighborhood blogs in Washington D.C., to Mardi Gras in East Baton Rouge. In addition, program staff, cohort members, and WebJunction have been building out an extensive online course space with educational materials for training on web archiving for local history. The full course space and all open educational resources will be released in early 2019 and a second full in-person meeting of the cohort will take place in Fall 2018.

For further information on the Community Webs program, contact Maria Praetzellis, Program Manager, Web Archiving [maria at archive.org] or Jefferson Bailey, Director, Web Archiving [jefferson at archive.org].

Public Library City State
Athens Regional Library System Athens GA
Birmingham Public Library Birmingham AL
Brooklyn Public Library – Brooklyn Collection New York City NY
Buffalo & Erie County Public Library Buffalo NY
Cleveland Public LIbrary Cleveland OH
Columbus Metropolitan Library Columbus OH
County of Los Angeles Public Library Los Angeles CA
DC Public Library Washington DC
Denver Public Library – Western History and Genealogy Department and Blair-Caldwell African American Research Library Denver CO
East Baton Rouge Parish Library East Baton Rouge LA
Forbes Library Northampton MA
Grand Rapids Public Library Grand Rapids MI
Henderson District Public Libraries Henderson NV
Kansas City Public Library Kansas City MO
Lawrence Public Library Lawrence KS
Marshall Lyon County Library Marshall MN
New Brunswick Free Public Library New Brunswick NJ
Schomburg Center for Research in Black Culture (NYPL) New York City NY
Patagonia Library Patagonia AZ
Pollard Memorial Library Lowell MA
Queens Library New York City NY
San Diego Public Library San Diego CA
San Francisco Public Library San Francisco CA
Sonoma County Public Library Santa Rosa CA
The Urbana Free Library Urbana IL
West Hartford Public Library West Hartford CT
Westborough Public Library Westborough MA