Author Archives: bnewbold

Search Scholarly Materials Preserved in the Internet Archive

Looking for a research paper but can’t find a copy in your library’s catalog or popular search engines? Give Internet Archive Scholar a try! We might have a PDF from a “vanished” Open Access publisher in our web archive, an author’s pre-publication manuscript from their archived faculty webpage, or a digitized microfilm version of an older publication.

We hope Internet Archive Scholar will aid researchers and librarians looking for specific open access papers that may not be otherwise available to them. Judith van Stegeren (@jd7g on Twitter), a PhD candidate in the Netherlands, encountered just such a situation recently when sharing a workshop paper on procedural generation in computer games: “Towards Qualitative Procedural Generation” by Mark R. Johnson, originally presented at the Computational Creativity & Games Workshop in 2016. The papers for this particular year of the workshop are not indexed in the usual bibliographic catalogs, and the original workshop website hosting the Open Access papers is no longer accessible. Fortunately, copies of all the 2016 workshop papers were captured in the Wayback Machine, and can be found today by searching IA Scholar by title or conference name.

As another example, dozens of papers from the Open Journal of Hematology are no longer resolvable via DOI. As mentioned in a previous blog post, the publisher’s website vanished and has been replaced with unrelated advertisements. But before that happened, the papers were captured in the Wayback Machine, indexed in our catalog, and can now be searched in full:

IA Scholar Search Results

IA Scholar is a simple, access-oriented interface to content identified across several Internet Archive collections, including web archives, archive.org files, and digitized print materials. The full text of articles is searchable for users that are hunting for particular phrases or keywords. This complements our existing full-text search index of millions of digitized books and other documents on archive.org.

The service builds on Fatcat, an open catalog we have developed to identify at-risk and web-published open scholarly outputs that can benefit from long-term preservation, additional metadata, and perpetual access. Fatcat includes resources that may be useful to librarians and archivists, such as bulk metadata dumps, a read/write API, command-line tool, and file-level archival metadata. If you are interested in collaborating with us, or are a researcher interested in text analysis applications, we have a public chat channel or can be contacted by email at info@archive.org.

IA Scholar marks a milestone in our work initiated in 2018 to leverage the automation and scale of web and API harvesting in providing open infrastructure for the preservation of and perpetual access to scholarly materials from the public web. We particularly want to thank the Mellon Foundation for their original and ongoing support of this work, our many current partners, and the other collaborators, contributors, and volunteers.

All of this is possible because of the incredible open research ecosystem built and collectively maintained by Open Access advocates. Thank you to the DOAJ and other groups for helping catalog open access journals which has aided preservation. Thank you to the Biodiversity Heritage Library and its supporters for digitizing print journal literature. And thank you to the many other organizations we have worked with, integrated, or whose services we have utilized, including open web indices (Unpaywall, CORE, CiteseerX, Microsoft Academic, Semantic Scholar), directories of open journals (DOAJ, ROAD SHERPA/ROMEO, JURN, Wikidata), and open bibliographic catalogs (Crossref, Datacite, J-STAGE, Pubmed, dblp). 

IA Scholar is built from open source software components, and is itself released as Free Software. The website has been translated into eight languages (so far!) by generous volunteers.

How the Internet Archive is Ensuring Permanent Access to Open Access Journal Articles

Internet Archive has archived and identified 9 million open access journal articles– the next 5 million is getting harder

Open Access journals, such as New Theology Review (ISSN: 0896-4297) and Open Journal of Hematology (ISSN: 2075-907X), made their research articles available for free online for years. With a quick click or a simple query, students anywhere in the world could access their articles, and diligent Wikipedia editors could verify facts against original articles on vitamin deficiency and blood donation.  

But some journals, such as these titles, are no longer available from the publisher’s websites, and are only available through the Internet Archive’s Wayback Machine. Since 2017, the Internet Archive joined others in concentrating on archiving all scholarly literature and making it permanently accessible.

The World Wide Web has made it easier than ever for scholars to collaborate, debate, and share their research. Unfortunately, the structure of today’s web means that content can disappear just as easily: as of today the official publisher websites and DOI redirects for both of the above journals go nowhere or have been replaced with unrelated content.


Wayback Machine captures of Open Access journals now “vanished” from publisher websites

Vigilant librarians saw this problem coming decades ago, when the print-to-digital migration was getting started. They insisted that commercial publishers work with contract digital preservation organizations (such as Portico, LOCKSS, and CLOCKSS) to ensure long-term access to expensive journal subscription content. Efforts have been made to preserve open articles as well, such as Public Knowledge Project’s Private LOCKSS Network for OJS journals and national hosting platforms like the SciELO network. But a portion of all scholarly articles continues to fall through the cracks.

Researchers found that 176 open access journals have already vanished from their publishers’ website over the past two decades, according to a recent preprint article by Mikael Laakso, Lisa Matthias, and Najko Jahn. These periodicals were from all regions of the world and represented all major disciplines — sciences, humanities and social sciences. There are over 14,000 open access journals indexed by the Directory of Open Access Journals and the paper suggests another 900 of those are inactive and at risk of disappearing. The pre-print has struck a nerve, receiving news coverage in Nature and Science.

In 2017, with funding support from the Andrew Mellon Foundation and the Kahle/Austin Foundation, the Internet Archive launched a project focused on preserving all publicly accessible research documents, with a particular focus on open access materials. Our first job was to quantify the scale of the problem.

Monitoring independent preservation of Open Access journal articles published from 1996 through 2019. Categories are defined in the article text.

Of the 14.8 million known open access articles published since 1996, the Internet Archive has archived, identified, and made available through the Wayback Machine 9.1 million of them (“bright” green in the chart above). In the jargon of Open Access, we are counting only “gold” and “hybrid” articles which we expect to be available directly from the publisher, as opposed to preprints, such as in arxiv.org or institutional repositories. Another 3.2 million are believed to be preserved by one or more contracted preservation organizations, based on records kept by Keepers Registry (“dark” olive in the chart). These copies are not intended to be accessible to anybody unless the publisher becomes inaccessible, in which case they are “triggered” and become accessible.

This leaves at least 2.4 million Open Access articles at risk of vanishing from the web (“None”, red in the chart). While many of these are still on publisher’s websites, these have proven difficult to archive.

One of our goals is to archive as many of the articles on the open web as we can, and to keep up with the growing stream of new articles published every day. Another is to look back over the vast petabytes of web content in the Wayback Machine, back to 1996, and find any content we might already have but is not easily findable or discoverable. Both of these projects are amenable to software automation, but made more difficult by the evolving nature of HTML and PDFs and their diverse character sets and encodings. To that end, we have approached this project not just as a technical one, but also as a collaborative one that aims to add another piece to the distributed infrastructure supporting open scholarship.

To expand our reach, we built an editable catalog (https://fatcat.wiki) with an open API to allow anybody to contribute. As the software is free and open source, as is the data, we invite others to reuse and link to the content we have archived. We have also indexed and made searchable much of the literature to help manage our work and help others find if we have archived particular articles. We want to make scholarly material permanently available, and available in new ways– including via large datasets for analysis and “meta research.” 

We also want to acknowledge the many partnerships and collaborations that have supported this work, many of which are key parts of the open scholarly infrastructure, including ISSN, DOAJ, LOCKSS, Unpaywall, Semantic Scholar, CiteSeerX, Crossref, Datacite, and many others. We also want to acknowledge the many Internet Archive staff and volunteers that have contributed to this work, including Bryan Newbold, Martin Czygan, Paul Baclace, Jefferson Bailey, Kenji Nagahashi, David Rosenthal, Victoria Reich, Ellen Spertus, and others.

If you would like to participate in this project, please contact the Internet Archive at webservices@archive.org.