Internet Archive has archived and identified 9 million open access journal articles– the next 5 million is getting harder
Open Access journals, such as New Theology Review (ISSN: 0896-4297) and Open Journal of Hematology (ISSN: 2075-907X), made their research articles available for free online for years. With a quick click or a simple query, students anywhere in the world could access their articles, and diligent Wikipedia editors could verify facts against original articles on vitamin deficiency and blood donation.
But some journals, such as these titles, are no longer available from the publisher’s websites, and are only available through the Internet Archive’s Wayback Machine. Since 2017, the Internet Archive joined others in concentrating on archiving all scholarly literature and making it permanently accessible.
The World Wide Web has made it easier than ever for scholars to collaborate, debate, and share their research. Unfortunately, the structure of today’s web means that content can disappear just as easily: as of today the official publisher websites and DOI redirects for both of the above journals go nowhere or have been replaced with unrelated content.
Vigilant librarians saw this problem coming decades ago, when the print-to-digital migration was getting started. They insisted that commercial publishers work with contract digital preservation organizations (such as Portico, LOCKSS, and CLOCKSS) to ensure long-term access to expensive journal subscription content. Efforts have been made to preserve open articles as well, such as Public Knowledge Project’s Private LOCKSS Network for OJS journals and national hosting platforms like the SciELO network. But a portion of all scholarly articles continues to fall through the cracks.
Researchers found that 176 open access journals have already vanished from their publishers’ website over the past two decades, according to a recent preprint article by Mikael Laakso, Lisa Matthias, and Najko Jahn. These periodicals were from all regions of the world and represented all major disciplines — sciences, humanities and social sciences. There are over 14,000 open access journals indexed by the Directory of Open Access Journals and the paper suggests another 900 of those are inactive and at risk of disappearing. The pre-print has struck a nerve, receiving news coverage in Nature and Science.
In 2017, with funding support from the Andrew Mellon Foundation and the Kahle/Austin Foundation, the Internet Archive launched a project focused on preserving all publicly accessible research documents, with a particular focus on open access materials. Our first job was to quantify the scale of the problem.
Of the 14.8 million known open access articles published since 1996, the Internet Archive has archived, identified, and made available through the Wayback Machine 9.1 million of them (“bright” green in the chart above). In the jargon of Open Access, we are counting only “gold” and “hybrid” articles which we expect to be available directly from the publisher, as opposed to preprints, such as in arxiv.org or institutional repositories. Another 3.2 million are believed to be preserved by one or more contracted preservation organizations, based on records kept by Keepers Registry (“dark” olive in the chart). These copies are not intended to be accessible to anybody unless the publisher becomes inaccessible, in which case they are “triggered” and become accessible.
This leaves at least 2.4 million Open Access articles at risk of vanishing from the web (“None”, red in the chart). While many of these are still on publisher’s websites, these have proven difficult to archive.
One of our goals is to archive as many of the articles on the open web as we can, and to keep up with the growing stream of new articles published every day. Another is to look back over the vast petabytes of web content in the Wayback Machine, back to 1996, and find any content we might already have but is not easily findable or discoverable. Both of these projects are amenable to software automation, but made more difficult by the evolving nature of HTML and PDFs and their diverse character sets and encodings. To that end, we have approached this project not just as a technical one, but also as a collaborative one that aims to add another piece to the distributed infrastructure supporting open scholarship.
To expand our reach, we built an editable catalog (https://fatcat.wiki) with an open API to allow anybody to contribute. As the software is free and open source, as is the data, we invite others to reuse and link to the content we have archived. We have also indexed and made searchable much of the literature to help manage our work and help others find if we have archived particular articles. We want to make scholarly material permanently available, and available in new ways– including via large datasets for analysis and “meta research.”
We also want to acknowledge the many partnerships and collaborations that have supported this work, many of which are key parts of the open scholarly infrastructure, including ISSN, DOAJ, LOCKSS, Unpaywall, Semantic Scholar, CiteSeerX, Crossref, Datacite, and many others. We also want to acknowledge the many Internet Archive staff and volunteers that have contributed to this work, including Bryan Newbold, Martin Czygan, Paul Baclace, Jefferson Bailey, Kenji Nagahashi, David Rosenthal, Victoria Reich, Ellen Spertus, and others.
If you would like to participate in this project, please contact the Internet Archive at firstname.lastname@example.org.