Search Scholarly Materials Preserved in the Internet Archive

Looking for a research paper but can’t find a copy in your library’s catalog or popular search engines? Give Internet Archive Scholar a try! We might have a PDF from a “vanished” Open Access publisher in our web archive, an author’s pre-publication manuscript from their archived faculty webpage, or a digitized microfilm version of an older publication.

We hope Internet Archive Scholar will aid researchers and librarians looking for specific open access papers that may not be otherwise available to them. Judith van Stegeren (@jd7g on Twitter), a PhD candidate in the Netherlands, encountered just such a situation recently when sharing a workshop paper on procedural generation in computer games: “Towards Qualitative Procedural Generation” by Mark R. Johnson, originally presented at the Computational Creativity & Games Workshop in 2016. The papers for this particular year of the workshop are not indexed in the usual bibliographic catalogs, and the original workshop website hosting the Open Access papers is no longer accessible. Fortunately, copies of all the 2016 workshop papers were captured in the Wayback Machine, and can be found today by searching IA Scholar by title or conference name.

As another example, dozens of papers from the Open Journal of Hematology are no longer resolvable via DOI. As mentioned in a previous blog post, the publisher’s website vanished and has been replaced with unrelated advertisements. But before that happened, the papers were captured in the Wayback Machine, indexed in our catalog, and can now be searched in full:

IA Scholar Search Results

IA Scholar is a simple, access-oriented interface to content identified across several Internet Archive collections, including web archives, archive.org files, and digitized print materials. The full text of articles is searchable for users that are hunting for particular phrases or keywords. This complements our existing full-text search index of millions of digitized books and other documents on archive.org.

The service builds on Fatcat, an open catalog we have developed to identify at-risk and web-published open scholarly outputs that can benefit from long-term preservation, additional metadata, and perpetual access. Fatcat includes resources that may be useful to librarians and archivists, such as bulk metadata dumps, a read/write API, command-line tool, and file-level archival metadata. If you are interested in collaborating with us, or are a researcher interested in text analysis applications, we have a public chat channel or can be contacted by email at info@archive.org.

IA Scholar marks a milestone in our work initiated in 2018 to leverage the automation and scale of web and API harvesting in providing open infrastructure for the preservation of and perpetual access to scholarly materials from the public web. We particularly want to thank the Mellon Foundation for their original and ongoing support of this work, our many current partners, and the other collaborators, contributors, and volunteers.

All of this is possible because of the incredible open research ecosystem built and collectively maintained by Open Access advocates. Thank you to the DOAJ and other groups for helping catalog open access journals which has aided preservation. Thank you to the Biodiversity Heritage Library and its supporters for digitizing print journal literature. And thank you to the many other organizations we have worked with, integrated, or whose services we have utilized, including open web indices (Unpaywall, CORE, CiteseerX, Microsoft Academic, Semantic Scholar), directories of open journals (DOAJ, ROAD SHERPA/ROMEO, JURN, Wikidata), and open bibliographic catalogs (Crossref, Datacite, J-STAGE, Pubmed, dblp). 

IA Scholar is built from open source software components, and is itself released as Free Software. The website has been translated into eight languages (so far!) by generous volunteers.

7 thoughts on “Search Scholarly Materials Preserved in the Internet Archive

  1. Colby

    It would be nice to allow searching by hash and/or some other form of fingerprint.

    Hypothes.is allows annotating local PDF documents and uses a few different methods of settling on a fingerprint for the URL-less PDFs, so you can still view annotations by others, even when you’re annotating a copy on your disk.

    https://youtu.be/oESJjiuxoiE?t=601

    Unfortunately, if you’re coming in from the other direction—being passed a link to a specific annotation and wanting to view the context for that annotation, you’re out of luck unless you happen to already know which file is being annotated; you just get a notice that “this annotation was made on a document that is not publicly available”. (Of course, that might not even be true—it might actually be publicly available, and it’s just that Hypothes.is doesn’t know about it, because the annotation was created by a person using a local copy.)

    If Scholar allowed searching by the same fingerprinting methods that Hypothes.is uses, then the Hypothes.is client could automatically query archive.org to check for availability.

    1. bnewbold Post author

      Hi Colby, thanks for you comment.

      You can actually do this lookup-by-hash in Fatcat, either on the website or via the API: https://fatcat.wiki/file/lookup

      For successful lookups, this will include a list of any known public access URLs.

      I’m aware of the Hypothesis annotation on local PDFs feature (https://web.hypothes.is/blog/annotating-pdfs-without-urls/) and it is pretty cool! We don’t currently include the robust fingerprint identifier in our catalog, but it would not be infeasible to bulk extract those and update the catalog at some point in the future. In the meanwhile we have MD5, SHA1, and SHA256.

  2. J. Peterson

    Extra credit project: Cross-correlate the papers in your archive with RetractionWatch.com, so papers discredited since publication are flagged as such.

  3. Michaela

    Will Google Scholar be crawling your collection, or will there be some unique content here that won’t show up in Google Scholar? Thank you for this resource!

  4. David C. Brock

    Is there a way to directly upload one’s own papers directly into Internet Archive Scholar?

    1. bnewbold Post author

      Hi David,

      There is not a way for anybody to upload files directly in to IA Scholar. If a copy of the paper is already available in a legitimate form on the public web, the “Save Paper Now” feature on fatcat.wiki can be used to point our crawlers to the location, and it will be added after a review process.

      If there is not already a copy online, individual authors can use a service like https://shareyourpaper.org/ to make it easier to find an appropriate repository to deposit in to.

  5. Xourx

    Many times when I refer to Wikipedia resources, I can not find the article, but I can find it by entering the address of the article in the archive site.

Comments are closed.