Searching Through Everything

With over 20 million items in the Internet Archive’s many collections, having a good way to search through them to find exactly what you want is crucial. It is equally important to be able to filter the data in flexible ways so that you see subsets of the data most relevant to you. We are pleased to offer two new features that might change everything about how you search.

Faceted Filtering

Once you’ve executed a site search, either from the search form at the top right of every page or by going to the search page directly, you’ll see a bunch of new checkboxes down the left-hand side, in addition to the search results. These checkboxes are grouped into categories, such as “Media Type” and “Topics & Subjects”.

Clicking any of the checkboxes adds the corresponding term to the search criteria, allowing you to more precisely define the filtered set of search results. Checkmarking more than one term within the same category causes items that match any of the selected terms to be displayed, whereas checkmarking items from two different categories means that only items matching both terms will be shown. Play around with it, and you’ll see how intuitive it is. Checking or unchecking new terms causes search results to be re-filtered on the fly.

We were looking for a way to provide a more powerful, visual approach to filtering search results. When we user-tested the faceted search interface, our testers loved it. It was a familiar interface already in use throughout the Internet which offered both simplicity and richness.

Full-Text Search (in Beta)

Every day, we see an average of 50,000 hits on our search pages, as you, our users, search for title, creator, and various other metadata about the items we’ve archived. But you have long asked when you would be able to search not only across all items but within them as well. For years you’ve been able to search within the text of a single book using our BookReader, but never before have you been able to search across and within all 9 million available text items at the Internet Archive in a single shot. Until now.

Full-Text Search

And here’s all you have to do: On the search page, after entering your search query in the text field, checkmark “Search full text of books” just underneath the text field, and then click or tap “GO”. That’s it! In seconds, you’ll have the results of searching through millions of texts. Note that the facets at the left work a little differently from non-full-text searches; just click or tap one to add it as a filter criterion.

At the moment, we’re still in beta. Suffice to say, we’ve faced quite a number of challenges in configuring and populating our full-text search engine, from creating the Elasticsearch clusters to dealing with optical character recognition (OCR) issues related to strange fonts, running page headers, or language recognition. We are continuing to make improvements, and still have a ways to go.

But please use it! Try searching for some phrase that’s stuck in your head from a book long ago forgotten, and see what comes up. You now have the contents of 9 million texts at your fingertips.

9 thoughts on “Searching Through Everything

  1. Mike Lichtenberg

    Can you talk a bit about how you are handling the “challenges in… dealing with optical character recognition (OCR) issues”?

    That would seem to be the biggest difficulty in getting worthwhile results from full-text searches. For example, you have to deal with books like https://archive.org/details/animalkingdomarr03cuvi, for which the OCR is basically illegible. Other books, like https://archive.org/details/America00Ogil, are much better, but still have plenty of errors.

    Are you evaluating the OCR output before indexing and trying to fix (if possible) or omit (if necessary) books with a large number of OCR errors? If so, what methods are you using to do that?

    Thanks for you time, and keep up the good work!

  2. Pingback: Open Access Resource Roundup | Authors Alliance

  3. Pingback: Internet Archive turns 20, gives birthday gifts to the world | D4mations.com

  4. Pingback: Searching the Internet Archive | Web Search Guide and Internet News

  5. Sheila Heathcote

    I am searching for titles of books that are now out of print that are recommended by the author of a Paris winter, Imogene Robertson. None of the books that she list can I find On your website, even though she leaves the instructions that they are all available at your site. What am I doing wrong? I’ve typed in the exact titles of the books and the name of the authors and I keep getting results that you do not have this book.
    Thank you for any help you can provide.

Comments are closed.