The Internet Archive would appreciate some help from a volunteer programmer to create software that would help determine if a book cover is useful to our users as a thumbnail or if we should use the title page instead. For many of our older books, they have cloth covers that are not useful, for instance:

But others are useful:

Just telling by age is not enough, because even 1923 cloth covers are sometimes good indicators of what the book is about (and are nice looking):

We would like a piece of code that can help us determine if the cover is useful or not to display as the thumbnail of a book. It does not have to be exact, but it would be useful if it knew when it didn’t have a good determination so we could run it by a person.

To help any potential programmer volunteers, we have created folders of hundreds of examples in 3 catatories: year 1923 books with not-very-useful covers, year 1923 books with useful covers, and year 2000 books with useful covers. The filenames of the images are the Internet Archive item identifier that can be used to find the full item: 1922forniaminera00bradrich.jpg would come from https://archive.org/details/1922forniaminera00bradrich. We would like a program (hopefully fast, small, and free/open source) that would say useful or not-useful and a confidence.

Interested in helping? Brenton at archive.org is a good point of contact on this project. Thank you for considering this. We can use the help. You can also use the comments on this post for any questions.

FYI: To create these datasets, I ran these command lines, and then by hand pulled some of the 1923 covers into the “useful” folder.

bash-3.2$ ia search "date:1923 AND mediatype:texts AND NOT collection:opensource AND NOT collection:universallibrary AND scanningcenter:*" --itemlist --sort=downloads\ desc | he\
ad -1000 | parallel --will-cite -j10 "curl -Ls https://archive.org/download/{}/page/cover_.jpg?fail=soon.jpg\&cnt=0 >> ~/tmp/cloth/{}.jpg"

bash-3.2$ ia search "date:2000 AND mediatype:texts AND scanningcenter:cebu" --itemlist --sort=downloads\ desc | head -1000 | parallel --will-cite -j10 "curl -Ls https://archive.\
org/download/{}/page/cover_.jpg?fail=soon.jpg\&cnt=0 >> ~/tmp/picture/{}.jpg"

21 thoughts on “Helping us judge a book by its cover: software help request”

Chris Potter January 6, 2019 at 5:30 pm

“Usefulness” is a pretty subjective criterion. It may be hard to get a good solution without being more specific with the requirements (e.g. has text on it, has large areas with different colors).

Certainly a blank cover is not useful. But looking at an example from the image sets, why is 2ndreportclass1918harvuoft.jpg considered “useful” while 1922forniaminera00bradrich.jpg is considered “not useful”? To me, they both just look to have some bland text on them.

Larry Strickland January 16, 2019 at 1:04 am

I would add the title of the book , author’s name, published date, and give a precise definition about the book
Also it would probably be easier to run this with java script , kotlin, ruby, or ASC11
1. Brewster Kahle Post authorJanuary 16, 2019 at 1:47 am
  
  if you see the filename it has an itemID in it. And these go to https://archive.org/details/ItemID you can get the metadata in JSON by https://archive.org/metadata/ItemID
  
  hope this helps.

Anonymous January 7, 2019 at 1:02 am

Couldn’t you utilize some piece of software that detects whether the image is mostly one solid color, or something along those lines.

Yuriy Sosov January 7, 2019 at 4:08 am

What kind of “AI scripting” is allowed? I mean R or python?
I may try and see.

j January 7, 2019 at 2:12 pm

Id love to help and am a avid reader so if knowledge is power he me fill my cup

Akhil K A January 8, 2019 at 8:03 am

Hi,

Which programming language do you prefer?

Thanks!

Justin January 8, 2019 at 4:29 pm

Id love to help and am a avid reader so if knowledge is power he me fill my cup

Amaterasus stepmother January 8, 2019 at 9:02 pm

If the books are categorized by language it should be fairly easy:
just run tesseract-OCR on every images, parse the returned string and check if parts of of it (if any) can be found in a dictionary of the corresponding language. Roughly 20 lines of python code and all Open Source.
Yes, OCR is error-prone and there WILL be 3% errors or so, so of course that’s Q&D, but will be a significant advantage to the current state.
If you got 97% correct ask your viewers to flag the remained faulty 3%.

brenton January 9, 2019 at 5:04 am

Awesome to have help! Yes, “usefulness” is a pretty non-specific term. 🙂 The goal is to determine whether we should show the cover or the title page for the book’s thumbnail. One might argue that if a cloth cover has at least title and author displayed clearly, it’s worthy of being shown.

At the Archive, we mainly use Python and PHP. Here’s what I’ve come up with so far: https://github.com/bfalling/book-cover-analyzer

It’s a simple algorithm but works decently. I’d love it if someone came up with something better. The gauntlet is thrown!

Barinder January 9, 2019 at 6:00 am

Which programming language used to web for compilation, what do you prefer?

j50s January 15, 2019 at 6:47 am

perl, python, visual basic /outdated though

Antani January 9, 2019 at 9:25 am

Will it apply to all text uploads by users too?
Currently, in generic uploads the thumbnail page is selected by some heuristics, but results aren’t good. For a start, better to remove these heuristics and just select the first page as the default title/thumbnail page. It’s just fine most of the times

Justin January 11, 2019 at 3:06 pm

What kind of “AI scripting” is allowed? I mean R or python?
I may try and see.

آهنگ January 14, 2019 at 5:51 pm

Couldn’t you utilize some piece of software that detects whether the image is mostly one solid color, or something along those lines.

Brewster Kahle Post authorJanuary 16, 2019 at 1:49 am

Brenton’s code basically does that. and it works quite well.
1. آهنگ January 17, 2019 at 3:26 pm
  
  thanks ,
  you are amazing

johnson January 15, 2019 at 7:50 am

Which programming language used to web for compilation, what do you prefer? What kind of “AI scripting” is allowed? I mean R or python?
I may try and see.

Brewster Kahle Post authorJanuary 16, 2019 at 1:48 am

Any language, but we prefer FOSS. If you are going to pull out those more serious tools, then we might be able to detect title pages, table of contents and other page types. We are trying to figure out how to automate all of these things.

-brewster

Gabriel January 16, 2019 at 11:57 am

Not related to the current topic:

I noticed in Item history you scan every item for malware, a task that takes an enormous amount of time. But I see your virus definitions are 1 year old! Can’t you update them, or else disable the scan, it is pointless!

<————— BookOp VirusCheck (updated: 1 year ago, 2018-01-12 23:21:46 +0000) Starting PST: 2019-01-16 01:55:17 ——————

elli January 17, 2019 at 3:53 am

That’s just the date the virus check task was updated, I think, which if I understand correctly passes off the data to the VirusTotal service, which then sends back an assessment of the data. So it’s the date VirusTotal last updated their software that really matters, I think, rather than the date the wrapper tool was updated.

Comments are closed.

Internet Archive Blogs

A blog from the team at archive.org

Helping us judge a book by its cover: software help request

21 thoughts on “Helping us judge a book by its cover: software help request”

Upcoming Events

Book Talk: The Secret Life of Data

Book Talk: Big Fiction