The Internet Archive would appreciate some help from a volunteer programmer to create software that would help determine if a book cover is useful to our users as a thumbnail or if we should use the title page instead. For many of our older books, they have cloth covers that are not useful, for instance:
But others are useful:
Just telling by age is not enough, because even 1923 cloth covers are sometimes good indicators of what the book is about (and are nice looking):
We would like a piece of code that can help us determine if the cover is useful or not to display as the thumbnail of a book. It does not have to be exact, but it would be useful if it knew when it didn’t have a good determination so we could run it by a person.
To help any potential programmer volunteers, we have created folders of hundreds of examples in 3 catatories: year 1923 books with not-very-useful covers, year 1923 books with useful covers, and year 2000 books with useful covers. The filenames of the images are the Internet Archive item identifier that can be used to find the full item: 1922forniaminera00bradrich.jpg would come from https://archive.org/details/1922forniaminera00bradrich. We would like a program (hopefully fast, small, and free/open source) that would say useful or not-useful and a confidence.
Interested in helping? Brenton at archive.org is a good point of contact on this project. Thank you for considering this. We can use the help. You can also use the comments on this post for any questions.
FYI: To create these datasets, I ran these command lines, and then by hand pulled some of the 1923 covers into the “useful” folder.
bash-3.2$ ia search "date:1923 AND mediatype:texts AND NOT collection:opensource AND NOT collection:universallibrary AND scanningcenter:*" --itemlist --sort=downloads\ desc | he\
ad -1000 | parallel --will-cite -j10 "curl -Ls https://archive.org/download/{}/page/cover_.jpg?fail=soon.jpg\&cnt=0 >> ~/tmp/cloth/{}.jpg"
bash-3.2$ ia search "date:2000 AND mediatype:texts AND scanningcenter:cebu" --itemlist --sort=downloads\ desc | head -1000 | parallel --will-cite -j10 "curl -Ls https://archive.\
org/download/{}/page/cover_.jpg?fail=soon.jpg\&cnt=0 >> ~/tmp/picture/{}.jpg"
“Usefulness” is a pretty subjective criterion. It may be hard to get a good solution without being more specific with the requirements (e.g. has text on it, has large areas with different colors).
Certainly a blank cover is not useful. But looking at an example from the image sets, why is 2ndreportclass1918harvuoft.jpg considered “useful” while 1922forniaminera00bradrich.jpg is considered “not useful”? To me, they both just look to have some bland text on them.
I would add the title of the book , author’s name, published date, and give a precise definition about the book
Also it would probably be easier to run this with java script , kotlin, ruby, or ASC11
if you see the filename it has an itemID in it. And these go to https://archive.org/details/ItemID you can get the metadata in JSON by https://archive.org/metadata/ItemID
hope this helps.
Couldn’t you utilize some piece of software that detects whether the image is mostly one solid color, or something along those lines.
What kind of “AI scripting” is allowed? I mean R or python?
I may try and see.
Id love to help and am a avid reader so if knowledge is power he me fill my cup
Hi,
Which programming language do you prefer?
Thanks!
Id love to help and am a avid reader so if knowledge is power he me fill my cup
If the books are categorized by language it should be fairly easy:
just run tesseract-OCR on every images, parse the returned string and check if parts of of it (if any) can be found in a dictionary of the corresponding language. Roughly 20 lines of python code and all Open Source.
Yes, OCR is error-prone and there WILL be 3% errors or so, so of course that’s Q&D, but will be a significant advantage to the current state.
If you got 97% correct ask your viewers to flag the remained faulty 3%.
Awesome to have help! Yes, “usefulness” is a pretty non-specific term. 🙂 The goal is to determine whether we should show the cover or the title page for the book’s thumbnail. One might argue that if a cloth cover has at least title and author displayed clearly, it’s worthy of being shown.
At the Archive, we mainly use Python and PHP. Here’s what I’ve come up with so far: https://github.com/bfalling/book-cover-analyzer
It’s a simple algorithm but works decently. I’d love it if someone came up with something better. The gauntlet is thrown!
Which programming language used to web for compilation, what do you prefer?
perl, python, visual basic /outdated though
Will it apply to all text uploads by users too?
Currently, in generic uploads the thumbnail page is selected by some heuristics, but results aren’t good. For a start, better to remove these heuristics and just select the first page as the default title/thumbnail page. It’s just fine most of the times
What kind of “AI scripting” is allowed? I mean R or python?
I may try and see.
Couldn’t you utilize some piece of software that detects whether the image is mostly one solid color, or something along those lines.
Brenton’s code basically does that. and it works quite well.
thanks ,
you are amazing
Which programming language used to web for compilation, what do you prefer? What kind of “AI scripting” is allowed? I mean R or python?
I may try and see.
Any language, but we prefer FOSS. If you are going to pull out those more serious tools, then we might be able to detect title pages, table of contents and other page types. We are trying to figure out how to automate all of these things.
-brewster
Not related to the current topic:
I noticed in Item history you scan every item for malware, a task that takes an enormous amount of time. But I see your virus definitions are 1 year old! Can’t you update them, or else disable the scan, it is pointless!
<————— BookOp VirusCheck (updated: 1 year ago, 2018-01-12 23:21:46 +0000) Starting PST: 2019-01-16 01:55:17 ——————
That’s just the date the virus check task was updated, I think, which if I understand correctly passes off the data to the VirusTotal service, which then sends back an assessment of the data. So it’s the date VirusTotal last updated their software that really matters, I think, rather than the date the wrapper tool was updated.