FOSS wins again: Free and Open Source Communities comes through on 19th Century Newspapers (and Books and Periodicals…)

Posted on November 23, 2020 by Brewster Kahle

I have never been more encouraged and thankful to Free and Open Source communities. Three months ago I posted a request for help with OCR’ing and processing 19th Century Newspapers and we got soooo many offers to help. Thank you, that was heart warming and concretely helpful– already based on these suggestions we are changing over our OCR and PDF software completely to FOSS, making big improvements, and building partnerships with FOSS developers in companies, universities, and as individuals that will propel the Internet Archive to have much better digitized texts. I am so grateful, thank you. So encouraging.

I posted a plea for help on the Internet Archive blog: Can You Help us Make the 19th Century Searchable? and we got many social media offers and over 50 comments the post– maybe a record response rate.

We are already changing over our OCR to Tesseract/OCRopus and leveraging many PDF libraries to create compressed, accessible, and archival PDFs.

Several people suggested the German government-lead initiative called OCR-D that has made production level tools for helping OCR and segment complex and old materials such as newspapers in the old German script Fraktur, or black letter. (The Internet Archive had never been able to process these, and now we are doing it at scale). We are also able to OCR more Indian languages which is fantastic. This Government project is FOSS, and has money for outreach to make sure others use the tools– this is a step beyond most research grants.

Tesseract has made a major step forward in the last few years. When we last evaluated the accuracy it was not as good as the proprietary OCR, but that has changed– we have done evaluations and it is just as good, and can get better for our application because of its new architecture.

Underlying the new Tesseract is a LSTM engine similar to the one developed for Ocropus2/ocropy, which was a project led by Tom Breuel (funded by Google, his former German University, and probably others– thank you!). He has continued working on this project even though he left academia. A machine learning based program is introducing us to GPU based processing, which is an extra win. It can also be trained on corrected texts so it can get better.

Proprietary example from an Anti-Slavery newspaper from my blog post:

New one, based on free and open source software that is still faulty but better:

The time it takes on our cluster to compute is approximately the same, but if we add GPU’s we should be able to speed up OCR and PDF creation, maybe 10 times, which would help a great deal since we are processing millions of pages a day.

The PDF generation is a balance trying to achieve small file size as well as rendering quickly in browser implementations, have useful functionality (text search, page numbers, cut-and-paste of text), and comply with archival (PDF/A) and accessibility standards (PDF/UA). At the heart of the new PDF generation is the “archive-pdf-tools” Python library, which performs Mixed Raster Content (MRC) compression, creates a hidden text layer using a modified Tesseract PDF renderer that can read hOCR files as input, and ensures the PDFs are compatible with archival standards (VeraPDF is used to verify every PDF that we generate against the archival PDF standards). The MRC compression decomposes each image into a background, foreground and foreground mask, heavily compressing (and sometimes downscaling) each layer separately. The mask is compressed losslessly, ensuring that the text and lines in an image do not suffer from compression artifacts and look clear. Using this method, we observe a 10x compression factor for most of our books.

The PDFs themselves are created using the high-performance mupdf and pymupdf python library: both projects were supportive and promptly fixed various bugs, which propelled our efforts forwards.

And best of all, we have expanded our community to include people all over the world that are working together to make cultural materials more available. We have a slack channel for OCR researchers and implementers now, that you can join if you would like (to join, drop an email to merlijn@archive.org). We look to contribute software and data sets to these projects to help them improve (lead by Merlijn Wajer and Derek Fukumori).

Next steps to fulfill the dream of Vanevar Bush’s Memex, Ted Nelson’s Xanadu, Michael Hart’s Project Gutenberg, Tim Berners-Lee’s World Wide Web, Raj Ready’s call for Universal Access to All Knowledge (and now the Internet Archive’s mission statement):

Find articles in periodicals, and get the titles/authors/footnotes
Linking footnote citations to other documents
OCR Balinese palm leaf manuscripts based 17,000 hand entered pages.
Improve Tesseract page handling to improve OCR and segmentation
Improve epub creation, including images from pages
Improve OCRopus by creating training datasets

Any help here would be most appreciated.

Thank you, Free and Open Source Communities! We are glad to be part of such a sharing and open world.

39 thoughts on “FOSS wins again: Free and Open Source Communities comes through on 19th Century Newspapers (and Books and Periodicals…)”

Lars Aronsson November 24, 2020 at 4:05 am

You are now doing Fraktur OCR at scale, which is fine. But are you condidering a proofreading feedback loop, where humans can point out OCR errors and thus help to improve the OCR engine?

Currently when PGDP (or Wikisource or Project Runeberg) proofreads a scanned text, we see the Internet Archive scan the next book or newspaper with the same old process, having the same kind of OCR errors. This means the proofreading effort only benefits the current book, but not other books. It should be possible to make better use of that volunteer effort. (It is like marking the errors in a child’s school paper, but not showing the marks to the child so it could learn from it.)
1. Brewster Kahle Post authorNovember 24, 2020 at 6:03 am
  
  We have started doing this in the palm leaf project https://palmleaf.org/ but we have not archive.org wide.
  
  I hope we get there with this round of upgrade to the texts. it would help greatly. It can be done by resubmitting a .txt file for instance, but we would like to keep the position information so it can be used in pdf’s and epubs and such.
  
  It is a UI an project management issue for us, and we are not there yet.
  1. Alex Santos November 29, 2020 at 12:52 am
    
    I am reading https://archive.org/details/sim_frederick-douglass-paper_1847-12-03_1_1/page/n1/mode/2up.
    
    This is truly important work, to capture this material and disseminate it is one of the most honorable of chores. Thank you to all those who work diligently and tirelessly to preserve these vitally important documents.
    
    Thank you for highlighting this particular activity. Kudos!
Nick Schmeller November 24, 2020 at 4:44 am

How do you join the Internet Archive Slack? The OCR channel link is asking me for an email in the form of *@archive.org. Apologies if I’m missing something!
1. Brewster Kahle Post authorNovember 24, 2020 at 6:00 am
  
  I submitted the request to the admins.
  1. Nick Schmeller November 24, 2020 at 5:56 pm
    
    Thank you!!
Nemo November 24, 2020 at 8:58 am

This is fantastic news! Especially as tesseract improvements can also benefit others, including Wikisource and other Wikimedia projects. I see archive-pdf-tools mentions plans to replace kakadu with OpenJPEG2000, which is also useful.

Is Internet Archive going to fully replace the proprietary OCR software, or is tesseract going to be used only on some collections for now?

Would you say that MRC compression brings back the benefits of DjVu (compression based on layers was its brand) while supporting the users whose limited software can only read PDF? I know it’s supposedly a standard, but last time I checked (~2015) not all PDF readers actually managed to read (certain) JPEG2000 PDFs; how broad is support nowadays?
1. Merlijn November 24, 2020 at 9:58 am
  
  It is likely that the new stack (with Tesseract) will become the default OCR software for most collections.
  
  On JPEG2000: I’m hoping to swiftly add a backend for OpenJPEG2000, and perhaps also for ordinary JPEG. While it is true that readers that only support a really old format of PDF might not support JPEG2000, we have done some user testing and it seemed to mostly work on newer reader software. The bigger problems with JPEG2000 are loading times: most software is quite slow to decode OpenJPEG2000, which could be a big reason to go for JPEG. I need to do some small research and gather the know-how on JPEG compression, so that we can use the right compression parameters. Once that’s all in place, we can evaluate if the quality/size ratio is more or less the same, and switch to JPEG.
  1. James Doe November 26, 2020 at 9:44 pm
    
    I hope you decide to ditch the use of JPEG2000 in PDFs.
    
    The background layers your pipeline produces usually are so heavily downsampled that in the end all they provide is some kind of per-page background tint, at the cost of slowing down page loads by 5x. For that reason, I usually prefer the b/w version when one is available. When a b/w version isn’t available, I find the PDFs unusable, so in the past I’ve resorted to PDF surgery to eliminate the background layers. Also for the reason that converting the background images from JPEG2000 to jpeg just takes too long, even when parallelized on multiple cores.
    
    JPEG2000 is simply a bad experience for readers.
    1. Brewster Kahle Post authorNovember 27, 2020 at 5:27 pm
      
      Google books are usually displayed with a flat black on white, while the archive has gone for keeping the background color. Your input is appreciated– that we may have made the wront decision for many books from a readability point of view.
      
      I think there is some work on the online book reader to make a slider for contrast. I wonder if that would help.
      1. James Doe November 28, 2020 at 3:25 pm
        
        Since you are actively rewriting the pipeline for generating these MRC PDFs, this is the most appropriate time to revisit the use of jpeg2000.
        
        You might also consider offering a b/w version (the same PDF sans the background layer) as an additional download option as a matter of policy.
        
        Perhaps you could use download logs to quantify whether your readers prefer to download the color or the monochrome version of a book when both are available, to inform your decisions.
      2. Brewster Kahle Post authorNovember 28, 2020 at 8:07 pm
        
        good points. I use the online bookreader, mostly.
        
        another version we are going to work on is the epub, which has many tradeoff’s as well.
giso November 24, 2020 at 6:28 pm

You are now doing Fraktur OCR at scale, which is fine. But are you condidering a proofreading feedback loop, where humans can point out OCR errors and thus help to improve the OCR engine?
1. Brewster Kahle Post authorNovember 28, 2020 at 6:35 am
  
  We dont have a good way to do this yet, maybe you or others could help?
  
  What would be ideal for us is to go from hOCR.html -> hOCR.html that we could then use for all further processing. The next best would be djvu.xml -> djvu.xml
  
  That way we would keep the position information.
  
  -brewster
  1. giso November 28, 2020 at 6:53 am
    
    you are awesome !
    thanks you
Lars Aronsson November 25, 2020 at 9:32 am

Thank you
John Muccigrosso November 25, 2020 at 12:45 pm

This looks great. Thanks for sharing.

Is there any plan to optimize text-heavy documents for that text? In other words, get rid of, for example, the yellow page color and surface texture in some old books, leaving readable pages of text. Black on white. (Or maybe the compression techniques you’re using really reduce the overhead of those visual components.)

I can imagine how this may not be part of archive.org’s mission, but I suspect it would be useful to some users, especially academics. Maybe more a hathitrust or jstor kind of thing.
1. Merlijn November 25, 2020 at 11:59 pm
  
  Hi John,
  
  For the PDFs that we make our books, the background image (the non-text part) is downscaled by a factor of three, which often gets rid of “surface textures” and does reduce the overhead of the visual components somewhat, but it will not get rid of the background (colour) completely, however.
  
  The mask part of the MRC is basically what you describe (except that colours are inverted): white text on black (due to the nature of mask – white for pixels that are part of the foreground, black for pixels that are not a part of the foreground).
  
  In the near future we also hope to produce epub files, which are based on the OCR results, so that might be closer to what you’re searching for – although you’d be dependent on the OCR accuracy.
موزیک ویک November 27, 2020 at 8:42 pm

Of course, there are these problems, but ocr can help a lot in getting pdf output. This case requires a lot of effort and good results have been achieved . Thanks for author.
uwin November 28, 2020 at 1:24 am

Find articles in periodicals, and get the titles/authors/footnotes
Linking footnote citations to other documents
OCR Balinese palm leaf manuscripts based 17,000 hand entered pages.
Improve Tesseract page handling to improve OCR and segmentation
Improve epub creation, including images from pages
Improve OCRopus by creating training datasets
with great opinion, using standard PDF for compression
open source. using python.
Thanks and contributions.
آهنگ جدید November 28, 2020 at 8:34 pm

We are in the middle of a long way to improve ocr و Problems with artificial intelligence will be solved soon However, defects and bugs in ocr are normal.
thanks fro page admin.
giso November 29, 2020 at 6:22 am

Is there any plan to optimize text-heavy documents for that text? In other words, get rid of, for example, the yellow page color and surface texture in some old books, leaving readable pages of text. Black on white. (Or maybe the compression techniques you’re using really reduce the overhead of those visual components.)
iro November 29, 2020 at 9:52 am

Hi how can i contact with you guys?
1. Brewster Kahle Post authorNovember 29, 2020 at 5:57 pm
  
  The general mailbox for the Internet Archive is info@archive.org. merlijn@archive.org is doing much of the text processing.
  1. آهنگ جدید November 29, 2020 at 8:51 pm
    
    thank you. can i ask some programming questions about html5 gaming like swf ?
سیلا November 29, 2020 at 9:02 pm

currently i opened project to optimize a lot of data to pdf but in ocr some pages have font problem and background problem.
when i try to print layout some texts lost.
now i am try to find ocr alternative to convert 400 book into pdf . if someone can help me please lets email or phone number.
special thanks for share good post.
1. Mark Graham November 30, 2020 at 3:44 pm
  
  Please feel free to email any specific requests to info@archive.org
giso November 30, 2020 at 4:40 pm

I can imagine how this may not be part of archive.org’s mission, but I suspect it would be useful to some users, especially academics. Maybe more a hathitrust or jstor kind of thing.
WewPet November 30, 2020 at 11:10 pm

I am doing my job to restore my old website 2 years ago. Do you have documentation. Please share for me. Thank you very much.
chickaDEE Magazine December 1, 2020 at 12:09 am

I need help with the Magazines
https://archive.org/details/pub_cousteau-kids
https://archive.org/details/pub_childrens-playmate-magazine
https://archive.org/details/pub_national-geographic-kids
they have nothing in them.
robu December 2, 2020 at 6:49 pm

In the new sample text you posted, it looks like the OCR text runs across the columns, not down them in page reading order based on the original page image. Is that a function of how the sample OCR text is displayed or is the page segmentation/layout analysis not working well for that specific page image?

Also, just flagging this paper from last month in case it’s of interest: http://ceur-ws.org/Vol-2723/long20.pdf
1. Merlijn Wajer December 4, 2020 at 1:54 pm
  
  Thank you for bringing this paper to our attention!
  
  Regarding the column analysis: for PFD documents specifically the selection order of the text is viewer dependent, but if you look at the hOCR files themselves, or a derivative format like our “_djvu.txt” file, those should generally honour text blocks as the OCR engine finds them, so if the text spans across columns on a single line there, it means that Tesseract was not able to automatically segment the document properly.
  
  I’ve found that generally the link that Brewster shared to the “Abendpost” seems to perform pretty well when it comes to column segmentation, so perhaps you can compare what you were seeing with the documents in here: https://archive.org/details/pub_abendpost-sonntagpost?sin=&and%5B%5D=hocr&and%5B%5D=year%3A%221896%22&and%5B%5D=year%3A%221895%22&and%5B%5D=year%3A%221894%22 and see if you’re seeing the same problems?
  1. robu December 4, 2020 at 9:23 pm
    
    Yes, the Abendpost example looks very good in comparison! The OCR .txt file is definitely in page reading order, and the columns are segmented appropriately.
    https://archive.org/stream/sim_abendpost-sonntagpost_1894-05-01_6_103/sim_abendpost-sonntagpost_1894-05-01_6_103_djvu.txt
    
    If it helps, in the Frederick Douglass example, I was specifically looking at the line that starts “We give Mr. Nell’s report . . .” and then it goes to column 2 ” replete with examples” and then on to column 3 “NEWARK, N. J.”, and so on. https://ia801704.us.archive.org/8/items/sim_frederick-douglass-paper_1847-12-03_1_1/sim_frederick-douglass-paper_1847-12-03_1_1_djvu.txt.
giso December 4, 2020 at 7:44 pm

currently i opened project to optimize a lot of data to pdf but in ocr some pages have font problem and background problem.
when i try to print layout some texts lost.
1. تابان موزیک December 4, 2020 at 9:22 pm
  
  hi giso please use latest version of adobe PDF reader to ignore page size or fit to paper by scale all content.
chickaDEE Magazine December 5, 2020 at 2:27 pm

Why does https://archive.org/details/pub_childrens-playmate-magazine have 340 results then.
four December 6, 2020 at 10:10 am

I have made great efforts to recover information from my old website
Do you have any documents to help me? Please share for me. Thank you
1. chickaDEE Magazine December 6, 2020 at 6:01 pm
  
  I need https://archive.org/details/pub_childrens-playmate-magazine please
Starc December 6, 2020 at 7:07 pm

I have never been more encouraged and thankful to Free and Open Source communities. Three months ago I posted a request for help with OCR’ing and processing 19th Century Newspapers and we got soooo many offers to help. Thank you, that was heart warming and concretely helpful– already based on these suggestions we are changing over our OCR and PDF software completely to FOSS, making big improvements, and building partnerships with FOSS developers in companies, universities, and as individuals that will propel the Internet Archive to have much better digitized texts. I am so grateful, thank you.