I have never been more encouraged and thankful to Free and Open Source communities. Three months ago I posted a request for help with OCR’ing and processing 19th Century Newspapers and we got soooo many offers to help. Thank you, that was heart warming and concretely helpful– already based on these suggestions we are changing over our OCR and PDF software completely to FOSS, making big improvements, and building partnerships with FOSS developers in companies, universities, and as individuals that will propel the Internet Archive to have much better digitized texts. I am so grateful, thank you. So encouraging.
I posted a plea for help on the Internet Archive blog: Can You Help us Make the 19th Century Searchable? and we got many social media offers and over 50 comments the post– maybe a record response rate.
We are already changing over our OCR to Tesseract/OCRopus and leveraging many PDF libraries to create compressed, accessible, and archival PDFs.
Several people suggested the German government-lead initiative called OCR-D that has made production level tools for helping OCR and segment complex and old materials such as newspapers in the old German script Fraktur, or black letter. (The Internet Archive had never been able to process these, and now we are doing it at scale). We are also able to OCR more Indian languages which is fantastic. This Government project is FOSS, and has money for outreach to make sure others use the tools– this is a step beyond most research grants.
Tesseract has made a major step forward in the last few years. When we last evaluated the accuracy it was not as good as the proprietary OCR, but that has changed– we have done evaluations and it is just as good, and can get better for our application because of its new architecture.
Underlying the new Tesseract is a LSTM engine similar to the one developed for Ocropus2/ocropy, which was a project led by Tom Breuel (funded by Google, his former German University, and probably others– thank you!). He has continued working on this project even though he left academia. A machine learning based program is introducing us to GPU based processing, which is an extra win. It can also be trained on corrected texts so it can get better.
New one, based on free and open source software that is still faulty but better:
The time it takes on our cluster to compute is approximately the same, but if we add GPU’s we should be able to speed up OCR and PDF creation, maybe 10 times, which would help a great deal since we are processing millions of pages a day.
The PDF generation is a balance trying to achieve small file size as well as rendering quickly in browser implementations, have useful functionality (text search, page numbers, cut-and-paste of text), and comply with archival (PDF/A) and accessibility standards (PDF/UA). At the heart of the new PDF generation is the “archive-pdf-tools” Python library, which performs Mixed Raster Content (MRC) compression, creates a hidden text layer using a modified Tesseract PDF renderer that can read hOCR files as input, and ensures the PDFs are compatible with archival standards (VeraPDF is used to verify every PDF that we generate against the archival PDF standards). The MRC compression decomposes each image into a background, foreground and foreground mask, heavily compressing (and sometimes downscaling) each layer separately. The mask is compressed losslessly, ensuring that the text and lines in an image do not suffer from compression artifacts and look clear. Using this method, we observe a 10x compression factor for most of our books.
And best of all, we have expanded our community to include people all over the world that are working together to make cultural materials more available. We have a slack channel for OCR researchers and implementers now, that you can join if you would like (to join, drop an email to email@example.com). We look to contribute software and data sets to these projects to help them improve (lead by Merlijn Wajer and Derek Fukumori).
Next steps to fulfill the dream of Vanevar Bush’s Memex, Ted Nelson’s Xanadu, Michael Hart’s Project Gutenberg, Tim Berners-Lee’s World Wide Web, Raj Ready’s call for Universal Access to All Knowledge (and now the Internet Archive’s mission statement):
- Find articles in periodicals, and get the titles/authors/footnotes
- Linking footnote citations to other documents
- OCR Balinese palm leaf manuscripts based 17,000 hand entered pages.
- Improve Tesseract page handling to improve OCR and segmentation
- Improve epub creation, including images from pages
- Improve OCRopus by creating training datasets
Any help here would be most appreciated.
Thank you, Free and Open Source Communities! We are glad to be part of such a sharing and open world.