Can You Help us Make the 19th Century Searchable?

Name: Authors Alliance 10th Anniversary: Authorship in an Age of Monopoly and Moral Panics
Start: 2024-05-17T16:00:00-08:00
End: 2024-05-17T19:00:00-08:00
Location: Internet Archive

Posted on August 21, 2020 by Brewster Kahle

In 1847, Frederick Douglass started a newspaper advocating the abolition of slavery that ran until 1851. After the Civil War, there was a newspaper for freed slaves, the Freedmen’s Record. The Internet Archive is bringing these and many more works online for free public access. But there’s a problem:

Our Optical Character Recognition (OCR), while the best commercially available OCR technology, is not very good at identifying text from older documents.

Take for example, this newspaper from 1847. The images are not that great, but a person can read them:

The problem is our computers’ optical character recognition tech gets it wrong, and the columns get confused.

What we need is “Culture Tech” (a riff on fintech, or biotech) and Culture Techies to work on important and useful projects–the things we need, but are probably not going to get gushers of private equity interest to fund. There are thousands of professionals taking on similar challenges in the field of digital humanities and we want to complement their work with industrial-scale tech that we can apply to cultural heritage materials.

One such project would be to work on technologies to bring 19th-century documents fully digital. We need to improve OCR to enable full text search, but we also need help segmenting documents into columns and articles. The Internet Archive has lots of test materials and thousands are uploading more documents all the time.

What we do not have is a good way to integrate work on these projects with the Internet Archive’s processing flow. So we need help and ideas there as well.

Maybe we can host an “Archive Summer of CultureTech” or something…Just ideas. Maybe working with a university department that would want to build programs and classes around Culture Tech… If you have ideas or skills to contribute, please post a comment here or send an email to info@archive.org with some of this information.

66 thoughts on “Can You Help us Make the 19th Century Searchable?”

Kathleen Parsons August 21, 2020 at 10:12 pm

I would fully support OCRing older fonts! 19th Century documents are definitely not the only hard-to-read fonts, and I have essentially retyped many religious and engineering documents. as a former ‘image specialist’ will be happy to assist wherever I can.
1. James Doe August 24, 2020 at 4:38 pm
  
  Tesseract (free) and (though it has been many years since I checked) commercial offerings all allow you to train the recognizer on custom fonts. It’s a fairly straight-forward process, if a little tedious (you need to provide labeled data/manually correct enough errors to get the accuracy up).
  
  I mentioned this previously, but i suspect that in the case of these such older publications, it’s image artifacts (not quality, because the resolution is certainly sufficient in the examples shows) and not the font which is often the problem. So a major win would probably be to put together a good image pre-processing stage before feeding the OCR (which expects a relatively noise-free black on white image, or something which it can easily convert to such a black-on-white image). That’s a distinct issue from figuring out the layout/columns/reading-order, and feeding the OCR engine the input in an appropriate form for it to produce usable output.
2. John Doe August 24, 2020 at 4:44 pm
  
  btw, a lot of the fonts commonly used today for books and newspapers (and therefore well-handled by OCR) have a direct lineage to 19th century (And earlier) typeset fonts: Caslon, Baskerville, bodoni etc’ (so called old-world fonts). Mechanical typesetting then wasn’t as high-quality as today’s prining presses or web typography, but essentially, you’re dealing with the same fonts. You have to go quite a bit further in time to meet fonts exotic enough that they fool OCR engines.
Jake August 22, 2020 at 1:19 am

Have you seen this platform created by the Library of Congress? It’s all opensource and works quite well for them! https://crowd.loc.gov/
1. Brewster Kahle Post authorAugust 23, 2020 at 3:42 am
  
  we have sufficient scale, and I hope that computers can make a good crack at this– that we should go for computer, or at least computers learning from people, and then have them do lots.
  
  we seem to have 2 million pre 1900 texts:
  https://archive.org/search.php?query=year%3A%5B0%20TO%201900%5D%20AND%20mediatype%3Atexts
  1. Adrian Apollo August 31, 2020 at 10:07 pm
    
    What about before the Year 1 ?
  2. Internet archive D.E.V September 1, 2020 at 9:21 am
    
    Oh yes! Of course we should publish. The only problem is that I can’t donate to buy it. Really sorry. But 19 century novels are really rare and I think we have to put it.
Peter Logan August 22, 2020 at 1:42 am

It would be a huge contribution to the digital humanities community if IA could develop a robust method to process the OCR from 19th-century texts! I have spent four years doing OCR on historical editions of the Encyclopedia Britannica (funded by the NEH), and while I have fine-tuned the process to get about 99.5 % accuracy, there are still hundreds of thousands of errors to be caught in a volume of text (four separate editions) that no human can possibly read through.

I have learned lessons from Distributed Proofreaders, whose experience at automated cleanup has been invaluable. But one of my goals is to upload the improved text to IA for the editions you have available (some of which I have OCR’d), if you are interested.
1. Brewster Kahle Post authorAugust 23, 2020 at 3:45 am
  
  Wow! a cleaned up Encylopaedia Britannica– what a contribution. That is such an amazing book.
  
  Of course the Internet Archive would love it.
  
  But I dream of cool Borge’s-like ways of moving through it.
  
  And, I wonder if there would be way to use machine learning to learn from all of your work to improve how other texts are done.
Linda Hamilton August 22, 2020 at 2:11 am

Distributed Proofreaders (pgdp.net) digitizes scans from The Internet Archive and other sources. We use crowd-sourced web-based technology to convert the scans into carefully reviewed searchable text. The workload is divided into individual pages, so that many volunteers can work on a book at the same time, which significantly speeds up the process.

During proofreading, volunteers are presented with a scanned page image and the corresponding OCR text on a single web page. This allows the text to be easily compared to the image, proofread, and sent back to the site. A second volunteer is then presented with the first volunteer’s work and the same page image, verifies and corrects the work as necessary, and submits it back to the site. The book then similarly progresses through a third proofreading round and two formatting rounds using the same web interface.

Once all the pages have completed these steps, a post-processor carefully assembles them into an e-book, optionally makes it available to interested parties for ‘smooth reading’, and submits it to the Project Gutenberg archive.
1. Brewster Kahle Post authorAugust 23, 2020 at 3:47 am
  
  We love the Digistributed Proofreaders and Project Gutenberg (love live Michael Hart!)– what we have never gotten to work is to merge the results back together with the scans.
  
  At least we should put a pointer into each of the volumes on the Internet Archive to the DP versions, and vice versa.
  1. James Doe August 24, 2020 at 4:31 pm
    
    Also note that the rate at which DP (a terrific project, nonetheless) processes texts is incommensurate with the rate which IA digitizes texts. Relying wholly on humans is probably not going to work. More importantly, their goals are quite different from the goals you described in your post. Creating a digital reproduction of a text sets a far higher bar than you need to get something very useful if your goal is to make the text searchable.
    
    Just for comparison sake, I’m interested if you have the stats for how many works gutenberg/DP have put up over their many years of activity, compared with even the average rate of works added to IA.
  2. Tonya Benton August 31, 2020 at 5:23 pm
    
    Is this the technology used by Google Books to publish this version of the Freedmen’s Record? https://books.google.com/books?id=1h_VAAAAMAAJ&pg=PA1&dq=freedmen%27s+journal&hl=en&newbks=1&newbks_redir=0&sa=X&ved=2ahUKEwjoiee0-cXrAhXBtp4KHdzaD3sQ6wEwAHoECAEQAQ
James August 22, 2020 at 7:03 am

How many images do you have to process? There must be a finite amount of material from the 19th century. Could a manual approach be feasible? Something like Project Gutenberg or Zooniverse or even reCAPTCHA (although I accept that they have used some automation as a first pass)?

You could set up a transcription workflow, dish out work units to volunteers, maybe gamify it a little… and you’ll probably have to do this anyway to validate the output of any OCR software?
1. Brewster Kahle Post authorAugust 23, 2020 at 3:50 am
  
  in a rough query we have 2 million texts pre-1900. And lots of it are not the most gripping, or even really worth careful handwork. We are bring online a large quantity of periodicals, which will be fun.
  
  But some certainly is worth hand work, and the distributed proofreading project is a win.
  
  https://archive.org/search.php?query=year%3A%5B0%20TO%201900%5D%20AND%20mediatype%3Atexts
  
  reCaptcha is a good idea, and was originally corrected IA books, but then they sold to google.
2. Ravana September 2, 2020 at 1:09 am
  
  I transcribe on Zooniverse and other sites all the time. I find it very relaxing. It is sad when I go back to a transcription project I’ve been working on and find that it has been finished.
Ian August 22, 2020 at 7:28 am

Have you checked out https://trove.nla.gov.au/
1. Ian Scales August 31, 2020 at 8:44 pm
  
  This was a good suggestion, but way too cryptic for North Americans.
  
  Brewster, what Ian is suggesting, is that the Australians have mounted all their old newspapers online, in such as way that anyone is able to correct the text in a panel on the left of the screen while they see the scanned image on the right of the screen. The software is able to keep track of the placement of the corrected text within the scanned image.
  
  It has been a highly successful project, run by the National Library of Australia. Over the 14 or so years of operation, great swathes of metropolitan and rural newspapers articles have been text-corrected by thousands of people. Some simply text-correct family notices relevant to themselves and do no more, others have corrected millions of lines. Many articles have been worked on by multiple people. Nobody ‘allocates work’ or ‘orchestrates’ — it all depends on personal interest of those involved. It is strangely addictive, actually, and has no burdensome obligations or deadlines attached. Naturally, the articles of most interest get the most attention, though lone researchers burrow into their own obscure tunnels of discovery as well.
  
  Overall, in the past 14 years, the history of Australia has opened up from a jumble of poor ocr to millions of accurate text-searcheable lines. New patterns and cultural discoveries are possible, spawning a whole new generation of published histories. ‘Trove’ is now a household name, at least among Australia’s educated households. It has been a highly successful model that works excellently.
  
  One day presumably AI will do all this for us – but that may be many years away – and the AI will probably train on the human input of Trove anyway. The bottom line is, if Internet Archive is searching for a better way, maybe look at Trove. It’s there and it works spectacularly well.
wiki August 22, 2020 at 8:05 am

Wikisource seeks to correct all OCR errors in public domain or creative commons texts:
https://en.wikisource.org/wiki/Main_Page

It’s a great resource for finding text, and if you find a scanno you can fix it! (https://en.wiktionary.org/wiki/scanno) Maybe archive.org could have a revision history thing with editors. That sounds kinda odd or nonsensical. Just work on the thing at Wikisource. Then on the other hand, Wikisource might exclude things based on notability. In that case maybe it would be good to have a Wikisource-like workflow as an optional thing in archive.org.
Miguel August 22, 2020 at 8:36 am

I have never worked with Optical Character Recognition, however I have some experience with the superset of Computer Vision. This project seems feasible by a team of volunteers, I would be happy to join if such a team is created.

Best,
Miguel
Pingback: Can You Help us Make the 19th Century Searchable? | Newsvideo.eu
Jeff August 22, 2020 at 10:41 am

How applicable would the technology built by the Library of Congress be here? Looking at their documentation they appear to be addressing the issues you’ve mentioned. It looks like they’ve been digitizing their collections of newspapers on a near industrial scale:

https://news-navigator.labs.loc.gov/
Pingback: Help Make the 19th Century Searchable – Hacker News Robot
MS August 22, 2020 at 12:23 pm

You might want to contact the german project OCR-D which is trying to improve OCR for books from the 16th–18th century. There should be a lot of technological overlap:

https://ocr-d.de/en/
Pingback: 帮助使19世纪变得可搜索 – HackBase
Ben W Brumfield August 22, 2020 at 3:38 pm

We run an open-source crowdsourced OCR correction/manuscript transcription platform, and many of our materials were originally hosted (and OCRed) on the Internet Archive. We’d love to find a way to contribute corrected OCR back to the Archive, whether as training material to improve your OCR engine or as ways to improve/replace the OCR you host. I suspect that similar projects like Wikisource or Distributed Proofreaders would also be interested.

There are several challenges, however. To take one example, is a correction made in plaintext an adequate contribution? Plaintext outputs are easiest for humans to produce, but they lose bounding box info useful for improving OCR engines or highlighting search terms for readers. That’s a challenge that’s independent of process integration, but illustrates the kinds of discussion needed.
1. James Doe August 24, 2020 at 4:55 pm
  
  If that ends up being the issue, it shouldn’t be hard correlate IA’s OCR text with the manually edited parallel text. You should be able to get a reasonable match at least at the line-level, which would be good enough for search-and-highlight applications. That still leaves the issue of scale, but it still might be a good idea to initiate a public hackathon which focuses on a particularly significant collection or two. If only to gauge interest, and to gather some data on usability and productivity improvement opportunities.
anon August 22, 2020 at 4:05 pm

Some conversation, ideas and examples of other OCR technologies with seemingly better results are at https://news.ycombinator.com/item?id=24241649
James Doe August 22, 2020 at 6:25 pm

Not that long ago, you posted a similar request for assistance regarding the classification of cover images.
Although 15 years ago, these would have (maybe) been research problems. Today, i think a practitioner would describe these as too easy to be interesting. I understand that you run a skeleton crew, to better direct donations into archival purposes. But really, an average engineer with a modest cloud budget should be able to stand up a pipeline for issues like these in very short order. It befuddles me why the IA considers these solved problems such deep technical hurdles. It suggests the organization needs and lacks a certain technical function which isn’t that hard to recruit these days. Either for hire, or even as volunteer works (I think your mission is inspiring and very marketable to volunteers).

Here are a few suggestions for how to manage such problems now and in the future:
1. This blog is not likely to reach your target audience. Try posting on relevant reddit communities, hacker news, ycombinator, or reaching out to AI tech workers in the big five (and others) through personal networks in search of volunteers.
2. Your technical challenges, combined with endless data is well positioned for attracting AI hobbyists (there is such a thing). You might consider organizing a challenge with Kaggle or AICrowd. If succesfull you’l get submissions which do well, with code which you can adapt for large scale on your pipeline. The OCR angle may not actually be a good fit for this, but other technical challenges you encounter might. The cover classification problem would have been suitable (if slightly boring) for a challenge. But, in order to initiate a challenge competition, it’s important that…
3. If you have a dataset – present it as one. There’s an established way for doing what you’re trying to do (presenting a challenge for a crowd to come up with a solution). Pointing people at an archive page and a bunch of images isn’t good enough. The way it works is, you (your org) gathers a representative and respectable subset of the images from the corpus you want to “crack”. You provide labels (by staff work, crowd-sourcing, mechanical-turk, etc’) for the images. You package these images and labels together with a description of what you’re looking for, and you do a reasonable job publicizing the availability of this dataset and the importance of its solution in the communication channels frequented by people competent to solve it (mentioned above). People try it and the better solution float to the top, with source code made available on github. You either pay, tempt, or in-house someone to take that code and turn it into a pipeline on your infrastructure. Your previous post regarding cover classification was a perfect instance of doing that.
4. The IA (Partly for legal reasons, I understand) has no success stories I’m aware in terms of crowd-sourcing. Not for metadata (as far as I know it’s still impossible to submit corrections, for which there are many). crowd-sourcing is a double-edged sword, since managing it can take up significant resources, and online communities are in general unstable, erupting into open revolt sooner or later. Stack Exchange has seen a massive exodus of competent community managers and moderators. These people know something about how to do such things. Try tracking some down and reaching out for advice on organizing a collaborative volunteer force to help you with data enhancement.
5. This is important because 100% automation is a tough target to achieve. What is more realistic is developing a software framework which can enable a small team to become very productive. Given the volume you deal with and your (comparatively) small resources, putting humans in the loop is not economically viable on a scale large enough to meet your intake rate. That’s why cultivating a volunteer community might be a good solution. Because you should be able to provide them with an interface which allows a small group to move a lot of volume.
6. Although AI algorithms do abjectly magical things regularly these days, its also quite common in image processing tasks to benefit from clustering tasks and “solving” them individually. What I mean by that is that the problem of “page layout analysis” (that’s what it’s called in the OCR community) for arbitrary publications is much more difficult than doing the same for a particular publication, because the layout for a particular newspaper (for example) has consistencies to its layout which simply the matter considerably. So, rather than betting on a magic bullet for *all* your archival material, if you have a substantial collection (say several years of a weekly) which has special historical significant, it’s a much easier task to “tailor” a pipeline for that collection and its specific idiosyncrasies than it is to write something that works for pamphlets, broadsheets, posters, books, census tables, etc’. Ideally, you would have a flexible software solution which has parameters which can be tweaked for each collection individually for best results.
7. That said, AI these days is so amazing, I may be giving you bad advice. If you have enough labeled data, and a competent engineer, I really would be surprised if this is a “hard problem” compared to, say, the image processing for self-driving cars.
8. Rather than trying to “improve OCR”, a more productive approach might be to work on software (a script, really) that preprocesses your images in order to make it more palatable to your exiting OCR solution. Current OCR offering are, generally, staggeringly good. You just need to feed them images that match their expectations. For example, the Fredrick Douglass image you offered as being out of scope for your OCR solution, with some very basic image processing by gimp, produces very good results with the free OCR program tesseract, which is quite good but by no means on par with current state of the art commercial offerings.
If you take my advice, and tailor the preprocessing to a collections, you can expect extremely good results.
Image processing is a specialized skill, but honestly, a little goes a long way, and from what I’ve seen, it only takes some undergraduate-level things to get you most of the way there. Writing a page segmentation algorithm that does well enough on a newspaper should not be that difficult, the methods are known and widely available.
tesseract itself (open-source) has code for layout analysis which could be customized for better results. The authors also published a paper on how to do table detection in documents. This is really standard stuff nowadays. There’s some engineering involved, but no real need for PhDs or geniuses.
9. Amazon/Google both have extremely good computer vision services in their cloud offerings. Tailored specifically for organization wishing to scan reams of documents. It would be uneconomical for you to pay their price, and as an organization with strong values of freedom and privacy, I expect you’re not keen on partnering with them. That shouldn’t prevent you from using their tech on a sample of your data in order to gauge the performance. This can be done quickly and cheaply. If you find that their tech can get close to 100% for what you throw at it, you can be confidence that you can get something close (but probably quite as good) to that performance with open-source and some elbow-grease.

I apologize for posting such a long and overbearing rant in response, but the issues you seek help on really are so obviously solvable, it’s a little frustrating to hear that your org is finding it difficult to deal with them. Librarians are awesome, but techies are too. You should maybe get some on your staff. They can do a lot for your mission.

Best of luck! The archive is totally awesome!
1. Brewster Kahle Post authorAugust 23, 2020 at 4:33 am
  
  Not that long ago, you posted a similar request for assistance regarding the classification of cover images.
  Although 15 years ago, these would have (maybe) been research problems.
  
  >> agreed. I do not think these are NSF fundable research projects, but I believe there can be those finding this worthwhile to do and at least some funding is available. I will try to answer your long and thoughtful post inline.
  
  Today, i think a practitioner would describe these as too easy to be interesting. I understand that you run a skeleton crew, to better direct donations into archival purposes. But really, an average engineer with a modest cloud budget should be able to stand up a pipeline for issues like these in very short order. It befuddles me why the IA considers these solved problems such deep technical hurdles. It suggests the organization needs and lacks a certain technical function which isn’t that hard to recruit these days. Either for hire, or even as volunteer works (I think your mission is inspiring and very marketable to volunteers).
  
  >> I hope you are right. Our constraints rise-fall and morph. Some things are getting easier. The hardest constraints are usually integration and the “last 90%.” For integration, we now have robust api’s so that makes things easier at the technical level.
  
  Here are a few suggestions for how to manage such problems now and in the future:
  1. This blog is not likely to reach your target audience. Try posting on relevant reddit communities, hacker news, ycombinator, or reaching out to AI tech workers in the big five (and others) through personal networks in search of volunteers.
  
  >> good ideas. but so far, we are getting some responses here.
  
  2. Your technical challenges, combined with endless data is well positioned for attracting AI hobbyists (there is such a thing). You might consider organizing a challenge with Kaggle or AICrowd. If succesfull you’l get submissions which do well, with code which you can adapt for large scale on your pipeline. The OCR angle may not actually be a good fit for this, but other technical challenges you encounter might. The cover classification problem would have been suitable (if slightly boring) for a challenge. But, in order to initiate a challenge competition, it’s important that…
  
  >> Running challenges takes skills we would like to find, maybe externally. then when there is a winner, then there is the integration and the long tail debugging.
  
  3. If you have a dataset – present it as one. There’s an established way for doing what you’re trying to do (presenting a challenge for a crowd to come up with a solution). Pointing people at an archive page and a bunch of images isn’t good enough. The way it works is, you (your org) gathers a representative and respectable subset of the images from the corpus you want to “crack”. You provide labels (by staff work, crowd-sourcing, mechanical-turk, etc’) for the images. You package these images and labels together with a description of what you’re looking for, and you do a reasonable job publicizing the availability of this dataset and the importance of its solution in the communication channels frequented by people competent to solve it (mentioned above). People try it and the better solution float to the top, with source code made available on github. You either pay, tempt, or in-house someone to take that code and turn it into a pipeline on your infrastructure. Your previous post regarding cover classification was a perfect instance of doing that.
  
  >> If you know folks that would be a challenge leader, this could work. We have some funding.
  
  4. The IA (Partly for legal reasons, I understand) has no success stories I’m aware in terms of crowd-sourcing. Not for metadata (as far as I know it’s still impossible to submit corrections, for which there are many). crowd-sourcing is a double-edged sword, since managing it can take up significant resources, and online communities are in general unstable, erupting into open revolt sooner or later. Stack Exchange has seen a massive exodus of competent community managers and moderators. These people know something about how to do such things. Try tracking some down and reaching out for advice on organizing a collaborative volunteer force to help you with data enhancement.
  
  >> most of archive.org and openlibrary.org is crowd sourced in different ways, but corrections are sometimes difficult to manage. spam fighting is persistent– what a problem. we spend endless hours dealing with that crap.
  
  But we are staffing up on Patron Services which deals with end-users, and working through backlog. If people want to help on that type of work, we are carefully hiring.
  
  5. This is important because 100% automation is a tough target to achieve. What is more realistic is developing a software framework which can enable a small team to become very productive. Given the volume you deal with and your (comparatively) small resources, putting humans in the loop is not economically viable on a scale large enough to meet your intake rate. That’s why cultivating a volunteer community might be a good solution. Because you should be able to provide them with an interface which allows a small group to move a lot of volume.
  
  >> yes, some issues can be done by non-technical people. This issue will take technical solutions.
  
  6. Although AI algorithms do abjectly magical things regularly these days, its also quite common in image processing tasks to benefit from clustering tasks and “solving” them individually. What I mean by that is that the problem of “page layout analysis” (that’s what it’s called in the OCR community) for arbitrary publications is much more difficult than doing the same for a particular publication, because the layout for a particular newspaper (for example) has consistencies to its layout which simply the matter considerably. So, rather than betting on a magic bullet for *all* your archival material, if you have a substantial collection (say several years of a weekly) which has special historical significant, it’s a much easier task to “tailor” a pipeline for that collection and its specific idiosyncrasies than it is to write something that works for pamphlets, broadsheets, posters, books, census tables, etc’. Ideally, you would have a flexible software solution which has parameters which can be tweaked for each collection individually for best results.
  
  >> I am glad you mentioned page analysis– this is coming up soon. We have periodicals we want to find article titles and authors. Agreed that working through sets of them rather than all makes sense.
  
  7. That said, AI these days is so amazing, I may be giving you bad advice. If you have enough labeled data, and a competent engineer, I really would be surprised if this is a “hard problem” compared to, say, the image processing for self-driving cars.
  
  >> I am hoping as well. Internet Archive engineers are starting to use some of these tools. Mostly we have been using other’s.
  
  8. Rather than trying to “improve OCR”, a more productive approach might be to work on software (a script, really) that preprocesses your images in order to make it more palatable to your exiting OCR solution. Current OCR offering are, generally, staggeringly good. You just need to feed them images that match their expectations. For example, the Fredrick Douglass image you offered as being out of scope for your OCR solution, with some very basic image processing by gimp, produces very good results with the free OCR program tesseract, which is quite good but by no means on par with current state of the art commercial offerings.
  If you take my advice, and tailor the preprocessing to a collections, you can expect extremely good results.
  Image processing is a specialized skill, but honestly, a little goes a long way, and from what I’ve seen, it only takes some undergraduate-level things to get you most of the way there. Writing a page segmentation algorithm that does well enough on a newspaper should not be that difficult, the methods are known and widely available.
  tesseract itself (open-source) has code for layout analysis which could be customized for better results. The authors also published a paper on how to do table detection in documents. This is really standard stuff nowadays. There’s some engineering involved, but no real need for PhDs or geniuses.
  
  >> agreed. I hoped my post would find folks that would be interested in trying things things.
  
  9. Amazon/Google both have extremely good computer vision services in their cloud offerings. Tailored specifically for organization wishing to scan reams of documents. It would be uneconomical for you to pay their price, and as an organization with strong values of freedom and privacy, I expect you’re not keen on partnering with them. That shouldn’t prevent you from using their tech on a sample of your data in order to gauge the performance. This can be done quickly and cheaply. If you find that their tech can get close to 100% for what you throw at it, you can be confidence that you can get something close (but probably quite as good) to that performance with open-source and some elbow-grease.
  
  >> Carl Malamud has been using Google’s ocr for scripts we do not handle with ABBYY with good results. They are price prohibitive at retail prices. One hope is they would give us an inkind grant, as they did with Kalev doing work finding covid information in our TV and Radio collections.
  
  I apologize for posting such a long and overbearing rant in response, but the issues you seek help on really are so obviously solvable, it’s a little frustrating to hear that your org is finding it difficult to deal with them. Librarians are awesome, but techies are too. You should maybe get some on your staff. They can do a lot for your mission.
  
  >> we have both librarians and techies. but we need more help. We hope to hire well but also work with other organizations and with volunteers.
  
  Best of luck! The archive is totally awesome!
  
  >> thank you. stick with us. we are trying.
  1. Joe Smith September 2, 2020 at 7:38 am
    
    > as they did with Kalev doing work finding covid information in our TV and Radio collections.
    
    Is there any chance you could point me towards more info about this? Sounds very interesting.
2. Brian Isett August 31, 2020 at 10:20 pm
  
  Heavily agree with this comment. Came here to write a somewhat briefer version suggesting Kaggle competition or similar. Specifically: DrivenData is a social good challenge site similar to Kaggle, but it might have a more relevant audience for this type of project. You could reach out to them about whether they provide assistance with challenge leadership (I also know someone who works there so I could ask if for some odd reason they didn’t respond to IA knocking lol). As mentioned above, the labeled representative data set is key.
  
  I would love to be involved in some capacity (not sure what it would be), but I work with a lot of data analysis and machine learning currently. Not sure if you have used NSF XSEDE supercomputing grants but they have an often underutilized social sciences division (see funding branch 5.8). I got a seed grant from them and it was very useful for my project (object tracking in 92 million video frames). I might be qualified to help with challenge leadership, but I haven’t done it before. Good luck!
Trevor August 22, 2020 at 7:24 pm

I’m not entirely sure what you guys are asking for in terms of help. It sounds like you guys need people to go through your OCR data in order to manually correct the columns and articles. If this is the case, I would love to volunteer some of my spare time in order to help.
Brett Bobley August 22, 2020 at 8:00 pm

Hi Brewster! Thanks for posting this. I agree 100%. Better OCR for historical materials is a major priority for my organization, the National Endowment for the Humanities, as it helps not only the general public but also researchers across nearly every humanities and scientific domain that uses digitized materials.

Northeastern University wrote a great report (& Internet Archive contributed to it) that lists some of the technical and social “next steps” we need to take to improve OCR. You can read the report here: https://ocr.northeastern.edu/report/. We are keen to help fund research that improves OCR.

Brett
John Levin August 23, 2020 at 11:38 am

I’ve been working on the automatic correction of OCR’d text, especially for eighteenth century volumes using fonts with ligatures and the ‘long s’. Many errors are predictable, and thus susceptible to automated correction; it’s as simply as correcting ‘againft’ or ‘againjt’ to against. This won’t produce perfect text, but harvests a lot of low hanging fruit.
In particular I’ve been correcting volumes of British statutes:
http://statutes.org.uk/site/
for which I’ve drawn on volumes in the Internet Archive. I’d love to be able to feed back corrections upstream.
I think there’s quite a lot of work on this going on in the librarian, archivst and digital humanities / digital history worlds – see the comment from Ben Brumfield above.

John Levin
1. Brewster Kahle Post authorAugust 23, 2020 at 8:10 pm
  
  We are getting better at taking feedback like this, but still issues. our OCR comes out as abbyy.xml which we then derive into other formats.
  
  our full text search uses the abbyy, so if that is not fixed, then it will not show up there.
  
  but would love to get your fixes in.
  
  -brewster
  1. E. Höffner August 31, 2020 at 7:17 pm
    
    There has been a discussion on twitter: https://twitter.com/jbaiter_/status/1293849532250423298?s=20
    
    The OCR-Reader Internet Archive is using does not have the ability to detect and read Fraktur. Almost all texts in German up to around 1930 were printed in Fraktur (the Nazis switched to Antiqua because Fraktur is supposed to be a Jewish font — according to the Nazis ). Depending on the quality of the source, the output with tesseract (which is free) is fair.
Christopher K. Philippo August 23, 2020 at 1:51 pm

Would love to help!

Aside from OCR issues, enhancing search capabilities would also be welcome. If one wanted to search for content in a December issue of an 1851 journal, for example, if it’s bound in a volume containing all issues from 1850-1855, on Google Books a search could not be made that specific. It would return hits on search terms from issues publisher in other months and years within that bound volume. Getting that search to be more effective would seemingly have to involve not just fixing bad OCR but coding the months and years within a bound volume somehow so a search could narrowed to them.

Somewhat related: I love fultonhistory.com’s fuzzy search and boolean search capabilities. Helps make up for bad OCR in many cases. I had also liked Genealogybank.com’s poem and song search, which while imperfect (it missed many and misidentified a number of advertisements and lists as poems or songs) was still useful sometimes.
1. Brewster Kahle Post authorAugust 23, 2020 at 8:07 pm
  
  On new periodicals going up, we break things up by issue, which should help your specific case. Our full text search is getting rev’ed so not everything is indexed yet.
  
  how might you want to help? (we are always looking for help)
Carol Schroeder August 23, 2020 at 9:45 pm

The Smithsonian has a large volunteer group of people who type hand written and unreadable historic docs written by scientists, naturalists, adventurers, etc. “Smithsonian Transcription” in Google should take you to it. They can help you set up a “volunpeer” system which also includes reviewing of others’ work, if needed. Anyone can help and there is a chart of ongoing projects and the per cent finished. This is a different task but their way of organizing people and projects might help. Good luck.
Ben W Brumfield August 24, 2020 at 12:56 pm

What about developing a program that could merge corrected plaintext into raw abbyy XML, producing abbyy XML containing corrected text and (as far as possible) modified bounding boxes?

If the raw OCR XML did not have significant layout errors (like mis-read columns or split lines due to image skew) and the quality of the OCR text were not terrible, it should be possible to do a fuzzy match between tokens in the corrected plaintext and the text contents of the raw XML elements, then replace those text nodes with the corrected tokens while retaining the XML bounding box information. More sophisticated approaches could merge XML nodes and boxes (in cases of incorrect OCR splitting), split nodes and boxes (in cases of incorrect merging), or compare lines in the plaintext against lines in the XML to handle lower-quality OCR less amenable to fuzzy text matching.

Such a program would be generally useful to the community, as–in addition to supporting IA’s ingestion of corrected OCR into a format supporting your existing downstream processes–it would allow text correction platforms to produce more useful export formats. Such an approach might not be limited to the DjVu XML IA uses, but could be applied to ALTO, hOCR, or PAGE.
Roger W Burdette August 24, 2020 at 5:49 pm

Much of the “problem” with OCR of typed and typeset 19th and pre-IBM selectric text is due to the obsolete definitions embodied in all “modern” OCR products. Most of the solution set is quite simple, although it takes longer to get at least 5-sigma than OCR does with computer or modern printed materials.

If IA is interested, please contact me.
Clemens Neudecker August 26, 2020 at 1:55 pm

Here at the Berlin State Library (https://staatsbibliothek-berlin.de/en/, SBB), we are working hard on both:

1. Qurator
Project to improve digital curation for digitized cultural heritage using AI/ML methods – we have a highly competitive albeit still under heavy development pipeline for binarization -> layout analysis -> textline detection -> multi-model OCR with voting -> OCR postcorrection -> Named Entity Recognition and Linking, all open source: https://github.com/qurator-spk. There are some slides (only in German, I’m afraid) here https://www.slideshare.net/cneudecker/kuratieren-mit-knstlicher-intelligenz.

2. OCR-D
Already mentioned in an earlier comment, OCR-D is a fully open project to develop suitable components for OCR for historical printed materials. With good trained models such as e.g. listed here https://ocr-d.de/en/models, a lot is already possible now compared to what commercial OCR software can deliver. Again, all code and development is fully open https://github.com/OCR-D with an open chat at https://gitter.im/OCR-D/Lobby.

Since historic newspapers are one of our core strategic targets at SBB, we put particular effort into making DIA/OCR tools that also work for historical newspapers.

We would be very happy to explore any possible collaborations in this area!
Art Rhyno August 27, 2020 at 3:25 pm

There is a synergy here with the Impresso newspaper workshop that was held virtually in April, see the shared summary document that resulted:

https://docs.google.com/document/d/1nYIyuKFLPjgL4VbkKjI338Hh7vy68raCTNoC_vCPT2I/edit#

esp. the opening statement: What can we do/should we be focussing on over the next 5-10 years to improve OCR quality of existing newspaper archives whose OCR quality is not “good enough” to be useful to historians. I see the OCR-D project has already been mentioned, both the PAGE and ALTO formats open some doors for correcting tools/projects like aletheia, transkribus and escripta. I think the latest builds of Tesseract would give better recognition then what’s been posted, and Tesseract also now supports ALTO exports.
Katy Alexander August 28, 2020 at 6:07 pm

Hi I was deputy librarian at The Star Newspaper for 5 years many years ago. I investigated using OCR as a way to access the 100 years of information in that newspaper and others like it held by the City of Johannesburg in bound volume and microfilm to no avail. The microfilm creators said they could rescan and OCR the newspapers using microfilm. Apparently it is easier to OCR a negative? My suggestion is that a distributed model using image, ocred text and human editing is done based on product, or type of artifact rather than distributed pages. This would probably get you a quicker result for things like Frederick Douglass’s Newspaper. Every newspaper has a style and format which would allow people to access the different types of information at different rates of importance. Do the news and editorial and personals first and then the brylcreme adverts. Can I also suggest you reach out to the New York City Library?they have some pretty amazing people there who might have ideas. If we could get this right it would be epic.
Lars Aronsson August 28, 2020 at 6:30 pm

Brewster, you need to take the columns (segmentation, zoning) as a separate issue.

Your existing OCR process (based on Finereader software) already outputs coordinates for each word (in …djvu.xml) which indicate where it has found the columns. If you reduce these word coordinates to column coordinates, a simple web application could display these as boxes laid over the scanned image. And then a user (with an archive.org login) could redraw the boxes and click “save”, to reshape/repair the interpretation of where the columns are. When redrawn boxes are saved, each box (cut out from the image) can be sent separately to OCR (Finereader or Tesseract) for reprocessing.

I’m a little surprised that nobody has done this already. The Australian national library’s newspaper digitization project might have something similar, with crowdsourced identification of columns. (Or maybe it was only proofreading within given columns?)
1. Lars Aronsson August 29, 2020 at 9:44 pm
  
  An idea of what the user interface could look like is seen in the screenshot at
  https://commons.wikimedia.org/wiki/File:Fragmented_OCR_segments.png
  That screenshot is from the PC (single user, interactive) version of ABBYY Finereader Professional version 10,
  of which the Internet Archive uses the server version.
2. Ian Scales August 31, 2020 at 9:08 pm
  
  The NLA’s Trove newspaper digitisation project that you mention, actually uses paid staff to define the page zones (e.g. to separate columns, to separate articles and to separate image captions from column text).
  
  They also attach metadata to each zone (e.g., to link zones that belong to a single newspaper article, and then attach an identifier to that article and add a title and manually add the first four lines of text, to override ocr errors in the critical first lines so that the article has better chance of search discovery). That is the most labour-intensive part of the operation but still quite fast.
  
  Once that process is completed, the publication is released and the ocr is corrected by crowdsourcing.
DAVID BERNARD WIRTZ August 28, 2020 at 9:34 pm

I would like to help in some way. I have needed a project desperately since I stopped teaching. If you can use my skills, I will help; I even may be able to travel!

I have a BA from Columbia University in NYC (‘90) in CompLit (with a minor in Creative Writing) and a Masters from Columbia University Teachers College (‘93) in “The Teaching of English Reading and Writing.” I have four years ESL/TESOL, Proofreading, and Ghost Writing experience overseas: one year at an international language school, then as needed at a clandestine government agency, followed by three wonderful years at university level (Sookmyung Women’s University, Seoul ‘93-‘97). I hired on for eight years at the University of Phoenix teaching English writing, reading, and grammar, and ten years as a high school English, Journalism, Yearbook and LitMag teacher.

I was in on the beginning of personal computers (IBM’s PS2), and I remember my DOS; I now use Apple products, however, so my DOS is probably rusty. I do learn quickly, though, and I am highly computer literate as a user.

Please contact me if there may be a slot you believe I could fill, no matter the level. Thanks!
Jonathan Berry August 29, 2020 at 12:57 am

The fonts are clear enough and there’s a highway through the middle of the valley telling the OCR what is what. But the OCR result doesn’t look better than it did in the 1990s. Maybe the OCR world slightly resembles the OS world, with us guessing which cup the pea is under, rather than seeing steady improvement in recognition performance.
DataExStudy@iseclab.org August 29, 2020 at 2:23 am

https://iseclab.org/
Naveed Aziz August 29, 2020 at 9:17 am

Yes why not worldwide technologies is always struggling to modify search day by day easy to access everything fast.
Pingback: Reading List – Summer 2020 – tess van stekelenburg
Liz August 31, 2020 at 5:14 pm

I wonder if we can utilize these for keyboarding assignments – when students are doing repetitive training (not just learning/muscle exercises). These would be far more interesting to read than “the old dog ran down the hill” 50 times!
Kanyua Nduthu August 31, 2020 at 5:22 pm

Retype docs and retouch images to improve visibility. If this is possible.
Michael May August 31, 2020 at 5:56 pm

Is this post related to the “Search Inside” not working on texts at Internet Archive for the last few months? There’s scant details on the forums, “yes, search inside indexing has been paused. i don’t have an estimate of when it might start again.” If you could provide more details here, it would be appreciated. If this is a long-term project, resuming OCR now however faulty would much better than having none for an extended period. Thank you.
1. Michael May September 3, 2020 at 6:53 pm
  
  And now it seems to be working again after several months. Thank you, if someone there restarted this, or even it if was just coincidence. I am hoping you come up with creative ways to improve OCR, since it is so important to preserving history.
Catherine Broué August 31, 2020 at 6:01 pm

The READ Consortium based in Innsbruck has developed a technology of manuscript and print recognition through its intelligent software TRANSKRIBUS. This consortium has numerous partners (researchers and librairies) around the world. You can try their software on line for free.
Eric Zhou August 31, 2020 at 6:07 pm

I’m a current first-year Master’s student in Computer Science at Carnegie Mellon. I would be very willing to participate in an “Archive Summer of CultureTech.”
Antonio Giron August 31, 2020 at 6:25 pm

Have you considered Transkribus?

If not, check out https://transkribus.eu/Transkribus/

This is a project developed at the University of Innsbruck, in Austria based on work done at Rostock University, in Germany and the University of Valencia, in Spain. The software is actually meant for hand written material, no less, and it should be able to do a good job of OCR’ing printed material, no matter what obscure script it may have been printed in.

The concept is that you train the algorithm by hand transcribing a number of words (five thousand are claimed to be sufficient for printed material, while much larger samples are obviously required for hand writing). The software does have tools for describing the layout of the document to the OCR engine. The web site supports trials, where you can upload your stuff and let the system perform its magic. My suggestion is that a very small group of volunteers could transcribe and prepare one sample of one collection, and then OCR the whole collection using that training.

I am not aware of any charges for usage, as the project is funded by the European Union. I suspect they would be only too happy to work with the Internet Archive.

I am not affiliated to this effort in any way, and have no direct experience with it, but I have heard good things about it.
Karl Ochsner August 31, 2020 at 8:02 pm

The old let people decode it for you using “Are You a Robot” validation. Scan a sentence at a time…and have people decipher it by typing what they see.
Ambrose August 31, 2020 at 9:15 pm

This is a bit random, but maybe people could read the text out loud into their computers and you could recognise the speech instead of recognising the test?
Michael Murdoch September 1, 2020 at 2:52 am

I think Lars has the right idea.

BTW, @Lars Aronsson they have done this already. Newspapers.com off the top of my head. Ancestry.com owns them. Obviously, old documents matter for doing genealogy.

I would just partner with newspapers.com or use them to power it. I’m familiar with the Brooklyn Daily Eagle and that’s OCR’d pretty well – same period. They use “clippings” which are OCR’d individually afaik.

Anyway, failing that –

The problem is user error (you’re using the wrong tools if you can’t scan columns, it’s basic location segmentation), the scans (they’re poor and off-center), and to a much, much lesser extent probably the fonts and occasionally moiring (depends if they’re scans of the paper or of microfiche/film).

I’ve done a project like this and I’ve scanned a number of older books with a text layer. Easiest thing is scanning using ABBYY and using human readers to fix this. AFAIK, you’re already using something like that, probably sub-optimally. I really need to know specs on what software you’re using now and what you’re scanning from.

I could think of something more complicated for the workflow, if you can’t bear licensing costs, but to be frank, it might not be worth it. Partnering with newspapers.com or whoever may work better. And using humans probably works best.

I’ve found digital archivists and people in library science and “Digital Humanities” are not generally serious about this or up on the tech and usually just do this for an article, publicity or fundraising $$$ and not to actually solve a problem. I really need to gauge your level of seriousness before I waste any significant amount of time here. Some of the comments here are by well meaning people who have no idea what they’re talking about.

I know a lot of people this may be of interest to. We preserve books. Generally anonymously. I think if you just post the newspapers and let us solve the problem – a reward or recognition or something, like they do with bug bounties would really incentivize things. To be frank, many of the people commenting don’t even seem to know what ABBYY or segmentation even is.

A while ago, I spent a lot of time typing up a fairly complicated solution and workflow conversion process using virtualization for an older version of a famous figure’s music and called someone I know who had a studio and legal versions of rather old software only for the archivist to not even bother returning my call. Suprise! Fundraising season and they got a story placed in the NYT magazine about an “unsolveable” problem it took me a weekend and a day to solve. People prefer a mystery to a functioning archive, I guess when it comes to raising $$$.

Anyway, I don’t know your level of seriousness. If this is something where you’ll credit or compensate me in some way, I can probably help. If it’s for PR, I can’t.
Twingsister September 1, 2020 at 6:05 pm

I am sorry, I have no experience in OCR and yet I’d like to contribute with this idea. I found that ReCaptcha raises some concern and adopting a FOSS ReCaptcha service would be appreciated by many aware internet users and developers.

I think that it will be not too difficult to develop an Internet Archive based ReCaptcha service that can qualify as non-bot a client on the ground of two pieces of text, one computer generated and one extracted from the not easily OCRed scans.

The Wikipedia page about reCAPTCHA reports that this has already been used to digitize NY Times. There you can also find some figures that suggest that “our” ReCaptcha can enjoy up to a billion of free translation of a couple of words every day. From the Wikipedia page “Size in Volumes” one can learn that Britannica contains 44 million words so, even capturing a small fraction of the ReCaptcha traffic, could do a great job.

Even, just placing an Internet Archive based ReCaptcha in Internet Archive, with a mention to the fact that you are helping to preserve cultural heritage, could be a bright start.
Twitter from Slashdot September 2, 2020 at 5:09 am

Contact a bank and see if they will help you. I worked at a regional check processing center briefly about 15 years ago. They were able to read handwriting, front and back of the check, almost perfectly and they sorted it all into neat database fields. The group I was with looked for errors when the sums did not add up, but the software showed us all of the fields in all their privacy busting glory. I imagine their software is better today, making fewer errors and hiding the other stuff.

It would be just as interesting to get a refusal as it would be to get assistance. If you are lucky, you might find your way back to the IBM people who set up banks. If not, you will understand better what the people who run our world really value.

Happy Hacking.
Christopher Kermorvant September 2, 2020 at 3:33 pm

at TEKLIA, we have developed a platform for scanned document processing (layout analysis, classification, OCR/HTR, named-entities extraction) at large scale. We are more specialized in handwriting recognition, but we would be happy to work with people from OCR-D to train OCR models on your data and integrate them in our platform. Let’s have a meeting to see how we can collaborate

Christopher