In 1847, Frederick Douglass started a newspaper advocating the abolition of slavery that ran until 1851. After the Civil War, there was a newspaper for freed slaves, the Freedmen’s Record. The Internet Archive is bringing these and many more works online for free public access. But there’s a problem:
Our Optical Character Recognition (OCR), while the best commercially available OCR technology, is not very good at identifying text from older documents.
Take for example, this newspaper from 1847. The images are not that great, but a person can read them:
The problem is our computers’ optical character recognition tech gets it wrong, and the columns get confused.
What we need is “Culture Tech” (a riff on fintech, or biotech) and Culture Techies to work on important and useful projects–the things we need, but are probably not going to get gushers of private equity interest to fund. There are thousands of professionals taking on similar challenges in the field of digital humanities and we want to complement their work with industrial-scale tech that we can apply to cultural heritage materials.
One such project would be to work on technologies to bring 19th-century documents fully digital. We need to improve OCR to enable full text search, but we also need help segmenting documents into columns and articles. The Internet Archive has lots of test materials and thousands are uploading more documents all the time.
What we do not have is a good way to integrate work on these projects with the Internet Archive’s processing flow. So we need help and ideas there as well.
Maybe we can host an “Archive Summer of CultureTech” or something…Just ideas. Maybe working with a university department that would want to build programs and classes around Culture Tech… If you have ideas or skills to contribute, please post a comment here or send an email to firstname.lastname@example.org with some of this information.