Upcoming changes in epub generation

Epub is a format for ebooks that is used on book reader devices.   It is often mostly text, but can incorporate images. The Internet Archive offers these in two cases:  when a user uploads them, and when they are created from other formats, such as scanned books or uploaded PDFs that were made up of images of pages.

The Internet Archive creates them from images of pages using “optical character recognition” (OCR) technology. This is then reformatted into the epub format (currently epub v2). These files are sometimes created “on-the-fly” and sometimes created as files and stored in our item directories.   All “on-the-fly” epubs use the newest code, where stored ones use the code available at the time of generation.

Based on a change in the format from our OCR engine last August, many of the epubs generated between then and last week have been faulty. Newly generated epubs are now fixed, and we will soon be going back to fix the faulty ones that were stored. We have also discovered that some of the older epubs have also been faulty, and it is difficult to know which.

To fix this we are shifting to the “on-the-fly” generation for all epubs so that all epubs get the newest code.   This is how we already generate daisy, mobi, and many zip files as well.   To access the epubs for the books we have scanned the URL is https://archive.org/download/ID/ID.epub, for instance https://archive.org/download/recordofpennsylv00linn/recordofpennsylv00linn.epub.

More generally, to find when an epub can be generated, for items that do not have a field the ocr field in meta.xml, that says “language not currently OCRable”, and there is a file an abbyy format file will be in an item. For instance, in an item’s file list, the presence of an abbyy file downloadable at  http://archive.org/download/file_abbyy.gz will mean a corresponding epub file can be downloaded at http://archive.org/download/file.epub.