The old news

Until about a year ago, if you wanted to upload a set of individual page images and have them be recognized as a “book” so we’d create the usual derivative formats from them, you had to mimic the *_jp2.zip (“Single Page Processed JP2 ZIP”) archive files that are created automatically at our scanning centers. Making these from your own existing images is inconvenient and error-prone, due to the rigid expectations for how individual image files are named and organized into a directory structure. That route was also limited to JPEG2000 (“JP2”) image files.

Things changed with the introduction last year of our *_images.zip (“Generic Raw Book Zip”) format, which is much more flexible. If you provide a file whose name ends in _images.zip, we’ll make a *_jp2.zip from it: the *_images.zip file will be unpacked, its contents sorted alphabetically, and the set of images found within converted into a standard *_jp2.zip, which we’ll then process as usual.

In a bit more detail, the *_images.zip will be scanned for files it contains, at any directory level, whose names end with .jp2, .jpg, .jpeg, .tif, .tiff, .bmp or .png, matched case-insensitively; any other files (.xml, .txt, etc.) will be ignored. You can mix and match different image formats. All image files found will be sorted alphabetically (including any directory names, so that files originally in the same subdirectory stay together in the new sequence), converted to JPEG2000 if they’re not already, renamed the way our code expects, and packed into a new *_jp2.zip, leaving your *_images.zip in place as it was.

For an example of how messy an *_images.zip we can deal with, see:

http://archive.org/download/hr100106/hr100106_images.zip/

The 589 images files found there were converted into:

http://archive.org/download/hr100106/hr100106_jp2.zip/

Note that the new *_jp2.zip, and the files it contains, are named according to the name of the original *_images.zip file (“hr100106”), regardless of how directories and files are named inside the *_images.zip. Those files and directories can be named any way you like; the names matter only in that they determine the sequence of the images in the new *_jp2.zip.

The new news

Now for what’s changed:

*_images.tar (“Generic Raw Book Tar”) is accepted as well as *_images.zip. Producing a tar file may be more convenient than producing a zip file for some uploaders, particularly if the file is going to be large. Older implementations of the zip compression scheme were limited to 4 GB, and some tools were known to produce files that we couldn’t read if the size exceeded 2 GB. Our advice in the past has been to use the 7-Zip tool for creating any zips larger than 2 GB. That still works, or you can now make a tar instead; the size of tar files is effectively unlimited.
Comic Book archive files are accepted. *.cbz (“Comic Book ZIP”) files are essentially zip archive files containing page images, typically as either JPEGs or PNGs. We now accept *.cbz files and treat them just like *_images.zip files. Similarly, *.cbr (“Comic Book RAR”) files are RAR archive files containing page images, and we now treat those just like *_images.zip files, too. So if you have any *.cbz or *.cbr files, just uploading them as is should result in having all the usual derivative formats created.

Internet Archive Blogs

A blog from the team at archive.org

Author Archives: Hank Bromley

Uploading images for text items (update on *_images.zip format)

Upcoming Events

Book Talk: Big Fiction