How Archive.org items are structured

What is an item?

An item is a logical “thing” that we present on one web page on archive.org. An item may be one video file along with scans of the DVD cover, one book, one audio file, or a set of audio files that represent a CD , etc.

How do you know whether your files should be in one item or separate items?  You get one metadata file per item.  If the same metadata describes ALL of the files (like a CD), then that’s one item.  If the files are too different to have the same metadata (title, creator, description, etc.), they should be in different items.

How Items Are Structured

All archive.org items have this format URL:
http://archive.org/details/[identifier]
(where [identifier] is unique within our system).

Example: For this item
http://www.archive.org/details/popeye_taxi-turvey
the identifier is popeye_taxi-turvey

An item is just a directory or folder of files that includes the originally uploaded content file(s) – audio, video, text, etc. – along with any derivative files we create from the originals and the metadata that describes the item.  To see all files in an item, click the HTTP link in the upper left box on the item page (circled in red below).

That link takes you to a directory listing showing all original, derived, and metadata files for the item.

You can view information about every file in this directory by viewing the file ending in _files.xml (in this example, popeye_taxi-turvey_files.xml). Each file in the item is listed here, along with whether the source is “original” (uploaded by the user), “derivative” (derived by archive.org), or “metadata” file.  You will also find a format designation, various checksums, and sometimes titles for the files.

To see all of the metadata for the item, view the file ending in _meta.xml (in this example, popeye_taxi-turvey_meta.xml). This file should list all of the pertinent information about the item, such as title, creator, description, etc.  IA’s metadata schema is based on Dublin Core, but it is extremely flexible.  You can add any key=value pair to this file and we will store it and make it searchable in the IA search engine.  (However, it may not automatically show up on the item page.)

Reviews, if there are any, are contained in the _reviews.xml file.

One thing to note: Many “display” characteristics on archive.org, among other things, work better if your item’s identifier matches your file name.  So if you’re uploading a file called popeye_taxi-turvey.mpg, it’s best to use the identifier popeye_taxi-turvey (just remove the file extension).  If you’re using the upload button on archive.org, put your desired identifier in the Title field of the upload form.  We turn that into the identifier automatically, and then after upload you can go back into the item and change the title to something more readable.

Archival URLs

An item’s “details” page will always be available at
http://archive.org/details/[identifier]

The item directory is always available at
http://archive.org/download/[identifier]

A particular file can always be downloaded from
http://archive.org/download/[identifier]/[filename]

Please Note: Archival URLs may redirect to an actual server that contains the content.  For example
http://www.archive.org/download/popeye_taxi-turvey
currently redirects to
http://ia600204.us.archive.org/14/items/popeye_taxi-turvey/
DO NOT LINK to any archive.org URL that begins with numbers like this.  This refers to the particular machine that we’re serving the file from right now, but we move items to new servers all the time.  If you link to this sort of URL, instead of the archival URL, your link WILL break at some point.

7 thoughts on “How Archive.org items are structured

  1. Lars Aronsson

    It’s sad that the Internet Archive doesn’t provide any structure for series or collections of items. Defining such a structure is a lot of work, and it’s sad that it can’t be shared with other visitors of the Internet Archive.

    One example is my identification of the series and volumes of the Transactions of the Swedish Academy, which now resides on Wikimedia Commons instead of the Internet Archive, http://commons.wikimedia.org/wiki/Category_talk:Svenska_Akademiens_handlingar

  2. Andreas K. Förster

    Feature request: There should be a way to link items with each other. For example alternative recordings, or book scans with the LibriVox audio-book and so on.
    The linked item should automatically be linked back.

    I hope this was the right place to post such ideas…

    1. internetarchive

      Hi Andreas,

      I agree, that’s a great idea! We have a way to link collection pages to other related collections, but we currently don’t have a way to link items to one another (other than the user including an html link in their description, of course). As our collections grow, I think having this feature will become increasingly important – even just being able to link all the different editions/languages of a book together would be nice. I’ll make sure the team is aware of this request, though I can’t make any promises about delivery.

      Alexis

  3. Pingback: Downloading in bulk using wget | Internet Archive Blogs

  4. Jason Henderson

    I am wondering: Is there a way to change out the main file (Say a PDF file) and then have Internet archive update the derivative files associated with that original file? I find I am needing to make corrections to previously created items, but in order to ensure the derivative files correspond I have had to completely delete the item and reupload.

    1. internetarchive

      Hi Jason,

      You can do this, but it’s a tad convoluted.
      1. Click “edit item” link on your item, click “change the files”
      2. Remove the original file and all the derived files
      3. Upload a replacement file and click “Update item”
      4. On item page, click “edit item” > “change the information” > “item manager”
      5. Click the “derive” button

      It will take a little while for the derive to finish, depending on the size of your original file, but that should do it!

Comments are closed.