Category Archives: Technical

Site down some of Tuesday and Wednesday for Power Upgrade

[Update:   Upgrade is done, we were offline twice, as we predicted (and are sorry about), but now we have twice the power.

New transformer for the Internet Archive Building.

New transformer for the Internet Archive Building.

Thank you PG&E, Ralf Muehlen, and the Archive engineers.]

This week, we are doubling the power coming into our primary data center so that we can archive and serve even more web pages, books, music and moving images. During those upgrades, there will be times when many of our web sites and services will not be available. Details below.

To keep the data safe, we will proactively shut down most of our services served from our primary data center. archive.org, openlibrary.org, iafcu.org and our blogs will be unavailable during the outages. The upgrades will happen over a two day period. We anticipate two prolonged outages, the first one from about 7am to 12noon PDT (14:00-19:00 UTC) on Tuesday, April 16. And the another one from 3pm to 7pm PDT (22:00-02:00 UTC) on Wednesday, April 17. Work might require additional outages between those two major ones.

During the outages, we’ll post updates to our @internetarchive twitter feed. Sorry for the inconvenience.

Update: To be on the safe side, we’ll expand Wednesday’s outage window from 2:15pm PDT to 7:15 PDT (21:15-02:15 UTC). For some of our services, the actual outages might be shorter.

Celebrating 100 million tasks (uploading and modifying archive.org content)

Just over 8-1/2 years ago, I wrote a multi-process daemon in PHP that we refer to as “catalogd”.  It runs 24 hours a day, 7 days a week, no rest!

It is in charge of uploading all content to our archive.org servers, and all changes to uploaded files.

We recently passed the 100 millionth “task” (upload or edit to an archive “item”).

After starting with a modest 100 or so tasks/day, we currently run nearly 100,000 tasks/day.  We’ve done some minor scaling, but of the most part, the little daemon has become our little daemon that could!

Here’s to the next 100 million tasks at archive.org!

-tracey

new mp4 (h.264) derivative technique — simpler and easy!

Greetings video geeks!  😎

We’ve updated the process and way we create our .mp4 files that are shown on video pages on archive.org

It’s a much cleaner/clearer process, namely:

  • We opted to ditch ffpreset files in favor of command-line argument 100% equivalents.  It seems a bit easier for someone reading the task log of their item, trying to see what we did.
  • We no longer need qt-faststart step and dropped it.  we use the cmd-line modern ffmpeg “-movflags faststart”
  • Entire processing is now done 100% with ffmpeg, in the standard “2-pass” mode
  • As before, this derivative plays in modern html5 video tag compatible browsers, plays in flash plugin within browsers, and works on all iOS devices.   it also makes sure the “moov atom” is at the front of the file, so browsers can playback before downloading the entire file, etc.)
Here is an example (you would tailor especially the “scale=640:480” depending on source aspect ratio and desired output size;  change or drop altogether the “-r 20” option (the source was 20 fps, so we make the dest 20 fps);  tailor the bitrate args to taste):
  • ffmpeg -y -i stairs.avi -vcodec libx264 -pix_fmt yuv420p -vf yadif,scale=640:480 -profile:v baseline -x264opts cabac=0:bframes=0:ref=1:weightp=0:level=30:bitrate=700:vbv_maxrate=768:vbv_bufsize=1400 -movflags faststart -ac 2 -b:a 128k -ar 44100 -r 20 -threads 2 -map_metadata -1,g:0,g -pass 1 -map 0:0 -map 0:1 -acodec aac -strict experimental stairs.mp4;
  • ffmpeg -y -i stairs.avi -vcodec libx264 -pix_fmt yuv420p -vf yadif,scale=640:480 -profile:v baseline -x264opts cabac=0:bframes=0:ref=1:weightp=0:level=30:bitrate=700:vbv_maxrate=768:vbv_bufsize=1400 -movflags faststart -ac 2 -b:a 128k -ar 44100 -r 20 -threads 2 -map_metadata -1,g:0,g -pass 2 -map 0:0 -map 0:1 -acodec aac -strict experimental -metadata title='”Stairs where i work” – lame test item, bear with us – http://archive.org/details/stairs’ -metadata year=’2004′ -metadata comment=license:’http://creativecommons.org/licenses/publicdomain/’ stairs.mp4;

Happy hacking and creating!

PS: here is the way we compile ffmpeg (we use ubuntu linux, but works on macosx, too).

new video and audio player — video multiple qualities, related videos, and more!

Many of you have already noticed that since the New Year, we have migrated our new “beta” player to be the primary/default player, then to be the only player.

We are excited about this new player!
It features the very latest release of jwplayer from longtailvideo.com.

Here’s some new features/improvements worth mentioning:

  • html5 is now the default — flash is a fallback option.  a final fallback option for most items is a “file download” link from the “click to play” image
  • videos have a nice new “Related Videos” pane that shows at the end of playback
  • should be much more reliable — I had previously hacked up a lot of the JS and flash from the jwplayer release version to accommodate our various wants and looks — now we use mostly the stock player with minimal JS alterations/customizations around the player.
  • better HD video and other quality options — uploaders can now offer multiple video size and bitrate qualities.  If you know how to code web playable (see my next post!) h.264 mp4 videos especially, you can upload different qualities of our source video and the viewer will have to option to pick any of them (see more on that below).
  • more consistent UI and look and feel.  The longtailvideo team *really* cleaned up and improved their UI, giving everything a clean, consistent, and aesthetically pleasing look.  Their default “skin” is also greatly improved, so we can use that now directly too
  • lots of cleaned up performance and more likely to play in more mobile, browsers, and and OS combinations under the hood.

Please give it a try!

-tracey

 

For those of you interested in trying multiple qualities, here’s a sample video showing it:

http://archive.org/details/kittehs

To make that work, I made sure that my original/source file was:

  • h.264 video
  • AAC audio
  • had the “moov atom” at the front (to allow instant playback without waiting to download entire file first) (search web for “qt-faststart” or ffmpeg’s “-movflags faststart” option, or see my next post for how we make our .mp4 here at archive.org)
  • has a > 480P style HD width/height
  • has filename ending with one of:   .HD.mov   .HD.mp4   .HD.mpeg4    .HD.m4v

When all of those are true, our system will automatically take:

  • filename.HD.mov

and create:

  • filename.mp4

that is our normal ~1000 kb/sec “derivative” video, as well as “filename.ogv”

The /details/ page will then see two playable mpeg-4 h.264 videos, and offer them both with the [HD] toggle button (seen once video is playing) allowing users to pick between the two quality levels.

If you wanted to offer a *third* quality, you could do that with another ending like above but with otherwise the same requirements.  So you could upload:

  • filename.HD.mp4       (as, say, a 960 x 540 resolution video)
  • filename.HD.mpeg4   (as, say, a 1920 x 1080 resolution video)

and the toggle would show the three options:   1080P, 540P, 480P

You can update existing items if you like, and re-derive your items, to get multiple qualities present.

Happy hacking!

 

 

 

getting only certain formats in .zip files from items — new feature

Per some requests from our friends in the Live Music Archive community…

You can get any archive.org item downloaded to your local machine as a .zip file (that we’ve been doing for 5+ years!)
But whereas before it would be all files/formats,
now you can be pick/selective about *just* certain formats.

We’ll put links up on audio item pages, minimally, but the url pattern is simple for any item.
It looks like (where you replace IDENTIFIER with the identifier of your item (eg: thing after archive.org/details/)):

http://archive.org/compress/IDENTIFIER

for the entire item, and for just certain formats:

http://archive.org/compress/IDENTIFIER/formats=format1,format2,format3,….

Example:


wget -q -O - 'http://archive.org/compress/ellepurr/formats=Metadata,Checksums,Flac' > zip; unzip -l zip
Archive: zip
Length Date Time Name
--------- ---------- ----- ----
1107614 2012-10-30 19:49 elle.flac
44 2012-10-30 19:49 ellepurr.md5
3114 2012-10-30 19:49 ellepurr_files.xml
693 2012-10-30 19:49 ellepurr_meta.xml
602 2012-10-30 19:49 ellepurr_reviews.xml
--------- -------
1112067 5 files

Enjoy!!

Uploading images for text items (update on *_images.zip format)

The old news

Until about a year ago, if you wanted to upload a set of individual page images and have them be recognized as a “book” so we’d create the usual derivative formats from them, you had to mimic the *_jp2.zip (“Single Page Processed JP2 ZIP”) archive files that are created automatically at our scanning centers. Making these from your own existing images is inconvenient and error-prone, due to the rigid expectations for how individual image files are named and organized into a directory structure. That route was also limited to JPEG2000 (“JP2”) image files.

Things changed with the introduction last year of our *_images.zip (“Generic Raw Book Zip”) format, which is much more flexible.  If you provide a file whose name ends in _images.zip, we’ll make a *_jp2.zip from it:  the *_images.zip file will be unpacked, its contents sorted alphabetically, and the set of images found within converted into a standard *_jp2.zip, which we’ll then process as usual.

In a bit more detail, the *_images.zip will be scanned for files it contains, at any directory level, whose names end with .jp2, .jpg, .jpeg, .tif, .tiff, .bmp or .png, matched case-insensitively; any other files (.xml, .txt, etc.) will be ignored.  You can mix and match different image formats.  All image files found will be sorted alphabetically (including any directory names, so that files originally in the same subdirectory stay together in the new sequence), converted to JPEG2000 if they’re not already, renamed the way our code expects, and packed into a new *_jp2.zip, leaving your *_images.zip in place as it was.

For an example of how messy an *_images.zip we can deal with, see:

http://archive.org/download/hr100106/hr100106_images.zip/

The 589 images files found there were converted into:

http://archive.org/download/hr100106/hr100106_jp2.zip/

Note that the new *_jp2.zip, and the files it contains, are named according to the name of the original *_images.zip file (“hr100106”), regardless of how directories and files are named inside the *_images.zip.  Those files and directories can be named any way you like; the names matter only in that they determine the sequence of the images in the new *_jp2.zip.

The new news

Now for what’s changed:

  • *_images.tar (“Generic Raw Book Tar”) is accepted as well as *_images.zip. Producing a tar file may be more convenient than producing a zip file for some uploaders, particularly if the file is going to be large. Older implementations of the zip compression scheme were limited to 4 GB, and some tools were known to produce files that we couldn’t read if the size exceeded 2 GB. Our advice in the past has been to use the 7-Zip tool for creating any zips larger than 2 GB. That still works, or you can now make a tar instead; the size of tar files is effectively unlimited.
  • Comic Book archive files are accepted. *.cbz (“Comic Book ZIP”) files are essentially zip archive files containing page images, typically as either JPEGs or PNGs. We now accept *.cbz files and treat them just like *_images.zip files. Similarly, *.cbr (“Comic Book RAR”) files are RAR archive files containing page images, and we now treat those just like *_images.zip files, too. So if you have any *.cbz or *.cbr files, just uploading them as is should result in having all the usual derivative formats created.

Improved theora/ogg video derivatives!

We’ve made our ogg video derivatives slightly better via:

  • minor bump up to “thusnelda” release
  • “upgrade” from 1-pass video encoding to 2-pass video encoding
  • direct ffmpeg creation of the video (you’ll need to re/compile ffmpeg minimally with “–enable-libtheora –enable-libvorbis” configure flags)

ffmpeg -y -i ‘camels.avi’ -q:vscale 3 -b:v 512k -vcodec libtheora -pix_fmt yuv420p -vf yadif,scale=400:300 -r 20 -threads 2 -map_metadata -1,g:0,g -pass 1 -an -f null /dev/null;

ffmpeg -y -i ‘camels.avi’ -q:vscale 3 -b:v 512k -vcodec libtheora -pix_fmt yuv420p -vf yadif,scale=400:300 -r 20 -threads 2 -map_metadata -1,g:0,g -pass 2 -map 0:0 -map 0:1 -acodec libvorbis -ac 2 -ab 128k -ar 44100 -metadata TITLE=’Camels at a Zoo’ -metadata LICENSE=’http://creativecommons.org/licenses/by-nc/3.0/’ -metadata DATE=’2004′ -metadata ORGANIZATION=’Dumb Bunny Productions’ -metadata LOCATION=http://archive.org/details/camels camels.ogv

some notes:

  • You’d want to adjust the “scale=WIDTH:HEIGHT” accordingly, as well as the “-r FRAMES-PER-SECOND” related args, to your source video.
  • I made a small patch to allow *both* bitrate target *and* quality level for theora in ffmpeg, after comparing the other popular tool “ffmpeg2theora” code with the libtheoraenc.c inside ffmpeg.  It may not be necessary, but I believe I saw *slightly* better quality coming out of theora/thusnelda ogg video.  For what it’s worth, my minor patch is here:  http://archive.org/~tracey/downloads/patches/ffmpeg-theora.patch
  • The way we compile ffmpeg (ubuntu/linux) is here.  (Alt MacOS version here )
  • (Edited post above after I removed this step) It’s *quite* odd, I realize to have ffmpeg transcode both the audio/video together, only to split/demux them back out temporarily.  However, for some videos, the “oggz-comment” step would wipe out the first video keyframe and cause unplayability in chrome (and the expected visual artifacts for things that could play it).   So, we split, comment the audio track, then re-stitch it back together.

Downloading in bulk using wget

If you’ve ever wanted to download files from many different archive.org items in an automated way, here is one method to do it.

____________________________________________________________

Here’s an overview of what we’ll do:

1. Confirm or install a terminal emulator and wget
2. Create a list of archive.org item identifiers
3. Craft a wget command to download files from those identifiers
4. Run the wget command.

____________________________________________________________

Requirements

Required: a terminal emulator and wget installed on your computer. Below are instructions to determine if you already have these.
Recommended but not required: understanding of basic unix commands and archive.org items structure and terminology.

____________________________________________________________

Section 1. Determine if you have a terminal emulator and wget.
If not, they need to be installed (they’re free)

1. Check to see if you already have wget installed
If you already have a terminal emulator such as Terminal (Mac) or Cygwin (Windows) you can check if you have wget also installed. If you do not have them both installed go to Section 2. Here’s how to check to see if you have wget using your terminal emulator:

1. Open Terminal (Mac) or Cygwin (Windows)
2. Type “which wget” after the $ sign
3. If you have wget the result should show what directory it’s in such as /usr/bin/wget. If you don’t have it there will be no results.

2. To install a terminal emulator and/or wget:
Windows: To install a terminal emulator along with wget please read Installing Cygwin Tutorial. Be sure to choose the wget module option when prompted.

MacOSX: MacOSX comes with Terminal installed. You should find it in the Utilities folder (Applications > Utilities > Terminal). For wget, there are no official binaries of wget available for Mac OS X. Instead, you must either build wget from source code or download an unofficial binary created elsewhere. The following links may be helpful for getting a working copy of wget on Mac OSX.
Prebuilt binary for Mac OSX Lion and Snow Leopard
wget for Mac OSX leopard

Building from source for MacOSX: Skip this step if you are able to install from the above links.
To build from source, you must first Install Xcode. Once Xcode is installed there are many tutorials online to guide you through building wget from source. Such as, How to install wget on your Mac.

____________________________________________________________

Section 2. Now you can use wget to download lots of files

The method for using wget to download files is:

  1. Generate a list of archive.org item identifiers (the tail end of the url for an archive.org item page) from which you wish to grab files.
  2. Create a folder (a directory) to hold the downloaded files
  3. Construct your wget command to retrieve the desired files
  4. Run the command and wait for it to finish

Step 1: Create a folder (directory) for your downloaded files
1. Create a folder named “Files” on your computer Desktop. This is where the downloaded where files will go. Create it the usual way by using either command-shift-n (Mac) or control-shift-n (Windows)

Step 2: Create a file with the list of identifiers
You’ll need a text file with the list of archive.org item identifiers from which you want to download files. This file will be used by the wget to download the files.

If you already have a list of identifiers you can paste or type the identifiers into a file. There should be one identifier per line. The other option is to use the archive.org search engine to create a list based on a query.  To do this we will use advanced search to create the list and then download the list in a file.

First, determine your search query using the search engine.  In this example, I am looking for items in the Prelinger collection with the subject “Health and Hygiene.”  There are currently 41 items that match this query.  Once you’ve figured out your query:

1. Go to the advanced search page on archive.org. Use the “Advanced Search returning JSON, XML, and more.” section to create a query.  Once you have a query that delivers the results you want click the back button to go back to the advanced search page.
3. Select “identifier” from the “Fields to return” list.
4. Optionally sort the results (sorting by “identifier asc” is handy for arranging them in alphabetical order.)
5. Enter the number of results from step 1 into the “Number of results” box that matches (or is higher than) the number of results your query returns.
6. Choose the “CSV format” radio button.
This image shows what the advance query would look like for our example:
Advanced Search

7. Click the search button (may take a while depending on how many results you have.) An alert box will ask if you want your results – click “OK” to proceed.  You’ll then see a prompt to download the “search.csv” file to your computer.  The downloaded file will be in your default download location (often your Desktop or your Downloads folder).
8. Rename the “search.csv” file “itemlist.txt” (no quotes.)
9. Drag or move the itemlist.txt file into your “Files” folder that you previously created
10. Open the file in a text program such as TextEdit (Mac) or Notepad (Windows). Delete the first line of copy which reads “identifier”. Be sure you deleted the entire line and that the first line is not a blank line. Now remove all the quotes by doing a search and replace replacing the ” with nothing.

The contents of the itemlist.txt file should now look like this:

AboutFac1941
Attitude1949
BodyCare1948
Cancer_2
Careofth1949
Careofth1951
CityWate1941

…………………………………………………………………………………………………………………………
NOTE: You can use this advanced search method to create lists of thousands of identifiers, although we don’t recommend using it to retrieve more than 10,000 or so items at once (it will time out at a certain point).
………………………………………………………………………………………………………………………...

Step 3: Create a wget command
The wget command uses unix terminology. Each symbol, letter or word represents different options that the wget will execute.

Below are three typical wget commands for downloading from the identifiers listed in your itemlist.txt file.

To get all files from your identifier list:
wget -r -H -nc -np -nH --cut-dirs=1 -e robots=off -l1 -i ./itemlist.txt -B 'http://archive.org/download/'

If you want to only download certain file formats (in this example pdf and epub) you should include the -A option which stands for “accept”. In this example we would download the pdf and jp2 files
wget -r -H -nc -np -nH --cut-dirs=1 -A .pdf,.epub -e robots=off -l1 -i ./itemlist.txt -B 'http://archive.org/download/'

To only download all files except specific formats (in this example tar and zip) you should include the -R option which stands for “reject”. In this example we would download all files except tar and zip files:
wget -r -H -nc -np -nH --cut-dirs=1 -R .tar,.zip -e robots=off -l1 -i ./itemlist.txt -B 'http://archive.org/download/'

If you want to modify one of these or craft a new one you may find it easier to do it in a text editing program (TextEdit or NotePad) rather than doing it in the terminal emulator.

…………………………………………………………………………………………………………………………
NOTE: To craft a wget command for your specific needs you might need to understand the various options. It can get complicated so try to get a thorough understanding before experimenting.You can learn more about unix commands at Basic unix commands

An explanation of each options used in our example wget command are as follows:

-r   recursive download; required in order to move from the item identifier down into its individual files

-H   enable spanning across hosts when doing recursive retrieving (the initial URL for the directory will be on archive.org, and the individual file locations will be on a specific datanode)

-nc   no clobber; if a local copy already exists of a file, don’t download it again (useful if you have to restart the wget at some point, as it avoids re-downloading all the files that were already done during the first pass)

-np   no parent; ensures that the recursion doesn’t climb back up the directory tree to other items (by, for instance, following the “../” link in the directory listing)

-nH   no host directories; when using -r, wget will create a directory tree to stick the local copies in, starting with the hostname ({datanode}.us.archive.org/), unless -nH is provided

--cut-dirs=1   completes what -nH started by skipping the hostname; when saving files on the local disk (from a URL likehttp://{datanode}.us.archive.org/{drive}/items/{identifier}/{identifier}.pdf), skip the /{drive}/items/ portion of the URL, too, so that all {identifier} directories appear together in the current directory, instead of being buried several levels down in multiple {drive}/items/ directories

-e robots=off   archive.org datanodes contain robots.txt files telling robotic crawlers not to traverse the directory structure; in order to recurse from the directory to the individual files, we need to tell wget to ignore the robots.txt directive

-i ../itemlist.txt   location of input file listing all the URLs to use; “../itemlist” means the list of items should appear one level up in the directory structure, in a file called “itemlist.txt” (you can call the file anything you want, so long as you specify its actual name after -i)

-B 'http://archive.org/download/'   base URL; gets prepended to the text read from the -i file (this is what allows us to have just the identifiers in the itemlist file, rather than the full URL on each line)

Additional options that may be needed sometimes:

-l depth --level=depth   Specify recursion maximum depth level depth. The default maximum depth is 5. This option is helpful when you are downloading items that contain external links or URL’s in either the items metadata or other text files within the item. Here’s an example command to avoid downloading external links contained in an items metadata:
wget -r -H -nc -np -nH --cut-dirs=1 -l 1 -e robots=off -i ../itemlist.txt -B 'http://archive.org/download/'

-A  -R   accept-list and reject-list, either limiting the download to certain kinds of file, or excluding certain kinds of file; for instance, adding the following options to your wget command would download all files except those whose names end with _orig_jp2.tar or _jpg.pdf:
wget -r -H -nc -np -nH --cut-dirs=1 -R _orig_jp2.tar,_jpg.pdf -e robots=off -i ../itemlist.txt -B 'http://archive.org/download/'

And adding the following options would download all files containing zelazny in their names, except those ending with .ps:
wget -r -H -nc -np -nH --cut-dirs=1 -A "*zelazny*" -R .ps -e robots=off -i ../itemlist.txt -B 'http://archive.org/download/'

See http://www.gnu.org/software/wget/manual/html_node/Types-of-Files.html for a fuller explanation.
…………………………………………………………………………………………………………………………

Step 4: Run the command
1. Open your terminal emulator (Terminal or Cygwin)
2. In your terminal emulator window, move into your folder/directory. To do this:
For Mac: type cd Desktop/Files
For Windows type in Cygwin after the $ cd /cygdrive/c/Users/archive/Desktop/Files
3. Hit return. You have now moved into th e”Files” folder.
4. In your terminal emulator enter or paste your wget command. If you are using on of the commands on this page be sure to copy the entire command which may be on two lines. You can just cut and paste in Mac. For Cygwin, copy the command, click the Cygwin logo in the upper left corner, select Edit then select Paste.
5. Hit return to run the command.

You will see your progress on the screen.  If you have sorted your itemlist.txt alphabetically, you can estimate how far through the list you are based on the screen output. Depending on how many files you are downloading and their size, it may take quite some time for this command to finish running.

…………………………………………………………………………………………………………………………
NOTE: We strongly recommend trying this process with just ONE identifier first as a test to make sure you download the files you want before you try to download files from many items.
…………………………………………………………………………………………………………………………

Tips:

  • You can terminate the command by pressing “control” and “c” on your keyboard simultaneously while in the terminal window.
  • If your command will take a while to complete, make sure your computer is set to never sleep and turn off automatic updates.
  • If you think you missed some items (e.g. due to machines being down), you can simply rerun the command after it finishes.  The “no clobber” option in the command will prevent already retrieved files from being overwritten, so only missed files will be retrieved.

IA forums now do “inline reply”

One thing that’s driven me a bit kooky is every time I “reply to this post” in the Internet Archive forums, it drives the browser to a new form page with no context/content of the post I was just looking at!

Very pleased to finally with some coffee just make it javascript “inline reply” right at the post you are looking at, so you can reference everything you were just thinking about and how you were going to reply.

It gracefully degrades to prior behaviour for those without javascript enabled.

[The prior behaviour:
Ooh, nice post and points.
But I gotta get in there with some comments… OK, I have some ideas ready to type…
[reply button]
Say, is that Jane’s Addiction old stuff coming up on random play, I like this…
oh crap, what was I going to say again?
what post was I looking at again?
Say, is that a lower-than-normal plane flying outside?

]