Category Archives: Technical

Using Docker to Encapsulate Complicated Program is Successful

The Internet Archive has been using docker in a useful way that is a bit out of the mainstream: to package a command-line binary and its dependencies so we can deploy it on a cluster and use it in the same way we would a static binary.

Columbia University’s Daniel Ellis created an audio fingerprinting program that was used in a competition.   It was not packaged as a debian package or other distribution approach.   It took a while for our staff to find how to install it and its many dependencies consistently on Ubuntu, but it seemed pretty heavy handed to install that on our worker cluster.    So we explored using docker and it has been successful.   While old hand for some, I thought it might be interesting to explain what we did.

1) Created a docker file to make a docker container that held all of the code needed to run the system.

2) Worked with our systems group to figure out how to install docker on our cluster with a security profile we felt comfortable with.   This included running the binary in the container as user nobody.

3) Ramped up slowly to test the downloading and running of this container.   In general it would take 10-25 minutes to download the container the first time. Once cached on a worker node, it was very fast to start up.    This cache is persistent between many jobs, so this is efficient.

4) Use the container as we would a shell command, but passed files into the container by mounting a sub filesystem for it to read and write to.   Also helped with signaling errors.

5) Starting production use now.

We hope that docker can help us with other programs that require complicated or legacy environments to run.

Congratulations to Raj Kumar, Aaron Ximm, and Andy Bezella for the creative solution to problem that could have made it difficult for us to use some complicated academic code in our production environment.

Go docker!

Job Posting: Web Application/Software Developer for Archive-It

The Internet Archive is looking for a smart, collaborative and resourceful engineer to lead and do the development of the next generation of the Archive-It service, a web based application used by libraries and archives around the world. The Internet Archive is a digital public library founded in 1996. Archive-It is a self-sustaining revenue generating subscription service first launched in 2006.

Primary responsibilities would be to extend the success of Archive-It, which librarians and archivists use to create collections of digital content, and then make them accessible to researchers, scholars and the general public.  Widely considered to be the market leader since its’ inception, Archive-It’s partner base has archived over five billion web pages and over 260 terabytes of data.  http://archive-it.org

Working for Archive-It program’s director, this position has technical responsibility to evolve this service while still being straightforward enough to be operated by 300+ partner organizations and their users with minimal technical skills. Our current system is primarily Java based and we are looking to help build the next-generation of Archive-It using the latest web technologies. The ideal candidate will possess a desire to work collaboratively with a small internal team and a large, vocal and active user community; demonstrating independence, creativity, initiative and technological savvy, in addition to being a great programmer/architect.

The ideal candidate will have:


  • 5+ years work experience in Java and Python web application development
  • Experience with Hadoop, specifically HBase and Pig
  • Experience developing web application database back-end (SQL or NoSQL).
  • Good understanding of latest web framework technologies, both JVM and non-JVM based, and trade-offs between them.
  • Strong familiarity with all aspects of web technology and protocols, including: HTTP, HTML, and Javascript
  • Experience with a variety of web applications, machine clusters, distributed systems, and high-volume data services.
  • Flexibility and a sense of humor
  • BS Computer Science, or equivalent work experience

Bonus points for:

  • Experience with web crawlers and/or applications designed to display [archived] web content (especially server-side apps)
  • Open source practices experience
  • Experience and/or interest in user interface design and information architecture
  • Familiarity with Apache SOLR or similar facet-based search technologies
  • Experience with the building/architecture of social media sites
  • Experience building out a mobile platform

To apply:

Please send your resume and cover letter to kristine at archive dot org with the subject line “Web App Developer Archive-It”.

The Archive thanks all applicants for their interest, but advises that only those selected for an interview will be contacted. No phone calls please!

We are an equal opportunity employer.

How to use the Virtual Machine for Researchers

Some researchers that are working with the Internet Archive, such as those at University of Massachusetts, have wanted closer access to some of our collections. We are learning how to support this type of “on-campus” use of the collections. This post is to document how to use these machines.

Who can have access?

This is for joint projects with the archive, usually some academic program often funded by NSF.  So this is not a general offering, but more of a special case thing. Most use the collections by downloading materials to their home machines. We have tools to help with this, and use “GNU Parallel” to make it go fast.

How to get an account?

Is there an agreement? Yes, there usually is. This is usually administered by Alexis Rossi.  All in all, these are shared machines, so please be respectful of others data and use of the machines.

How do I get access to the VM? To get an account you will need to forward a public SSH key to Jake Johnson. Please follow the steps below for more details.

Generate your SSH keys.

These instructions assume you’re on a Unix-like operating system. If you’re using Windows please see Mike Lichtenberg’s blog post, Generating SSH Keys on Windows.

  1. If you don’t already have an ~/.ssh directory, you will need to create one to store your SSH configuration files and keys:
    $ mkdir -p ~/.ssh
  2. Move into the ~/.ssh directory:
    $ cd ~/.ssh
  3. Create your keys (replacing {username} with the username you would like to use to login to the VM):
    $ bash -c 'ssh-keygen -t rsa -b 2048 -C "{username}@researcher0.fnf.archive.org"'
  4. You will be prompted to enter a filename which your private SSH key will be saved to. Use something like id_rsa.{username}@researcher0.fnf.archive.org, again replacing {username} with your username that you will be using to login to the VM):
    Enter file in which to save the key (~/.ssh/id_rsa): id_rsa.{username}@researcher0.fnf.archive.org
  5. You will be prompted again to enter a passphrase. Enter a passphrase, and continue.
    Enter passphrase (empty for no passphrase): [enter your passphrase]
    Enter same passphrase again: [enter your passphrase again]

You should now have two new files in your ~/.ssh directory, a private key and a public key. For example:

~/.ssh/id_rsa.{username}@researcher0.fnf.archive.org
~/.ssh/id_rsa.{username}@researcher0.fnf.archive.org.pub

Your public key is the key suffixed with “.pub“.

Adding your public key to the VM

Forward your public key  to Jake Johnson. He will create a user for you, and add your public key to the VM. Once you receive notification that your user has been created and your key successfully added to the VM, proceed to the next step.

Logging into the VM via SSH

You can now use your private key to login into the VM with the following command:

$ ssh -i ~/.ssh/id_rsa.{username}@researcher0.fnf.archive.org {username}@researcher0.fnf.archive.org

How do I bulk download data from archive.org onto the VM?

We recommend using wget to download data from archive.org. Please see our blog post, Downloading in bulk using wget, for more details.

If you have privileges to an access-restricted collection, you can use your archive.org cookies to download data from this collection by adding the following --header flag to your wget command:

--header "Cookie: logged-in-user={email%40example.com}; logged-in-sig={private};"

(Note: replace {email%40example.com} with the email address associated with your archive.org account (encoding @ as %40), and {private} with the value of your logged-in-sig cookie.)

You can retrieve your logged-in-sig cookie using the following steps:

  1. In Firefox , go to archive.org and log in with your account
  2. Go to Firefox > Preferences
  3. Click on the Privacy tab
  4. Select “Use custom settings for History” in drop down menu in the history section
  5. Click the “Show cookies” button
  6. Find archive.org in the list of cookies and expand to show options
  7. Select the logged-in-sig cookie. The long string in the “Content:” field is the value of your logged-in-sig cookie. This is the value that you will need for your wget command (specifically, replacing {private} in the --header flag mentioned above).

How do I bulk download metadata from archive.org onto the VM?

You can download all of an items metadata via our Metadata API.

How do I generate a list of identifiers for downloading data and metadata from collections in bulk?

You can use our advanced search engine. Please refer to the Create a file with the list of identifiers section in our Downloading in bulk using wget blog post.

How can I monitor usage of the VM?

You can monitor usage of the VM via MRTG (Multi Router Traffic Grapher) here: http://researcher0.fnf.archive.org:8088/mrtg/

Site down some of Tuesday and Wednesday for Power Upgrade

[Update:   Upgrade is done, we were offline twice, as we predicted (and are sorry about), but now we have twice the power.

New transformer for the Internet Archive Building.

New transformer for the Internet Archive Building.

Thank you PG&E, Ralf Muehlen, and the Archive engineers.]

This week, we are doubling the power coming into our primary data center so that we can archive and serve even more web pages, books, music and moving images. During those upgrades, there will be times when many of our web sites and services will not be available. Details below.

To keep the data safe, we will proactively shut down most of our services served from our primary data center. archive.org, openlibrary.org, iafcu.org and our blogs will be unavailable during the outages. The upgrades will happen over a two day period. We anticipate two prolonged outages, the first one from about 7am to 12noon PDT (14:00-19:00 UTC) on Tuesday, April 16. And the another one from 3pm to 7pm PDT (22:00-02:00 UTC) on Wednesday, April 17. Work might require additional outages between those two major ones.

During the outages, we’ll post updates to our @internetarchive twitter feed. Sorry for the inconvenience.

Update: To be on the safe side, we’ll expand Wednesday’s outage window from 2:15pm PDT to 7:15 PDT (21:15-02:15 UTC). For some of our services, the actual outages might be shorter.

Celebrating 100 million tasks (uploading and modifying archive.org content)

Just over 8-1/2 years ago, I wrote a multi-process daemon in PHP that we refer to as “catalogd”.  It runs 24 hours a day, 7 days a week, no rest!

It is in charge of uploading all content to our archive.org servers, and all changes to uploaded files.

We recently passed the 100 millionth “task” (upload or edit to an archive “item”).

After starting with a modest 100 or so tasks/day, we currently run nearly 100,000 tasks/day.  We’ve done some minor scaling, but of the most part, the little daemon has become our little daemon that could!

Here’s to the next 100 million tasks at archive.org!

-tracey

new mp4 (h.264) derivative technique — simpler and easy!

Greetings video geeks!  😎

We’ve updated the process and way we create our .mp4 files that are shown on video pages on archive.org

It’s a much cleaner/clearer process, namely:

  • We opted to ditch ffpreset files in favor of command-line argument 100% equivalents.  It seems a bit easier for someone reading the task log of their item, trying to see what we did.
  • We no longer need qt-faststart step and dropped it.  we use the cmd-line modern ffmpeg “-movflags faststart”
  • Entire processing is now done 100% with ffmpeg, in the standard “2-pass” mode
  • As before, this derivative plays in modern html5 video tag compatible browsers, plays in flash plugin within browsers, and works on all iOS devices.   it also makes sure the “moov atom” is at the front of the file, so browsers can playback before downloading the entire file, etc.)
Here is an example (you would tailor especially the “scale=640:480” depending on source aspect ratio and desired output size;  change or drop altogether the “-r 20” option (the source was 20 fps, so we make the dest 20 fps);  tailor the bitrate args to taste):
  • ffmpeg -y -i stairs.avi -vcodec libx264 -pix_fmt yuv420p -vf yadif,scale=640:480 -profile:v baseline -x264opts cabac=0:bframes=0:ref=1:weightp=0:level=30:bitrate=700:vbv_maxrate=768:vbv_bufsize=1400 -movflags faststart -ac 2 -b:a 128k -ar 44100 -r 20 -threads 2 -map_metadata -1,g:0,g -pass 1 -map 0:0 -map 0:1 -acodec aac -strict experimental stairs.mp4;
  • ffmpeg -y -i stairs.avi -vcodec libx264 -pix_fmt yuv420p -vf yadif,scale=640:480 -profile:v baseline -x264opts cabac=0:bframes=0:ref=1:weightp=0:level=30:bitrate=700:vbv_maxrate=768:vbv_bufsize=1400 -movflags faststart -ac 2 -b:a 128k -ar 44100 -r 20 -threads 2 -map_metadata -1,g:0,g -pass 2 -map 0:0 -map 0:1 -acodec aac -strict experimental -metadata title='”Stairs where i work” – lame test item, bear with us – http://archive.org/details/stairs’ -metadata year=’2004′ -metadata comment=license:’http://creativecommons.org/licenses/publicdomain/’ stairs.mp4;

Happy hacking and creating!

PS: here is the way we compile ffmpeg (we use ubuntu linux, but works on macosx, too).

new video and audio player — video multiple qualities, related videos, and more!

Many of you have already noticed that since the New Year, we have migrated our new “beta” player to be the primary/default player, then to be the only player.

We are excited about this new player!
It features the very latest release of jwplayer from longtailvideo.com.

Here’s some new features/improvements worth mentioning:

  • html5 is now the default — flash is a fallback option.  a final fallback option for most items is a “file download” link from the “click to play” image
  • videos have a nice new “Related Videos” pane that shows at the end of playback
  • should be much more reliable — I had previously hacked up a lot of the JS and flash from the jwplayer release version to accommodate our various wants and looks — now we use mostly the stock player with minimal JS alterations/customizations around the player.
  • better HD video and other quality options — uploaders can now offer multiple video size and bitrate qualities.  If you know how to code web playable (see my next post!) h.264 mp4 videos especially, you can upload different qualities of our source video and the viewer will have to option to pick any of them (see more on that below).
  • more consistent UI and look and feel.  The longtailvideo team *really* cleaned up and improved their UI, giving everything a clean, consistent, and aesthetically pleasing look.  Their default “skin” is also greatly improved, so we can use that now directly too
  • lots of cleaned up performance and more likely to play in more mobile, browsers, and and OS combinations under the hood.

Please give it a try!

-tracey

 

For those of you interested in trying multiple qualities, here’s a sample video showing it:

http://archive.org/details/kittehs

To make that work, I made sure that my original/source file was:

  • h.264 video
  • AAC audio
  • had the “moov atom” at the front (to allow instant playback without waiting to download entire file first) (search web for “qt-faststart” or ffmpeg’s “-movflags faststart” option, or see my next post for how we make our .mp4 here at archive.org)
  • has a > 480P style HD width/height
  • has filename ending with one of:   .HD.mov   .HD.mp4   .HD.mpeg4    .HD.m4v

When all of those are true, our system will automatically take:

  • filename.HD.mov

and create:

  • filename.mp4

that is our normal ~1000 kb/sec “derivative” video, as well as “filename.ogv”

The /details/ page will then see two playable mpeg-4 h.264 videos, and offer them both with the [HD] toggle button (seen once video is playing) allowing users to pick between the two quality levels.

If you wanted to offer a *third* quality, you could do that with another ending like above but with otherwise the same requirements.  So you could upload:

  • filename.HD.mp4       (as, say, a 960 x 540 resolution video)
  • filename.HD.mpeg4   (as, say, a 1920 x 1080 resolution video)

and the toggle would show the three options:   1080P, 540P, 480P

You can update existing items if you like, and re-derive your items, to get multiple qualities present.

Happy hacking!

 

 

 

getting only certain formats in .zip files from items — new feature

Per some requests from our friends in the Live Music Archive community…

You can get any archive.org item downloaded to your local machine as a .zip file (that we’ve been doing for 5+ years!)
But whereas before it would be all files/formats,
now you can be pick/selective about *just* certain formats.

We’ll put links up on audio item pages, minimally, but the url pattern is simple for any item.
It looks like (where you replace IDENTIFIER with the identifier of your item (eg: thing after archive.org/details/)):

http://archive.org/compress/IDENTIFIER

for the entire item, and for just certain formats:

http://archive.org/compress/IDENTIFIER/formats=format1,format2,format3,….

Example:


wget -q -O - 'http://archive.org/compress/ellepurr/formats=Metadata,Checksums,Flac' > zip; unzip -l zip
Archive: zip
Length Date Time Name
--------- ---------- ----- ----
1107614 2012-10-30 19:49 elle.flac
44 2012-10-30 19:49 ellepurr.md5
3114 2012-10-30 19:49 ellepurr_files.xml
693 2012-10-30 19:49 ellepurr_meta.xml
602 2012-10-30 19:49 ellepurr_reviews.xml
--------- -------
1112067 5 files

Enjoy!!