Category Archives: Technical

Two Thin Strands of Glass

There’s a tiny strand of glass inside that thick plastic coat.

Two thin strands of glass. When combined, these two strands of glass are so thin they still wouldn’t fill a drinking straw. That’s known in tech circles as a “fiber pair,” and these two thin strands of glass carry all the information of the world’s leading archive in and out of our data centers. When you think about it, it sounds kind of crazy that it works at all, but it does. Every day. Reliably.

Except this past Monday night, here in California…

On Monday, June 24, the real world had other ideas. As a result, the Internet Archive was down for 15 hours. For Californians, this was less of a big deal: those 15 hours stretched from mid-Monday evening (9:11pm on the US West coast), to 11:51am on Tuesday. Many Californians were asleep during several hours of that time. But in the Central European time zone (e.g. France, Germany, Italy, Poland, Tunisia), that fell on early Tuesday morning (06:11) to mid-Tuesday evening (21:51). And in the entire country of India, it was late Tuesday morning (09:41) to just after midnight on Wednesday (00:21).

Continue reading

New Views Stats for the New Year

We began developing a new system for counting views statistics on archive.org a few years ago. We had received feedback from our partners and users asking for more fine-grained information than the old system could provide. People wanted to know where their views were coming from geographically, and how many came from people vs. robots crawling the site.

The new system will debut in January 2019. Leading up to that in the next couple of weeks you may see some inconsistencies in view counts as the new numbers roll out across tens of millions of items.  

With the new system you will see changes on both items and collections.

Item page changes

An “item” refers to a media item on archive.org – this is a page that features a book, a concert, a movie, etc. Here are some examples of items: Jerky Turkey, Emma, Gunsmoke.

On item pages the lifetime views will change to a new number.  This new number will be a sum of lifetime views from the legacy system through 2016, plus total views from the new system for the past two years (January 2017 through December 2018). Because we are replacing the 2017 and 2018 views numbers with data from the new system, the lifetime views number for that item may go down. I will explain why this occurs further down in this post where we discuss how the new system differs from the legacy system.

Collection page changes

Soon on collection page About tabs (example) you will see 2 separate views graphs. One will be for the old legacy system views through the end of 2018. The other will contain 2 years of views data from the new system (2017 and 2018). Moving forward, only the graph representing the new system will be updated with views numbers. The legacy graph will “freeze” as of December 2018.

Both graphs will be on the page for a limited time, allowing you to compare your collections stats between the old and new systems.  We will not delete the legacy system data, but it may eventually move to another page. The data from both systems is also available through the views API.

People vs. Robots

The graph for new collection views will additionally contain information about whether the views came from known “robots” or “people.”  Known robots include crawlers from major search engines, like Google or Bing. It is important for these robots to crawl your items – search engines are a major source of traffic to all of the items on archive.org. The robots number here is your assurance that search engines know your items exist and can point users to them.  The robots numbers also include access from our own internal robots (which is generally a very small portion of robots traffic).

One note about robots: they like text-based files more than audio/visual files.  This means that text items on the archive that have a publicly accessible text file (the djvu.txt file) get more views from robots than other types of media in the archive. Search engines don’t just want the metadata about the book – they want the book itself.

“People” are a little harder to define. Our confidence about whether a view comes from a person varies – in some cases we are very sure, and in others it’s more fuzzy, but in all cases we know the view is not from a known robot. So we have chosen to class these all together as “people,” as they are likely to represent access by end users.

What counts as a view in the new system

  • Each media item in the archive has a views counter.
  • The view counter is increased by 1 when a user engages with the media file(s) in an item.
    • Media engagement includes experiencing the media through the player in the item page (pressing play on a video or audio player, flipping pages in the online bookreader, emulating software, etc.), downloading files, streaming files, or borrowing a book.
    • All types of engagements are treated in the same way – they are all views.
  • A single user can only increase the view count of a particular item once per day.
    • A user may view multiple media files in a single item, or view the same media file in a single item multiple times, but within one day that engagement will only count as 1 view.
  • Collection views are the sum of all the view counts of the items in the collection.
    • When an item is in more than one collection, the item’s view counts are added to each collection it is in. This includes “parent” collections if the item is in a subcollection.
    • When a user engages with a collection page (sorting, searching, browsing etc.), it does NOT count as a view of the collection.
    • Items sometimes move in or out of collections. The views number on a collection represents the sum of the views of the items that are in the collection at that time (e.g. the September 1, 2018 views number for the collection represents the sum of the views on items that were in the collection on September 1, 2018. If an item moves out of that collection, the collection does not lose the views from September 1, 2018.).

How the new system differs from the legacy system

When we designed the new system, we implemented some changes in what counted as a “view,” added some functionality, and repaired some errors that were discovered.  

  • The legacy system updated item views once per day and collection views once per month. The new system will update both item and collection views once per day.
  • The legacy system updated item views ~24 hours after a view was recorded.  The new system will update the views count ~4 days after the view was recorded. This time delay in the new system will decrease to ~24 hours at some point in the future.
  • The legacy system had no information about geographic location of users. The new system has approximate geolocation for every view. This geographic information is based on obfuscated IP addresses. It is accurate at a general level, but does not represent an individual user’s specific location.
  • The legacy system had no information about how many views were caused by robots crawling the site. The new system shows us how well the site is crawled by breaking out media access by robots (vs. interactions from people).
  • The legacy system did not count all book reader interactions as views.  The new system counts bookreader engagements as a view after 2 interactions (like page flips).
  • On audio and video items, the legacy system sometimes counted views when users saw *any* media in the item (like thumbnail images). The new system only counts engagements with the audio or video media files in an item in those media types, respectively.

In some cases, the differences above can lead to drastic changes in views numbers for both items and collections. While this may be disconcerting, we think the new system more accurately reflects end user behavior on archive.org.

If you have questions regarding the new stats system, you may email us at info@archive.org.

Identity in the Decentralized Web

In B. Traven’s The Death Ship, American sailor Gerard Gales finds himself stranded in post-World War I Antwerp after his freighter departs without him.  He’s arrested for the crime of being unable to produce a passport, sailor’s card, or birth certificate—he possesses no identification at all.  Unsure how to process him, the police dump Gales on a train leaving the country. From there Gales endures a Kafkaesque journey across Europe, escorted from one border to another by authorities who do not know what to do with a man lacking any identity.  “I was just a nobody,” Gales complains to the reader.

As The Death Ship demonstrates, the concept of verifiable identity is a cornerstone of modern life.   Today we know well the process of signing in to shopping websites, checking email, doing some banking, or browsing our social network.  Without some notion of identity, these basic tasks would be impossible.

That’s why at the Decentralized Web Summit earlier this year, questions of identity were a central topic.  Unlike the current environment, in a decentralized web users control their personal data and make it available to third-parties on a need-to-know basis.  This is sometimes referred to as self-sovereign identity: the user, not web services, owns their personal information.

The idea is that web sites will verify you much as a bartender checks your ID before pouring a drink.  The bar doesn’t store a copy of your card and the bartender doesn’t look at your name or address; only your age is pertinent to receive service.  The next time you enter the bar the bartender once again asks for proof of age, which you may or may not relinquish. That’s the promise of self-sovereign identity.

At the Decentralized Web Summit, questions and solutions were bounced around in the hopes of solving this fundamental problem.  Developers spearheading the next web hashed out the criteria for decentralized identity, including:

  • secure: to prevent fraud, maintain privacy, and ensure trust between all parties
  • self-sovereign: individual ownership of private information
  • consent: fine-tuned control over what information third-parties are privy to
  • directed identity: manage multiple identities for different contexts (for example, your doctor can access certain aspects while your insurance company accesses others)
  • and, of course, decentralized: no central authority or governing body holds private keys or generates identifiers

One problem with decentralized identity is that these problems often compete, pulling in polar directions.

For example, while security seems like a no-brainer, with self-sovereign identity the end-user is in control (and not Facebook, Google, or Twitter).  It’s incumbent on them to secure their information. This raises questions of key management, data storage practices, and so on. Facebook, Google, and Twitter pay full-time engineers to do this job; handing that responsibility to end-users shifts the burden to someone who may not be so technically savvy.  The inconvenience of key management and such also creates more hurdles for widespread adoption of the decentralized web.

The good news is, there are many working proposals today attempting to solve the above problems.  One of the more promising is DID (Decentralized Identifier).

A DID is simply a URI, a familiar piece of text to most people nowadays.  Each DID references a record stored in a blockchain. DIDs are not tied to any particular blockchain, and so they’re interoperable with existing and future technologies.  DIDs are cryptographically secure as well.

DIDs require no central authority to produce or validate.  If you want a DID, you can generate one yourself, or as many was you want.  In fact, you should generate lots of them.  Each unique DID gives the user fine-grained control over what personal information is revealed when interacting with a myriad of services and people.

If you’re interested to learn more, I recommend reading Michiel Mulders’ article on DIDs, “the Internet’s ‘missing identity layer’.”  The DID working technical specification is being developed by the W3C.  And those looking for code and community, check out the Decentralized Identity Foundation.

(While DIDs are promising, it is a nascent technology.  Other options are under development.  I’m using DIDs as an example of how decentralized identity might work.)

What does the future hold for self-sovereign identification?  From what I saw at the Decentralized Web, I’m certain a solution will be found.

Audio / Video player updated – to jwplayer v8.2

We updated our audio/video (and TV) 3rd party JS-based player from v6.8 to v8.2 today.

This was updated with some code to have the same feature set as before, as well as new:

  • much nicer cosmetic/look updates
  • nice “rewind 10 seconds” button
  • controls are now in an updated control bar
  • (video) ‘Related Items’ now uses the same (better) recommendations from the bottom of an archive.org /details/ page
  • Airplay (Safari) and Chromecast basic casting controls in player
  • playback speed rate control now easier to use / set
  • playback keyboard control with SPACE and left , right and up, down keys
  • (video) Web VTT (captions) has much better user interface and display
  • flash is now only used to play audio/video if html5 doesnt work (flash does not do layout or controls now)

Here’s some before / after screenshots:

Archive video now supports WebVTT for captions

We now support .vtt files (Web Video Text Tracks) in addition to .srt (SubRip) (.srt we have supported for years) files for captioning your videos.

It’s as simple as uploading a “parallel filename” to your video file(s).

Examples:

  • myvid.mp4
  • myvid.srt
  • myvid.vtt

Multi-lang support:

  • myvid.webm
  • myvid.en.vtt
  • myvid.en.srt
  • myvid.es.vtt

Here’s a nice example item:
https://archive.org/details/cruz-test

VTT with caption picker (and upcoming A/V player too!)

(We will have an updated A/V player with a better “picker” for so many language tracks in days, have no fear 😎

Enjoy!

 

Using Kakadu JPEG2000 Compression to Meet FADGI Standards

The Internet Archive is grateful to the folks at Kakadu Software for contributing to Universal Access to Knowledge by providing the world’s leading implementation of the JPEG2000 standard, used in the Archive’s image processing systems.

Here at the Archive, we digitize over a thousand books a day. JPEG2000, an image coding system that uses compression techniques based on wavelet technology, is a preferred file format for storing these images efficiently, while also providing advantages for presentation quality and metadata richness. The Library of Congress has documented its adoption of the JPEG2000 file format for a number of digitization projects, including its text collections on archive.org.

Recently we started using their SDK to apply some color corrections to the images coming from our cameras. This has helped us achieve FADGI standards in our work with the Library of Congress.

Thank you, Kakadu, for helping make it possible for millions of books to be digitized, stored, and made available with high quality on archive.org!

If you are interested in finding out more about Kakadu Software’s powerful software toolkit for JPEG2000 developers, visit kakadusoftware.com or email info@kakadusoftware.com.

The Hidden Shifting Lens of Browsers

browsersoup

Some time ago, I wrote about the interesting situation we had with emulation and Version 51 of the Chrome browser – that is, our emulations stopped working in a very strange way and many people came to the Archive’s inboxes asking what had broken. The resulting fix took a lot of effort and collaboration with groups and volunteers to track down, but it was successful and ever since, every version of Chrome has worked as expected.

But besides the interesting situation with this bug (it actually made us perfectly emulate a broken machine!), it also brought into a very sharp focus the hidden, fundamental aspect of Browsers that can easily be forgotten: Each browser is an opinion, a lens of design and construction that allows its user a very specific facet of how to address the Internet and the Web. And these lenses are something that can shift and turn on a dime, and change the nature of this online world in doing so.

An eternal debate rages on what the Web is “for” and how the Internet should function in providing information and connectivity. For the now-quite-embedded millions of users around the world who have only known a world with this Internet and WWW-provided landscape, the nature of existence centers around the interconnected world we have, and the browsers that we use to communicate with it.

Netscape1-about-1024x656

Avoiding too much of a history lesson at this point, let’s instead just say that when Browsers entered the landscape of computer usage in a big way after being one of several resource-intensive experimental programs. In circa 1995, the effect on computing experience and acceptance was unparalleled since the plastic-and-dreams home computer revolution of the 1980s. Suddenly, in one program came basically all the functions of what a computer might possibly do for an end user, all of it linked and described and seemingly infinite. The more technically-oriented among us can point out the gaps in the dream and the real-world efforts behind the scenes to make things do what they promised, of course. But the fundamental message was: Get a Browser, Get the Universe. Throughout the late 1990s, access came in the form of mailed CD-ROMs, or built-in packaging, or Internet Service Providers sending along the details on how to get your machine connected, and get that browser up and running.

As I’ve hinted at, though, this shellac of a browser interface was the rectangular window to a very deep, almost Brazillike series of ad-hoc infrastructure, clumsily-cobbled standards and almost-standards, and ever-shifting priorities in what this whole “WWW” experience could even possibly be. It’s absolutely great, but it’s also been absolutely arbitrary.

With web anniversaries aplenty now coming into the news, it’ll be very easy to forget how utterly arbitrary a lot of what we think the “Web” is, happens to be.

image444
There’s no question that commercial interests have driven a lot of browser features – the ability to transact financially, to ensure the prices or offers you are being shown, are of primary interest to vendors. Encryption, password protection, multi-factor authentication and so on are sometimes given lip service for private communications, but they’ve historically been presented for the store to ensure the cash register works. From the early days of a small padlock icon being shown locked or unlocked to indicate “safe”, to official “badges” or “certifications” being part of a webpage, the browsers have frequently shifted their character to promise commercial continuity. (The addition of “black box” code to browsers to satisfy the ability to stream entertainment is a subject for another time.)

Flowing from this same thinking has been the overriding need for design control, where the visual or interactive aspects of webpages are the same for everyone, no matter what browser they happen to be using. Since this was fundamentally impossible in the early days (different browsers have different “looks” no matter what), the solutions became more and more involved:

  • Use very large image-based mapping to control every visual aspect
  • Add a variety of specific binary “plugins” or “runtimes” by third parties
  • Insist on adoption of a number of extra-web standards to control the look/action
  • Demand all users use the same browser to access the site

Evidence of all these methods pop up across the years, with variant success.

Some of the more well-adopted methods include the Flash runtime for visuals and interactivity, and the use of Java plugins for running programs within the confines of the browser’s rectangle. Others, such as the wide use of Rich Text Format (.RTF) for reading documents, or the Realaudio/video plugins, gained followers or critics along the way, and were ultimately faded into obscurity.

And as for demanding all users use the same browser… well, that still happens, but not with the same panache as the old Netscape Now! buttons.

ie-rip

This puts the Internet Archive into a very interesting position.

With 20 years of the World Wide Web saved in the Wayback machine, and URLs by the billions, we’ve seen the moving targets move, and how fast they move. Where a site previously might be a simple set of documents and instructions that could be arranged however one might like, there are a whole family of sites with much more complicated inner workings than will be captured by any external party, in the same way you would capture a museum by photographing its paintings through a window from the courtyard.  

When you visit the Wayback and pull up that old site and find things look differently, or are rendered oddly, that’s a lot of what’s going on: weird internal requirements, experimental programming, or tricks and traps that only worked in one brand of browser and one version of that browser from 1998. The lens shifted; the mirror has cracked since then.

NextEditorBW
This is a lot of philosophy and stray thoughts, but what am I bringing this up for?

The browsers that we use today, the Firefoxes and the Chromes and the Edges and the Braves and the mobile white-label affairs, are ever-shifting in their own right, more than ever before, and should be recognized as such.

It was inevitable that constant-update paradigms would become dominant on the Web: you start a program and it does something and suddenly you’re using version 54.01 instead of version 53.85. If you’re lucky, there might be a “changes” list, but that luck might be variant because many simply write “bug fixes”. In these updates are the closing of serious performance or security issues – and as someone who knows the days when you might have to mail in for a floppy disk to be sent in a few weeks to make your program work, I can totally get behind the new “we fixed it before you knew it was broken” world we live in. Everything does this: phones, game consoles, laptops, even routers and medical equipment.

But along with this shifting of versions comes the occasional fundamental change in what browsers do, along with making some aspect of the Web obsolete in a very hard-lined way.

Take, for example, Gopher, a (for lack of an easier description) proto-web that allowed machines to be “browsed” for information that would be easy for users to find. The ability to search, to grab files or writings, and to share your own pools of knowledge were all part of the “Gopherspace”. It was also rather non-graphical by nature and technically oriented at the time, and the graphical “WWW” utterly flattened it when the time came.

But since Gopher had been a not-insignificant part of the Internet when web browsers were new, many of them would wrap in support for Gopher as an option. You’d use the gopher:// URI, and much like the ftp:// or file:// URIs, it co-existed with http:// as a method for reaching the world.

Until it didn’t.

Microsoft, citing security concerns, dropped Gopher support out of its Internet Explorer browser in 2002. Mozilla, after a years-long debate, did so in 2010. Here’s the Mozilla Firefox debate that raged over Gopher Protocol removal. The functionality was later brought back externally in the form of a Gopher plugin. Chrome never had Gopher support. (Many other browsers have Gopher support, even today, but they have very, very small audiences.)

The Archive has an assembled collection of Gopherspace material here.  From this material, as well as other sources, there are web-enabled versions of Gopherspace (basically, http:// versions of the gopher:// experience) that bring back some aspects of Gopher, if only to allow for a nostalgic stroll. But nobody would dream of making something brand new in that protocol, except to prove a point or for the technical exercise. The lens has refocused.

In the present, Flash is beginning a slow, harsh exile into the web pages of history – browser support dropping, and even Adobe whittling away support and upkeep of all of Flash’s forward-facing projects. Flash was a very big deal in its heyday – animation, menu interface, games, and a whole other host of what we think of as “The Web” depended utterly on Flash, and even specific versions and variations of Flash. As the sun sets on this technology, attempts to be able to still view it like the Shumway project will hopefully allow the lens a few more years to be capable of seeing this body of work.

As we move forward in this business of “saving the web”, we’re going to experience “save the browsers”, “save the network”, and “save the experience” as well. Browsers themselves drop or add entire components or functions, and being able to touch older material becomes successively more difficult, especially when you might have to use an older browser with security issues. Our in-browser emulation might be a solution, or special “filters” on the Wayback for seeing items as they were back then, but it’s not an easy task at all – and it’s a lot of effort to see information that is just a decade or two old. It’s going to be very, very difficult.

But maybe recognizing these browsers for what they are, and coming up with ways to keep these lenses polished and flexible, is a good way to start.

Those Hilarious Times When Emulations Stop Working

Jason Scott, Software Curator and Your Emulation Buddy, writing in.

With tens of thousands of items in the archive.org stacks that are in some way running in-browser emulations, we’ve got a pretty strong library of computing history afoot, with many more joining in the future. On top of that, we have thousands of people playing these different programs, consoles, and arcade games from all over the world.

Therefore, if anything goes slightly amiss, we hear it from every angle: twitter, item reviews, e-mails, and even the occasional phone call. People expect to come to a software item on the Internet Archive and have it play in their browser! It’s great this expectation is now considered a critical aspect of computer and game history. But it also means we have to go hunting down what the problem might be when stuff goes awry.

Sometimes, it’s something nice and simple, like “I can’t figure out the keys or the commands” or “How do I find the magic sock in the village.”, which puts us in the position of a sort of 1980s Software Company Help Line. Other times, it’s helping fix situations where some emulated software is configured wrong and certain functions don’t work. (The emulation might run too fast, or show the wrong colors, or not work past a certain point in the game.)

But then sometimes it’s something like this:

78909ba7-8f19-4df2-8eb8-66b60edba0e9

In this case, a set of programs were all working just fine a while ago, and then suddenly started sending out weird “Runtime” errors. Or this nostalgia-inducing error:

654c1e0c-a047-4c82-9b23-72620af9dbf5

Here’s the interesting thing: The emulated historic machine would continue to run. In other words, we had a still-functioning, emulated broken machine, as if you’d brought home a damaged 486 PC in 1993 from the store and realized it was made of cheaper parts than you expected.

To make things even more strange, this was only happening to emulated DOS programs in the Google Chrome browser. And only Google Chrome version 51.x. And only in the 32-bit version of Google Chrome 51.x. (A huge thanks to the growing number of people who helped this get tracked down.)

This is what people should have been seeing, which I think we can agree looks much better:

screenshot_00

The short-term fix is to run Firefox instead of Chrome for the moment if you see a crash, but that’s not really a “fix” per se – Chrome has had the bug reported to them and they’re hard at work on it (and working on a bug can be a lot of work). And there’s no guarantee an update to Firefox (or the Edge Browser, or any of the other browsers working today) won’t cause other weird problems going down the line.

All this, then, can remind people how strange, how interlocking, and even fragile our web ecosystem is at the moment. The “Web” is a web of standards dancing with improvisations, hacks, best guesses and a radically moving target of what needs to be obeyed and discarded. With the automatic downloading of new versions of browsers from a small set of makers, we gain security, but more-obscure bugs might change the functioning of a website overnight. We make sure the newest standards are followed as quickly as possible, but we also wake up to finding out an old trusted standard was deemed no longer worthy of use.

Old standards or features (background music in web pages, the gopher protocol, Flash) give way to new plugins or processes, and the web must be expected, as best it can, to deal with the new and the old and fail gracefully when it can’t quite do it. As part of the work of the Decentralized Web Summit was to bring forward the strengths of this world (collaboration, transparency, reproducibility) while pulling back from the weaknesses of this shifting landscape (centralization, gatekeeping, utter and total loss of history), it’s obvious a lot of people recognize this is an ongoing situation, needing vigilance and hard work.

In the meantime, we’ll do our best to keep on how the latest and greatest browsers deal with the still-fresh world of in-browser emulation, and try to emulate hardware that did come working from the factory.

In the meantime, enjoy some Apple II programs. On us.

Distributed Preservation Made Simple

Library partners of the Internet Archive now have at their fingertips an easy way – from a Unix-like command line in a terminal window – to download digital collections for local preservation and access.

This post will show how to use a Internet Archive command-line tool (ia) to download all items in a collection stored on Archive.org, and keep their local collections in sync with the Archive.org collection.

To use ia, the only requirement is to have Python 2 installed on a Unix-like operating system (i.e. Linux, Mac OS X). Python 2 is pre-installed on Mac OS X and most Linux systems so there is nothing more that needs to be done, except to open up a terminal and follow these steps:

1.  Download the latest binary of the ia command-line tool by running the following command in your terminal:

curl -LO https://archive.org/download/ia-pex/ia

2. Make the binary executable:

chmod +x ia

3. Make sure you have the latest version of the binary, version 1.0.0:

./ia --version

4. Configure ia with your Archive.org credentials (This step is only needed if you need privileges to access the items). :

./ia configure

5. Download a collection:

./ia download --search 'collection:solarsystemcollection'

or

./ia download --search 'collection:JangoMonkey'

The above command to “Download a collection”, for example, will download all files from all items from the band JangoMonkey or the NASA Solar System collection. If re-run, by default, will skip over any files already downloaded, as rysnc does, which can help keep your local collection in sync with the collection on Archive.org.

If you would like to download only certain file types, you can use the –glob option. For example, if you only wanted to download JPEG files, you could use a command like:

./ia download --search 'collection:solarsystemcollection' --glob '*.jpeg|*.jpg'

Note that by default ia will download files into your current working directory. If you launch a terminal window without moving to a new directory, the files will be downloaded to your user directory. To download to a different directory, you can either cd into that directory or use the “–destdir” parameter like so:

mkdir solarsystemcollection-collection

./ia download --search 'collection:solarsystemcollection' --destdir solarsystemcollection-collection

Downloading in Parallel

GNU Parallel is a powerful command-line tool for executing jobs in parallel. When used with ia, downloading items in parallel is as easy as:

./ia search 'collection:solarsystemcollection' --itemlist | parallel --no-notice -j4 './ia download {} --glob="*.jpg|*.jpeg"'

The -j option controls how many jobs run in parallel (i.e. how many files are downloaded at a time). Depending on the machine you are running the command on, you might get better performance by increasing or decreasing the number of simultaneous jobs. By default, GNU Parallel will run one job per CPU.

GNU Parallel can be installed with Homebrew on Mac OS X (i.e.: brew install parallel), or your favorite package manager on Linux (e.g. on Ubuntu: apt-get install parallel, on Arch Linux: pacman -S parallel, etc.). For more details, please refer to: https://www.gnu.org/software/parallel/

For more options and details, use the following command:

./ia download --help

Finally, to see what else the ia command-line tool can do:

./ia --help

Documentation of the ia command-line tool is available at: https://internetarchive.readthedocs.org/en/latest/cli.html

There you have it. Library partners, download and store your collections now using this command-line tool from the Internet Archive. If you have any questions or issues, please write to info (at) archive.org. We are trying to make distributed preservation simple and easy!

 

archive.org download counts of collections of items updates and fixes

Every month, we look over the total download counts for all public items at archive.org.  We sum item counts into their collections.  At year end 2014, we found various source reliability issues, as well as overcounting for “top collections” and many other issues.

archive.org public items tracked over time

archive.org public items tracked over time

To address the problems we did:

  • Rebuilt a new system to use our database (DB) for item download counts, instead of our less reliable (and more prone to “drift”) SOLR search engine (SE).
  • Changed monthly saved data from JSON and PHP serialized flatfiles to new DB table — much easier to use now!
  • Fixed overcounting issues for collections: texts, audio, etree, movies
  • Fixed various overcounting issues related to not unique-ing <collection> and <contributor> tags (more below)
  • Fixes to character encoding issues on <contributor> tags

Bonus points!

  • We now track *all collections*.  Previously, we only tracked items tagged:
    • <mediatype> texts
    • <mediatype> etree
    • <mediatype> audio
    • <mediatype> movies
  • For items we are tracking <contributor> tags (texts items), we now have a “Contributor page” that shows a table of historical data.
  • Graphs are now “responsive” (scale in width based on browser/mobile width)

 

The Overcount Issue for top collection/mediatypes

  • In the below graph, mediatypes and collections are shown horizontally, with a sample “collection hierarchy” today.
  • For each collection/mediatype, we show 1 example item, A B C and D, with a downloads/streams/views count next to it parenthetically.   So these are four items, spanning four collections, that happen to be in a collection hierarchy (a single item can belong to multiple collections at archive.org)
  • The Old Way had a critical flaw — it summed all sub-collection counts — when really it should have just summed all *direct child* sub-collection counts (or gone with our New Way instead)

overcount

So we now treat <mediatype> tags like <collection> tags, in terms of counting, and unique all <collection> tags to avoid items w/ minor nonideal data tags and another kind of overcounting.

 

… and one more update from Feb/1:

We graph the “difference” between absolute downloads counts for the current month minus the prior month, for each month we have data for.  This gives us graphs that show downloads/month over time.  However, values can easily go *negative* with various scenarios (which is *wickedly* confusing to our poor users!)

Here’s that situation:

A collection has a really *hot* item one month, racking up downloads in a given collection.  The next month, a DMCA takedown or otherwise removes the item from being available (and thus counted in the future).  The downloads for that collection can plummet the next month’s run when the counts are summed over public items for that collection again.  So that collection would have a negative (net) downloads count change for this next month!

Here’s our fix:

Use the current month’s collection “item membership” list for current month *and* prior month.  Sum counts for all those items for both months, and make the graphed difference be that difference.  In just about every situation that remains, graphed monthly download counts will be monotonic (nonnegative and increasing or zero).