Category Archives: Image Archive

archive.org download counts of collections of items updates and fixes

Every month, we look over the total download counts for all public items at archive.org.  We sum item counts into their collections.  At year end 2014, we found various source reliability issues, as well as overcounting for “top collections” and many other issues.

archive.org public items tracked over time

archive.org public items tracked over time

To address the problems we did:

  • Rebuilt a new system to use our database (DB) for item download counts, instead of our less reliable (and more prone to “drift”) SOLR search engine (SE).
  • Changed monthly saved data from JSON and PHP serialized flatfiles to new DB table — much easier to use now!
  • Fixed overcounting issues for collections: texts, audio, etree, movies
  • Fixed various overcounting issues related to not unique-ing <collection> and <contributor> tags (more below)
  • Fixes to character encoding issues on <contributor> tags

Bonus points!

  • We now track *all collections*.  Previously, we only tracked items tagged:
    • <mediatype> texts
    • <mediatype> etree
    • <mediatype> audio
    • <mediatype> movies
  • For items we are tracking <contributor> tags (texts items), we now have a “Contributor page” that shows a table of historical data.
  • Graphs are now “responsive” (scale in width based on browser/mobile width)

 

The Overcount Issue for top collection/mediatypes

  • In the below graph, mediatypes and collections are shown horizontally, with a sample “collection hierarchy” today.
  • For each collection/mediatype, we show 1 example item, A B C and D, with a downloads/streams/views count next to it parenthetically.   So these are four items, spanning four collections, that happen to be in a collection hierarchy (a single item can belong to multiple collections at archive.org)
  • The Old Way had a critical flaw — it summed all sub-collection counts — when really it should have just summed all *direct child* sub-collection counts (or gone with our New Way instead)

overcount

So we now treat <mediatype> tags like <collection> tags, in terms of counting, and unique all <collection> tags to avoid items w/ minor nonideal data tags and another kind of overcounting.

 

… and one more update from Feb/1:

We graph the “difference” between absolute downloads counts for the current month minus the prior month, for each month we have data for.  This gives us graphs that show downloads/month over time.  However, values can easily go *negative* with various scenarios (which is *wickedly* confusing to our poor users!)

Here’s that situation:

A collection has a really *hot* item one month, racking up downloads in a given collection.  The next month, a DMCA takedown or otherwise removes the item from being available (and thus counted in the future).  The downloads for that collection can plummet the next month’s run when the counts are summed over public items for that collection again.  So that collection would have a negative (net) downloads count change for this next month!

Here’s our fix:

Use the current month’s collection “item membership” list for current month *and* prior month.  Sum counts for all those items for both months, and make the graphed difference be that difference.  In just about every situation that remains, graphed monthly download counts will be monotonic (nonnegative and increasing or zero).

 

 

Archive-It Team Encourages Your Contributions To The “Occupy Movement” Collection

Since September 17th, 2011 when protesters descended on Wall Street, set up tents, and refused to move until their voices were heard, an impassioned plea for economic and social equality has manifested itself in similar protests and demonstrations around the world. Inspired by “Occupy Wall Street (OWS)”, these global protests and demonstrations are collectively now being referred to as the “Occupy Movement”.

In an effort to document these historic, and politically and socially charged, events as they unfold, IA’s Archive-It team has recently created an “Occupy Movement” collection to begin capturing information about the movement found online. With blogs communicating movement ideals and demands, social media used to coordinate demonstrations, and news related websites portraying the movement from a dizzying variety of angles, the presence and representation of the Occupy Movement online is both hugely valuable to our understanding of the movement as a whole, while constantly in-flux and at-risk.

The value of the collection hinges on the diversity, depth, and breadth of our seeds and websites we crawl. We are asking and encouraging anyone with websites they feel are important to archive, sites that tell a story about the movement, to pass them along and we will add them to the Occupy Movement collection. These might include movement-wide or city-specific websites, sites with images, blogs, YouTube videos, even Twitter accounts of individuals or organizations involved with the movement. No ideas or additions are too small or too large; perhaps your ideas or suggestions will be a unique part of the movement not yet represented in our collection. IA Archive-It friends and partners are already sending in seeds, which we greatly appreciate.

The web content captured in this collection will be included in the General Archive collection at http://www.archive.org/details/occupywallstreet
which has been actively collecting materials on the Occupy Movement for a few months.

Please send any seeds suggestions, questions, or comments to Graham at graham@archive.org.

The Awesomeness of Yosemite

Just back from a stay in Yosemite Valley. Just awesome…as it always has been.

So of course I came back and had to check on some of the history and other interesting information about the valley at the Archive. There’s a wealth of stuff found by simply searching “yosemite“.


This one from 1905 is one of the earliest with photos.  Lots of changes in the man-made aspects of the valley but not to the natural landforms that are so familiar and, well, awesome in the real sense of the word.

http://www.archive.org/details/discoveryofyosem01bunn
This 3rd edition from 1897 gives an account of the The Indian Wars that led to the discovery of Yo-Semite.

http://www.archive.org/details/yosemiteitshisto00lest
This one from 1873 might be the oldest book we have on Yosemite.

http://www.archive.org/details/yosemite00unkngoog
And of course there is “The Yosemite” by John Muir from 1912.

A great trip as always. The valley may be more crowded than a century ago but the experience is still inspiring and …awesome.

-Jeff Kaplan

NASA on The Commons

From nasaimages.org, a service of Internet Archive:

Internet Archive, NASA, and Flickr are together launching NASA on The Commons, a new way to view and interact with photos from NASA. NASA on The Commons invites the public to contribute information and knowledge to curated photo sets provided by nasaimages.org.  Visitors will be able to add tags, keywords, and annotations to three compilations of images curated by the New Media Innovation Team at NASA Ames and NASA photography and history experts across the Agency. The three collections, spanning more than half a century of NASA history, include: Launch and Takeoff, Building NASA, and Center Namesakes.

“NASA’s long-standing partnership with Internet Archive and this new one with Yahoo!’s Flickr provides an opportunity for the public to participate in the process of discovery,” said Debbie Rivera, lead for the NASA Images project at the agency’s headquarters in Washington. “In addition, the public can help the agency capture historical knowledge about missions and programs through this new resource and make it available for future generations.”

NASA on The Commons will make the NASA Images collection accessible to a wider audience while improving the information that accompanies these images with the help of the public.
Read more about NASA on The Commons.

New NASA Images Guest Showcase: June Lockhart

From NASA Images:

NASA Images is proud to welcome June Lockhart to the Guest Showcase line-up in June Lockhart: The NASA Diaries.

June has been involved with NASA for years.  She has attended shuttle launches, opened the Kids Space Museum at the Johnson Space Center, and helped NASA celebrate the 40th anniversary of Apollo. She has been a long time supporter of all things NASA. June takes this opportunity to share some of her dearest memories of her relationship with the space agency:

“’…There’s a new sunrise in space every hour and a half – so the song would be very appropriate.’ It was astronaut Ken Reightler speaking in response to my suggestion that my fathers song ‘The World is Waiting for the Sunrise’ would be a good wake up tune for the astronauts on the upcoming shuttle flight Columbia. We played the Les Paul and Mary Ford version of the song just after 2 a.m. On October 27, 1992. I was there in the mission control viewing room and listened to the music fill the sky. My father would have loved it. My eyes filled with tears…”

“…Bill Mcarthur and I shared some phone calls during his time on the ISS in 2005. On December 16, I went to JPL for a video conference. We sent films and photos and a poster of me in my space suit from ‘Lost In Space’ which he had posted on the wall of the ISS. He said I was the first pin-up in outerspace!”

Check out June Lockhart: The NASA Diaries at nasaimages.org

NASA Images selected as one of MARS Best Free Reference Web Sites of 2010

From NASA Images blog:

NASA Images has been selected as one of the MARS Best Free Reference Web Sites of 2010, an annual series initiated under the auspices of the Machine-Assisted Reference Section (MARS) of the Reference and User Services Association (RUSA) of the American Library Association (ALA) to recognize outstanding reference sites on the World Wide Web. This years list consists of 30 sites recognized by MARS as outstanding for reference information, view the list here.

Kudos to the NASA Images team: Jon Hornstein, Jake Johnson, Greg Williamson and Samantha O’Connell.

-Jeff Kaplan

"Houston, we’ve had a problem"

The now famous words spoken by Jim Lovell in 1970 in the ill-fated Apollo 13 flight. There was a reunion of astronauts and control crew to celebrate the 40th anniversary. NASAimages has many great photos and video from the flight. Here are a few of my favorites.

Video:
The news bulletin.
The duct tape fix!
Re-entry and recovery!

Images:
Tense ground control.
Success celebrated on the ground!

Check out more at NASAimages.org.

-Jeff Kaplan

NASA partners with Internet Archive to archive digital imagery

nasaimages - thousands of images to discoverFrom Jon Hornstein at Internet Archive’s NASA images:

NASA gave a nice shout-out to the Internet Archive for helping them address their Open Government Initiative requirements. http://www.nasa.gov/open/plan/records-management.html

Here’s a couple of choice quotes . . .

“. . . (the Internet Archive) serves as custodian of much of NASA’s current and legacy digital imagery records. In addition, IA will help digitize NASA’s historically significant, analog images for inclusion on the Web site, enabling digital archiving with the National Archives and greater public access to these records via the IA Website.”

“Strictly on its own initiative, IA recently began to capture NASA’s publicly posted social media content. NASA is considering exploration of how this activity might be leveraged for records management purposes.”

There’s always cool stuff to be discovered at NASA images: http://nasaimages.org

-Jeff Kaplan