Monthly Archives: October 2012

News from the Archive 0005: BBC Visit, Rocketship X-M, and Alice

No. 5, 31 October 2012

A BBC film crew visited the Internet Archive; here’s their story.

In addition, the San Francisco Chronicle did a nice profile of our work:

http://www.sfgate.com/default/article/Brewster-Kahle-s-Internet-Archive-3946898.php

From the Archive’s Mailbox

I’ve just downloaded an image file (various galaxies in their vast array) from your NASA Images pages to use on the jacket of my new SF novel for preteens, The Calling.

http://archive.org/details/nasa

I appreciate your open policy of not copyrighting these images but allowing people to use them with a simple acknowledgement (which I have added).

—John Peace

We’re glad to help, but the availability of NASA imagery is determined by the space agency.

http://nasaimages.org/Terms.html

Selected Collection: Crap from the Past

This is a pop music radio show for people who already know plenty about pop music. Hosted by Ron “Boogiemonster” Gerber, it’s broadcast Friday nights from 10:30 to midnight on KFAI, Minneapolis. This collection of over twelve-hundred recordings goes back two decades, a millennium, or “since the days of DOS,” depending on how you slice it.

http://archive.org/details/crapfromthepast

Other Picks from the Archive

Rocketship X-M (1950)

Rocketship X-M landed on the red planet over sixty years before NASA’s Mars Curiosity rover touched down there recently. Hollywood years, that is. Rocketship X-M is the story of five astronauts (played by Lloyd Bridges, Osa Massen, John Emery, Noah Beery, Jr., and Hugh O’Brien) who blast off to explore the moon but end up on Mars instead. Stay tuned for the ending …

http://archive.org/details/RocketshipXM 

— recommended by Emilio Conseco

Through The Looking-Glass (and what Alice found there), Lewis Carroll

This is a first edition “Presentation Copy” of the followup to Alice In Wonderland. Not only is this a personal favorite that blew my mind when I first read it some years ago, but this is a first edition copy in excellent condition with fifty of the original illustrations by John Tenniel. I don’t need to describe the impact this book had on literature, but what makes this copy so fascinating to me is that inside the front cover is a note in the authors own hand, “Emma Vine, with the author’s kind regards. Christmas 1871.” There is also a penciled-in note saying that Emma Vine was Lewis Carroll’s nursemaid. This was very exciting for me to discover and I can’t believe I was able to see something like this with my own eyes, a real literary treasure.

http://archive.org/details/throughlookinggl01carr

— recommended by Gemma Waterston

Music That’s Better Than It Sounds

This collection of thirty-four pieces (songs?) by Forty0ne really is better than it sounds.

And the liner notes aren’t bad either!

http://archive.org/details/csr041

— recommended by Helen Temnesen


What are your Archive favorites? Please suggest a link or two and a few words about why you appreciate your recommendation to:

bestof@archive.org

—David Glenn Rinehart

/ / / / /

To subscribe to this list, please visit:

http://archive.org/account/login.changepw.php

If you don’t already have a free Internet Archive library card, you may get yours here:

http://archive.org/account/login.createaccount.php

There, enter your password into the “Change Your Account Settings” Option, then click on the “Verify” button. That will bring you to your accounts setting page, where you may change your subscription status in the “Change Announcement Settings” section.

If the above URL is inoperable, make sure that you have copied the entire address. Some mail readers will wrap a long URL, breaking the link.

If you’re still having trouble, please contact the list owner at:

info@archive.org

/ / / / / / /

David Glenn Rinehart is an artist in residence at the Internet Archive as well as a cartoonist, composer, filmmaker, musician, and writer. His work is at http://stare.com/ and elsewhere.

getting only certain formats in .zip files from items — new feature

Per some requests from our friends in the Live Music Archive community…

You can get any archive.org item downloaded to your local machine as a .zip file (that we’ve been doing for 5+ years!)
But whereas before it would be all files/formats,
now you can be pick/selective about *just* certain formats.

We’ll put links up on audio item pages, minimally, but the url pattern is simple for any item.
It looks like (where you replace IDENTIFIER with the identifier of your item (eg: thing after archive.org/details/)):

http://archive.org/compress/IDENTIFIER

for the entire item, and for just certain formats:

http://archive.org/compress/IDENTIFIER/formats=format1,format2,format3,….

Example:


wget -q -O - 'http://archive.org/compress/ellepurr/formats=Metadata,Checksums,Flac' > zip; unzip -l zip
Archive: zip
Length Date Time Name
--------- ---------- ----- ----
1107614 2012-10-30 19:49 elle.flac
44 2012-10-30 19:49 ellepurr.md5
3114 2012-10-30 19:49 ellepurr_files.xml
693 2012-10-30 19:49 ellepurr_meta.xml
602 2012-10-30 19:49 ellepurr_reviews.xml
--------- -------
1112067 5 files

Enjoy!!

Internet Archive joins Open Wireless Movement

We are excited to join the Electronic Frontier Foundation and other open-minded organizations in the Open Wireless Movement. We have long believed that there should be many and low-cost options to get access to the Internet. Individuals and organizations sharing their WiFi networks with their neighbors can be one such option. The Open Wireless Movement shows how do that safely and legally.

The Internet Archive has offered free open outdoor unrestricted WiFi since 1998 using 3 generations of equipment.   Currently we serve users in San Francisco libraries and about 5,000 families in housing projects as well as our neighbors in Richmond, California and San Francisco.

Fast and Free.

10,000,000,000,000,000 bytes archived!

Ten Petabytes (10,000,000,000,000,000 bytes) of cultural material saved!

On Thursday, 25 October, hundreds of Internet Archive supporters, volunteers, and staff celebrated addition of the 10,000,000,000,000,000th byte to the Archive’s massive collections.

We also announced

Computer Science legend Don Knuth played the Archive’s organ to open the program.

The only thing missing was electricity; the building lost all power just as the presentation was to begin. Thanks to the creativity of the Archive’s engineers and a couple of ridiculously long extension cords that reached a nearby house, the show went on.

Video of the show thanks to Jonathan Minard:

 

80 terabytes of archived web crawl data available for research

petaboxInternet Archive crawls and saves web pages and makes them available for viewing through the Wayback Machine because we believe in the importance of archiving digital artifacts for future generations to learn from.  In the process, of course, we accumulate a lot of data.

We are interested in exploring how others might be able to interact with or learn from this content if we make it available in bulk.  To that end, we would like to experiment with offering access to one of our crawls from 2011 with about 80 terabytes of WARC files containing captures of about 2.7 billion URIs.  The files contain text content and any media that we were able to capture, including images, flash, videos, etc.

What’s in the data set:

  • Crawl start date: 09 March, 2011
  • Crawl end date: 23 December, 2011
  • Number of captures: 2,713,676,341
  • Number of unique URLs: 2,273,840,159
  • Number of hosts: 29,032,069

The seed list for this crawl was a list of Alexa’s top 1 million web sites, retrieved close to the crawl start date.  We used Heritrix (3.1.1-SNAPSHOT) crawler software and respected robots.txt directives.  The scope of the crawl was not limited except for a few manually excluded sites.  However this was a somewhat experimental crawl for us, as we were using newly minted software to feed URLs to the crawlers, and we know there were some operational issues with it.  For example, in many cases we may not have crawled all of the embedded and linked objects in a page since the URLs for these resources were added into queues that quickly grew bigger than the intended size of the crawl (and therefore we never got to them).  We also included repeated crawls of some Argentinian government sites, so looking at results by country will be somewhat skewed.  We have made many changes to how we do these wide crawls since this particular example, but we wanted to make the data available “warts and all” for people to experiment with.  We have also done some further analysis of the content.

Hosts Crawled pie chart

If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what you’re hoping to do with it.  We may not be able to say “yes” to all requests, since we’re just figuring out whether this is a good idea, but everyone will be considered.

 

Siteless Website Possible? If bittorrent is a fileserver without a server, what about a website without a site?

 

Bittorrent is a system that makes a fileserver that does not have a server.

I find the idea of calling for a file given a handle from the community of Internet users is pretty interesting.    It allows a community to build a collection of materials without a necessary central organization.  The Internet Archive is now leveraging the bittorrent technology to build distributed access, but the next step is to build distributed preservation.   To support the library application, we think the technology needs to be tweeked, but doable.

What about website without a site?

A next step would be to build a system to have a website that did not have a singular home computer.   A system, maybe, that had the functionality of wordpress: pages, searchability, updating, users, etc.   But did not exist on a single computer.

What would this be?   A siteless website?

Then this site could be supported by a number of people over time, and a shifting number of people over time.

It might remove some of the fragility of the current web:  when the originating host is taken down, then a whole community loses out.

The Internet Archive takes snapshots for the wayback machine, but this is different and not nearly as good.

Is this impossible or just hard?

-brewster

brewster@archive.org

 

 

Launch of the DigiBaeck Project

DigiBaeck

The Internet Archive, working with the Leo Baeck Institute, is pleased to be a part of the Oct 16, 2012 launch of their DigiBaeck project, a massive (formerly print) archival collection of history pertaining  to German speaking Jewry.

Robert Miller, Global Director of Books for the Internet Archive states that “digitizing over 4,000 linear feet of material whose scope ran the gamut of post cards from Berlin to letters from Auschwitz was both empowering and humbling at the same time.” He continues, “One of my staff, who worked on the collection, family was from Poland and suffered terribly during the Holocaust. Being able to assist in putting these original documents online was cathartic for her.”

The Leo Baeck Institute helped teach Miller’s teams in Princeton, NJ and San Francisco, CA. how to work with and handle unique and high value archival materials. And he and his staff helped teach Leo Baeck how to move from print to on-line pixels. It was a true partnership in every sense of the word.

Brewster Kahle, founder of the Internet Archive, states, “it is collections going public like Leo Baeck’s that remind us of the adage that collections that remain private or not digital are for all intents and purposes extinct. I applaud Leo Baeck for the direction they have taken.”

Baeck Institute logoLinks to the Internet Archive’s copy of the the Leo Baeck Material may be found at archive.org/details/LeoBaeckInstitute and details about the Leo Beck Collection may be found on their site at www.lbi.org/digibaeck.

The link to the New York Times Piece may be found here at http://artsbeat.blogs.nytimes.com/2012/10/09/archive-of-jewish-life-in-central-europe-going-online/.

Our Ten Petabyte Party: Live Streamed or In Person! Thurs Oct 25th 6-7:30PT

Please join us for a free reception and short presentations, Thursday, October 25th from 6 to 7:30pm, in person, or live streamed at http://toc.oreilly.com/:

  • petaboxTelevision News Broadcasts are now Searchable (350,000 of them!)
  • All of Balinese Literature now online and more books in the Lending Library
  • Digital Archive of Japan 2011 disaster
  • Hundreds of newly digitized Home movies and other ephemeral films

and, drum roll,

  • Ten Petabytes (10,000,000,000,000,000 bytes) of cultural material saved!

** this just in**  Don Knuth will be playing the organ as we start the event!

This will be a fun party that celebrates the community that is building and supporting this astonishing library.

Lets bring millions of books, music, movies, software and web pages online to over 2 million people every day and celebrate the 10,000,000,000,000,000th byte being added to the Archive.

Invite anyone and everyone.

Thursday, October 25th
Cocktail Reception at 6PM
Presentations 6:30-7:15PM

Location: Internet Archive
300 Funston Ave, San Francisco, CA 94118
415.561.6767

Please RSVP to June at RSVP@archive.org

archive building