Browser distribution for those using Archive.org

Ok, this is a geeky post.  :)

For web designers it is important to watch what is the distribution of user’s browser and browser/Operation system combinations.    Just for those that care, here is yesterday’s breakdown for us.     There are many other sites that display these, and over time, so this is not likely to be much more than a curiosity.     Thank you to Sam Stoller for making this page for us.

The rise of Chrome to the most popular (almost 1/3 of our users) is remarkable to me, and a growing number of users are mobile which indicates we should do more to help those users.

Below are 2 tables: browsers and browser/OS combinations for one day of human users to archive.org (trying to filter out bots, and does not count most Wayback Machine use).

 

Time window starting 2013-02-07 16:00:00
counted 2290115 pageviews in 86400 seconds
% of pageviews (actual count) browser
% 32.51 (556525) Chrome 24.0.1312
% 20.28 (347118) Firefox 18.0
% 15.86 (271526) IE 9.0
% 3.13 (53513) IE 8.0
% 3.04 (51971) Mobile Safari 6.1 iPhone
% 3.04 (51953) Safari 6.0.2
% 2.26 (38717) Mobile Safari 6.0 iPad
% 2.22 (37931) Safari 5.1.7
% 2.19 (37418) Mobile Safari 6.0 iPhone
% 1.05 (18038) IE 10.0
% 0.98 (16832) Mobile Safari 5.1 iPad
% 0.92 (15803) Mobile Safari 6.1 iPod
% 0.89 (15208) Mobile Safari 6.1 iPad
% 0.83 (14201) UNKNOWN
% 0.80 (13679) Firefox 16.0
% 0.77 (13161) Safari 5.0.6
% 0.63 (10859) Mobile Safari 5.1 iPhone
% 0.62 (10665) Firefox 17.0
% 0.57 (9804) Chrome 23.0.1271
% 0.56 (9550) Firefox 12.0
% 0.43 (7418) Firefox 14.0.1
% 0.40 (6849) Opera 12.14
% 0.38 (6517) Firefox 19.0
% 0.34 (5841) Firefox 15.0.1
% 0.31 (5264) Chrome 22.0.1229
% 0.29 (4945) Opera 12.12
% 0.23 (4009) Firefox 11.0
% 0.23 (3997) Safari
% 0.23 (3949) IE 7.0
% 0.23 (3882) Mobile Safari 5.1 iPod
% 0.21 (3526) Safari 6.0
% 0.20 (3458) Opera 12.13
% 0.20 (3438) Firefox 13.0.1
% 0.17 (2962) Mobile Safari 6.2 iPhone
% 0.17 (2944) Safari 4.1.3
% 0.16 (2772) Mobile Safari 6.0 iPod
% 0.16 (2747) Opera Mini 7.5
% 0.15 (2510) Safari 5.0.5
% 0.13 (2179) Chrome 18.0.1025 Nexus 7
% 0.12 (2087) Safari 6.0.1
% 0.12 (2086) Firefox 10.0
% 0.11 (1823) Chrome Mobile iOS 23.0.1271.100 iPad
% 0.11 (1805) Chrome 21.0.1180
% 0.10 (1744) Silk 1.0.22 Kindle Fire
% 0.10 (1724) Silk 2.2 Kindle Fire
% 0.09 (1623) Firefox 13.0
% 0.08 (1446) Mobile Safari 4.1 iPod
% 0.08 (1330) Firefox 7.0
% 0.07 (1215) Safari 5.1.2
% 0.07 (1158) Chrome 25.0.1364
% 0.07 (1133) Firefox 7.0.1
% 0.07 (1128) Android 4.1.1 SCH-I535
% 0.07 (1116) Safari 5.0
% 0.06 (1082) IE 6.0
% 0.06 (1076) Android 4.1.2 GT-I9300
% 0.06 (1062) Android 4.0.4 DROID RAZR 4G
% 0.06 (1059) Firefox 10.0.2
% 0.06 (1030) Mobile Safari 4.0.4 iPad
% 0.06 (1017) Android 4.1.1 SPH-L710
% 0.05 (925) Silk 2.7 Kindle Fire
% 0.05 (903) Android 4.0.4 SPH-D710
% 0.05 (895) Firefox 9.0.1
% 0.05 (858) Chrome 11.0.696
% 0.05 (828) Safari 5.1.3
% 0.05 (827) Android 4.1.1 SGH-T999
% 0.05 (816) Safari 5.1.6
% 0.04 (752) Firefox 8.0.1
% 0.04 (733) Blackberry WebKit 2.1.0 Blackberry Playbook
% 0.04 (730) Chrome 24.0.1309
% 0.04 (682) Safari 5.1.1
% 0.04 (673) Mobile Safari 5.0.2 iPad
% 0.04 (661) Opera 12.01
% of pageviews (actual count) browser and OS
% 18.90 (323495) Chrome 24.0.1312/Windows 7
% 13.89 (237766) IE 9.0/Windows 7
% 9.98 (170746) Firefox 18.0/Windows 7
% 7.05 (120645) Chrome 24.0.1312/Windows XP
% 5.52 (94450) Firefox 18.0/Windows XP
% 2.44 (41733) Chrome 24.0.1312/Windows Vista
% 2.22 (37935) IE 8.0/Windows XP
% 2.19 (37472) Mobile Safari 6.1/iOS 6.1 iPhone
% 1.97 (33760) IE 9.0/Windows Vista
% 1.97 (33708) Safari 5.1.7/Mac OS X 10.6.8
% 1.66 (28334) Safari 6.0.2/Mac OS X 10.8.2
% 1.56 (26704) Firefox 18.0/Windows Vista
% 1.44 (24631) Mobile Safari 6.0/iOS 6.1 iPad
% 1.38 (23619) Safari 6.0.2/Mac OS X 10.7.5
% 1.26 (21629) Chrome 24.0.1312/Windows 8
% 1.24 (21282) Mobile Safari 6.0/iOS 6.1 iPhone
% 1.00 (17065) Chrome 24.0.1312/Mac OS X 10.8.2
% 0.93 (15858) IE 10.0/Windows 8
% 0.93 (15842) Chrome 24.0.1312/Mac OS X 10.6.8
% 0.91 (15582) Firefox 18.0/Mac OS X 10.6
% 0.91 (15578) IE 8.0/Windows 7
% 0.85 (14499) Mobile Safari 6.1/iOS 6.0.1 iPhone
% 0.83 (14276) Mobile Safari 5.1/iOS 5.1.1 iPad
% 0.83 (14201) UNKNOWN
% 0.77 (13161) Safari 5.0.6/Mac OS X 10.5.8
% 0.72 (12285) Chrome 24.0.1312/Mac OS X 10.7.5
% 0.67 (11426) Mobile Safari 6.1/iOS 6.1 iPad
% 0.64 (10939) Mobile Safari 6.0/iOS 6.0.1 iPad
% 0.63 (10851) Mobile Safari 6.1/iOS 6.1 iPod
% 0.62 (10673) Firefox 18.0/Windows 8
% 0.62 (10547) Firefox 18.0/Ubuntu
% 0.59 (10133) Mobile Safari 6.0/iOS 6.0.1 iPhone
% 0.57 (9684) Firefox 18.0/Mac OS X 10.7
% 0.43 (7308) Mobile Safari 5.1/iOS 5.1.1 iPhone
% 0.40 (6801) Firefox 18.0/Mac OS X 10.8
% 0.31 (5264) Chrome 22.0.1229/Linux
% 0.30 (5156) Firefox 12.0/Windows XP
% 0.30 (5109) Firefox 17.0/Windows 7
% 0.29 (4952) Mobile Safari 6.1/iOS 6.0.1 iPod
% 0.27 (4645) Firefox 16.0/Windows 7
% 0.26 (4530) Mobile Safari 6.0/iOS 6.0 iPhone
% 0.26 (4394) Firefox 12.0/Windows 7
% 0.25 (4354) Firefox 17.0/Windows XP
% 0.25 (4345) Firefox 19.0/Windows 7
% 0.25 (4251) Firefox 14.0.1/Windows 7
% 0.25 (4195) Opera 12.14/Windows 7
% 0.24 (4035) Firefox 16.0/Mac OS X 10.5
% 0.22 (3806) Chrome 23.0.1271/Windows XP
% 0.22 (3782) Mobile Safari 6.1/iOS 6.0.1 iPad
% 0.22 (3702) Firefox 15.0.1/Windows 7
% 0.21 (3615) Firefox 16.0/Windows XP
% 0.20 (3460) Chrome 23.0.1271/Windows 7
% 0.19 (3284) IE 7.0/Windows 7
% 0.19 (3190) Mobile Safari 5.1/iOS 5.1.1 iPod
% 0.19 (3167) Firefox 14.0.1/Windows XP
% 0.18 (3147) Mobile Safari 6.0/iOS 6.0 iPad
% 0.17 (2962) Mobile Safari 6.2/iOS 6.0.2 iPhone
% 0.17 (2944) Safari 4.1.3/Mac OS X 10.4.11
% 0.16 (2747) Opera Mini 7.5/Android
% 0.16 (2654) Opera 12.14/Windows XP
% 0.15 (2543) Opera 12.12/Windows XP
% 0.15 (2538) Chrome 24.0.1312/Linux
% 0.15 (2510) Safari 5.0.5/Mac OS X 10.6.8
% 0.14 (2402) Opera 12.12/Windows 7
% 0.14 (2399) Safari 5.1.7/Mac OS X 10.7.5
% 0.13 (2180) IE 10.0/Windows 7
% 0.13 (2179) Chrome 18.0.1025/Android 4.2.1 Nexus 7
% 0.13 (2172) Firefox 19.0/Windows XP
% 0.12 (2139) Firefox 15.0.1/Windows XP
% 0.12 (2111) Firefox 11.0/Windows 7
% 0.12 (2087) Safari 6.0.1/Mac OS X 10.8.2
% 0.12 (2086) Firefox 10.0/Windows 7
% 0.12 (2037) Opera 12.13/Windows 7
% 0.11 (1931) Firefox 18.0/Linux
% 0.11 (1907) Firefox 13.0.1/Windows 7
% 0.11 (1898) Firefox 11.0/Windows XP
% 0.11 (1857) Chrome 23.0.1271/Chrome OS 2913.331.0
% 0.11 (1817) Mobile Safari 5.1/iOS 5.0 iPhone
% 0.11 (1805) Chrome 21.0.1180/Mac OS X 10.5.8
% 0.10 (1744) Silk 1.0.22/Mac OS X 10.6.3 Kindle Fire
% 0.10 (1734) Mobile Safari 5.1/iOS 5.0.1 iPhone
% 0.10 (1724) Silk 2.2/Linux Kindle Fire
% 0.09 (1534) Safari/Mac OS X 10.8.2
% 0.09 (1531) Firefox 13.0.1/Windows XP
% 0.09 (1514) Mobile Safari 5.1/iOS 5.1 iPad
% 0.09 (1481) Safari/Mac OS X 10.6.8
% 0.09 (1473) Mobile Safari 6.0/iOS 6.0.2 iPhone
% 0.08 (1446) Mobile Safari 4.1/iOS 4.2.1 iPod
% 0.08 (1421) Opera 12.13/Windows XP
% 0.08 (1384) Firefox 16.0/Ubuntu
% 0.08 (1330) Firefox 7.0/Windows XP
% 0.08 (1293) Chrome 24.0.1312/Mac OS X 10.7.4
% 0.07 (1277) Mobile Safari 6.0/iOS 6.1 iPod
% 0.07 (1248) Safari 6.0/Mac OS X 10.7.4
% 0.07 (1215) Safari 5.1.2/Mac OS X 10.6.8
% 0.07 (1210) Safari 6.0/Mac OS X 10.7.5
% 0.07 (1202) Firefox 17.0/Windows Vista
% 0.07 (1158) Chrome 25.0.1364/Windows 7
% 0.07 (1133) Firefox 7.0.1/Windows XP
% 0.07 (1128) Android 4.1.1/Android 4.1.1 SCH-I535
% 0.07 (1125) Safari 5.1.7/Windows 7
% 0.07 (1116) Safari 5.0/Mac OS X 10.6.3
% 0.06 (1082) IE 6.0/Windows XP
% 0.06 (1076) Android 4.1.2/Android 4.1.2 GT-I9300
% 0.06 (1068) Safari 6.0/Mac OS X 10.8
% 0.06 (1062) Android 4.0.4/Android 4.0.4 DROID RAZR 4G
% 0.06 (1059) Firefox 10.0.2/Windows XP
% 0.06 (1042) Chrome Mobile iOS 23.0.1271.100/iOS 6.1 iPad
% 0.06 (1042) Mobile Safari 5.1/iOS 5.0.1 iPad
% 0.06 (1030) Mobile Safari 4.0.4/iOS 3.2 iPad
% 0.06 (1017) Android 4.1.1/Android 4.1.1 SPH-L710
% 0.06 (982) Safari/Mac OS X 10.7.5
% 0.05 (925) Silk 2.7/Linux Kindle Fire
% 0.05 (903) Android 4.0.4/Android 4.0.4 SPH-D710
% 0.05 (899) Firefox 13.0/Windows 7
% 0.05 (895) Firefox 9.0.1/Windows XP
% 0.05 (858) Chrome 11.0.696/Linux
% 0.05 (828) Safari 5.1.3/Mac OS X 10.7.3
% 0.05 (827) Android 4.1.1/Android 4.1.1 SGH-T999
% 0.05 (816) Safari 5.1.6/Mac OS X 10.7.4
% 0.05 (785) Mobile Safari 6.0/iOS 6.0.1 iPod
% 0.05 (781) Chrome Mobile iOS 23.0.1271.100/iOS 6.0.1 iPad
% 0.04 (752) Firefox 8.0.1/Windows XP
% 0.04 (733) Blackberry WebKit 2.1.0/BlackBerry Tablet OS 2.1.0 Blackberry Playbook
% 0.04 (730) Chrome 24.0.1309/Mac OS X 10.8.2
% 0.04 (724) Firefox 13.0/Windows XP
% 0.04 (710) Mobile Safari 6.0/iOS 6.0 iPod
% 0.04 (699) Safari 5.1.7/Mac OS X 10.7.4
% 0.04 (692) Mobile Safari 5.1/iOS 5.0.1 iPod
% 0.04 (682) Safari 5.1.1/Mac OS X 10.7.2
% 0.04 (681) Chrome 23.0.1271/Windows Vista
% 0.04 (673) Mobile Safari 5.0.2/iOS 4.3.5 iPad
% 0.04 (665) IE 7.0/Windows XP
% 0.04 (661) Opera 12.01/Windows XP
Posted in News | 3 Comments

Presetting metadata with the new Beta Uploader

We have been testing out a new uploader, currently in beta. The new uploader allows our users to preset metadata for their items, which is useful if you are uploading many items.

When using the new uploader, the metadata editor will appear after initially choosing some files to upload, as shown below:

You will not see the metadata editor until you choose at least one file. You can add query arguments to the standard upload URL in order to preset metadata for the item. When the metadata editor appears, it will be populated with the metadata you supplied. You can preset standard metadata fields like title and description, as well as arbitrary key/value pairs.  The general form of the url will look like this:

http://archive.org/upload?key1=value1&key2=value2

An advantage to specifying metadata in this way is that it can then be searched on archive.org by putting   key1:value AND key2:value2     or key1:”Multiple word value”   The metadata is stored in the item’s metadata xml file located at: http://archive.org/download/ID/ID_meta.xml.

Here are some examples:

Title

This url will preset the title for you:

http://archive.org/upload/?title=My%20Item

Note that the space character in the title is encoded as “%20″

Description

http://archive.org/upload/?description=This%20is%20my%20description

Subjects

Subjects are separated by commas:

http://archive.org/upload/?subject=dogs,cats

Page URL / Item Identifier

Presetting the page url is tricky, since you must pick a unique identifier for your item. If you provide an identifier that already exists, the uploader will allow you to add more files to that item, assuming you have the correct permissions to upload to that item.

http://archive.org/upload/?identifier=this_item_does_not_exist_yet

Limiting identifier length

If you need to limit the length of the identifier, you can set max_id_length. This is useful if you need the identifier to be short enough to fit on a barcode.

http://archive.org/upload/?max_id_length=25

Collection

Presetting the collection will only be useful to curators who have permissions to upload to those collections:

http://archive.org/upload/?collection=americana,test_collection

Here, two collections are supplied, separated by a comma. The primary collection is listed first, and will appear in the “Collection” section of the metadata editor. Additional collections will appear at the bottom of the editor, in the “More Options” section.

Arbitrary Metadata

You can supply any additional metadata you would like to add to your item. If the key/value pair you supply is not one of the standard ones, it will appear at the bottom of the metadata editor, in the “More Options” section:

http://archive.org/upload/?foo=bar

Please let us know if you have any comments about the new uploader!

 

Posted in News | 11 Comments

new archive.org uploader: html5, for big big files, and easier (but not for IE)

We have been working on a new HTML5 version of the uploader, and it is now available in beta if you’d like to try it out: http://archive.org/create/ (look for the Beta Uploader button).

The new uploader is capable of uploading much larger files than the old one, and you can add a wider variety of metadata if you need it. Please note that the beta uploader does NOT work in Internet Explorer due to the limitations of that browser. We recommend Chrome or Firefox for the best experience.

If you have comments on the beta uploader, please reply in this forum thread.

Thanks!

Alexis – IA

Posted in News | 2 Comments

Bulk Downloading, Aaron Swartz, and Terms of Service

[Aaron Swartz worked for and with the Internet Archive for years.]

Aaron was threatened with 35 years in prison for being accused of something my library actively encourages: bulk downloading of library collections.    Some are calling it “hacking”, which is a problematic distortion of the term in the first place.(1)   It might be time to break down some of what is currently going on in scholarly research as it relates to datamining, bulk downloading, and terms of service.    It makes me very sad and mad because this confusion may have that lead a library (JSTOR) to track down a user, have led MIT to call the police (and not try to call them off later), and have a US Attorney mistake this for a crime which then combined to help lead to a death of a rising star in the Internet community.

Libraries:   All libraries, including JSTOR and the Internet Archive, contains materials from lots of different people and places– some copyrighted some not.   Jim Gray called libraries “Engines of Research.”    Research, by definition, is searching– searching for new patterns and new ideas.    Libraries provide raw materials for researchers.   Fortunately in the digital world, bulk access to materials does not hurt our preservation function as rifling through pages in the past might have.

Academic publishing is changing:  Traditionally academic publications mostly came from non-profit scholarly associations and university presses.   Some organizations started to acquire and aggregate many journals into databases, organizations such as Elsevier, Wiley-Blackwell, JSTOR.   These databases were funded by academic institutions and only available to those subscribing institutions.   Further than this, academic publishing is going more “open.”  New publishers are being created to explicitly allow open-access and bulk access such as the Public Library of Science.   Their open access journals end up being cited more often and this openness explicitly allow research results using “datamining” techniques.    Many universities made all future professors’ articles open access, except when specifically requested not to.

Datamining academic research as academic research:  Datamining academic publications is popular now because modern computers make it easy and the results are novel and publishable.   This involves collecting masses of journal articles so that they can be analyzed by computer programs to find statistical patterns.   This is different from individuals reading a paper at a time.    Biology and medicine is especially helped by this, but it is now going on in humanities and law research.  Larry Lessig wrote:

While at Stanford, Swartz had worked with a law student to download all the law review articles in the Westlaw database, to map funders of research with research conclusions. The result of that research was published in the Stanford Law Review, and showed a troubling connection between funders and their conclusions. At the time of Aaron’s alleged “crime,” he was a fellow at my Center at Harvard. The work of the Center? Studying the corruption of academic research (among other institutions) caused by money.

Bulk downloading or “crawling”:   Bulk downloading is now done for various reasons, and those libraries with large collections take various positions on it and express these positions in Terms of Service and robot exclusions.   “Robots” or “crawlers” in this context are computer programs that do repetitive actions like downloading many documents from a website.  Some such users are search engines, some are backing-up materials, some doing new research such as visualizing data, some building different interfaces to the full dataset (like freebase reuse of wikipedia), or even enabling others to more easily download in bulk.    Most datasets have some sort of licenses involved, so there is some nervousness on the part of the providers to explicitly allow all bulk downloading (for instance of Amazon.com’s book catalog data which is licensed from many players), but in general people are becoming more comfortable with the re-purposing of their data as it becomes more common.

The Internet Archive is regularly crawled.   We try to make our systems strong enough to serve these loads, and sometimes try to get robots to slow down.   We get hit with spam all the time, and occasional denial of service attacks.    But we haven’t called the police– we deal with it.   As a library we try to serve as many users as we can and some of those users are robots.

Open Data is a raising trend supported by government agencies and libraries.   Open Data is bulk data that is specifically licensed for datamining, graphing, and linking to other open data.    This is the minority of databases, but it is growing in importance.   I bring this up because it shows a trend towards openness and datamining.

Terms of Service and Robots.txt files:  These mostly invisible “agreements” that are often defensive documents to protect the organization from users and suppliers.   These are regularly trodden on sometimes resulting in the providers instituting technological measures to slow down mass downloaders.    I think of most Terms of Service as like an old joke about the Soviet Union:  everything is illegal except when it is not.    It is important to note that the specifics of many Terms of Service and robot exclusion files are regularly ignored by millions of people, and enforcement is ignored by millions of organizations.   Enforcement is often very selectively applied.

Bulk Downloading, Aaron Swartz, and Terms of Service:  putting this all together means that mass downloading is often not discouraged as long as it is done slowly enough, what most concerns providers, in our experience, is what is done with the materials after they are downloaded.    Terms of Service documents are generally “CYA” documents in which it is difficult to communicate nuance– but we should recognize that violating them may not be “right”, but is common practice.  Opening up library databases, including but not limited to public domain materials, to new types of research is important especially in academia.   Most organizations are adapting to these new types of computational research opportunities but some will try to stop them.   All in all we do not have a good way to draw lines of what is acceptable practice yet– it is all evolving.   What I know of Aaron’s downloading old journal articles for later use is not outside of what many people do.  What is unusual are the reactions on the part of JSTOR, MIT, and the US prosecutors.

What I am suggesting is we need a bit more slack in the system.   We need to be able to talk things through before we turn to police and courts.   We need to leave room for a new generation of people and ideas that may alter how our institutions work.    No, more than that, we should welcome and encourage people and ideas that will alter how our institutions work.

Aaron helped many of us adapt our institutions’ services to the digital opportunities– let’s continue this important work.

 

 

 

 

 

(1) From: the Hacker’s Dictionary
HACKER [originally, someone who makes furniture with an axe] n. 1. A person who enjoys learning the details of programming systems and how to stretch their capabilities, as opposed to most users who prefer to learn only the minimum necessary. 2. One who programs enthusiastically, or who enjoys programming rather than just theorizing about programming. 3. A person capable of appreciating hack value (q.v.). 4. A person who is good at programming quickly. Not everything a hacker produces is a hack. 5. An expert at a particular program, or one who frequently does work using it or on it; example: “A SAIL hacker”. (Definitions 1 to 5 are correlated, and people who fit them congregate.) 6. A malicious or inquisitive meddler who tries to discover information by poking around. Hence “password hacker”, “network hacker”.

Posted in News | 5 Comments

Memorial for Aaron Swartz in SF at Internet Archive Thurs 7pm

 

Dear Friends,Please join us as we gather to remember Aaron Swartz on the evening of Thursday, January 24th.

Reception at 7:00pm
Memorial at 8:00pm
at the Internet Archive
300 Funston Avenue
San Francisco 94118

Speakers will include Danny O’Brien, Lisa Rein, Peter Eckersley, Molly Shaffer Van Houweling, Cindy Cohn, Brewster Kahle, Tim O’Reilly, Elliot Peters, Alex Stamos, and Carl Malamud; there will be an opportunity for brief remembrances.

Please consider RSVPing so that we know how many people to expect. If you are unable to join us, you can watch a live stream of the event.

From Aaron’s friends at: Creative Commons, Electronic Frontier Foundation, Noisebridge, Internet Archive, Wikimedia Foundation, Stanford Center for Internet and Society, O’Reilly and Blurryedge.

Posted in News | 22 Comments

Aaron Swartz, hero of the open world, dies

Aaron Swartz Memorial Thursday, January 24th at the Internet Archive.

Downloadable version, and links to speakers.
Aaron’s girlfriend Taren’s and Open Access activist Carl Malamud’s gripping calls to action.


 

Eulogy by Brewster Kahle, written January 12, 2013:

Aaron Swartz, champion of the open world, committed suicide yesterday.

Working at the Internet Archive, Aaron was the architect and first coder of the OpenLibrary.org a site to open the world of books to the Internet generation.    As a user of the site, he helped put public domain books on the site that had been locked up.  Public access to the Public Domain, while seems obvious is not the position of many institutions, and this caused friction for Aaron.

As a volunteer, he helped make the RECAP system to offer free public access to public domain government court documents.   He took the bold step of seeding this system by going to a public library to download the public domain and then uploaded the documents to the Internet Archive– this got him in trouble with the FBI.   Now many millions of public domain documents have been used by over six million people for free, including researchers that could never have afforded the high fees to gain access.

If there is a sin in the open world it is locking up the public domain.  Aaron took selfless action.

When he was downloading a large number of old journal articles, he was arrested at MIT.   I was shocked by this.  When I was at MIT, if someone went to hack the system, say by downloading databases to play with them, might be called a hero, get a degree, and start a company– but they called the cops on him.  Cops.   MIT used to protect us when we transgressed the traditional.  Despite many of us supporting the lawyers for Aaron, he was still hounded by prosecutors.   (I hope JSTOR.org and MIT will act differently in the future.)

Aaron was steadfast in his dedication to building a better and open world.   Selfless.   Willing to cause change.

He is among the best spirits of the Internet generation.    I am crushed by his loss, but will continue to be enlightened by his work and dedication.

To mourn, I just watched this video with my son.   May I suggest you seek out your children and do the same.

May a hero and founder of our open world rest in peace.

-brewster

Founder, Digital Librarian of the Internet Archive

 

Other helpful reading:

Cory Doctorow: http://boingboing.net/2013/01/12/rip-aaron-swartz.html

Larry Lessig:  http://lessig.tumblr.com/post/40347463044/prosecutor-as-bully

Expert Witness in his case:  http://unhandled.com/2013/01/12/the-truth-about-aaron-swartzs-crime/

Posted in Announcements | 96 Comments

Wayback Machine: Now with 240,000,000,000 URLs

Today we updated the Wayback Machine with much more data and some code improvements.  Now we cover from late 1996 to December 9, 2012 so you can surf the web as it was up until a month ago.  Also, we have gone from having 150,000,000,000 URLs to having 240,000,000,000 URLs, a total of about 5 petabytes of data.   (Want a humorous description of a petabyte?  start at 28:55)  This database is queried over 1,000 times a second by over 500,000 people a day helping make archive.org the 250th most popular website.

live 2012 election coverageOver the past year we archived tons of pages about the United States 2012 presidential election.  You can revisit the New York Times live coverage page from election day, the campaign sites of Republican hopefuls like Newt Gingrich and Ron Paul, and mini-scandals like Romney’s car elevator or using aspirin as contraceptives.  The Wayback record of the 2008 election was recently used by the Sunlight Foundation to contrast how Obama’s team dealt with disclosing inauguration donors then vs. now, so hopefully the 2012 election content will prove just as useful in the future.

city of heroes siteThe prolific volunteers of Archive Team spent a lot of time this year archiving web sites on the verge of disappearing and then contributing those records to Internet Archive.  City of Heroes (including the boards with years of posts), Fortune City and Splinder were all saved from the proverbial wood chipper.

The updated version does have at least one known issue – there is a small amount of older content missing from the index, and it will take us another month or two to sort out that problem.  In the mean time, you can still visit the previous version of the Wayback with that content.

We would like to thank the following for all their efforts in making the updated Wayback Machine:

  • Andy Bezella
  • Aaron Binns
  • Hank Bromley
  • Kris Carpenter
  • Dominic Dela Cruz
  • Vinay Goel
  • Jake Johnson
  • Brewster Kahle
  • Jeff Kaplan
  • Ilya Kreymer
  • Raj Kumar
  • John Lekashman
  • Noah Levitt
  • Adam Miller
  • Gordon Mohr
  • Ralf Muehlen
  • Kenji Nagahashi
  • Alexis Rossi
  • Jim Shankland
  • Sam Stoller
  • Brad Tofel
  • Travis Wellman

Posted in Announcements | 32 Comments

Thanks a Million!

Thanks to the generous support of our users we raised $250,000 in donations during the month of December, and with the 3-to-1 match from one of our donors that gives us $1,000,000!  We raised enough to purchase 4 petabytes of storage, which helps us towards the 10 we estimate for next year.  Beyond that, this will help us archive books, music, video and web sites.   If you haven’t donated yet, please help keep the archive open!

We brought an unprecedented amount of information into the archive in 2012:

  • 50,000,000,000 web pages
  • 1,000,000 hours of television
  • 370,000 new audio/music items
  • 100,000 new videos

We launched the TV News Search & Borrow service, which makes almost 400,000 television news programs searchable and borrowable.  We made all of Balinese literature available online.  And you can play with a new, beta Wayback that has a much more up to date index.

We look forward to archiving even more great material in 2013.   Thank you for helping to support the goal of Universal Access to All Knowledge.

Posted in Announcements | 11 Comments

My adventure in donating bitcoins to the Internet Archive

A Bitcoin Adventure in Four Parts
—by Brewster Kahle, Digital Librarian

Part One: The Deposit

I am proud to say I succeeded in donating BitCoins to the Internet Archive, but it took some doing.   For your entertainment, here is my adventure in changing $100 into bitcoins, transferring them to the “wallet” that lives on my laptop, and then contributing them to the Internet Archive.

The first trick was to buy some bitcoins. After poking around, I found that I could use a wire transfer via the mtgox website.  But, now, uh, I don’t recommend this approach.  This is what happened:

To transfer $100, I asked my big bank to wire a hundred dollars to the bank MtGox suggested which is in Japan. Well, my bank cannot send dollars to Japan, only yen. And since I requested dollars, they had first transfer the money to JPMorgan, a bank that can transfer dollars. So far, so complicated.

I then waited the five days that MtGox said to allow for the transfer to conclude, but nope, nothing. I then entered the weird world of MtGox customer support.

After asking what happened to my money, and they have therefore determined that I was a risk, and that they needed to see a scanned passport and/or driver’s license to confirm the money came from an account of the same name.   Even if I had given the scanned ID, it would not have matched the bank account that I had chosen.   But the instruction had said nothing about a scanned ID, so worried this was a scam.    Then this went back and forth several times, and my alarm bells started to go off.

They declined my request to speak with a manager, and repeated that they needed scanned identity documents. That’s when I requested that they return the money. Sure, no problem, as long as I first sent them my scanned passport and/or driver’s license. Creep factor: high and rising. When I asked how long they would sit on a transfer that never made it into any account before they automatically returned the money, they asked for … wait for it! … my full identification.

After wasting all too much time on what should have been a simple deposit, I received a terse message from the “MtGox.com Team”:

“The transfer in question is confirmed and credited to your account.In future if you are not willing to provide ID proof then please never send us any deposit again because we do not accept deposits without proof that it is actually from the bank account owner.”

My lesson: avoid MtGox.

 

Part Two: Installing the Software

I downloaded the bitcoin-qt application to my mac laptop thinking I would install it and create an ID. But this process takes days– and can fail.  And did.

The bitcoin system is very cool—cryptographically secure, peer-to-peer, anonymous, and such—but it means that your computer is a first-class member of the system and that requires quite a bit of computing horsepower. Hours into the installation process, a friend advised me that it could take a day or so to complete.

There’s an small icon on interface that has a tool-tip to monitor progress. My progress was stymied by an error after a day, one with no recommended solution.  Searching, I found a forum post suggesting that I delete some files from the computer’s application directory and start over again.    Another day of processing, and Ready!

 

Part Three: Getting My Bitcoins into My Computer

I think I could have transferred the bitcoins directly from MtGox to the Internet Archive, but would have been cheating. I wanted to have the coins in my virtual pocket and then donate them– seemed more “real”.

The MtGox FAQ had an on-point entry, “How do I withdraw Bitcoin to my own computer?” After following the instructions, the bitcoin application on my computer said withdrawal was in process, but needed to be confirmed. Confirmed?  Since bitcoin is this magic of deep math, it uses other computers in the world to confirm that you have the coins you claim to have on deposit. In an hour or two, my bitcoins were confirmed to be in my bitcoin client on my computer.  Cool.

I wonder what happens when my machine crashes or is stolen, but I was on a roll, so there’s no looking back now.

 

Part Four: Making a Donation to the Internet Archive

The Internet Archive’s Donate page features a magic number, the bitcoin address needed to transfer the bitfunds. I cut and pasted that into the “Send Coins” tab in my bitcoin client program, labeled it for future reference as the Internet Archive, and pressed “Send.” I was hoping for that whooshing sound that iPhones make when sending mail, but nope, just silence.   Not sure if I should celebrate, I stayed cool.

The next day, I asked June Goldsmith—the Director of Administration at the Archive who runs the Archive’s bitcoin client—if she had received it, and indeed she had. My donation made it!

Last year, we received a few thousand dollars in bitcoin contributions. So far this year, Internet Archive supporters have donated 186 bitcoins worth U.S. $2,400 at the current exchange rate.

I am rather proud of succeeding and now kind-of like the adventure.   I feel like I am a member of a club and want to go buy something or donate some more.

If you find yourself similarly inclined, please visit the Internet Archive’s Donate page to support the world’s fastest growing library with bitcoins, dollars, time, books, and anything else.

Thanks, in advance, for your continued support.

—brewster

 

 

Posted in News | 19 Comments

Funds for 1 petabyte raised, 3 to go! (please help)

One down, three to go!

With help from a generous, anonymous donor who’s matching other donations three to one through the end of the year, we now have enough funding to buy a new Petabox! We now have only seventeen days left to get the three more we’ll need in 2013.

These massive servers are the backbone of the Archive, and critical to our continued growth. To all of you who’ve contributed to our fundraising drive, thanks from all of us here at the Internet Archive. If you can help us reach our goal by making a tax-deductible donation, we’d be grateful.

https://archive.org/donate/

Thanks for your support!

 

Posted in News | 16 Comments

News from the Archive 0006: New Petabox, Decoding, and Balloons

No. 6, 14 December 2012

One down, three to go!

With help from a generous, anonymous donor who’s matching other donations three to one through the end of the year, we now have enough funding to buy a new Petabox! We now have only seventeen days left to get the three more we’ll need in 2013.

These massive servers are the backbone of the Archive, and critical to our continued growth. To all of you who’ve contributed to our fundraising drive, thanks from all of us here at the Internet Archive. If you can help us reach our goal by making a tax-deductible donation, we’d be grateful.

https://archive.org/donate/

Thanks for your support!

Books in Browsers presentations now online

In October, the Internet Archive hosted the Books in Browsers conference, which covered achievements in moving books to the web, vending and lending, the design and effective deployment of ebooks and reading experiences for web environments, the portability of books and bookshelves, reader application interoperability, storage and transmission security (including encryption and caching), the legal and user consequences of book licensing versus purchase, and ramifications for user privacy and data protection.

Peter Brantley, director of Bookserver at the Internet Archive, provided an insightful summary of the two-day event: the new publishing doesn’t care about formats, it cares about story-telling. It is neutral about content-types, because all content-types can be manipulated on the web. That may seem prosaic, but it is actually revolutionary. We’re used to seeing tools that add video to textual narratives, or synchronize audio-based playback. But when you invent tools for the web, you can manipulate a vast array of content within the browser, and an author’s ability to integrate the reader into the experience of the story has few constraints. Indeed, one can expect those constraints to continue to yield under the pressure of increasingly flexible representations. Once technology liberates vision, it is only a matter of imagination becoming real.

If you’d like to learn more, the presentations are online:

http://archive.org/details/BooksInBrowsers2012Videos

From the Archive’s Mailbox

I have been watching the Pathé films with tears in my eyes, and here is why. You don’t show the film credits, but if you look at them you’ll see the film editor in chief is Leonard C. Hein, my dad. I believe he was president of his local union #707 in New York City. He worked for Pathé news for over 25 years, right up until the bankruptcy auction which he attended. Thank you for preserving this work!
http://archive.org/search.php?query=creator%3A%22Pathe%20News%22

—Donald Hein

Picks from the Archive

Decoded: An essay towards the reconciling of differences among Christians

A while back, I scanned a book with a ton of shorthand notes, thought to be written by Roger Williams, the founder of Rhode Island.

http://archive.org/stream/essaytowardsreco00will#page/n13/mode/2up

It has since been confirmed that the handwriting is his, a discovery made possible through the availability of our high-quality online version. The book itself is pretty fragile, and would not stand up to constant reading, as well as the digital images are easy to zoom in on for further study. Most of the code is quite small. Still, the Internet Archive’s scan of the book provided researchers the raw data needed to examine and ultimately decipher Williams’ code.

http://bostonglobe.com/metro/2012/12/04/brown-university-students-crack-roger-williams-code/6n1B9sLy812OyfOwWdIHvM/story.html

I find it so very cool to have been the one to put this book online and help in some small part to a better understanding of my state, and indeed nation’s history. A hurrah for the studious use of a book’s digital version.

— recommended by Xephyr Inkpen

Balloons

This unimaginatively titled, low-quality film documents Joseph Kittinger’s parachute jump from space in 1959.

https://archive.org/details/gov.archives.li.111-dd-301-59

The record held for decades until Felix Baumgartner recently broke it. Wow.

— recommended by Jilly Dybka


What are your Archive favorites? Please suggest a link or two and a few words about why you appreciate your recommendation to:

bestof@archive.org

—David Glenn Rinehart

/ / / / /

To subscribe to this list, please visit:

http://archive.org/account/login.changepw.php

If you don’t already have a free Internet Archive library card, you may get yours here:

http://archive.org/account/login.createaccount.php

There, enter your password into the “Change Your Account Settings” Option, then click on the “Verify” button. That will bring you to your accounts setting page, where you may change your subscription status in the “Change Announcement Settings” section.

If the above URL is inoperable, make sure that you have copied the entire address. Some mail readers will wrap a long URL, breaking the link.

If you’re still having trouble, please contact the list owner at:

info@archive.org

/ / / / / / /

David Glenn Rinehart is an artist in residence at the Internet Archive as well as a cartoonist, composer, filmmaker, musician, and writer. His work is at http://stare.com/ and elsewhere.

Posted in News | 6 Comments

Internet Archive & EFF successfully block Washington State law

Earlier this year the Internet Archive with EFF’s help joined a suit to challenge the enforcement of a new Washington state law, SB 6251. While the law was intended to curb advertising for underage sex workers, the language was overly broad and made online service providers and libraries criminally liable for providing access to third parties’ offensive materials, which is in conflict with federal law.

We have learned today that the challenge was successful– the law is permanently blocked. You can read more about the case on EFF’s site.

Associated Press article on this.

Posted in Announcements, News | 10 Comments

3-for-1 Match for Internet Archive Donations: Please Help

Dear Friends,

The Internet Archive has received a generous offer this holiday season. For every dollar we raise before December 31st, one of our supporters will match that money three to one. Please consider donating now.

Every day three million people around the world use our collections. We have archived over ten petabytes (that’s 10,000,000,000,000,000 bytes!) of information, including everything ever written in
Balinese. This year we also launched our groundbreaking TV News Search and Borrow service, which former FCC Chairman Newton Minow said “offers citizens exceptional opportunities” to easily do their own fact checking and “to hold powerful public institutions accountable.”

Our constantly expanding collections require a lot of storage space, and if we can raise $150,000 by the end of the year, the 3-for-1 match will give us an additional $450,000. Together that’s enough to buy four more petabytes of storage.

Please help us keep the library free for millions of people by making a tax-deductible donation today. On behalf of all of us at the Internet Archive, we wish you a happy holiday.

Thank you,

Brewster Kahle
Founder, Digital Librarian
Internet Archive

Posted in Announcements | 11 Comments

Call for an Open Stack: Securing an open and competitive information environment for the next decade

[letter to the Open Internet Preservation Society]

The “Stack” of technical layers that have delivered text and video from around the world are now embattled.      An Open Stack encourages competition at each layer, where a closed stack does not.    Unfortunately, the layers of the Internet stack are closing because of business and government interests, but this can be corrected.   Lets fix this.   And as we do, lets give openness some teeth.

Let us call for an “Open Stack” that enforces an open information environment which encourages competition at every layer and prevents closing of any layer.

The fantastic rise of the wired Internet, with personal computers, open source software, World Wide Web, and search engines– led to decades of company formation, new services, access to information, and productivity gains.     This open and competitive environment allowed new entrants to create new services at every level.

Now we are entering into a period that is quite different, a environment of cellular networks, apps, locked down devices, dominant phone companies, walled gardens, and strengthened copyright and patent laws, which means that at every level of the open and competitive landscape of the last couple of decades is under threat.

Let me say that again:

The openness that has been the Internet environment is being locked down and closed which is making competition and new services more difficult or impossible.     If allowed to continue, we will lose the benefits we struggled so hard to build.   Often sold as “secure,” it may well be the opposite.

First the bad news:  Some of approaches we used to secure the open Internet have been undermined, and need to be rethought and renewed.

  • Open source software was protected with the Gnu Public License, but the move to web services from distributed software has made this sharing license structure less impactful.    Efforts to modernize it were compromised.
  • The multi-stakeholder structure of Internet standards, lead by the Internet Society, is threatened to be replaced by a government driven regulatory group called the ITU.     This could lead to more national level firewalls, regulations, and fees.
  • Locked devices that will not even run open source operating systems is being built into personal computers to support the new Windows operating system.
  • Apple’s new personal computer operating system does not allow users to download software from the Internet without finding the magic keystroke to get beyond the prohibition (this is called a “cheat” in the gaming world which indicates how legitimate this feels).
  • Running open source software on Apple cellphones is now called “jail breaking” and voids warranties.
  • Home Internet connection choices have become fewer and fewer over the years in the wired world, and even fewer choices exist in the wireless world.   Now down to 2 or 3 choices in most locales these communications companies can and are starting to favor some services over another based on their own business interests and not the interests of their customers.   Network Neutrality regulations have not been adopted.
  • Content-level monopoly licensing schemes are close at hand, such as the “extended collective license” proposed by the libraries and Google as part of a book scanning project, but was denied by a court as monopolistic.   This approach may be reborn as part of the Digital Public Library of America project working with the US Copyright Office.    This would change the Web’s allowed-until-told-to-take-it-down approach that has made user-created content flower.
  • Bills to enable the copyright industries to take down whole websites because of some offending materials was only stopped by grassroots protests at the last minute.    But new bills are being crafted.
  • Internet telephony and new chat protocols are largely proprietary, unlike email and netnews protocols before them.   Open protocols invite more innovation competition.
  • Cables for iphone5 and some HP network switches have digital rights management features that prevent interoperability and competition.
  • Bittorrent, a protocol that encourages open source implementations, is discriminated against by network providers and prevented from the IPad/IPhone app universe by Apple.
  • Cloud storage and computing vendors are centralizing many services that were originally on leaf-level computers and then moved to central datacenters.   This can be seen as a mall compared to a town square– interactions ruled by contract and private security guards.
  • A competitive and open landscape is not built by accident and does not survive without vigilance and regulation.    While we may understand the importance, we may not see how we must work together to keep the rules fair and open.

Now the better news:  There are many ideas, often in isolation, that might be combined to build an Open Stack.  What if we had…

  • open source operating systems with the R&D budgets of proprietary vendors
  • open cell phones that work over open wireless networks and with open source software
  • wireless networks built on mass participation
  • backhaul conduits that can support dark fiber deployments running fibers for anybody
  • siteless websites that offer services in distributed way as bittorrent is a fileserver without servers
  • data caching, data storage, and computing services based on open standards and inter-operate
  • non-profit competitive infrastructure.  When a technology becomes infrastructure, think of railroads and roads, it is difficult for these to be in private hands because of the difficultly for regulators.   We can try, but maybe infrastructure should be non-commercial and competitive.
  • high-tech non-profits to add to the flowering we have seen over the past 20 years: Wikipedia, Mozilla, EFF, PLoS, Internet Archive, Free Software Foundation, Creative Commons, Linux Foundation, Internet Software Consortium, One Laptop per Child, PublicResource.org, Public Knowledge….    Lets use them and build new ones to build an open and competitive infrastructure.

I call on those of us see the benefit of openness and competition to create an “Open Stack”– a complete layer cake of openness: a network to device to application to content system that is open and competitive.       Then lets enforce the openness: lets create systems and regulations to ensure it stays open, because control of any one layer can lead to the closing of the whole system.

Together we can do ensure an Open Stack: software vendors and open source developers, device makers, network operators, creators, publishers, libraries, lawmakers and lawyers.

Together we can build and secure our information technology environment to offer every opportunity to the next generation to shape their world.

Posted in News | 4 Comments

News from the Archive 0005: BBC Visit, Rocketship X-M, and Alice

No. 5, 31 October 2012

A BBC film crew visited the Internet Archive; here’s their story.

In addition, the San Francisco Chronicle did a nice profile of our work:

http://www.sfgate.com/default/article/Brewster-Kahle-s-Internet-Archive-3946898.php

From the Archive’s Mailbox

I’ve just downloaded an image file (various galaxies in their vast array) from your NASA Images pages to use on the jacket of my new SF novel for preteens, The Calling.

http://archive.org/details/nasa

I appreciate your open policy of not copyrighting these images but allowing people to use them with a simple acknowledgement (which I have added).

—John Peace

We’re glad to help, but the availability of NASA imagery is determined by the space agency.

http://nasaimages.org/Terms.html

Selected Collection: Crap from the Past

This is a pop music radio show for people who already know plenty about pop music. Hosted by Ron “Boogiemonster” Gerber, it’s broadcast Friday nights from 10:30 to midnight on KFAI, Minneapolis. This collection of over twelve-hundred recordings goes back two decades, a millennium, or “since the days of DOS,” depending on how you slice it.

http://archive.org/details/crapfromthepast

Other Picks from the Archive

Rocketship X-M (1950)

Rocketship X-M landed on the red planet over sixty years before NASA’s Mars Curiosity rover touched down there recently. Hollywood years, that is. Rocketship X-M is the story of five astronauts (played by Lloyd Bridges, Osa Massen, John Emery, Noah Beery, Jr., and Hugh O’Brien) who blast off to explore the moon but end up on Mars instead. Stay tuned for the ending …

http://archive.org/details/RocketshipXM 

— recommended by Emilio Conseco

Through The Looking-Glass (and what Alice found there), Lewis Carroll

This is a first edition “Presentation Copy” of the followup to Alice In Wonderland. Not only is this a personal favorite that blew my mind when I first read it some years ago, but this is a first edition copy in excellent condition with fifty of the original illustrations by John Tenniel. I don’t need to describe the impact this book had on literature, but what makes this copy so fascinating to me is that inside the front cover is a note in the authors own hand, “Emma Vine, with the author’s kind regards. Christmas 1871.” There is also a penciled-in note saying that Emma Vine was Lewis Carroll’s nursemaid. This was very exciting for me to discover and I can’t believe I was able to see something like this with my own eyes, a real literary treasure.

http://archive.org/details/throughlookinggl01carr

— recommended by Gemma Waterston

Music That’s Better Than It Sounds

This collection of thirty-four pieces (songs?) by Forty0ne really is better than it sounds.

And the liner notes aren’t bad either!

http://archive.org/details/csr041

— recommended by Helen Temnesen


What are your Archive favorites? Please suggest a link or two and a few words about why you appreciate your recommendation to:

bestof@archive.org

—David Glenn Rinehart

/ / / / /

To subscribe to this list, please visit:

http://archive.org/account/login.changepw.php

If you don’t already have a free Internet Archive library card, you may get yours here:

http://archive.org/account/login.createaccount.php

There, enter your password into the “Change Your Account Settings” Option, then click on the “Verify” button. That will bring you to your accounts setting page, where you may change your subscription status in the “Change Announcement Settings” section.

If the above URL is inoperable, make sure that you have copied the entire address. Some mail readers will wrap a long URL, breaking the link.

If you’re still having trouble, please contact the list owner at:

info@archive.org

/ / / / / / /

David Glenn Rinehart is an artist in residence at the Internet Archive as well as a cartoonist, composer, filmmaker, musician, and writer. His work is at http://stare.com/ and elsewhere.

Posted in News | Leave a comment

getting only certain formats in .zip files from items — new feature

Per some requests from our friends in the Live Music Archive community…

You can get any archive.org item downloaded to your local machine as a .zip file (that we’ve been doing for 5+ years!)
But whereas before it would be all files/formats,
now you can be pick/selective about *just* certain formats.

We’ll put links up on audio item pages, minimally, but the url pattern is simple for any item.
It looks like (where you replace IDENTIFIER with the identifier of your item (eg: thing after archive.org/details/)):

http://archive.org/compress/IDENTIFIER

for the entire item, and for just certain formats:

http://archive.org/compress/IDENTIFIER/formats=format1,format2,format3,….

Example:


wget -q -O - 'http://archive.org/compress/ellepurr/formats=Metadata,Checksums,Flac' > zip; unzip -l zip
Archive: zip
Length Date Time Name
--------- ---------- ----- ----
1107614 2012-10-30 19:49 elle.flac
44 2012-10-30 19:49 ellepurr.md5
3114 2012-10-30 19:49 ellepurr_files.xml
693 2012-10-30 19:49 ellepurr_meta.xml
602 2012-10-30 19:49 ellepurr_reviews.xml
--------- -------
1112067 5 files

Enjoy!!

Posted in Audio Archive, Live Music Archive, Technical | Tagged , , , | Leave a comment

Internet Archive joins Open Wireless Movement

We are excited to join the Electronic Frontier Foundation and other open-minded organizations in the Open Wireless Movement. We have long believed that there should be many and low-cost options to get access to the Internet. Individuals and organizations sharing their WiFi networks with their neighbors can be one such option. The Open Wireless Movement shows how do that safely and legally.

The Internet Archive has offered free open outdoor unrestricted WiFi since 1998 using 3 generations of equipment.   Currently we serve users in San Francisco libraries and about 5,000 families in housing projects as well as our neighbors in Richmond, California and San Francisco.

Fast and Free.

Posted in News | Leave a comment

10,000,000,000,000,000 bytes archived!

Ten Petabytes (10,000,000,000,000,000 bytes) of cultural material saved!

On Thursday, 25 October, hundreds of Internet Archive supporters, volunteers, and staff celebrated addition of the 10,000,000,000,000,000th byte to the Archive’s massive collections.

We also announced

Computer Science legend Don Knuth played the Archive’s organ to open the program.

The only thing missing was electricity; the building lost all power just as the presentation was to begin. Thanks to the creativity of the Archive’s engineers and a couple of ridiculously long extension cords that reached a nearby house, the show went on.

Video of the show thanks to Jonathan Minard:

 

Posted in Announcements, News | 32 Comments

80 terabytes of archived web crawl data available for research

petaboxInternet Archive crawls and saves web pages and makes them available for viewing through the Wayback Machine because we believe in the importance of archiving digital artifacts for future generations to learn from.  In the process, of course, we accumulate a lot of data.

We are interested in exploring how others might be able to interact with or learn from this content if we make it available in bulk.  To that end, we would like to experiment with offering access to one of our crawls from 2011 with about 80 terabytes of WARC files containing captures of about 2.7 billion URIs.  The files contain text content and any media that we were able to capture, including images, flash, videos, etc.

What’s in the data set:

  • Crawl start date: 09 March, 2011
  • Crawl end date: 23 December, 2011
  • Number of captures: 2,713,676,341
  • Number of unique URLs: 2,273,840,159
  • Number of hosts: 29,032,069

The seed list for this crawl was a list of Alexa’s top 1 million web sites, retrieved close to the crawl start date.  We used Heritrix (3.1.1-SNAPSHOT) crawler software and respected robots.txt directives.  The scope of the crawl was not limited except for a few manually excluded sites.  However this was a somewhat experimental crawl for us, as we were using newly minted software to feed URLs to the crawlers, and we know there were some operational issues with it.  For example, in many cases we may not have crawled all of the embedded and linked objects in a page since the URLs for these resources were added into queues that quickly grew bigger than the intended size of the crawl (and therefore we never got to them).  We also included repeated crawls of some Argentinian government sites, so looking at results by country will be somewhat skewed.  We have made many changes to how we do these wide crawls since this particular example, but we wanted to make the data available “warts and all” for people to experiment with.  We have also done some further analysis of the content.

Hosts Crawled pie chart

If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what you’re hoping to do with it.  We may not be able to say “yes” to all requests, since we’re just figuring out whether this is a good idea, but everyone will be considered.

 

Posted in News, Wayback Machine | 35 Comments