archive.org download counts of collections of items updates and fixes

Every month, we look over the total download counts for all public items at archive.org.  We sum item counts into their collections.  At year end 2014, we found various source reliability issues, as well as overcounting for “top collections” and many other issues.

archive.org public items tracked over time

archive.org public items tracked over time

To address the problems we did:

  • Rebuilt a new system to use our database (DB) for item download counts, instead of our less reliable (and more prone to “drift”) SOLR search engine (SE).
  • Changed monthly saved data from JSON and PHP serialized flatfiles to new DB table — much easier to use now!
  • Fixed overcounting issues for collections: texts, audio, etree, movies
  • Fixed various overcounting issues related to not unique-ing <collection> and <contributor> tags (more below)
  • Fixes to character encoding issues on <contributor> tags

Bonus points!

  • We now track *all collections*.  Previously, we only tracked items tagged:
    • <mediatype> texts
    • <mediatype> etree
    • <mediatype> audio
    • <mediatype> movies
  • For items we are tracking <contributor> tags (texts items), we now have a “Contributor page” that shows a table of historical data.
  • Graphs are now “responsive” (scale in width based on browser/mobile width)

 

The Overcount Issue for top collection/mediatypes

  • In the below graph, mediatypes and collections are shown horizontally, with a sample “collection hierarchy” today.
  • For each collection/mediatype, we show 1 example item, A B C and D, with a downloads/streams/views count next to it parenthetically.   So these are four items, spanning four collections, that happen to be in a collection hierarchy (a single item can belong to multiple collections at archive.org)
  • The Old Way had a critical flaw — it summed all sub-collection counts — when really it should have just summed all *direct child* sub-collection counts (or gone with our New Way instead)

overcount

So we now treat <mediatype> tags like <collection> tags, in terms of counting, and unique all <collection> tags to avoid items w/ minor nonideal data tags and another kind of overcounting.

 

 

 

Posted in Audio Archive, Books Archive, Education Archive, Image Archive, Live Music Archive, Movie Archive, Music, Software Archive, Technical, Video Archive | Tagged , , , | Leave a comment

Community Wireless

The Internet Archive’s mission is universal access to knowledge.   For us, that access happens over the Internet. In many places, there are two or few providers of fast Internet access, which tends to lead to high prices, bad service and makes censorship too easy. We would like to see more options and are doing something where we can: in places where we own buildings, the Internet Archive provides free and fast Internet access. Currently, we cover parts of San Francisco and Richmond, California with Community Wireless.  Our most recent community project is with Atchison Village, in Richmond.

There are two layers to this, an access layer that anyone can connect to with WiFi devices, and a backbone layer that connects the access layer to the Internet at large. The backbone layer is built and operated by the Internet Archive. We monitor its performance and upgrade parts as needed.

The access layer is largely build in a crowd-sourced manner by willing participants. Anybody can connect with their own WiFi devices. The Internet Archive recommends specific devices that we know work well, but access is not limited to those. We also recommend connecting rooftop-to-rooftop; while rooftop-to-couch might work for some people, best results are achieved with devices mounted outdoors with line-of-sight to the closest access point.

Participants will be responsible for their own devices, including purchasing them, mounting them, pointing them and keeping them powered. For recommended devices the Internet Archive can provide initial configurations. If such a device’s configuration is changed, it is the participants responsibility to make it work.

There are a few caveats: Both layers operate in unlicensed frequency bands where interference is common and expected. The network is also a shared resource. Thus, experienced bandwidth and latency can and do vary. The Internet Archive will do a best effort to keep the backbone running well, but we cannot guarantee specific performance metrics. Also, over time expectations of what is an acceptable speed tend to go up. For this reason, we recommend upgrading devices about every three years, just like computers and phones.

Posted in News | Leave a comment

The New Yorker: The Cobweb–Can the Internet be archived?

Harvard history professor and New Yorker staff writer, Jill Lepore, has crafted a remarkable history of Web archiving–and the role of our own Brewster Kahle and the Wayback Machine.

Screenshot 2015-01-20 18.43.27

From the January 26, 2015 edition of The New Yorker.

My favorite passage:

Where is the Internet’s memory, the history of our time?

“It’s right here!” Kahle cries.

The machine hums and is muffled. It is sacred and profane. It is eradicable and unbearable. And it glows, against the dark. 

It’s well worth a read here.

 

Posted in Announcements, News | 1 Comment

University of California Libraries to partner with Archive-It

cdl_logoThis week, the University of California California Digital Libraries and the UC Libraries announced a partnership with Internet Archive’s Archive-It Service.

In the coming year, CDL’s Web Archiving Service (WAS) collections and all core infrastructure activities, i.e., crawling, indexing, search, display, and storage, will be transferred to Archive-It. WAS partners have captured close to 80 terabytes of archived content most of which will be added to the 450 terabytes Archive-It partners have collected.

We are excited to work with CDL as we transition over the UC (and other) libraries to the Archive-It service. These UC libraries have unique and compelling collections (some dating back to 2006) including their Grateful Dead Web Archive: http://webarchives.cdlib.orggdarchive/a/gratefuldead which of course fits in quite nicely with the Internet Archive’s large collection of downloadable and streamed Grateful Dead shows in our Live Music Archive.

By collaborating with CDL, Archive-it can continue to expand the core functionalities of web archiving and work with CDL and other colleagues to develop new tools to advance the use of web archives. Such collaboration is sorely needed at this juncture and we welcome the opportunity to expand the capabilities of web archiving. By working together as a community we can create useful and sustainable web archives and ensure growth in the field of web archiving.

Be sure and check out some of the CDL collections:

Archiving the LGBT Web: Eastern Europe and Eurasia- UCB: http://webarchives.cdlib.org/a/lgbtwebeasterneurope
Federal Regional Agencies in California Web Archive- UC Davis: http://webarchives.cdlib.org/a/uscalagencies
Salvadoran Presidential Election March 2009 – Web Archive- UC Irving: http://webarchives.cdlib.org/a/salvador
2009 H1N1 Influenza A (Swine Flu) Outbreak- UC San Diego: http://webarchives.cdlib.org/a/h1n1
California Tobacco Control Web Archive- UCSF http://webarchives.cdlib.org/a/caltobaccocontrol

Posted in Announcements, Archive-It, News | Leave a comment

Mirroring the Stone Oakvalley Music Collection

soamc_logo

The Internet Archive has begun mirroring a fantastic collection of music called the “Stone Oakvalley Music Collection”. When you visit one of their websites, the archive.org mirror is one of the choices for download. Going forward, the Archive will offer a full backup of the entire site (over a terabyte) for permanent storage.

Why the Stone Oakvalley Collection is important

Manufactured from the early 1980s to the mid 1990s, the Commodore 64 computer was a revolutionary piece of hardware and a critical introduction to programming for generations. It also had, within its design, a very well-regarded sound chip: the 6581/8580 SID (Sound Interface Device), whose unique properties in wave generation and effects gave a special sound in the hands of the right developers and musicians.

MOS_Technologies_6581

 

This successful piece of hardware was manufactured in the millions across the life of the C64, and in the late 1980s, the introduction of the Commodore Amiga computer brought to life an improved chipset for generating sound; the 8364, or PAULA. With a range of improvements to what sounds and music could come out of this chip, the Amiga soared with capabilities that took years to match in other machines.

paula8364The Archive hosts many examples of music generated by these chips: our C64 Games Archive has videos in the hundreds of games played on a Commodore 64, and searching for terms like “Amiga Music”, “Chiptunes” and “C64 Music” will yield a good amount of sound to enjoy.

But nothing comes close to the Stone Oakvalley Collection in terms of breadth, dedication, and craft in ensuring the unique sound of these chips can be enjoyed in the future.

setup01

The process, which is documented here, involved setting up a large amount of Commodore hardware connected to servers which would reboot the machines, over and over, playing thousands of pieces of music in different configurations, and automatically cataloging and saving the resulting waveforms. Considerations for modifications of the chipset over the years, of stereo versus mono recordings, and verification of the resulting 400,000 files have provided the highest quality of snapshots of this period.

Browsing the Collection

Currently, there are two websites for Stone Oakvalley’s collection – one based around the C64, and the other based around the Amiga.  Impeccable work has been done to catalog the music, so if there are songs or games you remember, they are likely to be saved on the site (and powered from Archive.org’s servers). Otherwise, browse the stacks of the sites and enjoy a soundscape of computer history.

The Internet Archive strives to provide universal access to the world’s knowledge. Through mirroring, hosting and gathering of data, our mission allows millions to gain ad-free, fast access to information and materials. Be sure to check our many collections on our main site.

Posted in Cool items, Software Archive | Comments Off

Update to Terms of Use

Terms difInternet Archive’s terms of use were written in March of 2001, and they haven’t changed once – until today.  The terms were written before the Wayback Machine was launched (in October 2001) when we had 4 billion web pages with no public access and 360 Prelinger Archive movies in the archive.  Now we have 435 billion web pages and more than 15 million public audio, video and text items.  Times have changed, and we have made a small change to our terms to reflect this.

In the interest of transparency, we want to show you exactly what the change is.

We have made small changes in paragraphs two and three of the terms.  The previous version of these sections is in red below:

“…You agree not to interfere with the work of other users or Archive personnel, servers, or resources. Further, you agree not to recirculate your password to other people or organizations or to copy offsite any part of the Collections without written permission. Please report any unauthorized use of your password promptly to info@archive.org…

“…You agree to abide by all applicable laws and regulations, including intellectual property laws, in connection with your use of the Archive. In particular, you certify that your use of any part of the Archive’s Collections will be noncommercial and will be limited to noninfringing or fair use under copyright law. In using the Archive’s site, Collections, and/or services…”

This is the new version with the changed portion in green type:

“…You agree not to interfere with the work of other users or Archive personnel, servers, or resources. Further, you agree not to recirculate your password to other people or organizations. Please report any unauthorized use of your password promptly to info@archive.org…

“…You agree to abide by all applicable laws and regulations, including intellectual property laws, in connection with your use of the Archive. In particular, you certify that your use of any part of the Archive’s Collections will be limited to noninfringing or fair use under copyright law. If a Creative Commons or other license has been declared for particular material on the Archive, to the extent you trust the declaration and declarer (which is rarely the Internet Archive), you may use the content according to the terms and conditions of the applicable license. In using the Archive’s site, Collections, and/or services…”

Thank you for continuing to use the amazing resources housed in the Internet Archive.

UPDATE 12/31/14:  The change on 12/30 applied to the language in the third paragraph of the terms.  On 12/31 we made an additional small change to the language in the second paragraph, and modified the text of this post to reflect both changes.

Posted in News | 2 Comments

Burning Brewster’s Bitcoin

[Guest post, hope you enjoy. -brewster]

Burning Brewster’s Bitcoin
First Installment – Coinbase offers a service that is contrary to everything the company professes to hold dear
Internet Archive
Morgen E. Peck

This fall, Brewster reached out to me with a proposition. He wanted to know more about what it’s like moving between bitcoin and fiat currencies—where the trades are happening, which ones are scams and which ones are legit, how long they take to go through, how much of my privacy I have to forfeit, and especially what kinds of fees traders are skimming from each individual transaction. In short, what’s it like for people who have no bitcoin and want to get in? And once they do get in, what options do they have?

To get the answers, Brewster sent me on my way with one bitcoin. He told me to sell it and buy it again in as many ways as possible, and not to come back until I had whittled his money down to nothing.

So. This is the mission. Find out how many licks it takes to get to the center of a bitcoin, or lose it all to thievery and grift (crrrruuunch!!!). We’ll be running updates on my progress through this blog with the hope of informing casual bitcoin users and digital currency gurus alike.

______________________________________________________________________ First Stop: Coinbase

Coinbase is a bitcoin “wallet” (I’ll explain in a minute why I put this word in quotes) merged together with an exchange platform. Most of the people I know who are playing with Bitcoin as a whimsical investment arrived at Coinbase as the first point of entry. I suspect this is because Coinbase accounts link up with external bank accounts, thereby offering an intuitive and familiar interface to the financial infrastructure with which we’re all so well acquainted.

After Brewster sent one bitcoin to my address, I opened a Coinbase account and used the blockchain.info browser-based wallet to dump my funds into it.

Before we even get started, I’d like to note that using blockchain.info is the best experience I’ve encountered so far in this little experiment and I want to hold this transaction up as the ideal that we can use to judge all future stunts. The only better option would be to handle my transactions with a full Bitcoin client.

What I like is that the guys at Blockchain.info have done everything they can to keep their software true to the heart of Bitcoin. I can set up a wallet without giving them my name or email address. The private keys are in my sole possession. Basically, it’s all on

me. If I lose the information that I need to access my account or I let it leak into the hands of a thief, well then I’m flat out of luck and I’ll probably learn to be more careful in the future.

This is what Bitcoin looks like without her makeup on, when she’s dragging herself off the couch to open the door for a package. And, the way I see it, she now has two options. She can either gussy herself up for people or she can try to teach people to accept her for who she is. I advocate the latter (and not merely because I’m wearing sweatpants as I type). I think that the best services will be the ones that leave most of the risk with the users while simultaneously taking pains to tutor them on how to manage key pairs, use cold storage, etc. In other words, part of what’s required in getting this whole Bitcoin thing to work is giving people a new way to understand digital ownership and, in general, just making people smarter. That’s not a bad thing.

Which brings me to Coinbase. As an exchange, Coinbase has functionalities, and therefore responsibilities, that surpass my blockchain.info wallet. It has to operate in conjunction with a world of passwords, bank account numbers and identity verification protocols, many of which are determined by federal regulations. But I still think it’s fair and instructive to ask whether or not the service retains any of the features that Bitcoin the network brings to the table.

What are these features? Coinbase lays out three of the most important ones right on the homepage of its own website. It touts Bitcoin as an open, global network, one which is “not controlled by any company or country,” (that’s #1) with transactions that are secure, “fast and cheap,” (that’s #2) which are processed without the need for collecting sensitive details about the user. “There is no need to give companies extra information or a blank check to bill you” (that’s #3).

Unfortunately, transactions made through Coinbase retain none of these properties. Not a single one. Unlike Bitcoin, Coinbase is a company and when you move your bitcoins to a Coinbase account, you give the company complete control over them. This is because, as I hinted at before, a Coinbase wallet is not a real wallet.

I know that Bitcoin has only been around for 5 years and the community is still in a tug- of-war over semantics. So maybe the term “wallet” is a work in progress. But it shouldn’t be. To me, it’s very clear what this word means. When we talk about wallets in the physical world, we’re talking about something we use to carry our cash around (and all the cursed things that accumulate in a billfold). The important thing about a wallet is that we have access to its contents. At any time we can reach in and pull out the money.

In Bitcoin, the proper analog for cash is the private keys that are used to sign transactions on the Bitcoin blockchain. Private keys are the only thing you really own in Bitcoin, and therefore, any real wallet should give you complete access to them.

Query Coinbase as to how to get your private keys and you will be directed to this message:

As Coinbase is a hosted wallet, we do not provide users with their private keys; doing so would prevent us from taking advantage of our secure cold-storage technology to protect your bitcoin funds.

Instead, you can submit transactions and sign messages using our web-based interface, bypassing the need for control of the private keys.

That pretty much does away with feature number one. Trust Coinbase with your bitcoins and you must trust them completely, because they give you no direct control. This is not a gateway to Bitcoin. It is a surrogate.

On to number two. Transactions processed through the Bitcoin network are fast and cheap. The transaction fees are mere pennies and the transactions themselves usually clear within an hour.

The same is true of a Coinbase transaction if all you are doing is moving money from one Bitcoin address to another. But buying and selling them is another matter completely. Hooking my Coinbase wallet up with my credit union account took days. Once that was settled, I sold my bitcoin across the Coinbase online exchange and waited for the money to land in my checking account. This took another four days, which is longer than I’ve waited when using other services like PayPal or Chase’s QuikPay bank transfer.

The fee was actually not too bad. I sold my bitcoin at $372.62. Of that, Coinbase took $3.88, which is just about one percent.

So, on to number three. Bitcoin is a payment network that eliminates the need for users to divulge sensitive information about themselves. Ownership is verified through strong cryptography that references pseudonyms rather than real-world identities.

This one you can definitely say goodbye to if you start trading on Coinbase and even if you just use their wallet. As I mentioned, the company now knows my name and my bank account number (which they also have the ability to dip in and out of), and my email address. In addition to that, I’ve given Coinbase my phone number in order to set up 2-factor identification. And because they possess the private keys to all of the bitcoins I store in my Coinbase wallet, the company can associate my identity with any transactions they process.

Everything that was attractive about the Bitcoin protocol has been sacrificed to make the Coinbase service user friendly in a way that simulates modern banking and that indulges the dangerous, but well-engrained notion that we are better off trusting professionals to secure our digital information than we would be if we took control of it ourselves.

I’m only picking on Coinbase because it’s the first online exchange I’ve used. I hope to take a look at more of them in the coming weeks and I suspect to find these strategies to be endemic.

But if I were to offer an opinion, I would recommend anyone who has any admiration for Bitcoin—and for what this technology is doing to disrupt traditional payment processors —to go ahead and use Coinbase to exchange between Bitcoin and fiat currencies, but to get in and out as quickly as possible. The fees are pretty low compared to what else is available. But once you start using Coinbase to process transactions on the blockchain, you’re throwing everything beautiful about Bitcoin out the window.

Next up, I hit the Bitcoin ATMs in New York City and the open air trading nights at the Bitcoin Center near Wall Street.

Posted in Announcements, News | 3 Comments

Crusading librarian for openness passes: Cathy Norton

cathy-nortonA live wire in the library field, and a firebrand for openness, Cathy Norton helped keep libraries free and open during this current digitization wave.

Fun and opinionated, we learned that she had the background and evidence to make the bold statements she did–  keep the library materials free and open.

Cathy played a very important role in the development of our Book Digitization project in it’s early years. These were years when the future of book digitization’s growth and it’s public access was not certain. She stood up to the biggest tech companies; she took on publishers, she badgered research libraries to be broader than their local agendas and, at the end of the day, made a difference. Cathy remained contemporary, relevant and vocal up to the very end.

I (brewster) was grateful when I would sail Woods Hole and show up with bags of laundry and a salty demeanor, she would be welcoming and helpful.   Always up for an adventure, she had a firm idea of the world she was trying to build.

On behalf of the Digital Readers everywhere, the Internet Archive would like to want to raise a digital book to Cathy Norton, a champion of open knowledge, a positive force for collaboration and just a truly fun person who was up to take on any challenge related to moving libraries and public access forward. Thank you Cathy for what you helped create!

With celebration and sadness,

The Internet Archive, Brewster Kahle, Robert Miller, and the Open World

Here is the obituary that appeared for Cathy.

With sadness, the MBL notes the passing of former Library Director, Catherine N. Norton, who died peacefully at home after a battle with cancer. Cathy graduated from Sacred Hearts Academy, Fairhaven, MA, Regis College, Weston MA, and taught psychology at Chamberlane Jr. College while at Boston College graduate school more than fifty years ago.   She and her husband Thomas J. Norton moved to Falmouth for the “summer” but never left. She is survived by her 4 children whom she idolized and were with her when she passed, Dr. Margaret Molly Norton, Michael Norton, Kerrie Norton Marzot, and Thomas “Packy” Norton; and her grandchildren: Buddy Norton Estes, Toby Marzot, Drew Norton, Kate Norton, Hailey Norton, Roberto Marzot, and Julietta Marzot.

Cathy was active in community affairs. She served on the Falmouth school committee in the eighties and early nineties as chair and vice chair, was a town meeting member, and most recently represented Falmouth on the Steamship Authority board. She was instrumental in naming the new vessel “Woods Hole” that will be serving the islands from the Mainland.

Cathy lived for her family, friends, fun, faith and flowers. She remained long time friends with classmates from grammar school all the way through graduate school and showed how much she valued their friendship. In her professional life at the Marine Biological Laboratory she helped build international networks that spread digital information freely to countries that needed it from South America, to Africa, to Europe, and all the countries in between. A proponent of open access, she loved to travel to these countries and spread the word about the Biodiversity Heritage Library Project. As President of the Boston Library Consortium she helped form a group of libraries that worked with the Internet Archive to digitize open access books and journals, making them available to anyone with an internet connection.

Cathy had a flair for life, and her tremendous energy and can-do attitude guided her more than 30-year career at the MBL. Cathy came to the MBL in 1980 as a member of the MBLWHOI Library staff and earned a Masters in Information Science from Simmons College in 1984. In 1991, as the electronic frontier began to enhance information access, Cathy embraced change to become the MBL’s first Director of Information Systems. In 1994, she was appointed Library Director and became a leader in promoting the digital library and open access.

During her tenure she spearheaded the development of uBio, a digital biodiversity database that served as a foundation for the Encyclopedia of Life project. She helped develop an innovative Biomedical Informatics course sponsored by the National Library of Medicine designed to enable biomedical researchers and practitioners to embrace the power of technology. Cathy was also a founding member and served as Chairman of the Biodiversity Heritage Library, a worldwide collaboration of libraries and museums making biodiversity literature freely available. In 2011 Cathy retired as MBLWHOI Library Director and was named Library Scholar.

Beyond the MBL, Cathy was a Justice of the Peace for 39 years, marrying many happy couples on the beaches and back porches of Cape Cod.

Everyone who knew her has a “Cathy story” – how she inspired them with a project, connected them with another collaborator, worked her “magic” to make the seemingly impossible a reality, or made them laugh, especially with stories of weddings she presided over as a Justice of the Peace.

The MBL has established an endowed fund in Cathy’s honor, and its flag will be lowered in her memory. The family has requested that in lieu of flowers, please make donations to the Catherine N. Norton Endowed Fellowship at the MBL, www.mbl.edu/research/norton-fellowship.

A memorial service will be held on Saturday, December 27 at 11 AM at St. Patricks church on Main Street in Falmouth.

Posted in Announcements, Books Archive, News | 6 Comments

Lost Landscapes of San Francisco: Fundraiser Benefitting Internet Archive — Friday, December 19, 2014

FerryBldgFromWaterDuskRick Prelinger’s Lost Landscapes of San Francisco is back for one final performance this year!   Now you can catch this perennially sold-out show and your ticket donation will benefit the Internet Archive, a nonprofit digital library which hosts the Prelinger Collection. Please give generously to support the effort.


Friday, December 19, 2014
6 pm Reception
7:30 pm Film

300 Funston Ave.
San Francisco, CA 94118

Get tickets here!


TouristsGGBopening1936ATripDownMarketStreet1906_1This year’s LOST LANDSCAPES brings together familiar and unseen archival film clips showing San Francisco as it was and is no more. Blanketing the 20th-century city from the Bay to Ocean Beach and the Presidio to Bayview, this screening includes San Franciscans at work and play; early hippies in the Haight; a highly privileged walk on the unfinished Golden Gate Bridge;
newly-discovered images of Playland and the waterfront; families living and playing in their neighborhoods; detail-rich streetscapes of the late 1960s; peace rallies in Golden Gate Park; 1930s color images of a busy Market Street; a selected reprise of greatest hits from years 1-8; and much, much more.

As usual, the viewers make the soundtrack — audience members are asked to identify places and events, ask questions, share their thoughts, and create an unruly interactive symphony of speculation about the city we’ve lost and the city we’d like to live in.

The film begins at 7:30 pm and is preceded by an informal
reception that begins at 6:00 pm.

Posted in Announcements, News | 2 Comments

Declaration to be ‘Defensive’ for the Defensive Patent License

The Internet Archive hereby declares itself ‘Defensive’ by committing to offer a Defensive Patent License, version 1.1 or any later version, for any of its patents, to any DPL User.   The Internet Archive does not have any patents at this time.

Our contact address is:  info@archive.org

-brewster

Founder, Digital Librarian
Internet Archive

 Birthday and Announcement about DPL.

Posted in News | 1 Comment

Defensive Patent License: Troll Proofed. Innovation Protected.

Today the Defensive Patent License is officially released.   It is designed to bring free software ideas to the patent arena by encouraging patent owners to declare themselves “defensive,” and share their patents with others that have declared themselves defensive.

defensive-patent-license-logo

This way a large number of patents can be used to help create new products and services without fear of being sued.  As more organizations join in becoming defensive, then the set of patents gets larger and the incentive to become defensive grows.

The Internet Archive hosted the “birthday party” as the license was refined, and declared itself defensive.  Brewster Kahle helped spur this generation of the idea by collaborating with lawyers who worked for years to get this to happen.

In celebration of this release, today John Gilmore is dedicating an important portfolio of patents from Pixel Qi to be defensive.   Pixel Qi was a company run by Mary Lou Jepsen of OLPC fame, and partially funded by Brewster Kahle and John Gilmore.

Please consider joining in by declaring your organization defensive, whether you have patents or not.  The Internet Archive has declared itself defensive to support this effort.

 

 

 

Posted in Announcements, News | 3 Comments

430 Billion Web Pages Saved….Help Us Do More!

141117-BrewsterDear Friends,

Today we launch our End-of-Year Campaign.  Once a year, I ask all of you to keep the Internet Archive going and growing stronger.   Please help us reach our goal of raising $1.5 million by the end of the year.  Your support will help pay for servers, bandwidth and our dedicated staff.

I founded the Internet Archive as a non-profit with a huge goal:  to give everyone access to all knowledge—the books, web pages, audio, television and software of our shared human culture. Forever.

Book Scanning with Table Top Scribe

Lan Zhu, a scanner at Internet Archive, at the Table Top Scribe. Zhu can scan a 300-page book in thirty minutes. Since 2005, the Internet Archive has digitized over 2.4 million books.

Together we are building the digital library of the future. A place where we can all go to learn and explore.

At the Internet Archive, we’ve preserved 430 billion web pages. People download 20 million books on our site each month. We get more visitors in a year than most libraries do in a lifetime. The key is to keep improving—and to keep it free. That’s where you can help us.

For the cost of buying a book, you can make a book permanently available for the next generation. Please consider donating $10, $25, $50 or whatever you can afford  to support the Internet Archive before the end of this year. It’s is a small amount to inform millions. Help us do more. I promise you, it’s money well spent.

Thank you,

Brewster Kahle
Founder, Digital Librarian
Internet Archive

Photos by David Rinehart/Internet Archive

Posted in Announcements, News | 14 Comments

Partnership Promotes Jobs and Builds Free Global Library

BARM1As part of their Building Libraries Together initiative the Internet Archive is testing a new socially-responsible jobs model with Bay Area Rescue Mission (BARM) of Richmond, California.

The Internet Archive has been digitizing books for nearly 10 years, but needed help reaching a goal of 10 million eBooks. “We had so much high value content that needed to be digitized, but not enough staff to do the work”, explains Robert Miller, Director of Digital Books and Media. “We wondered how we could make our problem someone else’s solution.” BARM offers a ‘Healthy Living’ addiction recovery program, where over 350 men and women work in a residential setting designed to move them towards self-sufficiency and independent living. The challenge for the staff at BARM is that most of their graduating clients lacked the job skills and professional résumé required for securing a job. Internet Archive can offer job skills and a work history. A conversation between Miller and Tim Hammock, Vice-President of the Bay Area Rescue Mission ensued and the Work Transition Program was born.

BARM2Candidates for the Internet Archive Work Transition Program are men and women from BARM who have completed a 12-month sober living, drug counseling or domestic abuse crisis program and are ready to re-enter the job market. This group often lacks relevant job skills, recent work experience, interpersonal and work relationship skills, self-confidence and, a résumé that a national or local employer would find compelling enough to grant an interview. The curriculum for the Internet Archive Work Transition Program lasts 9 months and focuses on ‘Learning-to-Work’. This three-phase program was based on lessons learned from the 600+ staff that the Archive has hired over the past 8 years. From these lessons, a program of progressive responsibility, constant feedback and a merit badging system was built to meet this challenge. Miller notes that this is not a make-work program. The work is substantive and needs to be completed to help get content online to share with the global community. “The Internet Archive Texts collections have over 20 million downloads each month and the material digitized by the team maintains our high standard of quality.”

BARMquote_horizontal

BARM3To ‘grease the skids’ for the Work Transition Program graduates, Hammack and Miller contacted local companies, explaining that the program was not a handout and they weren’t looking for charity. They simply asked for a commitment from employers to grant the graduate an interview. Upon reviewing the program goals and expectations, local businesses including UPS, San Francisco Public library, Costco and others signed on. The first class graduates in February 2015, but already two of the candidates have secured part-time employment.

Hammack is thrilled with the program, adding that “We take people on the worst day of their lives and help them achieve dignity, learn healthy living habits, while getting clean and sober. The Work Transition Program continues this path to recovery by helping them earn a job; a huge accomplishment!”

 

Special thanks to the teams at Internet Archive: Jesse Bell Digitization Coordinator, and Antoine McGrath, Work Transition Supervisor, and at Bay Area Rescue Mission, headed by Tim Hammack ,Vice- President of Operations. For more information about the program, contact Robert Miller.

Posted in Announcements, Books Archive, News | Comments Off

Music Analysis Beginnings

As mentioned in our recent Building Music Libraries post, we are working with researchers at Columbia University and UPF in Barcelona to run their code on the music collection to help their research and to provide new analyses that could help with exploration and understanding.

We are doing some pilot runs to generate files which some close observers may see in the music item directories on archive.org.  Audio fingerprints from audfprint are .afpt and music attributes from Essentia are in _esslow.json.gz (download sample) and _esshigh.json.gz.

Spectrogram of a Grateful Dead track

Spectrogram of a Grateful Dead track

We are also creating image files showing the audio spectrum used.  We hope this is useful for those that want to see if files have been compressed in the past (even if they are posted as flac files now).  There is also a .png for each audio file of a basic waveform that is being used in the archive’s beta site as eye candy.

More as it happens, but we wanted you know there is some progress and you will see some new files.  If you have proposed other analyses that would benefit from being run over a large corpus, please let us know by contacting info at archive dot org.

Thank you to the researchers and the Archive programmers who are working together to make this happen.

 

Posted in Audio Archive, Live Music Archive, Music | Comments Off

Using Docker to Encapsulate Complicated Program is Successful

The Internet Archive has been using docker in a useful way that is a bit out of the mainstream: to package a command-line binary and its dependencies so we can deploy it on a cluster and use it in the same way we would a static binary.

Columbia University’s Daniel Ellis created an audio fingerprinting program that was used in a competition.   It was not packaged as a debian package or other distribution approach.   It took a while for our staff to find how to install it and its many dependencies consistently on Ubuntu, but it seemed pretty heavy handed to install that on our worker cluster.    So we explored using docker and it has been successful.   While old hand for some, I thought it might be interesting to explain what we did.

1) Created a docker file to make a docker container that held all of the code needed to run the system.

2) Worked with our systems group to figure out how to install docker on our cluster with a security profile we felt comfortable with.   This included running the binary in the container as user nobody.

3) Ramped up slowly to test the downloading and running of this container.   In general it would take 10-25 minutes to download the container the first time. Once cached on a worker node, it was very fast to start up.    This cache is persistent between many jobs, so this is efficient.

4) Use the container as we would a shell command, but passed files into the container by mounting a sub filesystem for it to read and write to.   Also helped with signaling errors.

5) Starting production use now.

We hope that docker can help us with other programs that require complicated or legacy environments to run.

Congratulations to Raj Kumar, Aaron Ximm, and Andy Bezella for the creative solution to problem that could have made it difficult for us to use some complicated academic code in our production environment.

Go docker!

Posted in Music, Technical | 3 Comments

SEEKING: Visual Studies PostDoc for an Exciting New Opportunity at Internet Archive!

Council on Library and Information Resources

Today, the Internet Archive and the Council on Library and Information Resources (CLIR) announced a new position:

Visual Data Curation Fellow

Do you know a recent Ph.D in Visual Studies (film, photography, information sciences, fine art) who would like to work at the Internet Archive? We’re looking for a talented Post-doc to come work with our growing Film Archive. This two-year position is based at the Internet Archive offices in San Francisco and begins July 1, 2015 through June 30, 2017. We want to thank CLIR and the Andrew W. Mellon Foundation for a generous grant to support this position. For more information visit CLIR.  Applications are open here through December 29.

Posted in News | Comments Off

Lost Landscapes of San Francisco: Fundraiser Benefitting Internet Archive — Friday, December 19, 2014

FerryBldgFromWaterDuskRick Prelinger’s Lost Landscapes of San Francisco is back for one final performance this year!   Now you can catch this perennially sold-out show and your ticket donation will benefit the Internet Archive, a nonprofit digital library which hosts the Prelinger Collection. Please give generously to support the effort.


Friday, December 19, 2014
6 pm Reception
7:30 pm Film

300 Funston Ave.
San Francisco, CA 94118

Get tickets here!


TouristsGGBopening1936ATripDownMarketStreet1906_1This year’s LOST LANDSCAPES brings together familiar and unseen archival film clips showing San Francisco as it was and is no more. Blanketing the 20th-century city from the Bay to Ocean Beach and the Presidio to Bayview, this screening includes San Franciscans at work and play; early hippies in the Haight; a highly privileged walk on the unfinished Golden Gate Bridge;
newly-discovered images of Playland and the waterfront; families living and playing in their neighborhoods; detail-rich streetscapes of the late 1960s; peace rallies in Golden Gate Park; 1930s color images of a busy Market Street; a selected reprise of greatest hits from years 1-8; and much, much more.

As usual, the viewers make the soundtrack — audience members are asked to identify places and events, ask questions, share their thoughts, and create an unruly interactive symphony of speculation about the city we’ve lost and the city we’d like to live in.

The film begins at 7:30 pm and is preceded by an informal
reception that begins at 6:00 pm.

Posted in Announcements, Event, News | 2 Comments

Inviting the Internet Over to Play

B062ql8CMAErlkJ

At our Annual Event last week, the Archive announced a variety of new projects and plans, including our new beta interface, our compact book scanner, and our progress in tracking political ads on television. The event (full video is here) went very well, with lots of activities and social gathering before and afterwards, and included the first public unveiling of our newest project, the Internet Arcade.

Photo by Kyle Way

Photo by Kyle Way

It was obvious we were on to something – the smallish room with the two stations set up to play emulated arcade games from the collection was constantly packed. Players young and old tried out classic video games, including parents showing their children games they’d played in their own teenage years. All of it was running off the Archive’s own web pages through standard web browsers, with no special plug-ins – and it held up well. We even tracked high scores.

B1F2VfECYAAHKr3

The party, of course, was just the beginning – over the weekend, we quietly announced that the Internet Arcade was available through the main site. With over 900 arcade machines in the collection, most every major machine released between 1976 and 1988 was included. (The emulation system we use, JSMESS, is a Javascript port of a long-running emulation project called MESS/MAME, which has had hundreds of contributors over the years – we salute them.)

After an initial tweet or two, the Arcade’s existence went from a mention by Waxy and Laughing Squid, to sites like Hacker News and Mashable, and from there it hit larger and larger audiences. Within a few hours news had spread to a whole range of sites, including Joystiq, The Verge, Engadget, CNN, PC World, Gizmodo, Ars Technica… and, well, let’s just say a very large amount of sites were reporting on this story.

And that’s when the world showed up.

We’re still counting, but we know hundreds of thousands of people came, many of them all at once, to play.

And as these thousands of curious visitors and first-time callers came to the Archive to try out our collection, minor inefficiencies became showstoppers and the site was temporarily crushed. Our brave administration team persevered, repairs were made, and the site settled in for the new reality:

That's a lot of new visitors!

Everything’s fine and normal… then we crash and fix things… and WOW that’s a lot of new visitors!

This crush of new visitors are coming to the Internet Archive, possibly for the first time ever, and we welcome them with open arms. After all, that’s what we were founded for –  our stated purpose is to function as the Internet’s Library, with stored websites, digitized texts, music, movies and software.  It’s our mission as a non-profit library: make as much of culture and information available to as many people as possible. You can lose a workday or a whole winter in our virtual stacks, and our users often do.

Meanwhile, the story continues to have legs, appearing in newspapers, on radio shows, video podcasts, and message boards around the world.

And then we made it to TV news:

B1hYwH-CQAAb5_f

So now that we have (apparently) the world’s attention… ahem ahem..

Even we don’t know where this story is going to lead. But one thing is sure – video games and software are as important a part of history and culture as books, movies and music have been in the past.  And we’re dedicated to bringing all of this to you, the Internet. Sure, it can be a bit surprising when the entire internet comes over to play, but we wouldn’t have put out the welcome mat if we didn’t want you to visit.

As a non-profit, we depend heavily on user donations to stay afloat – we even take Bitcoin and subscriptions. Keeping 20 petabytes of information flowing, fast and free, is what we’re working on day and night and the positive messages and feedback we’ve gotten this past week (and over the years) tell us we’re doing the right thing.

The JSMESS emulation project is one of many open-source projects the Internet Archive is involved with, and while a lot of it is fun and games we’ve got a serious side too, gathering up disappearing web resources and important historical events into our archives to preserve for next generations. We hope that after you relive your childhood or live out a second new one, you’ll stick around and see what else we have here. It’s quite a place.

Game on!

 

 

 

 

 

Posted in Announcements, News, Software Archive | 3 Comments

Redesigning Archive.org

Last week we announced a new beta version of the archive.org site.  The beta is the first step toward inviting people to participate in building libraries together.

archive1997

1997

2000

2000

archive2001

2001

archive2002Oct

2002

archive2005

2005

2014

2014

2014 beta site

2014 beta site

Why redesign the site?

The Wayback Machine was launched in 2001, and the current look of the site was debuted in 2002 when we added movies, texts, software, and music.  There have been minor design changes and we’ve added features over the years to make the library materials more usable, but the current interface has just accumulated over time.  We have not “rethought” the site in a holistic way in the past 12 years.

A lot has changed since 2002, for the Internet Archive and on the web.  In 2002 the archive contained 5,000 non-Wayback items, about half movies from the Prelinger Archive and half live music concerts from the Etree.org community with a few books and pieces of software sprinkled in. Those 5,000 files added up to about 3 terabytes of data.  Today we have more than 20 million media items that add up to about 10,000 terabytes of data (that’s not including 435 billion saved web pages that take up an additional 10,000 terabytes of space).

As we added more stuff to the archive, people came to visit.  We ended 2002 with about 9,000 registered users.  Today we have just a hair under 2 million registered users, and around 2.5 million individuals use the library materials every day.

Having thousands of movies available on the Internet in 2002 was actually pretty rare (remember, Youtube didn’t exist until 2005). Those 5,000 media items couldn’t be played on our site – you had to download them to your own computer to watch or listen. It was very difficult to add your own files to the Internet Archive – and who would have had the bandwidth to do it anyway?  In 2002 only 21% of U.S. homes had “high speed” internet connections.  High speed back then meant 200 kb per second. [1]

And of course, we can’t forget mobile. About 20-30% of our users today are on mobile devices, and the current web site is not serving them well.

Over the years the archive has grown immensely in terms of material and patrons. Our mission is Universal Access to All Knowledge.  And we think we can do better both with Access and with gathering All Knowledge if we have new tools and a better interface for the site.

Why this interface?

We started talking about the redesign in January of this year.  (Well, honestly we’ve been talking about it since 2006, but this was the first serious, archive-wide project.)

First we found a wonderful Creative Director, David Merkoski, and hired a great designer, Kristen Schlott.  We interviewed people, both users of the archive and people who had never heard of us, and asked them questions about how they use media. We examined how our site was being used, and talked about the intricacies and complications that come with archiving 20 million disparate things. We researched how other sites deal with large amounts of media. We used our current collections and use cases to understand how different designs would perform. Our lead developer, Tracey Jaquith, built prototypes and we user tested them. We talked to some of our power users and partners about our plans and showed them the prototype to get feedback. We had a LOT of meetings.

Idea clustering after user interviews

Idea clustering after user interviews

During this process we realized that we needed to find a way to open the archive up to more participation.  The Internet Archive has built some important and useful collections, both with partners and on our own.  We digitize 1,000 books per day.  We archive 1 billion URLs every week.  We capture television 24 hours per day, every single day.  But there is a lot of media out there in the world, and we can’t save all of it for the future without the help of experts.

Who are the experts?  You!  There are some amazing collections of media in the archive, out on the web, and sitting around on shelves and in basements that have been created by the people who know and care the most about saving those things and making sure their collections are complete and well described.  We want to create a place for those people to build communities around their interests where they can safely store these amazing collections and show them to as many people as possible.  If we all work together, we can create the most useful library the world has ever seen.

WHEN!?

Today the beta has the same basic functions as the current site, with some great additions: more visual cues to help you find things, facets on collections to quickly get you where you want to go, easy searching within collections, user pages, and many more.  We think it’s already an improvement over the current site – otherwise, we wouldn’t be showing it to you yet!

But the tools that will allow you to create your own collections and collaborate with others are still being built.  These features will be released in stages so that we can test them out in the beta and see how they work for people.  We will use feedback from patrons – both what you tell us, and the usage logs for the beta – to make decisions about how things will evolve. (Don’t worry, we aren’t keeping IP addresses — the beta respects user privacy.) When you’re in the beta, you’re going to run into things that might not work quite the way you expected, or that have suddenly changed since you used them yesterday. Sometimes it will be slow or you’ll find bugs. New things will appear, and other things may disappear. New tools will suddenly start working. We hope that for our intrepid beta users, this will be part of the fun. (Because we certainly think it’s fun!)

archivemedia

What new things are coming?

To some extent, this remains to be seen.  We will in part make decisions based on how the beta is used, so please use it!

Our current ideas include: speeding up the site; allowing patrons to create their own collections; improving accessibility for the print disabled, adding ways for patrons to collaborate around collections and items, etc.

There’s a lot more to come.  We hope you will explore all of these new options with us, and help us build the library.  If you would like to give us feedback, please write to us at info at archive dot org, or leave comments here.

 

 

Posted in News | 6 Comments

New York Times: The Internet Archive, Trying to Encompass All Creation

Thanks to the New York Times for doing a great write-up of our annual celebration.  Check it out!

nytshot

 

Posted in News | Comments Off