Web Archiving with National Libraries

national-library-of-australia-2

After the Internet Archive started web archiving in the late 1990s, National libraries also took their first steps towards systematic preservation of the web. Over 30 national libraries currently have a web archiving programme. Many among them archive the web under a legal mandate, which is an extension of the Legal Deposit system to cover non-print publication and enable heritage institutions such as a national library to collect copies of online publications within a country or state.

The Internet Archive has a long tradition of working with national libraries. As a key provider of web archiving technologies and services, Internet Archive has made available open source software for crawling and access, enabling national bodies to undertake web archiving locally. The Internet Archive also runs a global web archiving service for the general public, a tailored broad crawling service for national libraries and Archive-It, a subscription service for creating, managing, accessing and storing web archive collections. Many national libraries are partners of these services.

The Internet Archive conducted a stakeholders’ consultation exercise between November 2015 and March 2016, with the aim to understand current practices, and then review Internet Archive’s current services in this light and explore new aspects for national libraries. Thirty organizations and individuals were consulted, representing national libraries, archives, researchers, independent consultants and web archiving service providers.

The main findings of the consultation are summarized below, which give an overview of the current practices of web archiving at national libraries, as well as a general impression of the progress in web archiving and specific feedback on Internet Archive’s role and services.

  • Strategy and organization
    Web archiving has become increasingly important in national libraries’ strategy. Many have wanted to own the activity and develop the capability in-house. This requires integration of web archives with the library other collections and the traditional library practice for collection development. Budget cuts and lack of resources were observed at many national libraries, making it difficult to sustain the ongoing development of tools for web archiving.
  • Quality and comprehensiveness of collection
    There is a general frustration about the content gaps in the web archives. National libraries also have strong desires to collect the portion of Twitter, YouTube, Facebook and other social media which is considered as part of their respective national domain. They would also like to leverage web archiving as a complementary collecting tool for digital objects on the web and that are included in web archives such as eBooks, eJournals, music and maps.
  • Access and research use
    National web archives are, in general, poorly used due to access restrictions. Many national libraries wish to support research use of their web archives, by engaging with researchers to understand requirements and eventually embedding web archive collections into the research process.
  • Reflection on 20 years of web archiving
    While there is recognition of the progress in web archiving, there is also a general feeling that the community is stuck with a certain way of doing things without making any significant technological progress in the last ten years, and being outpaced by the fast evolving web.
  • Perception and expectation of Internet Archive’s services
    Aspects of Internet Archive’s currently services are unknown or misperceived. Stakeholders wish for services that are complementary to what national libraries undertake locally and help them put in place better web archives. There is a strong expectation for the Internet Archive to lead the ongoing collaborative development of (especially) Heritrix and the Wayback software. A number of national libraries have expressed the need for a service supporting the use of key software including maintenance, support and new features. There are also clearly expressed interests in services that can help libraries collect advanced content such as social media and embedded videos.

The Internet Archive would like to thank the participants again for being open with us and providing us with valuable input which will inform the development and improvement of our services.

The full consultation report can be accessed at https://archive.org/details/InternetArchiveStakeholdersConsultationFindingsPublic.

Posted in Announcements, News | 2 Comments

IA + ARC + Cuba

CubaMusicWeek

Cuba Music Week is a live and online effort – both crowd sourced and curated – to highlight the importance and beauty of Cuban Music. One goal is to introduce people to ideas and music from this vibrant culture.

In the past we have created “weeks” on Muslim music, Brazil and India. To do this we contact artists, academic institutions, bloggers, broadcasters, venues and collectors to send essays, activities and events that could be coordinated with our event. Sometime the response is great, sometimes not.

Cuba is our fourth attempt and we have partnered with Cubadiscos, a Cuban government organization that hosts a weeklong music festival and a symposium on the music in Havana. Cuba has a few problems with the internet, so there is no website. We have posted a list of their activities on our site from a list that we only got the day before the festival began!

Just for fun have a look at the galleries of record covers, cha cha maybe? Our galleries are one of the best features we create. The ARC doesn’t scan images of other people’s holdings or borrow materials for the site – we own everything pictured. A few of the recordings are taken from the joint ARC and Internet Archive collection stored out in the Richmond warehouses. Here are two sweet ‘almost’ Cuban, afro-Cuban recordings from this collection. They were donated by the family of Jerry Adams.

AfroDizzieGillespie

Flautista

Mr. Adams was a radio DJ who became a major voice in promoting the Monterey Jazz Festival and helped Clint Eastwood build his collection. So some very nice stuff here. A good reason why the Internet Archive is, and should be, going after audio collections of quality with us.

One of the best features of the site are the databases, listing the Cuban recordings here at the ARC and glossaries of genres and instruments – many hundreds of styles and instruments briefly described. It’s info that is only available here. Soon everything will be stolen by Wikipedia, but for now probably the only easy-to-find source for much of this information. For audio fun we have worked with the Peabody Award winning radio show, Afropop Worldwide to bring everyone 18 hours on Cuban Music. Soon all of their 25+ years of audio will be available on the Internet Archive.
An important outgrowth of this project is our work – both the Internet Archive’s and the Archive of Contemporary Music’s – with the Cuban National Library José Martí. Last year I met with Perdo Urra who was working on a project to take old library typed and handwritten index cards on the recordings in their collection into OCR readable form.  So for us they rushed this project forward and now there are more than 30,000 cards scanned, making this data available online for scholars for the first time. Catalog available here and one example below.
DulceRezazo

Our Cuba site site will remain active as an online resource to make this culturally significant body of work readily available to people around the globe for study and enjoyment.

Do have a look at Cuba Music Week and spread the word.

Thanks,  B. George,

Director, The ARChive of Contemporary Music, NYC.

Sound Curator, The Internet Archive, San Francisco

 

Posted in News | Leave a comment

Join us for the first Decentralized Web Summit — June 8-9, in SF

Decentralized Web Summit: Locking The Web Open at the Internet Archive

The first Decentralized Web Summit is a call for dreamers and builders who believe we can lock the Web open for good. This goal of the Summit (June 8) and Meetup featuring lightning talks and workshops (June 9) is to spark collaboration and take concrete steps to create a better Web.

Together we can build a more reliable, more dynamic, and more private Web on top of the existing web infrastructure.

At the Summit on June 8, the “father of the Internet,” Vint Cerf, will share with us his “Lessons from the Internet,” the things he’s learned in his 40+ years that may help us create a new, more secure, private and robust Web. EFF’s Cory Doctorow, such a fine weaver of digital dystopias in his science fiction, will share what has gone awry with the current Web and what kind of values we need to build into the code this time.

Current builders of decentralized technologies will be on hand to share their visions of how we can build a fully decentralized Web. The founders and builders of IPFS, the Dat Project, WebTorrent, Tahoe-LAFS, zcash, Zeronet.io, BitTorrent, Ethereum, BigChainDB, Blockstack, Interledger, Mediachain, MaidSafe, Storj and others will present their technologies and answer questions. If you have a project or workshop to share on June 9, we’d love to hear from you at Dwebsummit@archive.org.

You can join the conversation in our Decentralized Web Slack channel, or — as a decentralized option — you can join the Slack as a guest through Matrix.

It will take the passion and expertise of many to lock the Web open. As Internet Archive founder, Brewster Kahle, wrote last year:

We can make openness irrevocable.
We can build this.

We can do it together.

On June 8-9, let’s collaborate to get there.

For more information and official schedule, go to decentralizedweb.net.

Event Info:

Wednesday, June 8, 2016 at 8:00 AM Thursday, June 9, 2016 at 8:00 PM

Internet Archive, 300 Funston Avenue, San Francisco, CA 94118

Please register on our Eventbrite (limit 250 participants on June 8).

Posted in Announcements, Event | 8 Comments

The tech powering the Political TV Ad Archive

Ever wonder how we built the Political TV Ad Archive? This post explains what happens back stage — how we are using advanced technology to generate the counts for how many times a particular ad has aired on television, where, and when, in markets that we track.

There are three pieces to the Political TV Ad Archive:

  • The Internet Archive collects, prepares, and serves the TV content in markets where we have feeds. Collection of TV is part of a much larger effort to meet the organization’s mission of providing “Universal Access to All Knowledge.”The Internet Archive is the online home to millions of free books, movies, software, music, images, web pages and more.
  • The Duplitron 5000 is our whimsical name for an open source system responsible for taking video and creating unique, compressed versions of the audio tracks. These are known as audio fingerprints. We create an audio fingerprint for each political ad that we discover, which we then match against our incoming stream of broadcast television to find each new copy, or airing, of that ad. These results are reported back to the Internet Archive.
  • The Political TV Ad Archive is a WordPress site that presents our data and our videos and presents it to the rest of the world. On this website, for the sake of posterity, we also archive copies of political ads that may be airing in markets we don’t track, or exclusively on social media. But for the ads that show up in areas where we’re collecting TV, we are able to present the added information about airings.

 

Step 1: recording television

We have a whole bunch of hardware spread around the country to record television. That content is then pieced together to form the programs that get stored on the Internet Archive’s servers. We have a few ways to collect TV content. In some cases, such as the San Francisco market, we own and manage the hardware that records local cable. In other cases, such as markets in Ohio and Iowa, the content is provided to us by third party services.

Regardless of how we get the data, the pipeline takes it to the same place. We record in minute-long chunks of video and stitch them together into programs based on what we know about the station’s schedule. This results in video segments of anywhere from 30 minutes to 12 hours. Those programs are then turned into a variety of file formats for archival purposes.

The ad counts we publish are based on actual airings, as opposed to reported airings. This means that we are not estimating counts by analyzing Federal Election Commission (FEC) reports on spending by campaigns. Nor are we digitizing reports filed by broadcasting stations with the Federal Communications Commission (FCC) about political ads, though that is a worthy goal. Instead we generate counts by looking at what actually has been broadcast to the public.

Because we are working from the source, we know we aren’t being misled. On the flip side, this means that we can only report counts for the channels we actively track and record. In the first phase of our project, we tracked more than 20 markets in 11 key primary states (details here.) We’re now in the process of planning which markets we’ll track for the general elections. Our main constraint is simple: money. Capturing TV comes at a cost.

A lot can go wrong here. Storms can affect reception, packets can be lost or corrupted before they reach our servers. The result can be time shifts or missing content. But most of the time the data winds up sitting comfortably on our hard drives unscathed.

Step 2: searching television

Video is terrible when you’re trying to look for a specific piece of it. It’s slow, it’s heavy, it is far better suited for watching than for working with, but sometimes you need to find a way.

There are a few things to try. One is transcription; if you have a time-coded transcript you can do anything. Like create a text editor for video, or search for key phrases, like “I approve this message.”

The problem is that most television is not precisely transcribed. Closed captions are required for most U.S. TV programs, but not for advertisements. Shockingly, most political ads are not captioned. There are a few open source tools out there for automated transcript generation, but the results leave much to be desired.

Introducing audio fingerprinting

We use a free and open tool called audfprint to convert our audio files into audio fingerprints.

An audio fingerprint is a summarized version of an audio file, one that has removed everything except the most interesting pieces of every few milliseconds. The trick is that the summaries are formed in a way that makes it easy to compare them, and because they are summaries, the resulting fingerprint is a lot smaller and faster to work with than the original.

The audio fingerprints we use are based on a thing called frequency. Sounds are made up of waves, and each wave repeats–oscillates–at different rates. Faster repetitions are linked to higher sounds, lower repetitions are lower sounds.

An audio file contains instructions that tell a computer how to generate these waves. Audfprint breaks the audio files into tiny chunks (around 20 chunks per second) and runs a mathematical function on each fragment to identify the most prominent waves and their corresponding frequencies.

The rest is thrown out, the summaries are stored, and the result is an audio fingerprint.

If the same sound exists across two files, a common set of dominant frequencies will be seen in both fingerprints. Audfprint makes it possible to compare the chunks between two sound files, count how many they have in common, and how many appear in roughly the same distance from one another.

This is what we use to find copies of political ads.

Step 3: cataloguing political ads

When we discover a new political ad the first thing we do is register it on the Internet Archive, kicking off the ingestion process. The person who found it types in some basic information such as who the ad mentions, who paid for it, and what topics are discussed.

The ad is then sent to the system we built to manage our fingerprinting workflow, we whimsically call the Duplitron 5000—or the “DT5k.” This uses audfprint to generate fingerprints, organizes how the fingerprints are stored, process the comparison results, and allows us to scale to process across millions of minutes of television.

DT5k generates a fingerprint for the ad, stores it, and then compares that fingerprint with hundreds of thousands of existing fingerprints for the shows that had been previously ingested into the system. It takes a few hours for all of the results to come in. When they do, the Duplitron makes sense of the numbers and tells the archive which programs contain copies of the ad and what time the ad aired.

These result end up being fairly accurate, but not perfect. The matches are based on audio, not video, which means we face trouble when the same soundtrack is used in a political ad as has been used in, for instance, an infomercial.

We are working on improving the system to filter out these kinds of false positives, but even with no changes these fingerprints have provided solid data across the markets we track.

Duplitron

The Duplitron 5000, counting political ads. Credit: Lyla Duey.

Step 4: enjoying the results

And so you understand a little bit more about our system. You can download our data and watch the ads at the Political TV Ad Archive.  (For more on our metadata–what’s in it, and what can you can do with it, read here.)

Over the coming months we are working to make the system more accurate. We are also exploring ways to identify newly released political ads without any need for manual entry.

P.S. We’re also working to make it as easy as possible for any researchers to download all of our fingerprints to use in their own local copies of the Duplitron 5000. Would you like to experiment with this capability? If so, contact me on Twitter at @slifty.

Posted in Announcements, News, Television Archive | Tagged , , , , , | Comments Off on The tech powering the Political TV Ad Archive

Discover Books Donates Large Numbers of Books

discoverbooksInternet Archive is proud to partner with Discover Books, a major used book seller, to help let the stories in books live on.   Discover books is donating books that the Internet Archive does not yet own and would have gone to a landfill.   Through this process the Internet Archive has more books to digitize and preserve.

Together we are giving books the longest life possible both in print and online.

Thank you to discoverbooks.com.

Posted in Announcements, Books Archive, News | 3 Comments

Reflections on From Clay to the Cloud: The Internet Archive and Our Digital Legacy, a.k.a. The Internet Archive – The Exhibition!

Screen Shot 2016-05-01 at 9.00.47 PM

Photograph by Jason Scott

By Carolyn Peter

It started with a visit to Nuala Creed’s ceramics studio in Petaluma in the spring of 2014. My interest was piqued as she described “a commission of sculptures for the Internet Archive” that was ever-growing. She heavily encouraged me to stop by the Archive to experience the famous Friday lunch and to see her work. I’m so glad she did.

While enjoying a tasty lunch of sausage and salad, I listened to Archive staff members talk about their week and curious visitors who shared their inspiration for coming to the Archive for a meal and a tour. I did not understand all the technical vocabulary, but I was struck by all the individuals who were working together on a project which, prior to this day, I had only experienced as a website on my computer screen.

Of course, I fell in love with Nuala’s sculptures as soon as we stepped inside the Great Room, where 100+ colorful figures stood facing the stage as if waiting for a performance or lecture to begin. The odd objects in their hands, their personally fashioned clothing, and their quirky expressions reinforced the idea that the Internet Archive was the shared, creative effort of a huge number of individuals. The technical was becoming human to me. By the time Brewster had brought his visitors back down into the common workspace, my mind was racing with ideas and questions.

As a museum professional who has spent her career making choices about what works of art to acquire and preserve for future generations and as someone who takes great joy in handling and caring for objects, I wondered what threads ran from this digital archive through to more traditional archives and libraries. If I had sleepless nights wondering how to best protect a work of art for posterity, how was the Internet Archive going to ensure that its vast data was going to survive for millennia to come?

Before I knew what I was doing, I heard myself telling Brewster that I would love to do an exhibition about the Internet Archive. I don’t think he or I fully registered what I was saying. That would take more time.

This curatorial challenge brewed in my mind. The more I thought about it, the more I thought an exploration of the past, present and future of archives and libraries and the basic human desire to preserve knowledge for future generations would be a perfect topic for an exhibition in my university art gallery. I knew Nuala’s series could serve as the core artistic and humanizing element for such a show, but I wondered how I would be able to convey these ideas and questions in an accessible and interesting way, how to make this invisible digital world visible? And turning the tables—if Brewster had brought art into the world of technology with his commission of the Internet Archivists series, how could I bring technology into the artistic realm?

When I approached Brewster for a second time about a year later with a proposal, he thought I was crazy. He has said it was as if I had told him I wanted to do “The Internet Archive, The Musical.” A few conversations and months later, he agreed to let me run with the idea.

I have to admit, at times, I too wondered if I was crazy. I wrestled with devising ways to visibly convey the Archive’s unfathomable vastness while also trying to spotlight the diverse aspects of the Archive through hands-on displays.

Screen Shot 2016-05-01 at 9.00.22 PM

Photograph by Jason Scott.

While some big dreams had to be let go, I was able to achieve most of the goals I set out for the exhibition. Transporting thirty-two of Nuala’s fragile sculptures to Los Angeles required two days of careful packing and a fine art shipping truck committed solely to this special load. Along with film editor Chris Jones and cameraman Scott Oller, I also created a film that documents the story of the Internet Archivists sculpture series through interviews with Nuala, Brewster and a number of the archivists who have had their sculptural portraits made.

When visitors entered the gallery, they were greeted by three of the Internet Archivist figures and a full-scale shipping container (a trompe-l’oeil work of art by Makayla Blanchard) that conveyed Brewster’s often-repeated claim that he had fit the entire World Wide Web inside a shipping container. The exhibition was filled with juxtapositions of the old and new. To the right of the three archivists was a case filled with a dozen ancient clay cuneiforms and pieces of Egyptian papyrus introducing very early forms of archiving. A china hutch displayed out of fashion media formats that the Internet Archive has been converting into digital form such as record albums, cassette tapes, slides, and VHS tapes. I partnered with LMU’s librarians to bring the mystery of archiving out into the light. Using one of the Archive’s Tabletop Scribes, the librarians scanned and digitized numerous rare books from their collection. The exhibition also included displays and computer monitors so visitors could explore the Wayback Machine, listen to music from the archive’s collections, play vintage video games and test out the Oculus Rift.

clay-cloud scribe

Photograph by Brian Forrest.

In the end, I think the exhibition asked a lot more questions than it answered. Nevertheless, I hope this first exhibition will spark others to think of ways to make the abstract ideas and invisible aspects of digital archives more tangible. Who knows, maybe, a musical is in the Internet Archive’s future.

I was sad to pack up the clay archivists and say goodbye to their smiling faces. I’m sure they are happy to be back with the rest of their friends in the Great Room on Funston Avenue, but oh, the stories they have to tell of their travels to a gallery in Los Angeles.

Carolyn Peter is the director and curator of the Laband Art Gallery at Loyola Marymount University. She curated From Clay to the Cloud: The Internet Archive and Our Digital Legacy, which was on view from January 23-March 20, 2016 at the Laband.

Posted in Announcements, Event | Comments Off on Reflections on From Clay to the Cloud: The Internet Archive and Our Digital Legacy, a.k.a. The Internet Archive – The Exhibition!

Google Library Project Legal: Let the Robots Read!

Guardian_of_Law_by_James_Earle_Fraser,_US_Supreme_Court

The decade-long legal battle over Google’s massive book scanning project is finally over, and it’s a huge win for libraries and fair use. On Monday, the Supreme Court declined to hear an appeal by the Author’s Guild, which had argued that Google’s scanning of millions of books was an infringement of copyright on a grand scale. The Supreme Court’s decision means that the Second Circuit case holding that Google’s creation of a database including millions of digital books is fair use still stands. The appeals court explained how its fair use rationale aligns with the very purpose of copyright law: “[W]hile authors are undoubtedly important intended beneficiaries of copyright, the ultimate, primary intended beneficiary is the public, whose access to knowledge copyright seeks to advance by providing rewards for authorship.”

Google Books gives readers and internet users the world over access to millions of works that had previously been hidden away in the archives of our most elite universities. As a Google representative said in a statement, “The product acts like a card catalog for the digital age by giving people a new way to find and buy books while at the same time advancing the interests of authors.”

Google began scanning books in partnership with a group of university libraries in 2004. In 2005, author and publisher groups filed a class action lawsuit to put a stop to the project. The parties agreed to settle the lawsuit in a manner that would have forever changed the legal landscape around book rights. The District Court judge rejected the settlement in 2011, based on concerns about competition, access, and fairness, and so litigation over the core question of fair use resumed.

Judge Chin, Judge Leval, and the Supreme Court all made the right decisions along the long and winding path to Google’s victory. Libraries around the country are now free to rely on fair use as they determine how to manage their own digitization projects–encouraging innovation and increasing our access to human knowledge.

Posted in Announcements, Books Archive, News | Comments Off on Google Library Project Legal: Let the Robots Read!

Truck and Back Again: The Internet Archive Truck Takes a Detour

When one of our employees came out of his home over the weekend, he saw an empty parking space. Granted, in San Francisco, that’s a pretty precious thing, but since this empty parking space had held the Internet Archive Truck for the previous two days, he was not feeling particularly lucky.

A staff conversation then ensued, the city was called to see if the truck had been towed, and after a short time, it became obvious that no, somebody had stolen the Truck.

This in itself is not news: thousands of vehicles are stolen in the Bay Area every year. But what makes this unusual was the nature of the vehicle stolen… the Truck is a pretty unique looking vehicle.

IMG_3634

IMG_3635

Once the report was filed with the police and a few more checks were made to ensure that the truck was absolutely, positively missing and presumed stolen, the truck’s theft was announced on Twitter, which garnered tens of thousands of views and the news being spread very far. Thanks to everyone who got the word out.

What was not expected, besides the initial theft, was that a lot of people wondered why the Internet Archive, essentially a website, would have a truck. So, here’s a little bit about why.

Besides the providing of older websites, books, movies, music, software and other materials to millions of visitors a day, the Internet Archive also has buildings for physical storage located in Richmond, just outside the limits of San Francisco. In these buildings, we hold copies of books we’ve scanned, audio recordings, software boxes, films, and a variety of other materials that we are either turning digital or holding for the future. It turns out you can’t be a 100% online experience – physical life just gets in the way. We also have multiple data centers and the need to transport equipment between them.

Therefore, we’ve had a hard-working vehicle for getting these materials around: a 2003 GMC Savana Cutaway G3500, often parked out front of the Archive’s 300 Funston Avenue address and making up to several trips a week between our various locations.

In a touch of whimsy, the truck has had a unique paint job for most of its life with the Archive. Notably, this isn’t even the first mural it had on its sides; here is a shot with the previous mural:

10620121_10152811702463834_2063151320571234802_o

We’re not sure of the motivation in stealing this rather unique and noticeable vehicle, and there seems to be some evidence it was driven around the city for a while after it was taken. But yesterday, we were contacted by the San Francisco Police Department with really great news:

The Truck has been recovered!

Left abandoned by the side of the road, the truck was found and is about to be returned to the Archive, and with good luck, back and in service helping us prepare and transport materials related to our mission: to bring the world’s knowledge to everyone.

Again, thanks to everyone who sounded out the original call for the truck’s return, and to the SFPD for getting a hold of the truck so quickly after it was gone.

Posted in Announcements, Cool items | Comments Off on Truck and Back Again: The Internet Archive Truck Takes a Detour

Join us for “How Digital Memory is Shaping our Future” with Abby Smith Rumsey– April 26

Abby Smith Rumsey photo by Cindi de ChannesWhat is the future of human memory? What will people know about us when we are gone?

Abby Smith Rumsey, historian and author, has explored these important questions and more in her new book When We Are No More: How Digital Memory is Shaping Our Future.

On the evening of Tuesday, April 26 at 7 p.m., the Internet Archive hosts Abby Smith Rumsey as she takes us on a journey of human memory from prehistoric times to the present, highlighting the turning points in technology that have allowed us to understand more about the history of the world around us.

Each step along the way – from paintings on cave walls to cuneiform on clay tablets, from the Gutenberg printing press to the recent technological advances of digital storage – shows how humans have adapted to the increasing need for new methods to share knowledge with a widening community. In addition to these milestones of human communication, the development of machinery in the industrial age helped unlock the geological record of the physical world around us, changing how our societies think about time and change to the natural environment on a grand scale.

When We Are No More_HC_catExamining the past helps us understand where the future might lead us. Yet with our current methods of digital storage, what will still be accessible and what steps can we take to make sure knowledge persists? Out of the vast amounts of data that we are capable of saving, what will be considered important? Only time will tell, and it will be when “we are no more.”  The Internet Archive, under the leadership of Brewster Kahle, is one organization playing an important role in bringing our civilization’s record of knowledge into the future. Smith Rumsey will share her insights into how we can leave a legacy for those in the future to best understand our lives, our struggles, our passions – our very humanity.

We hope you’ll join us for an enlightening evening with this thought-provoking author, historian and librarian.

Event Info:
How Digital Memory is Shaping Our Future:  A Conversation with Abby Smith Rumsey
Tuesday, April 26, 2016
Internet Archive, 300 Funston Avenue, San Francisco, CA 94118

Doors open at 6:30 PM, Talk begins at 7:00pm
Reception and book signing to follow presentation

This event is free and open to the public.  Please RSVP to our Eventbrite at:
http://www.eventbrite.com/e/abby-smith-rumsey-how-digital-memory-is-shaping-our-future-tickets-22473471759

For more information about Abby Smith Rumsey and her book, please visit her website at www.rumseywrites.com.

Posted in News | 3 Comments

Upcoming changes in epub generation

Epub is a format for ebooks that is used on book reader devices.   It is often mostly text, but can incorporate images. The Internet Archive offers these in two cases:  when a user uploads them, and when they are created from other formats, such as scanned books or uploaded PDFs that were made up of images of pages.

The Internet Archive creates them from images of pages using “optical character recognition” (OCR) technology. This is then reformatted into the epub format (currently epub v2). These files are sometimes created “on-the-fly” and sometimes created as files and stored in our item directories.   All “on-the-fly” epubs use the newest code, where stored ones use the code available at the time of generation.

Based on a change in the format from our OCR engine last August, many of the epubs generated between then and last week have been faulty. Newly generated epubs are now fixed, and we will soon be going back to fix the faulty ones that were stored. We have also discovered that some of the older epubs have also been faulty, and it is difficult to know which.

To fix this we are shifting to the “on-the-fly” generation for all epubs so that all epubs get the newest code.   This is how we already generate daisy, mobi, and many zip files as well.   To access the epubs for the books we have scanned the URL is https://archive.org/download/ID/ID.epub, for instance https://archive.org/download/recordofpennsylv00linn/recordofpennsylv00linn.epub.

More generally, to find when an epub can be generated, for items that do not have a field the ocr field in meta.xml, that says “language not currently OCRable”, and there is a file an abbyy format file will be in an item. For instance, in an item’s file list, the presence of an abbyy file downloadable at  http://archive.org/download/file_abbyy.gz will mean a corresponding epub file can be downloaded at http://archive.org/download/file.epub.

Posted in News | Comments Off on Upcoming changes in epub generation

New video shows rich resources available at Political TV Ad Archive

Since our launch on January 22, the Political TV Ad Archive has archived more than 1,080 ads with more than 155,000 airings. We’ve trained hundreds of journalists, students, and other interested members of the public with face-to-face trainings. But much as we would like to, we can’t talk to each of you individually. That’s why we created this video.

Watch the video for an overview of the project, the wealth of information it provides, and how fact checkers and journalists have been using it to enrich their reporting. It is a great introduction for educators to use with students, for civic groups to engage their membership in the political process, and for reporters who want to get the basics on how to use the site.

And remember: we want to hear from you about how you are using the Political TV Ad Archive. Please drop us an email at politicalad@archive.org or tweet us @PolitAdArchive. Over the week ahead, we’ll be highlighting examples of how educators have used the project in their classrooms. We’d love to feature examples of how other members of the public are using this collection to enhance deeper understanding of the 2016 elections.

Going forward, we are tracking ads in the New York City, Philadelphia, San Francisco, and Washington, DC markets. These markets will provide a window on political ads appearing in several upcoming primary states: California, Maryland, New Jersey, New York, and Pennsylvania. 

Enjoy!

Posted in Announcements, News | Comments Off on New video shows rich resources available at Political TV Ad Archive

Getting back to “View Source” on the Web: the Movable Web / Decentralized Web

The Web 1.0 moved so fast partly because you could “View Source” on a webpage you liked and then modify and re-use it to make your own webpages. This even worked with pages with JavaScript programs—you could see how it worked, modify and re-use it. The Web jumped forward.

Then came Web 2.0, where the big thing was interaction with “APIs” or application programmable interfaces.  This meant that the guts of a website were on the server and you only got to ask approved questions to get approved answers, or it would specially format a webpage for you with your answer on it.   The plus side was that websites had more dynamic webpages, but learning from how others did things became harder.

Power to the People went to Power to the Server.

Can we get both?  I believe we can, and with a new Web built on top of the existing Web.  A “decentralized web” or a “movable web” has many privacy and archivability features, but another feature could be knowledge reuse.  In this way, the set of files that make up a website—text/HTML, programs, and data—are available to the user if they want to see them.

The decentralized Web works by having a p2p distribution of the files that make up the website, and then the website runs in your browser.  By being completely portable, the website has all the pieces it needs: text, programs, and data.  It can all be versioned, archived, and examined.

[Upcoming Summit on the Decentralized Web at the Internet Archive June 8th, 2016]

For instance, this demo has the pages of a blog in a peer-to-peer file system called IPFS, but also the search engine for the site, in JavaScript, that runs locally in the browser.    The browser downloads the pages and JavaScript and the search-engine index from many places on the net and then displays in the browser.  The complete website, including its search engine and index, are therefore downloadable and inspectable.

This new Web could be a way to distribute datasets because the data would move with programs that could make use of it, thus helping document the dataset.  This use of the decentralized Web became clear to me by talking with the Karissa McKelvey and Max Ogden of the DAT Data project working on distributing scientific datasets.

What if scientific papers evolved to become movable websites (or call them “distributed websites” or “decentralized websites”)?  That way, the text of the paper, the code, and the data would all move around together documenting itself.  It could be archived, shared, and examined.

Now that would be “View Source” we could all live with and learn from.

Posted in News | Comments Off on Getting back to “View Source” on the Web: the Movable Web / Decentralized Web

The Internet Archive, ALA, and SAA Brief Filed in TV News Fair Use Case

tvnewsarchiveThe Internet Archive, joined by the American Library Association, the Association of College and Research Libraries, the Association of Research Libraries, and the Society of American Archivists filed an amicus brief in Fox v. TVEyes on March 23, 2016. In the brief, the Internet Archive and its partners urge the court to issue a decision that will support rather than hinder the development of comprehensive archives of television broadcasts.

The case involves a copyright dispute between Fox News and TVEyes, a service that records all content broadcast by more than 1,400 television and radio stations and transforms the content into a searchable database for its subscribers. Fox News sued TVEyes in 2013, alleging that the service violates its copyright. TVEyes asserted that its use of Fox News content is protected by fair use.

politicaltvadDrawing on the Internet Archive’s experience with its TV News Archive and Political TV Ad Archive, the friend-of-the-court brief highlights the public benefits that flow from archiving and making television content available for public access. “The TV News Archive allows the public to view previously aired broadcasts–as they actually went out over the air–to evaluate and understand statements made by public officials, members of the news media, advertising sponsors, and others, encouraging public discourse and political accountability,” said Roger Macdonald, Director of the TV Archive.

Moreover, creating digital databases of television content allows aggregated information about the broadcasts themselves to come to light, unlocking researchers’ ability to process, mine, and analyze media content as data. “Like library collections of books and newspapers, television archives like the TV News Archive and the Political TV Ad Archive allow anyone to thoughtfully assess content from these influential media, enhancing the work of journalists, scholars, teachers, librarians, civic organizations, and other engaged citizens,” said Tomasz Barczyk, a Berkeley Law student from the Samuelson Law, Technology & Public Policy Clinic who helped author the brief.

The brief also explains the importance of fostering a robust community of archiving organizations. Because television broadcasts are ephemeral, content is easily lost if efforts are not made to preserve it systematically.  In fact, a number of historically and culturally significant broadcasts have already been lost, from BBC news coverage of 9/11 to early episodes of Doctor Who. Archiving services prevent this disappearance by collecting, indexing, and preserving broadcast content for future public access.

A decision in this case against fair use would chill these services and could result in the loss of significant cultural resources. “This is an important case for the future of digital archives,” explained William Binkley, the other student attorney who worked on the brief. “If the court rules against TVEyes, there’s a real risk it could discourage efforts by non-profits to create searchable databases of television clips. That would deprive researchers and the general public of a tremendously valuable source of knowledge.”

The Internet Archive would like to thank Tomasz Barczyk, William Binkley, and Brianna Schofield from the Samuelson Law, Technology & Public Policy Clinic at Berkeley Law for helping to introduce an important library perspective as the Second Circuit court considers this case with important cultural implications.

Posted in Announcements, News | Comments Off on The Internet Archive, ALA, and SAA Brief Filed in TV News Fair Use Case

Three takeaways after logging 1,032 political ads in the primaries

The Political TV Ad Archive launched on January 22, 2016, with the goal of archiving airings of political ads across 20 local broadcast markets in nine key primary states and embedding fact checks and source checks of those ads by our journalism partners. We’re now wrapping up this first phase of the project, and are preparing for the second, where we’ll fundraise so we can apply the same approach to political ads in key 2016 general election battleground states.

But first: here are some takeaways from our collection after logging 1,032 ads. Of those ads, we captured 263 airing at least 100 times apiece, for a total all together of more than 145,000 airings.

1. Only a small number of ads earned “Pants on Fire!” or “Four Pinocchio” fact checking ratings. Just four ads received the worst ratings possible from our fact-checking partners.

Donald Trump’s campaign won the only “Pants on Fire” rating awarded by fact checking partner PolitiFact for a campaign ad: “Trump’s television ad purports to show Mexicans swarming over ‘our southern border.’ However, the footage used to support this point actually shows African migrants streaming over a border fence between Morocco and the Spanish enclave of Melilla, more than 5,000 miles away,” wrote PolitiFact reporters C. Eugene Emery Jr. and Louis Jacobson in early January, when Trump released the ad, his very first paid ad of the campaign. The ad aired more than 1,800 times, most heavily in the early primary states of Iowa and New Hampshire.

Trump also won a “four Pinocchio” rating from the Washington Post’s Fact Checker for this ad which charges John Kasich of helping “Wall Street predator Lehman Brothers destroy the world economy.” “[I]t’s preposterous and simply not credible to say Kasich, as one managing director out of 700, in a firm of 25,000, “helped” the firm “destroy the world economy,” wrote reporter Michelle Ye Hee Lee.

Two other ads received the “four Pinocchio” rating from the Washington Post’s Fact Checker. This one, from Ted Cruz’s campaign, claims that Marco Rubio supported an immigration plan that would have given President Obama the authority to admit Syrian refugees, including ISIS terrorists. “[T]his statement is simply bizarre,” wrote Glenn Kessler. “With or without the Senate immigration bill, Obama had the authority to admit refugees, from any country, under the Refugee Act of 1980, as long as they are refugees and are admissible….What does ISIS have to do with it? Nothing. Terrorists are not admissible under the laws of the United States.”

This one, from Conservative Solutions PAC, the super PAC supporting Rubio, claims that there was only one “Republican helpful” who had “actually done something” to dismantle the Affordable Care Act, by inserting a provision preventing protection for insurance companies from losses if they didn’t do accurate estimates on the premiums in first three years of the law. “Rubio goes way too far in claiming credit here,” wrote Kessler. “He raised initial concerns about the risk-corridor provision, but the winning legislative strategy was executed by other lawmakers.”

Overall, our fact-checking and journalism partners—the Center for Responsive Politics, the Center for Public IntegrityFactCheck.org, PolitiFact, and the Washington Post’s Fact Checker—wrote 57 fact- and source-checks of 50 ads sponsored by presidential campaigns and outside groups. (The American Press Institute and Duke Reporters’ Lab, also partners, provided training and tools for journalists fact checking ads.)

Of the 25 fact checks done by PolitiFact, 60 percent of the ads earned “Half True,” “Mostly True,” and “True” ratings, with the remainder earning “Mostly False,” “False,” and “Pants on Fire” ratings. The Washington Post’s Fact Checker, the other fact-checking group that uses ratings, fact-checked 11 ads. Of these, seven earned ratings of three or four Pinocchios. A series of ads featuring former employees and students denouncing Trump University, from a “dark money” group that doesn’t disclose its donors, earned the coveted “Geppetto Checkmark” for accuracy. Those ads aired widely in Florida and Ohio leading up to the primaries there.

The ad that produced the most fact checks and source checks was this one from the very same group, the American Future Fund, for an attack ad on John Kasich. Robert Farley of FactCheck.org wrote, “An ad from a conservative group attacks Ohio Gov. John Kasich as an ‘Obama Republican,’ and misleadingly claims his budget ‘raised taxes by billions, hitting businesses hard and the middle class even harder.'” PolitiFact Ohio reporter Nadia Pflaum gave the ad a “False” rating; Michelle Ye Hee Lee of the Washington Post’s Fact Checker awarded it “Three Pinocchios.” The Center for Public Integrity described the American Future Fund as “a conservative nonprofit linked to the billionaire brothers Charles and David Koch that since 2010 has inundated federal and state races with tens of millions of dollars.”

This ad from Donald Trump’s campaign earned a “Pants on Fire” rating from PolitiFact.

2. Super Campaign Dodger, and other creative ways to experience and analyze political ads. Journalists did some serious digging into the downloadable metadata the Political TV Ad Archive provides here to analyze trends in presidential ad campaigns.

The Economist mashed up data about airings in Iowa and New Hampshire with polling data and asked the question: Does political advertising work? The answer—”a bit of MEH” (or, “minimal-effects hypothesis”)—in other words, voters are persuaded, but just the littlest bit.

Farai Chideya of FiveThirtyEight and Kate Stohr of Fusion delved into data on anti-Trump ads airing ahead of the Florida primary—which Trump went on to win handily, despite the onslaught.

Nick Niedzwiadek plumbed the collection when writing about political ad gaffes for The Wall Street Journal. Nadja Popovich of The Guardian graphed Bernie Sanders’s surge in ad airings in Nevada, ahead of the contest there.

William La Jeunesse of Fox News reported on negative ads here. Philip Bump of The Washington Post used gifs to illustrate just how painful it was to be a TV-watching voter in South Carolina in the lead up to the primary there.

And in what was the most interactive use of the project’s metadata, Andrew McGill, a senior associate editor for The Atlantic, created an old-style video game, where the viewer uses the space key on a computer keyboard to try to dodge all the ads that aired on Iowa airwaves ahead of the caucuses there. For links to other journalists’ uses of the Political TV Ad Archive, click here.

via GIPHY
3. Candidates’ campaigns dominated; super PACs favored candidates who failed. In our collection, candidates’ official campaigns sponsored the most ad airings—63 percent. Super PACs accounted for another 27 percent, and nonprofit groups, often called “dark money” groups because they do not disclose their donors, accounted for nine percent of ad airings.

Bernie Sanders‘ and Hillary Clinton‘s campaigns had the most ad airings—29,347 and 26,891 respectively. Of the GOP candidates, who faced a more divided competition, it was Marco Rubio’s campaign that had the most airings—11,798—and Donald Trump was second, with 9,590. However, in the Republican field, super PACs played a much bigger role, particularly those advocating for candidates who have since pulled out of the race. Conservative Solutions PAC, the super PAC that supported Marco Rubio in his candidacy, showed 12,851 airings; Right to Rise, which supported Jeb Bush, had 12,543.

This pair of issue ads sponsored by the AARP (aka the American Association of Retired People), aired at least 9,653 times; the ads focus on social security and have been broadcast across the markets monitored by the Political TV Ad Archive.

The biggest non-news shows that featured political ads were “Jeopardy!,” “Live With Kelly and Michael,” and “Wheel of Fortune.” Fusion did an analysis that showed that the most popular entertainment shows targeted by presidential candidates and mashed it up with Nielsen data about viewership. For example, Bernie Sanders’ campaign favored “Jimmy Kimmel Live,” while Hillary Clinton’s campaign likes “The Ellen Degeneres Show.”

 

Screenshot 2016-03-04 13.50.08

The Political TV Ad Archive–which is a project of the Internet Archive’s TV News Archive–is now conducting a thorough review of this project, which was funded by a grant from the Knight News Challenge, an initiative of the John S. and James L. Knight Foundation. The Challenge is a joint effort of the Rita Allen Foundation, the Democracy Fund, and the Hewlett Foundation.

Stay tuned for news of the Political TV Ad Archive’s plans for covering future primaries in California, New York, and Pennsylvania, and beyond, our fundraising for the second phase of this project: fundraising to track ads in key battleground states in the general elections.

This post is cross posted at the Political TV Ad Archive.

Posted in Announcements, News | Tagged , , , , , , , , , , , , , | 1 Comment

Save our Safe Harbor: Submission to Copyright Office on the DMCA Safe Harbor for User Contributions

lighthouseThe United States Copyright Office is seeking feedback on how the “notice and takedown” system created by the Digital Millennium Copyright Act, also known as the “DMCA Safe Harbors,” is working. Congress decided that in this country, users of the Internet should be allowed to share their ideas with the world via Internet platforms. In order to facilitate this broad goal, Congress established a system that protects platforms from liability for the copyright infringement of their users, as long as the platforms remove material when a copyright holder complains. The DMCA also allows users to challenge improper takedowns.

We filed comments this week, explaining that the DMCA is generally working as Congress intended it to. These provisions allow platforms like the Internet Archive to provide services such as hosting and making available user-generated content without the risk of getting embroiled in lawsuit after lawsuit. We also offered some thoughts on ways the DMCA could work better for nonprofits and libraries, for example, by deterring copyright holders from using the notice and takedown process to silence legitimate commentary or criticism.

The DMCA Safe Harbors, while imperfect, have been essential to the growth of the Internet as an engine for innovation and free expression. We are happy to provide our perspective on this important issue to the Copyright Office.

Posted in Announcements, News | Comments Off on Save our Safe Harbor: Submission to Copyright Office on the DMCA Safe Harbor for User Contributions

Guess what we find in books? A look Inside our Midwest Regional Digitization Center– by Jeff Sharpe

The history of a book isn’t captured merely by the background of the author or its publishing date or its written content. Most books were purchased and read by someone; they are from a specific time and place. That too is part of each book’s history. Sometimes in digitizing books we find pressed flowers or a single leaf or pieces of paper that were used as bookmarks then forgotten. We even found a desiccated chameleon in one book  When we find something like that at the Internet Archive’s Digitization Centers, we digitize the object because it is part of the history of that book. We see our mission to be archiving each book exactly as it was found, so that when you flip through a book, you are seeing it as if you had the physical copy in your hands, not just black text on a white page.

Take for example this book from the Lincoln Financial Foundation Collection:  The Life and Speeches of Henry Clay. In the chapter on Clay’s speeches, you can see what Abraham Lincoln highlighted, points he thought worthy of noting.

Blog Lincoln notations in book

In fact, by seeing what Lincoln underscored as he read this book and by reading his notes, you get a glimpse into what may have shaped his ideas; how he might have then used certain concepts to express his thoughts and policies about slavery and its abolition. The history of this book, which was held and read and annotated by Abraham Lincoln, had a direct effect on the history of this nation. A historic book that also has a history of it’s own.

We’ve digitized over 125,000 items here at the Midwest Regional Digitization Center at the Allen County Public Library in Fort Wayne, Indiana. In several books we digitized for the University of Pittsburgh’s Darlington Collection, we found some treasures. In one, we found a note by William Henry Harrison , then governor of the Indian Territory in 1803. (Scroll down the pages to see the letters in situ.)

In another we found a promissory note by Aaron Burr from 1796 for a large sum of money. Burr was a controversial person to say the least. He was not only a Revolutionary War hero, Thomas Jefferson’s Vice President and a presidential candidate himself, but also the man who shot and killed Alexander Hamilton in a duel.

Once someone at the University of Pittsburgh contacted me regarding an item a digital reader had made them aware of:  a previously unknown, original survey report written by none other than Daniel Boone!  He asked me if  I knew anything about it. I verified that we had found and digitized it–along with the note by Aaron Burr  and the letter by William Henry Harrison. I got a shocked reply, “Where??”  Apparently digitizing not only opened up access to these books, it also rediscovered long-lost manuscripts stuck between the pages, penned by important figures in American history. Blog Boone letter (1)

The history of these books turned out to contain the history of this country, highlighted in a very personal way. Whether it is someone pressing a violet between the pages, Abe Lincoln researching abolition, or a forgotten survey report by Daniel Boone, sometimes the material we digitize can bring our past alive.  What will you discover lodged between the pages in our three million digital books?

Take a tour of the Midwest Regional Digitization with Jeff Sharpe in this recent video.


jeffsharpeJeff Sharpe is Senior Digitization Manager for the Midwest Region.

Jeff’s work experience in administration and research led him to the Internet Archive’s digitization center in the Allen County Public Library in Fort Wayne Indiana. He’s proud of his role in helping to bring well over a hundred thousand books online for universal access, including more than fifteen thousand items digitized by volunteers at the Midwest Center. Jeff is a voracious reader and loves books. He has a passion for history and archaeology– particularly from the Mayan civilization which has led him to  travel extensively to Mayan ruins. He enjoys among other things bicycle riding, gardening, and hanging out with his wife, two kids, and their two dogs.

Posted in Announcements, News | 11 Comments

CASH BOX Music Magazine to Come Online

The Swem Library at the College of William & Mary in Virginia has received a grant from the Council on Library and Information Resources (CLIR) to digitize its entire run of Cash Box, a music trade magazine published from 1942 to 1996.  Swem Library is partnering with the Internet Archive, to scan all 190,000 pages of the 163-volume collection and create an online portal for reading and downloading the digital images.

“We are overjoyed to be able to unleash decades of music industry information to the public,” said Dean of University Libraries Carrie Cooper. “Swem Library has been gearing up for a greater emphasis on the digitization of unique and rare collections that are of interest to the public and scholars. We are grateful to have partners like CLIR to support our efforts to expose the hidden treasures of our library.”

The grant is part of CLIR’s Digitizing Hidden Special Collections and Archives awards program, a national competition that funds the digitization of rare and unique content held by libraries and institutions that would otherwise be unavailable to the public. The program is funded by the Andrew W. Mellon Foundation.

An alternative to Billboard Magazine, Cash Box included regional chart data; hit songs by city, radio station, and record sales; popularity by jukebox; and charts by genre including country and R&B. It also featured stories on artists, news of tours, insider gossip, album summaries and photographs found nowhere else. Later issues included sections relating to the music industry in Canada, Europe, Japan and Mexico.

“We are very excited to make this important and internationally significant resource for the study of music history and popular culture more widely accessible,” said Jay Gaidmore, director of the library’s Special Collections Research Center. “Since acquiring these issues in 2010, we have received more requests for copies and information from Cash Box than from any other individual collection held in Special Collections.”

Filling requests for copies of Cash Box materials has been difficult, Gaidmore said, due to the library’s lack of resources. Researchers who need immediate access to the collection typically must travel to Williamsburg. Making the collection available online will put this resource into the hands of researchers across the globe.

Philip Gentry, assistant professor of music history at the University of Delaware, is one of those researchers. As a scholar and teacher of American music in the post-war era, Gentry believes Cash Box provides a crucial alternative to Billboard, which primarily focused on mainstream music.

“[Cash Box’s] formula relied more heavily upon jukebox ‘plays,’ and thus are often a much more reliable window into trends of more subcultural markets such as African American-dominated rhythm and blues or white working-class country,” he said.

Gentry is currently working on a project documenting anti-communist blacklisting in popular music during the McCarthy era. He has found very little discussion on the topic in Billboard, but has seen hints that it was more openly discussed in Cash Box.

Not only is Gentry excited to see Cash Box digitized for his own scholarship, he sees impact on his teaching as well.

“Digitization makes possible a whole world of classroom assignments,” he said. “Unlike with older primary sources, very few institutions have undertaken the commitment to properly archive and make accessible collections of the recent past. And yet, teaching research skills and the tools of critical reading is no less important for students engaging with popular culture of the American twentieth century.”

The project will begin in February and is expected to be completed by December 2016. The collection will be made freely and publicly available through the Swem Library website and here at the Internet Archive.

This article was republished by permission of our partners at the Swem Library.  It first appeared in January 2016.  

Posted in News | 4 Comments

Saving 500 Apple II Programs from Oblivion

Among the tens of thousands of computer programs now emulated in the browser at the Internet Archive, a long-growing special collection has hit a milestone: the 4am Collection is now past 500 available Apple II programs preserved for the first time.

playable_screenshot

To understand this achievement, it’s best to explain what 4am (an anonymous person or persons) has described as their motivations: to track down Apple II programs, especially ones that have never been duplicated or widely distributed, and remove the copy protection that prevents them from being digitized. After this, the now playable floppy disk is uploaded to the Internet Archive along with extensive documentation about what was done to the original program to make it bootable. Finally, the Internet Archive’s play-in-a-browser emulator, called JSMESS (a Javascript port of the MAME/MESS emulator) allows users to click on the screenshot and begin experiencing the Apple II programs immediately, without requiring installation of emulators or the original software.

In fact, all the screenshots in this entry link to playable programs!

playable_screenshot (1)

If you’re not familiar with the Apple II software library that has existed over the past few decades, a very common situation of the most groundbreaking and famous programs produced by this early home computer is that only the “cracked” versions persist. Off the shelf, the programs would include copy protection routines that went so far as to modify the performance of the floppy drive, or force the Apple II’s operating system to rewrite itself to behave in strange ways.

Because hackers (in the “hyper-talented computer programmers” sense) would take the time to walk through the acquired floppy disks and remove copy protection, those programs are still available to use and transfer, play and learn from.

One side effect, however, was that these hackers, young or proud of the work they’d done, would modify the graphics of the programs to announce the effort they’d put behind it, or remove/cleave away particularly troublesome or thorny routines that they couldn’t easily decode, meaning the modern access to these programs were to incomplete or modified versions. For examples of the many ways these “crack screens” might appear, I created an extensive gallery of them a number of years ago. (Note that there are both monochrome and color versions of the same screen, and these are just screen captures, not playable versions.) They would also focus almost exclusively on games, especially arcade games, meaning any programs that didn’t fall into the “arcade entertainment” section of the spectrum of Apple II programs was left by the wayside entirely.

With an agnostic approach to the disks being preserved, 4am has brought to light many programs that fall almost into the realm of lore and legend, only existing as advertisements in old computer magazines or in catalog listings of computer stores long past.

playable_screenshot (2)

It gets better.

Easily missed if you’re not looking for it are the brilliant and humorous write-ups done by 4am to explain, completely, the process of removing the copy protection routines. The techniques used by software companies to prevent an Apple II floppy drive from making a duplicate while also allowing the program to boot itself were extensive, challenging, and intense. Some examples of these write-ups include this one for “Cause and Effect”, a 1988 education program, as well as this excellent one for “The Quarter Mile”, another educational program. (To find the write-up for a given 4am item in the collection click on the “TEXT” link on the right side of the item’s web page.)

These extensive write-ups shine a light on one of the core situations about these restored computer programs.

As 4am has wryly said over the years, “Copy Protection Works!” – if the copy protection of a floppy disk-based Apple II program was strong and the program did not have the attention of obsessed fans or fall into the hands of collectors, its disappearance and loss was almost guaranteed.  Because many educational and productivity software programs were specialized and not as intensely pursued/wanted as “games” in all their forms, those less-popular genres suffer from huge gaps in recovered history. Sold in small numbers, these floppy disks are subject to bit rot, neglect, and being tossed out with the inevitably turning of the wheels of time.

This collection upends that situation: by focusing on acquiring as many different unduplicated Apple II programs as possible, 4am are using their skills to ensure an extended life and documented reference materials for what would otherwise disappear.

Classifying Animals with Backbones title screen

Already, the collection has garnered some attention – the “Classifying Animals With Backbones” educational program linked above has a guest review from one of the creators describing the process of the application coming to life. And a particularly thorny copy protection scheme on a 1982 game of Burger Time went viral (in a good way) and was read 25,000 times when it was uploaded to the Archive.

In a few cases, the amount of effort behind the copy protection schemes and the concerned engineering involved in removing the copy protection are epics in themselves.

Speed Reader II 091286 screen 3 - main menu

As an example, this educational program Speed Reader II contains extensive copy protection routines, using tricks and traps to resist any attempts to understand its inner workings and misleading any potential parties who are duplicating it. 4am do their best to walk the user through what’s going on, and even if you might not understand the exact code and engineering involved, it leaves the reader smarter for having browsed through it.

This project has been underway for years and is now at the 500 newly-preserved program mark – that’s 500 different obscure programs preserved for the first time, which you can play and experience on the archive.

Get cracking!

Algernon title screen

(The usual notes: The “Play in Browser” technology used at the Internet Archive is still relatively new, and works best on modern machines running newest versions of browsers, especially Firefox, Chrome and Brave. Javascript (not Java) needs to be enabled on the machine to work. (By default on all browsers, it is.) The manuals for many of the programs are not directly available in many cases, so some experimentation is required, although educational programs often worked to be understood without any manuals for the use of their audiences. Thanks to 4am for housing their collection at the Internet Archive and the many individuals on the MAME and JSMESS teams who have made this emulation possible.)

Posted in Emulation, Software Archive | 12 Comments

Distributed Preservation Made Simple

Library partners of the Internet Archive now have at their fingertips an easy way – from a Unix-like command line in a terminal window – to download digital collections for local preservation and access.

This post will show how to use a Internet Archive command-line tool (ia) to download all items in a collection stored on Archive.org, and keep their local collections in sync with the Archive.org collection.

To use ia, the only requirement is to have Python 2 installed on a Unix-like operating system (i.e. Linux, Mac OS X). Python 2 is pre-installed on Mac OS X and most Linux systems so there is nothing more that needs to be done, except to open up a terminal and follow these steps:

1.  Download the latest binary of the ia command-line tool by running the following command in your terminal:

curl -LO https://archive.org/download/ia-pex/ia

2. Make the binary executable:

chmod +x ia

3. Make sure you have the latest version of the binary, version 1.0.0:

./ia --version

4. Configure ia with your Archive.org credentials (This step is only needed if you need privileges to access the items). :

./ia configure

5. Download a collection:

./ia download --search 'collection:solarsystemcollection'

or

./ia download --search 'collection:JangoMonkey'

The above command to “Download a collection”, for example, will download all files from all items from the band JangoMonkey or the NASA Solar System collection. If re-run, by default, will skip over any files already downloaded, as rysnc does, which can help keep your local collection in sync with the collection on Archive.org.

If you would like to download only certain file types, you can use the –glob option. For example, if you only wanted to download JPEG files, you could use a command like:

./ia download --search 'collection:solarsystemcollection' --glob '*.jpeg|*.jpg'

Note that by default ia will download files into your current working directory. If you launch a terminal window without moving to a new directory, the files will be downloaded to your user directory. To download to a different directory, you can either cd into that directory or use the “–destdir” parameter like so:

mkdir solarsystemcollection-collection

./ia download --search 'collection:solarsystemcollection' --destdir solarsystemcollection-collection

Downloading in Parallel

GNU Parallel is a powerful command-line tool for executing jobs in parallel. When used with ia, downloading items in parallel is as easy as:

./ia search 'collection:solarsystemcollection' --itemlist | parallel --no-notice -j4 './ia download {} --glob="*.jpg|*.jpeg"'

The -j option controls how many jobs run in parallel (i.e. how many files are downloaded at a time). Depending on the machine you are running the command on, you might get better performance by increasing or decreasing the number of simultaneous jobs. By default, GNU Parallel will run one job per CPU.

GNU Parallel can be installed with Homebrew on Mac OS X (i.e.: brew install parallel), or your favorite package manager on Linux (e.g. on Ubuntu: apt-get install parallel, on Arch Linux: pacman -S parallel, etc.). For more details, please refer to: https://www.gnu.org/software/parallel/

For more options and details, use the following command:

./ia download --help

Finally, to see what else the ia command-line tool can do:

./ia --help

Documentation of the ia command-line tool is available at: https://internetarchive.readthedocs.org/en/latest/cli.html

There you have it. Library partners, download and store your collections now using this command-line tool from the Internet Archive. If you have any questions or issues, please write to info (at) archive.org. We are trying to make distributed preservation simple and easy!

 

Posted in News, Technical | 6 Comments

Next Librarian of Congress: Carla Hayden

Carla Hayden

Carla Hayden

The President has nominated Carla Hayden to be the next Librarian of Congress.    I have met her through IMLS and support her for this position.

As a public librarian, she can bring an access and public service orientation to a position that has traditionally been focused on Congress’ needs and collecting valuable materials.

The Library of Congress is both a powerful symbol and a fabulous organization.   Its collections are unbelievable– there are employees in Cairo and Delhi collecting the best that humanity has produced. The Library has high collecting standards and has resisted restrictions from being put on access.

For instance, the Library of Congress has actively pursued web archiving since 2000 and made these collections more available than almost any other institution. As the home of the US Copyright Office, the Library can keep the constitutional balance in mind as copyright laws evolve.

All of these features of the Library play into the strengths of Carla Hayden who can help shape a potent institution for our new century.

-brewster

 

Posted in Announcements | 1 Comment