Reflections on the birth of QUANTOPIA, a New Work of Art

Silently, the performers took their places, tucked behind the thick black curtain. The members of the San Francisco Girls Chorus, led by director Valérie Sainte-Agathe, quickly flowed into two neat rows in the wings of the Yerba Buena Center for the Arts Theater. The stage crew snapped into their positions at the table covered with laptops, monitors, tape and cords. DJ Spooky, aka Paul D. Miller, appeared relaxed, a total picture of cool. Annie, our stage manager ran down the cues. “Take a moment to let your eyes adjust to the darkness… look for the mark at center stage. Oh, and please be very careful not to knock this very important stand of equipment that has been carefully set up by the tech crew.”

Andi Wong, artists Greg Niemeyer and Paul K. Miller at the beginning of their year-long creative process.

After a year of dreaming, scheming and aligning the stars with the artistic team, the big moment had finally arrived. As the Internet Archive’s project manager and coordinator of educational outreach, I had the amazing opportunity to work with DJ Spooky’s creative team for over a year to help put this multimedia experience. The Internet Archive and DJ Spooky’s QUANTOPIA: The Evolution of the Internet was ready to be presented before an audience for the very first time.

The world premiere performance of QUANTOPIA, the first of ten music commissions to receive major support from a Hewlett 50 Arts Commissions grant, would actually happen seven hours later. But, for the record, the very first audience to receive this new work was an audience of public school students and their teachers along with the families of the San Francisco Girls Chorus.

Funding from the Hewlett grant helped the Internet Archive to secure the caravan of school buses and vans that transported students from Visitacion Valley Middle School and Willie Brown Middle School in San Francisco to the Yerba Buena Center for the Arts Theater for this special school day performance. The field trip was the first of the year for these students. Some children got off the bus wide-eyed, sharing that they had never ever been to the theater or even this part of town. The visit was a unique opportunity to experience a work of art created by a talented and diverse team of collaborators, artists working together with technologists, drawing from the Internet Archive’s vast repository of knowledge.

The arts, with their unique role in the shaping of personal identity and celebration of diversity, have had an important role in representing the collective culture the Bay Area. The Internet Archive is committed to the monumental task of recording and preserving, who we are and who we are becoming through text, image, sound and code. With this project, DJ Spooky and company learned first hand that the creative arts and technology can come together as equal partners in the pursuit of greater knowledge, with representation and justice for all. The performing arts calls for the audience to complete a circle of dialogue. Could we hold the door open and invite more people in?

“STAY TUNED”

In his program notes, Paul D. Miller, aka DJ Spooky points out:

Today, according to the International Telecommunication Union, around
55.1% of the world’s population, more than 3.1 billion people, have access to the Internet. More will be joining in huge numbers over the next couple of years. As we move further into a world that is defined by information and how it shapes and molds all aspects of modern society, the Internet and its ancillary effects have resulted in the most complex systems architecture humanity has ever made.

After Dr. Leonard Kleinrock’s video introduction from the very spot that gave birth to the Internet when UCLA and Stanford first connected in 1969, DJ Spooky’s symphony opened with Movement I: Mimesis – De Revolutionibus. Layered over the bright rippling of electronic sounds, The San Francisco Girls Chorus, accompanied by the Classical Revolution string quartet, began to sing of the equal protections granted to all, text inspired by Article 19 of the UN Declaration of Human Rights.

Everyone has the right to freedom of opinion and expression; this right includes freedom to hold opinions without interference and to seek, receive and impart information and ideas through any media and regardless of frontiers.

A college-age usher later shared that when the music began, he noticed that students seemed a little confused. They leaned forward, listening intently. A look of surprise crossed their faces when they realized the ethereal sounds emanating from the stage were coming from the young artists of the SF Girls Chorus. I wonder if children are now more familiar with the ubiquitous sounds of technology than with the diverse expressive range of the natural human voice.

A moving timeline of media gleaned from the Internet Archive, flashed on the giant screen behind the chorus. Movement I of Quantopia presents the Internet as “a mirror of existing social, commercial and political structures.” Greg Niemeyer’s barcode forest invites the audience to consider the evolution of the Internet through fifty moments of technological, social, political and cultural importance. Imagine learning that the Internet has a history! For middle school students, the history of the Internet is a fait accompli. Most sixth graders were born in 2007—the year that Steve Jobs introduced the iPhone and changed how we access music.

Movement II: Diegesis: Roots, Routes, Rights resurrected the ancient sounds of the dial up modem (which elicited delighted laugh of recognition from the evening crowd). This time, Article 19 was presented in binary code, zeros and ones sung by mezzo-soprano Eve Orenstein. The stage at YBCA lit up with awe-inspiring imagery of effortless complexity made possible by technology, a freestyle tour-de-force performed live from a laptop behind the curtain by Roger Antonsen. The computer scientist from the University of Norway casually described his role as “playing around with worlds,” using nodes and edges to render the birth and destruction of networks.

Movement III: Elpis – Polygon Chaos highlighted VR technology created by the Colorado production company, Medium Labs, as DJ Spooky painted the theater with waves of light and sound.

Later that evening, Internet Archive Founder and Digital Librarian Brewster Kahle and Emiko Ono of the Hewlett Foundation welcomed a sold out house to the World Premiere performance of QUANTOPIA.  The lights went down once again. This time the audience was a Bay Area blend of DJ Spooky super fans, students, technologists, YBCA subscribers, city officials, groups from Facebook and Lyft, representatives from Creative Commons, Hewlett 50 artists, and employees of Internet Archive— all ready and willing to experience something new.

One audience member told me, “There was a moment when the visuals, the music, especially the voices and melodies produced by the Girls’ Chorus and the quartet, produced a truly mystical/spiritual experience within the architectural environment.” Equally transporting—the African American spirituals that DJ Spooky weaves through his composition:

Go tell it on the mountain
Over the hills and everywhere
Go tell it on the mountain
To let my people go…

After the performance that evening, Wendy Hanamura, the Internet Archive’s Director of Partnerships, led an audience Q&A session with the QUANTOPIA creative team: Roger Antonsen, Paul D. Miller, Greg Niemeyer, Valérie Sainte-Agathe, Pax of Medium Labs.  “If you could embed a value into the code, what would that be?” Hanamura asked.  “Justice,” replied Niemeyer.

FINDING A CONNECTION: THE CREATION OF NETWORKS

In the lobby of the theater, we handed out Hewlett 50 Arts Commission Passports which included information on all of ten of the Class of 2017 grantees in musical composition. We encouraged audience members to stamp their passports with Internet Archive’s logo to record their attendance. We created this passport to be more than just a souvenir of a world premiere. My hope is that the passport will encourage our tech-centric audience to explore the nine other Hewlett-commissioned works.  Each work of arts presents an exciting world of new ideas and perspectives. You can also enjoy the fun challenge of collecting all of the stamps from each of the Hewlett 50 arts organizations. Gotta get ’em all!

Brewster Kahle and Paul D. Miller aka DJ Spooky

FINDING A CONNECTION: THE CREATION OF NETWORKS

At the Internet Archive, we’ve created an enduring online archive for QUANTOPIA to openly share and provide access to the work “across all frontiers.” It’s a model we’d like to offer to archive all works of art, which so often include text, webpages, video, audio and other ephemera. We see this as an exciting opportunity to develop new models of collaboration between the arts organizations, artists and audiences, as new works are created and continue evolving into the future.

“LO” a sketch by Greg Niemeyer for the QUANTOPIA timeline.

 

 

 

 

 

 

 

WHAT MAKES ME LOVE MUSIC SO MUCH?

The audience quieted when the tall young student, KyShawn, stepped up to the mike to ask the final question at the school day performance.  He asked DJ Spooky, “When did you know that you liked music?”

The artist took a deep breath before sharing his thoughtful response. He spoke of his parents who were both professors, and his childhood growing up in Washington, D.C., a place filled with the sounds of so many different styles of music. For Paul, music is not just music. Music is information.

We offered our thanks to the students for being such a great first audience and one by one, we left the stage. Behind the curtain, Paul smiled and noted how smart the student’s questions were.

After the Q&A, the children spilled outside into the bright light of day. Some voiced surprise at seeing so many people out and about, having lunch in the gardens or enjoying a stroll. The young people gathered at the edge of the sparkling reflecting pool, their voices ringing with excitement like grace notes over the roaring waters of “Revelation,” the nearby memorial dedicated to the memory of Dr. Martin Luther King, Jr.  Back at school,  teachers assigned KyShawn and his classmates to write their reflections on the day. Perhaps KyShawn’s essay captures what we all hoped to achieve over this year: music that inspires; music that creates empathy; music that heals.

WHAT MAKES ME LOVE MUSIC SO MUCH (by KyShawn)

What makes me love music so much is music inspires me.

Even though music exists in all cultures. To me, I love music. It helps me relax and think about others…

Listening preferences are probably the sum of many variables, developing and changing throughout Life. Understanding the sources of individual differences related to musical enjoyment more deeply would also help design more effective therapeutic procedures involving music.

Posted in News | 10 Comments

Internet Archive helps make books accessible for students with disabilities

The Internet Archive will be part of a team that is working to address a key challenge for students with disabilities: getting books in accessible formats. This participation aligns with an existing Internet Archive program to make materials available and accessible to readers with disabilities.

The number of students with disabilities at colleges and universities has grown over the past few decades. Many of those students have print disabilities, including the largest subgroup, those with learning differences.  Students with print disabilities require text to be reformatted for screen readers, text-to-speech software, or other forms of audio delivery, often with human intervention. Universities are required to perform this reformatting on request but are rarely staffed to do that work at scale and this type of reformatting and remediation can cost hundreds or even thousands of dollars. Once the work has been done for a student at one university, the reformatted book is almost never made available for use by students with disabilities at other universities.  Without collaboration and coordination across campuses efforts are wasted and students with disabilities often wait weeks to get texts in a form they can access and use.

A newly-funded pilot project, “Federated Repositories of Accessible Materials for Higher Education,” aims to address this problem. This is a two-year pilot program that has recently been funded by a $1,000,000 grant from The Andrew W. Mellon Foundation to the University of Virginia (as principal investigator) with a primary goal of reducing the duplication of remediation activity across the seve (7) universities participating in the pilot. It will also support the cumulative improvement of accessible texts and decrease the turnaround time for delivering those texts to students and faculty.

Within this program, the Internet Archive will participate as one of several repositories of digitized books, both to provide initial digital copies (for remediation) and to receive and hold remediated book files. Those improved books can then be shared with other schools and organizations that provide services to people with disabilities. They may also be used as a starting point for further conversion into additional formats (such as Braille) that may be needed to support specific reader needs.

The Internet Archive’s role in this pilot project dovetails with our existing program to make materials available and accessible to readers with disabilities. Our current program allows any organization that is already working with people with disabilities, known as Qualifying Authorities, to access the digital files of over 1.8 million books (about 900,000 of which are otherwise unavailable). Those Qualifying Authorities, especially Disability Student Service teams at colleges and universities, are then able to streamline their preparation and remediation of these digital books for people with print disabilities. Because they work directly with individual readers, Qualifying Authorities are also able to enable existing (and qualified) Internet Archive users for an account with disability access. With that access, these users can enjoy expanded and immediate access to the Internet Archive’s full collection of books (through archive.org or OpenLibrary).

We are excited to participate in and support the wider community of teams working to make books accessible for all.

Posted in Announcements, News | 22 Comments

A Love Letter to the People Who Build the Internet Archive

canvas-1-done When you visit a public library, you get to meet the librarians and others who build and care for those collections. You know there are people who empty the garbage cans, who put back the borrowed books, who maintain the computers, and who determine what ends up on the shelf.

A digital library, on the other hand, is “just” a web site.  You don’t really see the people who build it – we are often anonymous. But the Internet Archive wasn’t built by computers and algorithms.

From its inception, the Internet Archive has been built by thousands of people who understand that we have an opportunity to use the Internet to give everyone access to canvas-2-doneknowledge. Every person on the planet should have the opportunity to learn and to make a contribution.

This goal – Universal Access to All Knowledge – inspires the people who have built the Internet Archive over the past 23 years.  

People clean and repair the buildings that we occupy. People do payroll, choose our health plans, answer the phones, plan our events, reply to user emails, clean up spam, and pay our bills. People design and build the computers that hold the collections. People construct the network that carries data to every corner of the world. People write software that processes, backs up, and delivers files. People design and test and build interfaces. People digitize analog media and type in metadata. People curate collections, establish collaborations, and manage projects.

holidaypic

There’s no way I can mention all of these people by name. Even if I listed every employee from the past 23 years, I would still be missing the volunteers, the people from other organizations who worked on joint projects with us, the pro bono lawyers, the delightfully compulsive collectors, the funding organizations, the idea generators, our sounding boards for crazy ideas, the individuals who have donated money or materials, and the hundreds of thousands of people who have uploaded media into the archive.

staff2011

Libraries are built by people, for people.  Thank you so much to all of the people who have contributed to building the Internet Archive, whether they were employees or our huge group friends and family.  We would not be here without you, and we hope you will continue to help bring universal access to all knowledge in the future.

Happy Valentine’s Day!


151020-archive-staff-large

Posted in Announcements, News | 23 Comments

ZIP is Broken, Except it’s Not, Except it Is

With many thousands of software items up at the archive, we’re both very useful and also very intimidating, depending on how exactly you know what you’re looking for. While it’s great when your search query gives you exactly what you need (like, say, a manual for the greatest elevator simulator of all time or a lovely flip-album of floppy disk sleeves), it’s not so great when it doesn’t.

Our rather expansive approach to acquisition of items means that if you have a long-hazy memory of something you want to see again or want to do a query in a generalized “show me all the shooters that came out for this platform”, you’ve got a lot of digging ahead of you. I’ve had many lovely conversations with people who are looking for something specific software or game-wise, that have ended with being able to point them to an emulated version of it. Other times, I have to hand them a way to look inside a CD-ROM image from nearly 20 years ago, like this URL inside a GIF CD-ROM from 1992, which was a lovely rendered image of the Apple Logo and semi-transparent balls.

Here’s the image, which is just nice to look at:

Beyond the findability problem, there’s also the deeper problem that computer history has a lot of buried bodies. There were conflicts and issues related to interoperability, who ran what standards, and which programs actually did what they were supposed to. These problems persist in the modern world, but they have rapidly become the province of several abstract layers away: “my Playstation 4 doesn’t play every Playstation 3 game”, or “I can’t paste this image into my twitter post with a simple copy-paste, I have to put it in a paint program and copy-paste that.”

It used to be a lot, lot worse.

Which brings us to .ZIP.

A SHORT (COMPRESSED) HISTORY TO COMPRESSION

Since computers have come onto the scene, connections between them (and to the user) have always suffered for lack of bandwidth. Sending text, data, images and sounds between different locations has always been some level of slow or undependable. There have been lots of innovations across the decades to deal with it; one of them is compression techniques.

This is where the computer takes a file or sets of files, combines them, finds similar parts, and replaces those similar parts with one-off references to them. The algorithms to do these have become more complicated over time and require more computing power on the compressing end, and in some cases the decompressing end.

And here’s the thing: There have been a lot of file compression formats.

So many of them, in fact, that there’s some legitimate concern that there are compressed files out there for which no decompression program exists anymore. That’s certainly the case for a lot of proprietary file storage formats that were meant to run with one specific program (think a game data file, or a word processing program), but we’re sticking to generalized “File Compression Utility” formats in this essay.

Just in the IBM/DOS world, here are some file compression format extensions that have been created for a variety of reasons and which have been considered as in use:

ARJ, LZH, PAK, ARC, ZOO, SQZ, HYP, ARCE, ARC128, ARC286, .PAK, UC2, LHA, LBR, SFX, HAP, HA, DWC, LAR, SQZ, PIT, SIT, ICE

Some of these were made for other machines, but were made available via utility to the DOS world. They’ve got great names, reflected in the filename but just barely; names like Hamnersoft HAP/ Knowledge Dynamics, Voof, Zoo, Novosielski, ShrinkIt, and ReeveSoft Freeze. Pretty much all have fallen to the wayside in various usage (as has DOS itself) so we don’t generally see new versions of these show up.

Except .ZIP. ZIP won the battle, and is the dominant compression scheme for “files” (as opposed to video/audio compression).

But what is .ZIP?

ZIP is ZIP, except Not ZIP

Co-created by Phil Katz and Gary Conway in 1989, .ZIP was a reaction to a lawsuit. In the growing realm of file compression utilities, one format, .ARC, created by System Enhancement Associates, had started to rise, and PKWARE (Katz’ company) made a competing product, PKARC, that used original .ARC source code but rewrote it in faster routines, making it speedier. System Enhancement Associates sued PKWARE and won in a settlement, resulting in abandoning .ARC and a new format being created. The bad blood and publicity from the lawsuit helped drive adoption/conversion to the replacement format, .ZIP.

(I actually made a documentary about this part of the story.)

ZIP’s wide adoption and easy, clear documentation of the format meant support for it started expanding over time. Besides compressing the files themselves, a format like .ZIP preserves timestamps, has integrity checks, and maintains directory structure. (Many others do this as well.). If you uncompress a .ZIP file from 1992, you’ll be able to see when it was created and compressed, and other important data from a historical perspective. Also, if the file is from the early 1990s, chances of unpacking these .ZIP files successfully with any of a large range of current methods are really, really high. Drag it to your Windows, OSX or *nix environment, and chances are you’ll do fine.

The closer you get to now, though, and problems arise.

The most damning issue is that different operating system versions approach .ZIP slightly differently, which mostly works, and lets you even treat a .ZIP file like a little disk drive or folder, adding and removing files within it while preserving the compression. Why unpack 800 megabytes of files when you only need this single 5 megabyte one? Similarly, you can construct a new .ZIP file on your desktop, adjust a bunch of parameters within it, and poof, a .ZIP file you can attach to e-mail or pass along via other ways.

But between 1989 to now, with ZIP being 30 years old, there have been expansions to the format, small changes that make it backwards compatible, but with nothing to easily tell a user that they’re using an out of date or different uncompression program.

The current cross-platform king is Info-ZIP, which has a homepage that credits the many people who have worked on it and access to the versions from over the years. It has been continually maintained to handle new issues, and is generally excellent at backwards compatibility. It’s probably your best bet to getting the information back out of a .ZIP file.

But that’s not what everyone uses.

“It Doesn’t Work”

On dozens of software items at the Internet Archive are reviews where a strange phenomenon happens:

  • Some reviews indicate the contents were just what they were looking for.
  • Some declare it broken, and terrible and truncated.

They’re both right.

One of the most problematic technical issues on a day to day basis with computers are the bit limits. When you hear discussions of “8-bit”, “16-bit”, “32-bit” and “64-bit”, it usually reflects some resource within the system (graphics, filesystem, pipeline) being limited to a certain amount of addressing. If your daily job is computer development, this is probably old news to you; but not everyone’s daily job is computer development.

In general, a modern system will be some amount of 64-bit, with some 32-bit addressing thrown in a few corners simply because it’s not thought there’ll be a use for more. 32-bit is, very roughly, about 3 gigabytes of information.

This means that when someone on the Archive uploads a .ZIP file that is larger than 3 gigabytes, there’s a somewhat good chance that a patron who downloads that file will not have the ability to uncompress/unpack that file using the tools on their specific desktop. If they use the internal tools (or a downloaded tool) to go through that .ZIP, the program (or even the operating system itself) won’t know what to do with this very large file, and begin throwing out errors.

However, since the nature of .zip files is to be somewhat resilient, some files will make it out. It’ll start to unpack them, then declare a corruption or a bug and stop working. So it looks like some of it’s there, but not what the user was expecting or needed.

What Is The Lesson Here?

As the Internet Archive continues growing in acquiring software and files, our propensity for easily searchable and accessible programs means that people will rush in, encounter a file like a .ZIP file, and not know about this 30 year+ history with that format and issues that could arise. How could they be expected to?

In earlier eras of computer history, the user was expected to be able to build and pilot the ship as comfortably as ride in it as a passenger. Thankfully, those days are mostly behind us and picking up a piece of technology and using it runs into issues like placement of buttons or lacking a headphone jack, instead of concerns of header information or data formats.

But under this surface of ease and frictionless experience is the occasional roiling current of decisions, movements and changes. It reflects how truly unsettled our computer world is, and how, every once in a while, we get a glimpse into it in ways that are not obvious.

It’s a privilege to be able to hold and present these vintage programs and documents from technology and time long past. But these items lived in an environment and support structure now truly gone, and it is sometimes a period of rediscovery for researchers professional, academic and hobbyist to re-learn what we’ve forgotten.

Hopefully the archive can help remember that too.

Further Reading



Posted in Announcements, News | 3 Comments

Want to read like a celebrity?

Apparently you’re not alone. I ran across a list of celeb’s favorite books and thought you might like to check out a few. (See what I did there? Librarian pun.) Happy reading!

Anna Kendrick
All Quiet on the Western Front by Erich Maria Remarque
Slaughterhouse-Five by Kurt Vonnegut
The Things They Carried by Tim O’Brien

Bill Murray
Huckleberry Finn by Mark Twain
A Story Like the Wind by Laurens Van Der Post
A Far Off Place by Laurens Van Der Post
The Plague by Albert Camus

Bill Murray
(photo by Georges Biard, CC BY-SA 3.0, from Wikimedia Commons)

Emma Watson
Le Petit Prince by Antoine de Saint-Exupéry

Olivia Munn
Replay by Ken Grimwood

Michelle Obama
Song of Solomon by Toni Morrison

Kit Harington
1984 by George Orwell

Dolly Parton
The Little Engine That Could by Watty Piper
(And check out Dolly Parton’s Imagination Library, which gives free books to kids!)

Dolly Parton
(photo by Josef Just [CC BY-SA 3.0, from Wikimedia Commons)

Robin Williams
Foundation trilogy by Isaac Asimov (or individually at 1, 2, 3)

Daniel Radcliffe
The Master and Margarita by Mikhail Bulgakov

Rachel McAdams
When You Are Engulfed in Flames by David Sedaris

Zooey Deschanel
A Supposedly Fun Thing I’ll Never Do Again by David Foster Wallace

Donald Glover
The Curious Incident of the Dog in the Night-Time by Mark Haddon
Extremely Loud And Incredibly Close by Jonathan Safran Foer

Donald Glover
(photo by NASA/Bill Ingalls [Public domain], via Wikimedia Commons)

Alec Baldwin
The Phantom Tollbooth by Norton Juster

Hillary Clinton
The Brothers Karamazov by Fyodor Dostoyevsky
Runaway by Alice Munro

Jessica Biel
Tender Is the Night by F. Scott Fitzgerald

Chelsea Handler
Mawson’s Will by Lennard Bickel
One Thousand White Women by Jim Fergus
Anna Karenina by Leo Tolstoy

Keira Knightley
The Passion by Jeanette Winterson

J. K. Rowling
The Woman Who Walked Into Doors by Roddy Doyle

Halle Berry
Some Love, Some Pain, Sometime by J. California Cooper

Jamie Chung
The Orphan Master’s Son by Adam Johnson

Jamie Chung
(photo by David Shankbone [CC BY 3.0], from Wikimedia Commons)

Jennifer Lawrence
Catcher in the Rye by J. D. Salinger
Raise High the Roof Beam, Carpenters; and Seymour by J. D. Salinger

Lady Gaga
Letters to a Young Poet by Rainer Maria Rilke

John Hamm
Arcadia by Tom Stoppard

Cher
Music for Chameleons by Truman Capote
Stranger in a Strange Land by Robert A. Heinlein

Kesha
Still Life with Woodpecker by Tom Robbins

Anne Hathaway
The Secret Garden by Frances Hodgson Burnett

Zoe Saldana
Shawshank Redemption by Stephen King

Zoe Saldana
(photo by Gage Skidmore [CC BY-SA 3.0], from Wikimedia Commons)


George R. R. Martin
Lord of the Rings by J. R. R. Tolkien

Matt Damon
A People’s History of the United States by Howard Zinn

Nas
Convictions by Richard Pryor

Natalie Portman
Cloud Atlas by David Mitchell

Bill Gates
Better Angels of our Nature by Steven Pinker

Joan Didion
Victory by Joseph Conrad

Posted in Announcements, Books Archive | Comments Off on Want to read like a celebrity?

Making Out-of-Print Pre-1942 books available with “Last 20” provision

About a year and a half ago, the Internet Archive launched a collection of older books that were determined to qualify for the “Last 20” provision in Copyright Law, also known as Section 108(h) for the lawyers. As I understand this provision, it states that published works in the last twenty years of their copyright term may be digitized and distributed by libraries, archives and museums under certain circumstances. At the time, the small number of books that went into the collection were hand-researched by a team of legal interns. As you can imagine, this is a process that would be difficult to perform one-by-one for a large and ever-growing corpus of works.

So we set out to automate it. Amazon has an API with book information, so I figured with a little data massaging it shouldn’t be too hard to build a piece of software to do that job for us. Pull the metadata from our MARC* metadata records, send it to Amazon, and presto!

I was wrong. It was hard.

Library Catalog Names are different from Book Seller’s Names

Library-generated metadata is often very detailed, which leads to problems when we try to match the metadata provided by librarians to the metadata used on consumer-oriented web sites. For example, an author listed in a MARC record might appear as 

Purucker, G. de (Gottfried), 1874-1942

But when you look on Amazon, that same author appears as 

G. de Purucker

If we search the full author from the MARC on Amazon (including full name and birth and death dates), we may miss potential matches. And this is just one simple example.  We have to transform every author field we get from MARC using a set of rules that may continue to expand as we find new problems to solve.  Here are the current rules just for transforming this one field:

General rules for transforming MARC author to Amazon author:

  • Maintain all accented or non-Roman characters as-is
  • If there are no commas, semicolons or parentheses in the string, use the whole string as-is
  • If there are no commas in the string, but there are semicolon and/or parentheses, use anything before semicolon or parentheses as the entire author string
  • If there are commas in the string:
    • Everything before the first comma should be used as the author’s last name
    • Everything after the first comma but BEFORE any of these should be used as the author’s first name:
      • comma [ , ],
      • semicolon [ ; ],
      • open parentheses [ ( ]
      • any number [0-9]
      • end of string
    • Remaining information should be discarded
  • Period [ . ] and apostrophe [ ‘ ] and other symbols should not be used to delimit any name and should be maintained as-is in the transformed string.

An Account of the Saga of the Never-ending Title: as told to the author by three blah blah blahs…

Some older books have really long titles. The MARC record contains the entire title, of course! Why wouldn’t it?! But consumer-oriented sites like Amazon often carry these books with shortened or modified titles.  

For example, here’s the title of a real page-turner:

American authors, 1600 – 1900 a biographical dictionary of American literature ; compl. in 1 vol. with 1300 biographies and 400 portraits

But on Amazon that title is:

American Authors 1600-1900: A Biographical Dictionary of American Literature (Wilson Authors)

As you can image, it’s far more difficult to reliably match books with longer titles. A human can look at those two titles and think “yeah, that’s probably the same book,” but software doesn’t work quite that well.

*$%!@$* Serials

Now that the librarians have had a laugh, let’s explain that for everybody else! Think back to the days of yore when you went to the library and looked things up in a physical card catalog. If you wanted to know where a serial or periodical was located within the library collections, you really just needed one card to tell you that. It’s on this shelf in this area and the collection contains these years.

Great! Except when you’re looking at digital versions of these serials, they are distinct entities – they have different dates, different topics, different authors sometimes, etc. And yet they often still have just one MARC record – the digital equivalent of that one card in the catalog.

And that means that the publication dates pulled from the MARC records are sometimes very wrong.

For example, we have several items from the annual series The Book of Knowledge – 1947, 1957, 1958, 1959, 1974…  The date provided in the MARC file for all of these is 1940.

As you can imagine, when we are filtering texts by year for various purposes, serials are a consistent issue.

Even when we have a correct date, Amazon does not match very well on volume and other serial or periodical-based information.  For example, when we search for a particular month of a magazine, we are likely to match an entirely different month of that same magazine.

Not All Metadata is Good Metadata

Unbelievably, librarians do make mistakes. Sometimes the data we have from MARC records has typos, or a MARC record for a different publication date was attached to the book. For example, we have an author named Fkorence A Huxley, but her name is really Florence.  Not according to the MARC record, though! Fat finger errors don’t just happen on phones. Another example: we scanned a book originally published in 1924, and *republished* in 1971. We have the 1971 version.  But the MARC record tells us it’s from 1924.

Essentially, our search is only as good as our metadata. If there are typos, or the wrong MARC record, or wrong data, our search and/or filtering will not be accurate.

Commercial APIs Are Not Built to Solve Library Problems

Amazon’s API is built to sell books to end users. Yes, it helps you find a particular book, but the other data the API contains about availability, formats and pricing is less accurate. Because the Section 108(h) exemption for libraries (read more here) involves knowing whether copies are being sold at reasonable prices, we need to know about these aspects of the book to determine whether they qualify. But Amazon’s API is incomplete in this area. So we found ourselves needing to use the API to find a match for the title and author, and then go to the page and scrape it to actually get accurate availability and pricing information.

This increases the complexity of the programming required to use Amazon as a source for information, and greatly lengthened the process of building tools for this purpose.

Everything changes

We are making a determination about whether a book meets the qualifications for Section 108(h) at a particular point in time. Even with all of the issues discussed here, the accuracy of the data we can now pull about book availability and price is high. But it’s only accurate for the moment that we pull the data, because Amazon’s marketplace is constantly changing.  If we don’t find a book on Amazon today, that doesn’t mean it won’t appear on the site tomorrow. 

Because of this, when we make an item available to the public via Section 108(h), we write into the item’s metadata the date on which the determination was made. 

Who Wants In!?

Since I’ve made this process sound SO appealing, I would imagine that any number of other library institutions are going to line up around the block wanting to try it out for themselves. Or not. But here’s the good news! If we digitize your books, the Internet Archive may be able to do the Section 108(h) determination on your behalf. Please contact us if you would like to participate.

*A MARC record is a MAchine-Readable Cataloging record. Essentially, it is the digital equivalent of the physical card from a card catalog. 

Posted in Announcements, Books Archive, News | 3 Comments

The World As They Saw It

Guest blog post by professor Tom Gally

As international travel becomes cheaper and easier, many of the tourists who now swamp Venice, Barcelona, San Francisco, and Hong Kong are visiting a foreign country for the first time. Surprised, fascinated, and sometimes repulsed by what they see, they eagerly post to social media their photos and impressions. Such reports are the source of much of what we believe, consciously or unconsciously, about places we haven’t visited yet.

Centuries ago, too, travelers were eager to tell their stories to people back home, and those stories helped to create the images and stereotypes that were formed about other lands and people. Many of those stories can be found in the thousands of travel books that are available in the text collections of the Internet Archive.

Here is a description, from a book published in London in 1701, of an Englishman’s first impressions of Paris:

Having enter’d this famous City, we were set down near the Louvre, and drop’d in first at a paltry House where the Fellow call’d himself in his Sign Le grand Voyageru, (or great Traveller) and pretended to Speak all Languages, but could scarce speak his own. Finding here but indifferent Accommodation, our Man provided us a Lodging in a House, where liv’d no less than two and twenty Families; thither we were carried in Sedans with Wheels, drag’d along by one Man, no Hackney-Coaches being then to be had. This was on a Sunday, and I was not a little surpriz’d to see Violins about the Streets, and People singing and dancing every where, as if they had been mad.

Though the language is archaic, the sentiments—bragging about visiting a famous city, complaining about accommodation and transportation, frowning at the local customs—would not be out of place in a tourist’s Facebook post today.

“View of the suburbs of a Chinese city”

In the early 1790s, King George III sent an envoy to the Emperor of China. Though the diplomatic mission was unsuccessful in its main purpose—to obtain trade concessions for Britain similar to those granted to the Portuguese and Dutch—it yielded a three-volume official report, by George Staunton, that contains a fascinating account of the long voyage halfway around the world (volume 1) and of the Chinese empire as seen through British eyes (volume 2). The report also includes many carefully engraved illustrations of sights in China—the Instagram posts of the era (volume 3).

“Descending the rapids of the Madeira”

Other travelers’ accounts I’ve dipped into include Travels from St. Petersburg, in Russia, to Diverse Parts of Asia by John Bell (1763) (volume 1, volume 2), Travels in America by George Howard (1851) (here), and a large compendium titled Cyclopædia of Modern Travel by Bayard Taylor (1856) (here).

Lately, I’ve also been exploring the Internet Archive’s rich collection of books written by British and American visitors to Japan in the 19th and early 20th centuries. Until the 1850s, Japan had been shut off nearly completely from the rest of the world for more than two hundred years, and people elsewhere were eager to learn about the mysterious country. Many sailors, traders, diplomats, missionaries, journalists, and individual travelers who were able to visit Japan wrote later about their experiences, and I’ve compiled a list of more than 240 of their books.

I myself moved to Japan in 1983 and have lived here ever since. As I read now the accounts of Westerners who arrived at Nagasaki or Yokohama in 1858 or 1869 or 1880 or 1905, I recall my own vivid first impressions of the country 36 years ago. While there are many differences—they rode rickshas, I took commuter trains; those Victorians were shocked by the casual nudity, this Californian was surprised by how formally people dressed—our experiences were also similar in many ways. And those who, as I did, stayed for more than a year or two and learned the language gradually came to see how their initial assessments had also been incomplete and sometimes biased.

“Tokio”

Several times a week, I pass through the bustling Shibuya crossing in Tokyo, and in recent years I’ve noticed more and more foreign tourists taking pictures of that famous location. After reading travelers’ accounts from more than a century ago, I increasingly wonder how tourists today are perceiving this country that is now my home, and I speculate how people elsewhere, seeing those photos posted to Instagram and Twitter and Weibo, will come to view that intersection and this country. I never would have thought deeply about this, and I certainly wouldn’t be contrasting our experiences with those of 19th-century visitors, if it weren’t for the great collections of books that the Internet Archive makes available for anyone in the world to read.

Tom Gally was born in Pasadena, California, in 1957. Since moving to Japan, he has worked as a translator, teacher, lexicographer, and writer. He is now a professor in the Graduate School of Arts and Sciences at the University of Tokyo and is compiling a book of excerpts from travelers’ accounts to be titled Japan As They Saw It. Samples can be read at the book’s website.

Posted in Announcements, News | 12 Comments

QUEER.ARCHIVE.WORK 2, 1923 INTERNET ARCHIVE EDITION

By Paul Soulellis

We usually think about archives as places of abundance. Deep, rich sites that house a multitude of perspectives. This can certainly be true, but archives are also sites of erasure, allowing some voices or perspectives to be minimized and excluded when they don’t fit into normative narratives.

Traditionally, stories involving people of color, queer people, and other historically-marginalized voices have been left out of archives, or diminished, because of ignorance, homophobia, and racism. Histories aren’t “discovered” in archives; rather, we use archives to actively construct versions of history, stories that accommodate our own subjective positions and ideologies. All too frequently, these stories favor the familiar structures of oppressive power—whiteness, patriarchy, and capitalism.

Likewise, the public domain is a remarkable construction that allows us to define who is or isn’t included in normative narratives. The public domain proclaims certain material as property owned by no one; cultural material in the public domain, theoretically, belongs to everyone. As copyright law enables new content to enter the public domain each year, it’s important to look closely at which voices are amplified in the celebration of open culture. There is no actual public domain. There is no site or territory or designation that reflects an authentic condition of “making public.” 

Rather, it’s a complex, evolving structure defined by the institutions that serve as portals to cultural material—museums, libraries, courts, and archives like this one. They carry a responsibility to give (or deny) access to materials that traverse in and out of the public domain. But as an institutional construct, the public domain can easily fail to reflect any true nature of “the public;” without careful consideration, access to the public domain ends up repeating and perpetuating, in a highly predictable way, the same oppressive structures that govern society and culture.     

What can be done? It’s crucial that we carefully examine our archives and search for lost voices, stories of failure, non-linear trajectories, and other non-conventional perspectives. We must refuse to accept traditional timelines at face value, and work to amplify marginalized material that has otherwise gone unnoticed, or erased. When confronting an archive or any presentation of historic cultural material, it’s irresponsible not to ask urgent questions like: What forces shaped this? Who was excluded? Who else should be included here in order to better understand the material at hand? Once engaged, we can actively work to change the shape of history, giving it dimension and depth and greater representation for all who were involved. This is what I’ve been calling queer archive work.  

I’m really grateful to the Internet Archive for inviting me to help shape their effort to present newly available material in the public domain. During my residency here, for the last 3 weeks, I’ve been searching archive.org for forgotten material — in particular, evidence of African-American culture, Native American culture, early LGBTQ voices, and other artifacts from 1923 that in the past would have been forgotten or actively left out of celebrations of open access culture. If something seemed to be missing, I tried to find it elsewhere and upload it to archive.org. Remarkably, I found the first openly lesbian book of poetry ever published in North America, On A Grey Thread, by the Bay-area poet Elsa Gidlow, from 1923. It had never been digitized, but a PDF from the author’s estate was sent to me for this project and is now online, as of a few days ago.

The result is QUEER.ARCHIVE.WORK 2, 1923 INTERNET ARCHIVE EDITION. It’s an edition of 100 copies that I edited, designed, and printed myself at a small press in Berkeley, and it features 15 lesser-known historical artifacts. All of it is now available on archive.org. I’m very proud that the Internet Archive enabled me to create this project. By bringing these items together in a loose assemblage, in the form of a publication, my hope is to create a place for forgotten voices to co-mingle. I think by doing more of this work, we can challenge what we think or assume we know about the early years of the 20th century, and imagine other kinds of histories.

For more see:
http://soulellis.com
http://queer.archive.work
https://queer.archive.work/2/index.html
https://archive.org/details/soulellis


Posted in Announcements, News | 5 Comments

Helping us judge a book by its cover: software help request

The Internet Archive would appreciate some help from a volunteer programmer to create software that would help determine if a book cover is useful to our users as a thumbnail or if we should use the title page instead. For many of our older books, they have cloth covers that are not useful, for instance:

But others are useful:

Just telling by age is not enough, because even 1923 cloth covers are sometimes good indicators of what the book is about (and are nice looking):

We would like a piece of code that can help us determine if the cover is useful or not to display as the thumbnail of a book. It does not have to be exact, but it would be useful if it knew when it didn’t have a good determination so we could run it by a person.

To help any potential programmer volunteers, we have created folders of hundreds of examples in 3 catatories: year 1923 books with not-very-useful covers, year 1923 books with useful covers, and year 2000 books with useful covers. The filenames of the images are the Internet Archive item identifier that can be used to find the full item:  1922forniaminera00bradrich.jpg would come from https://archive.org/details/1922forniaminera00bradrich.   We would like a program (hopefully fast, small, and free/open source) that would say useful or not-useful and a confidence. 

Interested in helping? Brenton at archive.org is a good point of contact on this project.   Thank you for considering this. We can use the help. You can also use the comments on this post for any questions.

FYI: To create these datasets, I ran these command lines, and then by hand pulled some of the 1923 covers into the “useful” folder.

bash-3.2$ ia search "date:1923 AND mediatype:texts AND NOT collection:opensource AND NOT collection:universallibrary AND scanningcenter:*" --itemlist --sort=downloads\ desc | he\
ad -1000 | parallel --will-cite -j10 "curl -Ls https://archive.org/download/{}/page/cover_.jpg?fail=soon.jpg\&cnt=0 >> ~/tmp/cloth/{}.jpg"

bash-3.2$ ia search "date:2000 AND mediatype:texts AND scanningcenter:cebu" --itemlist --sort=downloads\ desc | head -1000 | parallel --will-cite -j10 "curl -Ls https://archive.\
org/download/{}/page/cover_.jpg?fail=soon.jpg\&cnt=0 >> ~/tmp/picture/{}.jpg"
Posted in Announcements, News | 21 Comments

A Public Peek into 1923

Commercial radio broadcasting began in the 1920s, bringing entertainment, news and music into people’s homes. Now, instead of needing to play a 78rpm disc on your phonograph, you could just tune in to listen to popular songs.

And in 1923 that means you would have been listening to one of the many versions of “Yes! We Have No Bananas” written by Frank Silver and Irving Cohn.  

You could listen to the Billy Jones version (play below), the Billy Murray version, a Yiddish version, or an Italian version, among others.

Yes! We Have No Bananas by Billy Jones from the 78rpm collection

Then you could have moved on to dancing the Charleston, popularized by the song of the same name from the 1923 musical “Runnin’ Wild.”   And with the explosion of recordings by African American musicians, you could also enjoy “Baby Won’t You Please Come Home” by Bessie Smith and “Dipper Mouth Blues” by Louis Armstrong.

Autogyro (1934)

In the news of the day you saw the first flight of an autogyro (the precursor to the helicopter).

Jack Dempsey defended his World Heavyweight Championship title against Tommy Gibbons and Luis Firpo.

And Howard Carter’s team finally entered the burial chamber of King Tutankhamen, as covered in books, sheet music and song

But why are we focusing on 1923? Because for the first time in 20 years, new works are entering the public domain in the United States (read more: 1, 2, 3). And those works were all published in, you guessed it, 1923.

Settle in with a Reese’s Peanut Butter Cup, a Butterfinger, or a refreshing Popsicle (all invented in 1923!) while you watch Cecil B. DeMille’s The Ten CommandmentsThe White Sister starring Lillian Gish, or The Hunchback of Notre Dame starring Lon Chaney. Or any one of 50 other films available on archive.org from that year.

After your movie marathon, you can turn to your “new” reading materials to learn about sewing the latest women’s fashions, try an old recipe from a cook book (we recommend the Marshmallow Loaf), learn about theatrical lighting, construct yourself a bungalow (um, check the lastest building codes first), grab some sheet music, read up on Benito Mussolini, and learn “How You Can Keep Fit” from Rudolph Valentino (!).

Finally, settle in to read some Robert Frost, Virginia Woolf, Edith Wharton, or Kahlil Gibran. And while you’re here, take a look at the 20,000 other texts we have available from 1923. 

We look forward to introducing you to 1924 NEXT January!

Posted in 78rpm, Announcements, Audio Archive, Books Archive, Cool items, Movie Archive, Music, News | 3 Comments