The final vote on the Copyright Directive in the European Parliament is expected between the 26 and 28 March. As we explained previously, one particular provision, known as Article 13, would lead to upload filters being required on most Internet services. The proposed law has only gotten worse over the months of debate, and many in the EU and across the globe are concerned that this will lead to censorship even of legal content. The #SaveYourInternet fight has one last chance to prevent this law from taking effect. If you are an EU citizen, the most effective thing you can do is to call your MEP and ask them to vote against Article 13. Real world peaceful protests are also planned throughout Europe. Go to savetheinternet.info/demos to find out where your nearest demonstration is. Those of us outside the EU can support this effort on social media using the #SaveYourInternet hashtag.
Growing up, my father worked night and day on a massive project which he called The Encyclopedia of Folk Music. Dad’s desk was in the middle of whatever small place we rented, if indeed rent was ever paid. His desk was the center of our household universe, piled high with papers, a Corona typewriter, stacks of reference books and sheet music, his ashtray over-flowing with cigar butts.
As a kid, I believed The Encyclopedia was going to make Dad famous, and he told me we’d have loads of money when it was published. He said he was listing thousands of folk songs in detailed entries. There would be nothing like it, or so I was told; the most complete collection of folk songs in existence, and he was the person to put it all together, having been a singer and songwriter back in the day, long before I was born in 1965.
Because my father couldn’t read music, a musician named Joe Tansman spent countless hours creating sheet music for Dad, he was at our house so often he became family. Unfortunately, Joe was never given attribution for his work, as far as I can tell.
Several years into the project, things went haywire. My mother said Dad sold The Encyclopedia to Billboard Magazine, but he couldn’t bring himself to give them the work. He kept the advance, using the money to rush us out of town.
When I was twelve, a man called the house calling my father a crook, and said he’d invested his life savings and wanted to be paid. That same year, more bad news: Dad said somehow his index for The Encyclopedia had been mysteriously burned and he’d have to start again, although there was no fire in or around our house. There was always a wacky reason why The Encyclopedia wouldn’t be published anytime soon.
Fast forward: I’m an adult trying to piece it together. My father died in 2009, and his encyclopedia isn’t with the rest of his papers. I began looking into things I was told, songs he said he wrote and his many pseudonyms. Recently I tracked down an older copy of The Encyclopedia a couple had invested in back in 1970. I knew by this time the work would never be published. I loved my dad and we were close, but I became obsessed with fact checking him, which started while he was still alive, and continues to this day.
Last summer, I pitched this complex story to NPR’s “Hidden Brain” podcast, and an episode was created based on decades of my research. They interviewed my contacts including a half-sister I met in 2011 (one of many children my father had abandoned). While “Hidden Brain” was in production, I purchased the 1970s copy of The Encyclopedia which became part of the story the show crafted.
Being featured on a popular podcast gave me a lucky break, I was contacted by Internet Archive. David Fox, Development Director, and Jeff Kaplan, Collections Manager had heard the show, and Jeff reached out to me, wondering if I’d be interested in having the work scanned. Brewster Kahle, the founder, approved the pro bono scanning of the work.
On my 54th birthday, 10 years after my father’s death, I took my copy of The Encyclopedia to Internet Archives and gave it to Jeff and Brewster. It’s hard to put into words the closure this gave me, knowing that at least after all the twists, turns and broken promises, Dad’s early copy will be online for people to use at no cost. I was told by Jeff Kaplan that he’d already found an obscure song in The Encyclopedia and performed it with his duo. I wish I could have been there to hear it!
There’s the last version of The Encyclopedia, which has mysteriously vanished. The boxes full of my father’s work were supposed to go to The Buck Owens Museum, but may have ended up in some unknown person’s storage. I’ve yet to track it down. That missing copy has more entries, and would take months to scan. But for now, I’m going to pause to enjoy the memory of my best birthday ever. Thank you to Internet Archive and all the wonderful people who made this happen.
Silently, the performers took their places, tucked behind the thick black curtain. The members of the San Francisco Girls Chorus, led by director Valérie Sainte-Agathe, quickly flowed into two neat rows in the wings of the Yerba Buena Center for the Arts Theater. The stage crew snapped into their positions at the table covered with laptops, monitors, tape and cords. DJ Spooky, aka Paul D. Miller, appeared relaxed, a total picture of cool. Annie, our stage manager ran down the cues. “Take a moment to let your eyes adjust to the darkness… look for the mark at center stage. Oh, and please be very careful not to knock this very important stand of equipment that has been carefully set up by the tech crew.”
Andi Wong, artists Greg Niemeyer and Paul K. Miller at the beginning of their year-long creative process.
After a year of dreaming, scheming and aligning the stars with the artistic team, the big moment had finally arrived. As the Internet Archive’s project manager and coordinator of educational outreach, I had the amazing opportunity to work with DJ Spooky’s creative team for over a year to help put this multimedia experience. The Internet Archive and DJ Spooky’s QUANTOPIA: The Evolution of the Internet was ready to be presented before an audience for the very first time.
The world premiere performance of QUANTOPIA, the first of ten music commissions to receive major support from a Hewlett 50 Arts Commissions grant, would actually happen seven hours later. But, for the record, the very first audience to receive this new work was an audience of public school students and their teachers along with the families of the San Francisco Girls Chorus.
Funding from the Hewlett grant helped the Internet Archive to secure the caravan of school buses and vans that transported students from Visitacion Valley Middle School and Willie Brown Middle School in San Francisco to the Yerba Buena Center for the Arts Theater for this special school day performance. The field trip was the first of the year for these students. Some children got off the bus wide-eyed, sharing that they had never ever been to the theater or even this part of town. The visit was a unique opportunity to experience a work of art created by a talented and diverse team of collaborators, artists working together with technologists, drawing from the Internet Archive’s vast repository of knowledge.
The arts, with their unique role in the shaping of personal identity and celebration of diversity, have had an important role in representing the collective culture the Bay Area. The Internet Archive is committed to the monumental task of recording and preserving, who we are and who we are becoming through text, image, sound and code. With this project, DJ Spooky and company learned first hand that the creative arts and technology can come together as equal partners in the pursuit of greater knowledge, with representation and justice for all. The performing arts calls for the audience to complete a circle of dialogue. Could we hold the door open and invite more people in?
In his program notes, Paul D. Miller, aka DJ Spooky points out:
Today, according to the International Telecommunication Union, around 55.1% of the world’s population, more than 3.1 billion people, have access to the Internet. More will be joining in huge numbers over the next couple of years. As we move further into a world that is defined by information and how it shapes and molds all aspects of modern society, the Internet and its ancillary effects have resulted in the most complex systems architecture humanity has ever made.
Everyone has the right to freedom of opinion and expression; this right includes freedom to hold opinions without interference and to seek, receive and impart information and ideas through any media and regardless of frontiers.
A college-age usher later shared that when the music began, he noticed that students seemed a little confused. They leaned forward, listening intently. A look of surprise crossed their faces when they realized the ethereal sounds emanating from the stage were coming from the young artists of the SF Girls Chorus. I wonder if children are now more familiar with the ubiquitous sounds of technology than with the diverse expressive range of the natural human voice.
A moving timeline of media gleaned from the Internet Archive, flashed on the giant screen behind the chorus. Movement I of Quantopia presents the Internet as “a mirror of existing social, commercial and political structures.” Greg Niemeyer’s barcode forest invites the audience to consider the evolution of the Internet through fifty moments of technological, social, political and cultural importance. Imagine learning that the Internet has a history! For middle school students, the history of the Internet is a fait accompli. Most sixth graders were born in 2007—the year that Steve Jobs introduced the iPhone and changed how we access music.
Movement II: Diegesis: Roots, Routes, Rights resurrected the ancient sounds of the dial up modem (which elicited delighted laugh of recognition from the evening crowd). This time, Article 19 was presented in binary code, zeros and ones sung by mezzo-soprano Eve Orenstein. The stage at YBCA lit up with awe-inspiring imagery of effortless complexity made possible by technology, a freestyle tour-de-force performed live from a laptop behind the curtain by Roger Antonsen. The computer scientist from the University of Norway casually described his role as “playing around with worlds,” using nodes and edges to render the birth and destruction of networks.
Later that evening, Internet Archive Founder and Digital Librarian Brewster Kahle and Emiko Ono of the Hewlett Foundation welcomed a sold out house to the World Premiere performance of QUANTOPIA. The lights went down once again. This time the audience was a Bay Area blend of DJ Spooky super fans, students, technologists, YBCA subscribers, city officials, groups from Facebook and Lyft, representatives from Creative Commons, Hewlett 50 artists, and employees of Internet Archive— all ready and willing to experience something new.
One audience member told me, “There was a moment when the visuals, the music, especially the voices and melodies produced by the Girls’ Chorus and the quartet, produced a truly mystical/spiritual experience within the architectural environment.” Equally transporting—the African American spirituals that DJ Spooky weaves through his composition:
Go tell it on the mountain Over the hills and everywhere Go tell it on the mountain To let my people go…
After the performance that evening, Wendy Hanamura, the Internet Archive’s Director of Partnerships, led an audience Q&A session with the QUANTOPIA creative team: Roger Antonsen, Paul D. Miller, Greg Niemeyer, Valérie Sainte-Agathe, Pax of Medium Labs. “If you could embed a value into the code, what would that be?” Hanamura asked. “Justice,” replied Niemeyer.
FINDING A CONNECTION: THE CREATION OF NETWORKS
In the lobby of the theater, we handed out Hewlett 50 Arts Commission Passports which included information on all of ten of the Class of 2017 grantees in musical composition. We encouraged audience members to stamp their passports with Internet Archive’s logo to record their attendance. We created this passport to be more than just a souvenir of a world premiere. My hope is that the passport will encourage our tech-centric audience to explore the nine other Hewlett-commissioned works. Each work of arts presents an exciting world of new ideas and perspectives. You can also enjoy the fun challenge of collecting all of the stamps from each of the Hewlett 50 arts organizations. Gotta get ’em all!
Roger Antonsen stamps his Hewlett 50 passport.
Patricia Kristof Moy of Kohl Mansion will present Jake Heggie’s dramatic vocal-chamber work, based on the true stories of The Violins of Hope, a set of more than 60 instruments, originally played by prisoners in concentration camps and ghetto residents during World War II.
Katherine Bates and Charlton Lee of the Del Sol String Quartet. Their collaboration with composer Huang Ruo, The Angel Island Oratorio, will premiere on October 2020 on Angel Island.
Brewster Kahle and Paul D. Miller aka DJ Spooky
FINDING A CONNECTION: THE CREATION OF NETWORKS
At the Internet Archive, we’ve created an enduring online archive for QUANTOPIA to openly share and provide access to the work “across all frontiers.” It’s a model we’d like to offer to archive all works of art, which so often include text, webpages, video, audio and other ephemera. We see this as an exciting opportunity to develop new models of collaboration between the arts organizations, artists and audiences, as new works are created and continue evolving into the future.
“LO” a sketch by Greg Niemeyer for the QUANTOPIA timeline.
WHAT MAKES ME LOVE MUSIC SO MUCH?
The audience quieted when the tall young student, KyShawn, stepped up to the mike to ask the final question at the school day performance. He asked DJ Spooky, “When did you know that you liked music?”
The artist took a deep breath before sharing his thoughtful response. He spoke of his parents who were both professors, and his childhood growing up in Washington, D.C., a place filled with the sounds of so many different styles of music. For Paul, music is not just music. Music is information.
We offered our thanks to the students for being such a great first audience and one by one, we left the stage. Behind the curtain, Paul smiled and noted how smart the student’s questions were.
After the Q&A, the children spilled outside into the bright light of day. Some voiced surprise at seeing so many people out and about, having lunch in the gardens or enjoying a stroll. The young people gathered at the edge of the sparkling reflecting pool, their voices ringing with excitement like grace notes over the roaring waters of “Revelation,” the nearby memorial dedicated to the memory of Dr. Martin Luther King, Jr. Back at school, teachers assigned KyShawn and his classmates to write their reflections on the day. Perhaps KyShawn’s essay captures what we all hoped to achieve over this year: music that inspires; music that creates empathy; music that heals.
WHAT MAKES ME LOVE MUSIC SO MUCH (by KyShawn)
What makes me love music so much is music inspires me.
Even though music exists in all cultures. To me, I love music. It helps me relax and think about others…
Listening preferences are probably the sum of many variables, developing and changing throughout Life. Understanding the sources of individual differences related to musical enjoyment more deeply would also help design more effective therapeutic procedures involving music.
The Internet Archive will be part of a team that is working to address a key challenge for students with disabilities: getting books in accessible formats. This participation aligns with an existing Internet Archive program to make materials available and accessible to readers with disabilities.
The number of students with disabilities at colleges and universities has grown over the past few decades. Many of those students have print disabilities, including the largest subgroup, those with learning differences. Students with print disabilities require text to be reformatted for screen readers, text-to-speech software, or other forms of audio delivery, often with human intervention. Universities are required to perform this reformatting on request but are rarely staffed to do that work at scale and this type of reformatting and remediation can cost hundreds or even thousands of dollars. Once the work has been done for a student at one university, the reformatted book is almost never made available for use by students with disabilities at other universities. Without collaboration and coordination across campuses efforts are wasted and students with disabilities often wait weeks to get texts in a form they can access and use.
A newly-funded pilot project, “Federated Repositories of Accessible Materials for Higher Education,” aims to address this problem. This is a two-year pilot program that has recently been funded by a $1,000,000 grant from The Andrew W. Mellon Foundation to the University of Virginia (as principal investigator) with a primary goal of reducing the duplication of remediation activity across the seve (7) universities participating in the pilot. It will also support the cumulative improvement of accessible texts and decrease the turnaround time for delivering those texts to students and faculty.
Within this program, the Internet Archive will participate as one of several repositories of digitized books, both to provide initial digital copies (for remediation) and to receive and hold remediated book files. Those improved books can then be shared with other schools and organizations that provide services to people with disabilities. They may also be used as a starting point for further conversion into additional formats (such as Braille) that may be needed to support specific reader needs.
The Internet Archive’s role in this pilot project dovetails with our existing program to make materials available and accessible to readers with disabilities. Our current program allows any organization that is already working with people with disabilities, known as Qualifying Authorities, to access the digital files of over 1.8 million books (about 900,000 of which are otherwise unavailable). Those Qualifying Authorities, especially Disability Student Service teams at colleges and universities, are then able to streamline their preparation and remediation of these digital books for people with print disabilities. Because they work directly with individual readers, Qualifying Authorities are also able to enable existing (and qualified) Internet Archive users for an account with disability access. With that access, these users can enjoy expanded and immediate access to the Internet Archive’s full collection of books (through archive.org or OpenLibrary).
We are excited to participate in and support the wider community of teams working to make books accessible for all.
When you visit a public library, you get to meet the librarians and others who build and care for those collections. You know there are people who empty the garbage cans, who put back the borrowed books, who maintain the computers, and who determine what ends up on the shelf.
A digital library, on the other hand, is “just” a web site. You don’t really see the people who build it – we are often anonymous. But the Internet Archive wasn’t built by computers and algorithms.
From its inception, the Internet Archive has been built by thousands of people who understand that we have an opportunity to use the Internet to give everyone access to knowledge. Every person on the planet should have the opportunity to learn and to make a contribution.
This goal – Universal Access to All Knowledge – inspires the people who have built the Internet Archive over the past 23 years.
People clean and repair the buildings that we occupy. People do payroll, choose our health plans, answer the phones, plan our events, reply to user emails, clean up spam, and pay our bills. People design and build the computers that hold the collections. People construct the network that carries data to every corner of the world. People write software that processes, backs up, and delivers files. People design and test and build interfaces. People digitize analog media and type in metadata. People curate collections, establish collaborations, and manage projects.
There’s no way I can mention all of these people by name. Even if I listed every employee from the past 23 years, I would still be missing the volunteers, the people from other organizations who worked on joint projects with us, the pro bono lawyers, the delightfully compulsive collectors, the funding organizations, the idea generators, our sounding boards for crazy ideas, the individuals who have donated money or materials, and the hundreds of thousands of people who have uploaded media into the archive.
Libraries are built by people, for people. Thank you so much to all of the people who have contributed to building the Internet Archive, whether they were employees or our huge group friends and family. We would not be here without you, and we hope you will continue to help bring universal access to all knowledge in the future.
Our rather expansive approach to acquisition of items means that if you have a long-hazy memory of something you want to see again or want to do a query in a generalized “show me all the shooters that came out for this platform”, you’ve got a lot of digging ahead of you. I’ve had many lovely conversations with people who are looking for something specific software or game-wise, that have ended with being able to point them to an emulated version of it. Other times, I have to hand them a way to look inside a CD-ROM image from nearly 20 years ago, like this URL inside a GIF CD-ROM from 1992, which was a lovely rendered image of the Apple Logo and semi-transparent balls.
Here’s the image, which is just nice to look at:
Beyond the findability problem, there’s also the deeper problem that computer history has a lot of buried bodies. There were conflicts and issues related to interoperability, who ran what standards, and which programs actually did what they were supposed to. These problems persist in the modern world, but they have rapidly become the province of several abstract layers away: “my Playstation 4 doesn’t play every Playstation 3 game”, or “I can’t paste this image into my twitter post with a simple copy-paste, I have to put it in a paint program and copy-paste that.”
It used to be a lot, lot worse.
Which brings us to .ZIP.
A SHORT (COMPRESSED) HISTORY TO COMPRESSION
Since computers have come onto the scene, connections between them (and to the user) have always suffered for lack of bandwidth. Sending text, data, images and sounds between different locations has always been some level of slow or undependable. There have been lots of innovations across the decades to deal with it; one of them is compression techniques.
This is where the computer takes a file or sets of files, combines them, finds similar parts, and replaces those similar parts with one-off references to them. The algorithms to do these have become more complicated over time and require more computing power on the compressing end, and in some cases the decompressing end.
And here’s the thing: There have been a lot of file compression formats.
So many of them, in fact, that there’s some legitimate concern that there are compressed files out there for which no decompression program exists anymore. That’s certainly the case for a lot of proprietary file storage formats that were meant to run with one specific program (think a game data file, or a word processing program), but we’re sticking to generalized “File Compression Utility” formats in this essay.
Just in the IBM/DOS world, here are some file compression format extensions that have been created for a variety of reasons and which have been considered as in use:
ARJ, LZH, PAK, ARC, ZOO, SQZ, HYP, ARCE, ARC128, ARC286, .PAK, UC2, LHA, LBR, SFX, HAP, HA, DWC, LAR, SQZ, PIT, SIT, ICE
Some of these were made for other machines, but were made available via utility to the DOS world. They’ve got great names, reflected in the filename but just barely; names like Hamnersoft HAP/ Knowledge Dynamics, Voof, Zoo, Novosielski, ShrinkIt, and ReeveSoft Freeze. Pretty much all have fallen to the wayside in various usage (as has DOS itself) so we don’t generally see new versions of these show up.
Except .ZIP. ZIP won the battle, and is the dominant compression scheme for “files” (as opposed to video/audio compression).
But what is .ZIP?
ZIP is ZIP, except Not ZIP
Co-created by Phil Katz and Gary Conway in 1989, .ZIP was a reaction to a lawsuit. In the growing realm of file compression utilities, one format, .ARC, created by System Enhancement Associates, had started to rise, and PKWARE (Katz’ company) made a competing product, PKARC, that used original .ARC source code but rewrote it in faster routines, making it speedier. System Enhancement Associates sued PKWARE and won in a settlement, resulting in abandoning .ARC and a new format being created. The bad blood and publicity from the lawsuit helped drive adoption/conversion to the replacement format, .ZIP.
ZIP’s wide adoption and easy, clear documentation of the format meant support for it started expanding over time. Besides compressing the files themselves, a format like .ZIP preserves timestamps, has integrity checks, and maintains directory structure. (Many others do this as well.). If you uncompress a .ZIP file from 1992, you’ll be able to see when it was created and compressed, and other important data from a historical perspective. Also, if the file is from the early 1990s, chances of unpacking these .ZIP files successfully with any of a large range of current methods are really, really high. Drag it to your Windows, OSX or *nix environment, and chances are you’ll do fine.
The closer you get to now, though, and problems arise.
The most damning issue is that different operating system versions approach .ZIP slightly differently, which mostly works, and lets you even treat a .ZIP file like a little disk drive or folder, adding and removing files within it while preserving the compression. Why unpack 800 megabytes of files when you only need this single 5 megabyte one? Similarly, you can construct a new .ZIP file on your desktop, adjust a bunch of parameters within it, and poof, a .ZIP file you can attach to e-mail or pass along via other ways.
But between 1989 to now, with ZIP being 30 years old, there have been expansions to the format, small changes that make it backwards compatible, but with nothing to easily tell a user that they’re using an out of date or different uncompression program.
The current cross-platform king is Info-ZIP, which has a homepage that credits the many people who have worked on it and access to the versions from over the years. It has been continually maintained to handle new issues, and is generally excellent at backwards compatibility. It’s probably your best bet to getting the information back out of a .ZIP file.
But that’s not what everyone uses.
“It Doesn’t Work”
On dozens of software items at the Internet Archive are reviews where a strange phenomenon happens:
Some reviews indicate the contents were just what they were looking for.
Some declare it broken, and terrible and truncated.
They’re both right.
One of the most problematic technical issues on a day to day basis with computers are the bit limits. When you hear discussions of “8-bit”, “16-bit”, “32-bit” and “64-bit”, it usually reflects some resource within the system (graphics, filesystem, pipeline) being limited to a certain amount of addressing. If your daily job is computer development, this is probably old news to you; but not everyone’s daily job is computer development.
In general, a modern system will be some amount of 64-bit, with some 32-bit addressing thrown in a few corners simply because it’s not thought there’ll be a use for more. 32-bit is, very roughly, about 3 gigabytes of information.
This means that when someone on the Archive uploads a .ZIP file that is larger than 3 gigabytes, there’s a somewhat good chance that a patron who downloads that file will not have the ability to uncompress/unpack that file using the tools on their specific desktop. If they use the internal tools (or a downloaded tool) to go through that .ZIP, the program (or even the operating system itself) won’t know what to do with this very large file, and begin throwing out errors.
However, since the nature of .zip files is to be somewhat resilient, some files will make it out. It’ll start to unpack them, then declare a corruption or a bug and stop working. So it looks like some of it’s there, but not what the user was expecting or needed.
What Is The Lesson Here?
As the Internet Archive continues growing in acquiring software and files, our propensity for easily searchable and accessible programs means that people will rush in, encounter a file like a .ZIP file, and not know about this 30 year+ history with that format and issues that could arise. How could they be expected to?
In earlier eras of computer history, the user was expected to be able to build and pilot the ship as comfortably as ride in it as a passenger. Thankfully, those days are mostly behind us and picking up a piece of technology and using it runs into issues like placement of buttons or lacking a headphone jack, instead of concerns of header information or data formats.
But under this surface of ease and frictionless experience is the occasional roiling current of decisions, movements and changes. It reflects how truly unsettled our computer world is, and how, every once in a while, we get a glimpse into it in ways that are not obvious.
It’s a privilege to be able to hold and present these vintage programs and documents from technology and time long past. But these items lived in an environment and support structure now truly gone, and it is sometimes a period of rediscovery for researchers professional, academic and hobbyist to re-learn what we’ve forgotten.
About a year and a half ago, the Internet Archive launched a collection of older books that were determined to qualify for the “Last 20” provision in Copyright Law, also known as Section 108(h) for the lawyers. As I understand this provision, it states that published works in the last twenty years of their copyright term may be digitized and distributed by libraries, archives and museums under certain circumstances. At the time, the small number of books that went into the collection were hand-researched by a team of legal interns. As you can imagine, this is a process that would be difficult to perform one-by-one for a large and ever-growing corpus of works.
So we set out to automate it. Amazon has an API with book information, so I figured with a little data massaging it shouldn’t be too hard to build a piece of software to do that job for us. Pull the metadata from our MARC* metadata records, send it to Amazon, and presto!
I was wrong. It was hard.
Library Catalog Names are different from Book Seller’s Names
Library-generated metadata is often very detailed, which leads to problems when we try to match the metadata provided by librarians to the metadata used on consumer-oriented web sites. For example, an author listed in a MARC record might appear as
Purucker, G. de (Gottfried), 1874-1942
But when you look on Amazon, that same author appears as
G. de Purucker
If we search the full author from the MARC on Amazon (including full name and birth and death dates), we may miss potential matches. And this is just one simple example. We have to transform every author field we get from MARC using a set of rules that may continue to expand as we find new problems to solve. Here are the current rules just for transforming this one field:
General rules for transforming MARC author to Amazon author:
Maintain all accented or non-Roman characters as-is
If there are no commas, semicolons or parentheses in the string, use the whole string as-is
If there are no commas in the string, but there are semicolon and/or parentheses, use anything before semicolon or parentheses as the entire author string
If there are commas in the string:
Everything before the first comma should be used as the author’s last name
Everything after the first comma but BEFORE any of these should be used as the author’s first name:
comma [ , ],
semicolon [ ; ],
open parentheses [ ( ]
any number [0-9]
end of string
Remaining information should be discarded
Period [ . ] and apostrophe [ ‘ ] and other symbols should not be used to delimit any name and should be maintained as-is in the transformed string.
An Account of the Saga of the Never-ending Title: as told to the author by three blah blah blahs…
Some older books have really long titles. The MARC record contains the entire title, of course! Why wouldn’t it?! But consumer-oriented sites like Amazon often carry these books with shortened or modified titles.
For example, here’s the title of a real page-turner:
American authors, 1600 – 1900 a biographical dictionary of American literature ; compl. in 1 vol. with 1300 biographies and 400 portraits
But on Amazon that title is:
American Authors 1600-1900: A Biographical Dictionary of American Literature (Wilson Authors)
As you can image, it’s far more difficult to reliably match books with longer titles. A human can look at those two titles and think “yeah, that’s probably the same book,” but software doesn’t work quite that well.
Now that the librarians have had a laugh, let’s explain that for everybody else! Think back to the days of yore when you went to the library and looked things up in a physical card catalog. If you wanted to know where a serial or periodical was located within the library collections, you really just needed one card to tell you that. It’s on this shelf in this area and the collection contains these years.
Great! Except when you’re looking at digital versions of these serials, they are distinct entities – they have different dates, different topics, different authors sometimes, etc. And yet they often still have just one MARC record – the digital equivalent of that one card in the catalog.
And that means that the publication dates pulled from the MARC records are sometimes very wrong.
For example, we have several items from the annual series The Book of Knowledge – 1947, 1957, 1958, 1959, 1974… The date provided in the MARC file for all of these is 1940.
As you can imagine, when we are filtering texts by year for various purposes, serials are a consistent issue.
Even when we have a correct date, Amazon does not match very well on volume and other serial or periodical-based information. For example, when we search for a particular month of a magazine, we are likely to match an entirely different month of that same magazine.
Not All Metadata is Good Metadata
Unbelievably, librarians do make mistakes. Sometimes the data we have from MARC records has typos, or a MARC record for a different publication date was attached to the book. For example, we have an author named Fkorence A Huxley, but her name is really Florence. Not according to the MARC record, though! Fat finger errors don’t just happen on phones. Another example: we scanned a book originally published in 1924, and *republished* in 1971. We have the 1971 version. But the MARC record tells us it’s from 1924.
Essentially, our search is only as good as our metadata. If there are typos, or the wrong MARC record, or wrong data, our search and/or filtering will not be accurate.
Commercial APIs Are Not Built to Solve Library Problems
Amazon’s API is built to sell books to end users. Yes, it helps you find a particular book, but the other data the API contains about availability, formats and pricing is less accurate. Because the Section 108(h) exemption for libraries (read more here) involves knowing whether copies are being sold at reasonable prices, we need to know about these aspects of the book to determine whether they qualify. But Amazon’s API is incomplete in this area. So we found ourselves needing to use the API to find a match for the title and author, and then go to the page and scrape it to actually get accurate availability and pricing information.
This increases the complexity of the programming required to use Amazon as a source for information, and greatly lengthened the process of building tools for this purpose.
We are making a determination about whether a book meets the qualifications for Section 108(h) at a particular point in time. Even with all of the issues discussed here, the accuracy of the data we can now pull about book availability and price is high. But it’s only accurate for the moment that we pull the data, because Amazon’s marketplace is constantly changing. If we don’t find a book on Amazon today, that doesn’t mean it won’t appear on the site tomorrow.
Because of this, when we make an item available to the public via Section 108(h), we write into the item’s metadata the date on which the determination was made.
Who Wants In!?
Since I’ve made this process sound SO appealing, I would imagine that any number of other library institutions are going to line up around the block wanting to try it out for themselves. Or not. But here’s the good news! If we digitize your books, the Internet Archive may be able to do the Section 108(h) determination on your behalf. Please contact us if you would like to participate.
*A MARC record is a MAchine-Readable Cataloging record. Essentially, it is the digital equivalent of the physical card from a card catalog.
As international travel becomes cheaper and easier, many of the tourists
who now swamp Venice, Barcelona, San Francisco, and Hong Kong are
visiting a foreign country for the first time. Surprised, fascinated,
and sometimes repulsed by what they see, they eagerly post to social
media their photos and impressions. Such reports are the source of much
of what we believe, consciously or unconsciously, about places we
haven’t visited yet.
Centuries ago, too, travelers were eager to tell their stories to people
back home, and those stories helped to create the images and
stereotypes that were formed about other lands and people. Many of those
stories can be found in the thousands of travel books that are
available in the text collections of the Internet Archive.
Here is a description, from a book published in London in 1701, of an Englishman’s first impressions of Paris:
Having enter’d this famous City, we were set down near the Louvre, and drop’d in first at a paltry House where the Fellow call’d himself in his Sign Le grand Voyageru,
(or great Traveller) and pretended to Speak all Languages, but could
scarce speak his own. Finding here but indifferent Accommodation, our
Man provided us a Lodging in a House, where liv’d no less than two and
twenty Families; thither we were carried in Sedans with Wheels, drag’d
along by one Man, no Hackney-Coaches being then to be had. This was on a
Sunday, and I was not a little surpriz’d to see Violins about
the Streets, and People singing and dancing every where, as if they had
Though the language is archaic, the sentiments—bragging about visiting a
famous city, complaining about accommodation and transportation,
frowning at the local customs—would not be out of place in a tourist’s
Facebook post today.
In the early 1790s, King George III sent an envoy to the Emperor of
China. Though the diplomatic mission was unsuccessful in its main
purpose—to obtain trade concessions for Britain similar to those granted
to the Portuguese and Dutch—it yielded a three-volume official report,
by George Staunton, that contains a fascinating account of the long
voyage halfway around the world (volume 1) and of the Chinese empire as seen through British eyes (volume 2). The report also includes many carefully engraved illustrations of sights in China—the Instagram posts of the era (volume 3).
Other travelers’ accounts I’ve dipped into include Travels from St. Petersburg, in Russia, to Diverse Parts of Asia by John Bell (1763)
Travels in America by George Howard (1851)
and a large compendium titled Cyclopædia of Modern Travel by Bayard Taylor (1856) (here).
Lately, I’ve also been exploring the Internet Archive’s rich collection
of books written by British and American visitors to Japan in the 19th
and early 20th centuries. Until the 1850s, Japan had been shut off
nearly completely from the rest of the world for more than two hundred
years, and people elsewhere were eager to learn about the mysterious
country. Many sailors, traders, diplomats, missionaries, journalists,
and individual travelers who were able to visit Japan wrote later about
their experiences, and I’ve compiled a list of more than 240 of their books.
I myself moved to Japan in 1983 and have lived here ever since. As I
read now the accounts of Westerners who arrived at Nagasaki or Yokohama
in 1858 or 1869 or 1880 or 1905, I recall my own vivid first impressions
of the country 36 years ago. While there are many differences—they rode
rickshas, I took commuter trains; those Victorians were shocked by the
casual nudity, this Californian was surprised by how formally people
dressed—our experiences were also similar in many ways. And those who,
as I did, stayed for more than a year or two and learned the language
gradually came to see how their initial assessments had also been
incomplete and sometimes biased.
Several times a week, I pass through the bustling Shibuya crossing in
Tokyo, and in recent years I’ve noticed more and more foreign tourists
taking pictures of that famous location. After reading travelers’
accounts from more than a century ago, I increasingly wonder how
tourists today are perceiving this country that is now my home, and I
speculate how people elsewhere, seeing those photos posted to
Twitter and Weibo,
will come to view that intersection and this country. I never would have
thought deeply about this, and I certainly wouldn’t be contrasting our
experiences with those of 19th-century visitors, if it weren’t for the
great collections of books that the Internet Archive makes available for
anyone in the world to read.
Tom Gally was born in Pasadena, California, in 1957. Since
moving to Japan, he has worked as a translator, teacher, lexicographer,
and writer. He is now a professor in the Graduate School of Arts and
Sciences at the University of Tokyo and is compiling a book of excerpts
from travelers’ accounts to be titled Japan As They Saw It. Samples can be read at the book’s website.
We usually think about archives as places of abundance. Deep, rich sites that house a multitude of perspectives. This can certainly be true, but archives are also sites of erasure, allowing some voices or perspectives to be minimized and excluded when they don’t fit into normative narratives.
Traditionally, stories involving people of color, queer people, and other historically-marginalized voices have been left out of archives, or diminished, because of ignorance, homophobia, and racism. Histories aren’t “discovered” in archives; rather, we use archives to actively construct versions of history, stories that accommodate our own subjective positions and ideologies. All too frequently, these stories favor the familiar structures of oppressive power—whiteness, patriarchy, and capitalism.
Likewise, the public domain is a remarkable construction that allows us to define who is or isn’t included in normative narratives. The public domain proclaims certain material as property owned by no one; cultural material in the public domain, theoretically, belongs to everyone. As copyright law enables new content to enter the public domain each year, it’s important to look closely at which voices are amplified in the celebration of open culture. There is no actual public domain. There is no site or territory or designation that reflects an authentic condition of “making public.”
Rather, it’s a complex, evolving structure defined by the institutions that serve as portals to cultural material—museums, libraries, courts, and archives like this one. They carry a responsibility to give (or deny) access to materials that traverse in and out of the public domain. But as an institutional construct, the public domain can easily fail to reflect any true nature of “the public;” without careful consideration, access to the public domain ends up repeating and perpetuating, in a highly predictable way, the same oppressive structures that govern society and culture.
What can be done? It’s crucial that we carefully examine our archives and search for lost voices, stories of failure, non-linear trajectories, and other non-conventional perspectives. We must refuse to accept traditional timelines at face value, and work to amplify marginalized material that has otherwise gone unnoticed, or erased. When confronting an archive or any presentation of historic cultural material, it’s irresponsible not to ask urgent questions like: What forces shaped this? Who was excluded? Who else should be included here in order to better understand the material at hand? Once engaged, we can actively work to change the shape of history, giving it dimension and depth and greater representation for all who were involved. This is what I’ve been calling queer archive work.
I’m really grateful to the Internet Archive for inviting me to help shape their effort to present newly available material in the public domain. During my residency here, for the last 3 weeks, I’ve been searching archive.org for forgotten material — in particular, evidence of African-American culture, Native American culture, early LGBTQ voices, and other artifacts from 1923 that in the past would have been forgotten or actively left out of celebrations of open access culture. If something seemed to be missing, I tried to find it elsewhere and upload it to archive.org. Remarkably, I found the first openly lesbian book of poetry ever published in North America, On A Grey Thread, by the Bay-area poet Elsa Gidlow, from 1923. It had never been digitized, but a PDF from the author’s estate was sent to me for this project and is now online, as of a few days ago.
The result is QUEER.ARCHIVE.WORK 2, 1923 INTERNET ARCHIVE EDITION. It’s an edition of 100 copies that I edited, designed, and printed myself at a small press in Berkeley, and it features 15 lesser-known historical artifacts. All of it is now available on archive.org. I’m very proud that the Internet Archive enabled me to create this project. By bringing these items together in a loose assemblage, in the form of a publication, my hope is to create a place for forgotten voices to co-mingle. I think by doing more of this work, we can challenge what we think or assume we know about the early years of the 20th century, and imagine other kinds of histories.