You could listen to multiple people recite the first 50 digits of pi in various styles, including to the tune of the Battle Hymn of the Republic (my personal favorite), in the voice of Bullwinkle, as an infomercial, in Latin, while laughing, in Morse Code, and while eating actual pie.
When you visit a public library, you get to meet the librarians and others who build and care for those collections. You know there are people who empty the garbage cans, who put back the borrowed books, who maintain the computers, and who determine what ends up on the shelf.
A digital library, on the other hand, is “just” a web site. You don’t really see the people who build it – we are often anonymous. But the Internet Archive wasn’t built by computers and algorithms.
From its inception, the Internet Archive has been built by thousands of people who understand that we have an opportunity to use the Internet to give everyone access to knowledge. Every person on the planet should have the opportunity to learn and to make a contribution.
This goal – Universal Access to All Knowledge – inspires the people who have built the Internet Archive over the past 23 years.
People clean and repair the buildings that we occupy. People do payroll, choose our health plans, answer the phones, plan our events, reply to user emails, clean up spam, and pay our bills. People design and build the computers that hold the collections. People construct the network that carries data to every corner of the world. People write software that processes, backs up, and delivers files. People design and test and build interfaces. People digitize analog media and type in metadata. People curate collections, establish collaborations, and manage projects.
There’s no way I can mention all of these people by name. Even if I listed every employee from the past 23 years, I would still be missing the volunteers, the people from other organizations who worked on joint projects with us, the pro bono lawyers, the delightfully compulsive collectors, the funding organizations, the idea generators, our sounding boards for crazy ideas, the individuals who have donated money or materials, and the hundreds of thousands of people who have uploaded media into the archive.
Libraries are built by people, for people. Thank you so much to all of the people who have contributed to building the Internet Archive, whether they were employees or our huge group friends and family. We would not be here without you, and we hope you will continue to help bring universal access to all knowledge in the future.
About a year and a half ago, the Internet Archive launched a collection of older books that were determined to qualify for the “Last 20” provision in Copyright Law, also known as Section 108(h) for the lawyers. As I understand this provision, it states that published works in the last twenty years of their copyright term may be digitized and distributed by libraries, archives and museums under certain circumstances. At the time, the small number of books that went into the collection were hand-researched by a team of legal interns. As you can imagine, this is a process that would be difficult to perform one-by-one for a large and ever-growing corpus of works.
So we set out to automate it. Amazon has an API with book information, so I figured with a little data massaging it shouldn’t be too hard to build a piece of software to do that job for us. Pull the metadata from our MARC* metadata records, send it to Amazon, and presto!
I was wrong. It was hard.
Library Catalog Names are different from Book Seller’s Names
Library-generated metadata is often very detailed, which leads to problems when we try to match the metadata provided by librarians to the metadata used on consumer-oriented web sites. For example, an author listed in a MARC record might appear as
Purucker, G. de (Gottfried), 1874-1942
But when you look on Amazon, that same author appears as
G. de Purucker
If we search the full author from the MARC on Amazon (including full name and birth and death dates), we may miss potential matches. And this is just one simple example. We have to transform every author field we get from MARC using a set of rules that may continue to expand as we find new problems to solve. Here are the current rules just for transforming this one field:
General rules for transforming MARC author to Amazon author:
Maintain all accented or non-Roman characters as-is
If there are no commas, semicolons or parentheses in the string, use the whole string as-is
If there are no commas in the string, but there are semicolon and/or parentheses, use anything before semicolon or parentheses as the entire author string
If there are commas in the string:
Everything before the first comma should be used as the author’s last name
Everything after the first comma but BEFORE any of these should be used as the author’s first name:
comma [ , ],
semicolon [ ; ],
open parentheses [ ( ]
any number [0-9]
end of string
Remaining information should be discarded
Period [ . ] and apostrophe [ ‘ ] and other symbols should not be used to delimit any name and should be maintained as-is in the transformed string.
An Account of the Saga of the Never-ending Title: as told to the author by three blah blah blahs…
Some older books have really long titles. The MARC record contains the entire title, of course! Why wouldn’t it?! But consumer-oriented sites like Amazon often carry these books with shortened or modified titles.
For example, here’s the title of a real page-turner:
American authors, 1600 – 1900 a biographical dictionary of American literature ; compl. in 1 vol. with 1300 biographies and 400 portraits
But on Amazon that title is:
American Authors 1600-1900: A Biographical Dictionary of American Literature (Wilson Authors)
As you can image, it’s far more difficult to reliably match books with longer titles. A human can look at those two titles and think “yeah, that’s probably the same book,” but software doesn’t work quite that well.
Now that the librarians have had a laugh, let’s explain that for everybody else! Think back to the days of yore when you went to the library and looked things up in a physical card catalog. If you wanted to know where a serial or periodical was located within the library collections, you really just needed one card to tell you that. It’s on this shelf in this area and the collection contains these years.
Great! Except when you’re looking at digital versions of these serials, they are distinct entities – they have different dates, different topics, different authors sometimes, etc. And yet they often still have just one MARC record – the digital equivalent of that one card in the catalog.
And that means that the publication dates pulled from the MARC records are sometimes very wrong.
For example, we have several items from the annual series The Book of Knowledge – 1947, 1957, 1958, 1959, 1974… The date provided in the MARC file for all of these is 1940.
As you can imagine, when we are filtering texts by year for various purposes, serials are a consistent issue.
Even when we have a correct date, Amazon does not match very well on volume and other serial or periodical-based information. For example, when we search for a particular month of a magazine, we are likely to match an entirely different month of that same magazine.
Not All Metadata is Good Metadata
Unbelievably, librarians do make mistakes. Sometimes the data we have from MARC records has typos, or a MARC record for a different publication date was attached to the book. For example, we have an author named Fkorence A Huxley, but her name is really Florence. Not according to the MARC record, though! Fat finger errors don’t just happen on phones. Another example: we scanned a book originally published in 1924, and *republished* in 1971. We have the 1971 version. But the MARC record tells us it’s from 1924.
Essentially, our search is only as good as our metadata. If there are typos, or the wrong MARC record, or wrong data, our search and/or filtering will not be accurate.
Commercial APIs Are Not Built to Solve Library Problems
Amazon’s API is built to sell books to end users. Yes, it helps you find a particular book, but the other data the API contains about availability, formats and pricing is less accurate. Because the Section 108(h) exemption for libraries (read more here) involves knowing whether copies are being sold at reasonable prices, we need to know about these aspects of the book to determine whether they qualify. But Amazon’s API is incomplete in this area. So we found ourselves needing to use the API to find a match for the title and author, and then go to the page and scrape it to actually get accurate availability and pricing information.
This increases the complexity of the programming required to use Amazon as a source for information, and greatly lengthened the process of building tools for this purpose.
We are making a determination about whether a book meets the qualifications for Section 108(h) at a particular point in time. Even with all of the issues discussed here, the accuracy of the data we can now pull about book availability and price is high. But it’s only accurate for the moment that we pull the data, because Amazon’s marketplace is constantly changing. If we don’t find a book on Amazon today, that doesn’t mean it won’t appear on the site tomorrow.
Because of this, when we make an item available to the public via Section 108(h), we write into the item’s metadata the date on which the determination was made.
Who Wants In!?
Since I’ve made this process sound SO appealing, I would imagine that any number of other library institutions are going to line up around the block wanting to try it out for themselves. Or not. But here’s the good news! If we digitize your books, the Internet Archive may be able to do the Section 108(h) determination on your behalf. Please contact us if you would like to participate.
*A MARC record is a MAchine-Readable Cataloging record. Essentially, it is the digital equivalent of the physical card from a card catalog.
Commercial radio broadcasting began in the 1920s, bringing entertainment, news and music into people’s homes. Now, instead of needing to play a 78rpm disc on your phonograph, you could just tune in to listen to popular songs.
But why are we focusing on 1923? Because for the first time in 20 years, new works are entering the public domain in the United States (read more: 1,2, 3). And those works were all published in, you guessed it, 1923.
We began developing a new system for counting views statistics on archive.org a few years ago. We had received feedback from our partners and users asking for more fine-grained information than the old system could provide. People wanted to know where their views were coming from geographically, and how many came from people vs. robots crawling the site.
The new system will debut in January 2019. Leading up to that in the next couple of weeks you may see some inconsistencies in view counts as the new numbers roll out across tens of millions of items.
With the new system you will see changes on both items and collections.
Item page changes
An “item” refers to a media item on archive.org – this is a page that features a book, a concert, a movie, etc. Here are some examples of items: Jerky Turkey, Emma, Gunsmoke.
On item pages the lifetime views will change to a new number. This new number will be a sum of lifetime views from the legacy system through 2016, plus total views from the new system for the past two years (January 2017 through December 2018). Because we are replacing the 2017 and 2018 views numbers with data from the new system, the lifetime views number for that item may go down. I will explain why this occurs further down in this post where we discuss how the new system differs from the legacy system.
Collection page changes
Soon on collection page About tabs (example) you will see 2 separate views graphs. One will be for the old legacy system views through the end of 2018. The other will contain 2 years of views data from the new system (2017 and 2018). Moving forward, only the graph representing the new system will be updated with views numbers. The legacy graph will “freeze” as of December 2018.
Both graphs will be on the page for a limited time, allowing you to compare your collections stats between the old and new systems. We will not delete the legacy system data, but it may eventually move to another page. The data from both systems is also available through the views API.
People vs. Robots
The graph for new collection views will additionally contain information about whether the views came from known “robots” or “people.” Known robots include crawlers from major search engines, like Google or Bing. It is important for these robots to crawl your items – search engines are a major source of traffic to all of the items on archive.org. The robots number here is your assurance that search engines know your items exist and can point users to them. The robots numbers also include access from our own internal robots (which is generally a very small portion of robots traffic).
One note about robots: they like text-based files more than audio/visual files. This means that text items on the archive that have a publicly accessible text file (the djvu.txt file) get more views from robots than other types of media in the archive. Search engines don’t just want the metadata about the book – they want the book itself.
“People” are a little harder to define. Our confidence about whether a view comes from a person varies – in some cases we are very sure, and in others it’s more fuzzy, but in all cases we know the view is not from a known robot. So we have chosen to class these all together as “people,” as they are likely to represent access by end users.
What counts as a view in the new system
Each media item in the archive has a views counter.
The view counter is increased by 1 when a user engages with the media file(s) in an item.
Media engagement includes experiencing the media through the player in the item page (pressing play on a video or audio player, flipping pages in the online bookreader, emulating software, etc.), downloading files, streaming files, or borrowing a book.
All types of engagements are treated in the same way – they are all views.
A single user can only increase the view count of a particular item once per day.
A user may view multiple media files in a single item, or view the same media file in a single item multiple times, but within one day that engagement will only count as 1 view.
Collection views are the sum of all the view counts of the items in the collection.
When an item is in more than one collection, the item’s view counts are added to each collection it is in. This includes “parent” collections if the item is in a subcollection.
When a user engages with a collection page (sorting, searching, browsing etc.), it does NOT count as a view of the collection.
Items sometimes move in or out of collections. The views number on a collection represents the sum of the views of the items that are in the collection at that time (e.g. the September 1, 2018 views number for the collection represents the sum of the views on items that were in the collection on September 1, 2018. If an item moves out of that collection, the collection does not lose the views from September 1, 2018.).
How the new system differs from the legacy system
When we designed the new system, we implemented some changes in what counted as a “view,” added some functionality, and repaired some errors that were discovered.
The legacy system updated item views once per day and collection views once per month. The new system will update both item and collection views once per day.
The legacy system updated item views ~24 hours after a view was recorded. The new system will update the views count ~4 days after the view was recorded. This time delay in the new system will decrease to ~24 hours at some point in the future.
The legacy system had no information about geographic location of users. The new system has approximate geolocation for every view. This geographic information is based on obfuscated IP addresses. It is accurate at a general level, but does not represent an individual user’s specific location.
The legacy system had no information about how many views were caused by robots crawling the site. The new system shows us how well the site is crawled by breaking out media access by robots (vs. interactions from people).
The legacy system did not count all book reader interactions as views. The new system counts bookreader engagements as a view after 2 interactions (like page flips).
On audio and video items, the legacy system sometimes counted views when users saw *any* media in the item (like thumbnail images). The new system only counts engagements with the audio or video media files in an item in those media types, respectively.
In some cases, the differences above can lead to drastic changes in views numbers for both items and collections. While this may be disconcerting, we think the new system more accurately reflects end user behavior on archive.org.
If you have questions regarding the new stats system, you may email us at firstname.lastname@example.org.
In honor of World Day for Audiovisual Heritage (October 27) we’d like to take you on a brief tour through seven decades of digitized music and audio recordings from 1900 through 1970. We’ve been working to digitize 78rpm discs for the Great 78 Project to preserve the heritage of the first half of the 20th century, and now we’re turning our eyes toward vinyl LPs that have fallen out of print in the Unlocked Recordings collection.
1905 – A Picnic For Two
1906 – Talmage on Infidelity (very judgy)
1912 – Till the Sands of the Desert Grow Cold
1916 – I’ll Take you Home Again, Kathleen
1920 – I Want a Jazzy Kiss (as opposed to a bluesy kiss)
1937 – A Cowboy Honeymoon (hint: includes yodeling)
1939 – The Red Army Chorus of the U.S.S.R. (when we were pals)
1945– Don’t you Worry ‘Bout That Mule” (spoiler alert – he ain’t goin’ blind)
1947 – Everything is Cool (so sayeth Bab’s 3 Bips & a Bop)
1950 – When both accordions and Hi-Fi were hip
1950 – “They’re all dressed up to go swinging and, Man, they’re a gas!” (Sonny Burke from the back cover)
1957 – Amongst fierce competition, this gem wins Most Nightmare Inducing Cover Image
1958 – Dance music from Israel
1959 – This intensely sleepy version of “Makin’ Whoopee” will send you to sleep in the lounge.
1960 – My next story is a little risque (and so is the one after that)
1961 – Recorded live at the Second City Cabaret Theatre, Chicago, Ill.
1961 – Easy winner for the worst song opening we’ve ever heard, enjoy Tiger Rag from The Percussive Twenties.
1962 – Significant improvement on the Tiger Rag from the Doowackadoodlers
1963 – “Adults only” saucy comedy
1966 – Organ-ized wins best pun, as well as having “Popular songs arranged for organ” by “Brazil’s #1 Organist”
1966 – The music stylings of Mrs. Miller are not to be missed – personal favorites are “Hard day’s night” and “These boots are made for walkin'”
1966 – The “You Don’t Have to be Jewish” Players are falling in love
The Buddhist Digital Resource Center (BDRC) and Internet Archive (IA) announced today that they are making a large corpus of Buddhist literature available via the Internet Archive. This collection represents the most complete record of the words of the Buddha available in any language, plus many millions of pages of related commentaries, teachings and works such as medicine, history, and philosophy.
BDRC founder E. Gene Smith sits at the computer with Buddhist monks and others
BDRC’s founder, E. Gene Smith, spent decades collecting and preserving Tibetan texts in India before starting the organization in 1999. Since then, as a neutral organization they have been able to work on both sides of the Himalayas in search of rare texts.
Several months ago in a remote monastery in Northeast Tibet, a BDRC employee photographed an old work and sent it in to their library. It was a text that the tradition has always known about, but which was long considered to have been lost. Its very existence was unknown to anyone outside of the caretakers of the monastery that had safeguarded it for centuries.
The Kadampa school, active in the 11th and 12 centuries, was known to scholars – they knew who had started the tradition and where it fit in the history of Buddhism – but most of the writings from that period had not survived the centuries. And yet suddenly here was a lost classic of this tradition, the only surviving manuscript of the work: The exposition on the graduated path by Kadam Master Sharawa Yontan Drak (1070-1141). Dozens of pithy sayings are attributed to Sharawa in later works but this writing of his is never directly cited in the classics of the genre that date back to the fifteenth century and before.
The exposition on the graduated path by Kadam Master Sharawa Yontan Drak (1070-1141).
BDRC’s digitizers never know what they will find when they arrive at a new location, but their work has uncovered missing links, beautiful woodblock versions of known texts, writings of previously unknown authors, and texts by famous people that they thought had been lost to time. While the manuscript above is an amazing find, it is by no means the only one their work has unearthed.
Children holding a manuscript in its box
This work highlights the importance of preserving cultures before they disappear or are too dispersed to gather together. In its efforts to make all of Buddhist literature available, BDRC is also digitizing fragile palm leaf manuscripts in Thailand, Sanskrit texts in Nepal, and the entire Tibetan collection of the National Library of Mongolia. Brewster Kahle, founder of Internet Archive, said, “In 2011 we announced that we had digitized every historic work in Balinese, and this year we are making Tibetan literature available. We hope that this is a trend that will see the literatures of many more cultures become openly available.”
Children studying Buddhist teachings
This is not an academic pursuit. Many Tibetans have left their homeland, spreading to India and around the world. Younger generations who have been displaced and raised in other societies may not have the opportunity to grow up with these traditional teachings. The work of the BDRC is to make those teachings available to everyone.
Jeff Wallman, Executive Director Emeritus of BDRC and Jann Ronis, Executive Director of BDRC, addressed their reasons for making this information available on the Internet Archive: “The founding mission of BDRC is to make the treasures of Buddhist literature available to all on the Internet. We recognize that you cannot preserve culture; you can only create the right conditions for culture to preserve itself. We hope that by making these texts available via the Internet Archive, we can spur a new generation of usage. Openness ensures preservation.”
The BDRC’s extensive collection is used by laypeople and monks alike. Karmapa Ogyen Trinley Dorje is a frequent user of their collection. He and other traveling teachers call on the BDRC’s library for references and works when they are away from their libraries, or whenever they need a rare text that they could not otherwise access.
Chokyi Nyima Rinpoche, the Abbot of Ka-Nying Shedrub Ling Monastery in Nepal, and a well regarded teacher of Tibetan Buddhism around the world, is gratified that the teachings of Buddha have been made available. “We can share the entire body of literature with every Tibetan who can use it. These texts are sacred, and should be free.”
BDRC’s home office is in Cambridge, Massachusetts, with additional offices and digitization centers in Hangzhou, China; Bangkok, Thailand; Kathmandu, Nepal; and at the National Library of Mongolia in Ulaanbaatar where it is establishing a project in collaboration with the Asian Classics Input Project (ACIP).
Internet Archive and BDRC are both delighted to join forces on sharing the Buddhist literary tradition for the benefit of humanity.
About Buddhist Digital Resource Center
BDRC is a 501(c)(3) nonprofit dedicated to seeking out, preserving, organizing, and disseminating Buddhist literature. Joining digital technology with scholarship, BDRC ensures that the treasures of the Buddhist literary tradition are not lost, but are made available for future generations. BDRC would like every monastery, every Buddhist master, every scholar, every translator, and every interested reader to have access to the complete range of Buddhist literature, regardless of social, political, or economic circumstances. BDRC is headquartered in Harvard Square in Cambridge, Massachusetts.
About Internet Archive
The Internet Archive is a 501(c)(3) nonprofit digital library based in San Francisco that specializes in offering broad public access to digitized and born-digital books, music, movies and Web pages.
Jann Ronis, BDRC, email@example.com
Jeff Wallman, BDRC firstname.lastname@example.org
Brewster Kahle, Internet Archive, email@example.com
Afghan Media Resource Center’s correspondent interviewing a Muj Commander, 1991
Journalists and others risk their lives to keep the public informed in times of conflict. War imagery provides us with important information in the moment, and creates a trove of invaluable archival content for the future.
Please be aware that this collection contains some disturbing photos of violence and its aftermath (though we have not included any in this blog post).
The Afghan Media Resource Center (AMRC) was founded in Peshawar, Pakistan, in 1987, by a team of media trainers working under contract to Boston University. The goal of the project was to assist Afghans to produce and distribute accurate and reliable accounts of the Afghan war to news agencies and television networks throughout the world. Beginning in the early 1980’s amidst a news blackout imposed by the Soviet backed Kabul government, foreign journalists had become targets to be captured or killed. The AMRC was an effort to overcome the substantial obstacles encountered by media representatives in bringing events surrounding the Afghan-Soviet war to world attention.
An armed Muj posing for the camera, 1988
Beginning in 1987, a series of six week training sessions were conducted at the AMRC original home in University Town, Peshawar, Pakistan. Qualified Afghans were recruited from all major political parties, all major ethnic groups and all regions of Afghanistan, to receive professional training in print journalism, photo journalism and video news production. Haji Sayed Daud, a former television producer and journalist at Kabul TV before the Soviet invasion, was named AMRC Director.
After the completion of their training, 3-person teams were dispatched on specific stories throughout Afghanistan’s 27 provinces, with 35mm cameras, video cameras, notebooks, and audio tape recorders. Photo materials were distributed internationally through SYGMA and Agence France Press (AFP). Video material was syndicated and broadcast by VisNews (now Reuters), with 150 broadcasters in 87 countries, Euronews and London-based WTN (now Associated Press), Thames Television, ITN, Swedish, French, Pakistani and other regional networks.
A young girl carrying clean drinking water, 1989
In 2000 AMRC began publishing a popular and influential newspaper in Kabul: ERADA (Intention). With one interruption, ERADA publication continued until 2012.
Beyond the AMRC archive, the AMRC conducted dozens of training programs and workshops for writers and radio journalists, including training programs for Refugee Women in Development (REFWID). The AMRC also established radio and TV studios in the provincial capitaol, Jalalabad, and produced radio and TV programs, including educational radio dramas, for a variety of international organizations. AMRC also conducted public opinion polls in Afghanistan, including an extensive Media Use Survey in Afghanistan, financed by InterMedia, a Washington, D.C. group.
Armed Muj pulling out an unexploded missile, 1989
The AMRC collection spans a critical period in Afghanistan’s history – (1987 – 1994), including 76,000 photographs, 1,175 hours of video material, 356 hours of audio material, and many stories from print media.
An Afghan weaving carpet, 1990
In 2012 AMRC received a grant to digitize the entire AMRC archive, to preserve the collection at the U.S. Library of Congress. AMRC senior media advisors Stephen Olsson and Nick Mills were trained in the digitization processes by the Library of Congress, then spent two weeks in Kabul training the AMRC staff. The digitization and metadata sheets (in English, Dari and Pashto) were completed in 2016, and were welcomed into the Library of Congress with a formal ceremony. We are now making the entire AMRC collection available through our on-line partner, The Internet Archive.
Now the entire collection is readily available to scholars, researchers and publishers. All royalties for commercial use of the photo images and video material will continue to support the non-profit work of the AMRC.