Category Archives: Technical

Learning from Cyberattacks

The Wayback Machine, Archive.org, Archive-it.org, and OpenLibrary.org came up in stages over the week after cyberattacks with some of the contributor features coming up over the last couple of weeks.  A few to go.  Much of the development during this time has been focused on securing the services so they can still run while attacks continue.  

The Internet Archive is adapting to a more hostile world, where DDOS attacks are recurring periodically (such as yesterday and today), and more severe attacks might happen. Our response has been to harden our services and learn from friends. This note is to share some high level findings, without being so detailed as to help those that are still attacking archive.org.

By tightening firewall technologies, we have changed how data flows through our systems to improve monitoring and control. The downside is these upgrades have forced changes to software, some of it quite old. 

The bright side is this is forcing upgrades that we have long planned or hoped for.  We are greatly helped by the free and open source community’s improving tools that can be used by large corporations as well as non-profit libraries because they are freely available.

Also, some commercial companies have offered assistance that would generally be prohibitively expensive.  We are grateful for the support.

Where the Internet Archive has always focused on building collections and preserving them, we have been starkly reminded how important reliable access is to researchers, journalists, and readers. This is leading us to install technical defenses and increase staff to improve service availability.

Libraries in general, and the Internet Archive in specific, have been under attack for many years now.  For us it started with the book publishers suing (about lending books), and now the recording industry (about 78rpm records), which is a drain on our staff and financial resources. Now recurring DDOS attacks distract us from the goals of preservation and access to our digital heritage.

We don’t know why these attacks have started recently and if they are coordinated, but we are building defenses.

We are grateful for the support from our patrons, through social media, through donations, and through offers of help, which frankly, makes it worthwhile to keep building a library for all of us.

– Brewster Kahle

In Memoriam of Python 2

Generated by Bing Image Creator

Today, on the Day of the Dead 2023, we at the Internet Archive honor the death of Python 2. Having mostly emerged from one of the greatest software upgrade SNAFU’s in history—the migration from Python 2 to Python 3—we now shed a tear for that old version that served us so well.

When Python 3 was launched in 2008, it contained a number of significant improvements which nevertheless broke compatibility with the previous version of Python at the syntax, string-handling, and library level. As terrible as this sounds, breaking changes are fairly normal for a major software upgrade.

Rather, the chaos that followed was rooted in the fact that unlike most software transitions of this sort, it could not be done incrementally. Instead of being offered a way to gradually upgrade, remaining compatible with both versions and spreading the incremental costs over time, developers were given a risky all-or-nothing choice. The result has been a reluctant, glacial, expensive migration that continues to plague the world.

At the Internet Archive, we did not begin our migration in earnest until 2021, starting with Open Library and then this year focusing on Archive.org and its underlying services. However, we are now happy to declare migration of our core storage service, S3, which underlies all of the millions of items stored in the Archive, complete. We are grateful for the intensive efforts over many months by Chris, Scott, and Tracey, and everyone who supported them!

There are just a few more projects to go, but we are nearly there. And come our next OS upgrade, Python 2 will be but the whisper of a memory, preserved in the Archive and honored on a day like today. Rest in peace, Python 2. And please stay dead.

IMLS National Leadership Grant Supports Expansion of the ARCH Computational Research Platform

In June, we announced the official launch of Archives Research Compute Hub (ARCH) our platform for supporting computational research with digital collections. The Archiving & Data Services group at IA has long provided computational research services via collaborations, dataset services, product features, and other partnerships and software development. In 2020, in partnership with our close collaborators at the Archives Unleashed project, and with funding from the Mellon Foundation, we pursued cooperative technical and community work to make text and data mining services available to any institution building, or researcher using, archival web collections. This led to the release of ARCH, with more than 35 libraries and 60 researchers and curators participating in beta testing and early product pilots. Additional work supported expanding the community of scholars doing computational research using contemporary web collections by providing technical and research support to multi-institutional research teams.

We are pleased to announce that ARCH recently received funding from the Institute of Museum and Library Services (IMLS), via their National Leadership Grants program, supporting ARCH expansion. The project, “Expanding ARCH: Equitable Access to Text and Data Mining Services,” entails two broad areas of work. First, the project will create user-informed workflows and conduct software development that enables a diverse set of partner libraries, archives, and museums to add digital collections of any format (e.g., image collections, text collections) to ARCH for users to study via computational analysis. Working with these partners will help ensure that ARCH can support the needs of organizations of any size that aim to make their digital collections available in new ways. Second, the project will work with librarians and scholars to expand the number and types of data analysis jobs and resulting datasets and data visualizations that can be created using ARCH, including allowing users to build custom research collections that are aggregated from the digital collections of multiple institutions. Expanding the ability for scholars to create aggregated collections and run new data analysis jobs, potentially including artificial intelligence tools, will enable ARCH to significantly increase the type, diversity, scope, and scale of research it supports.

Collaborators on the Expanding ARCH project include a set of institutional partners that will be closely involved in guiding functional requirements, testing designs, and using the newly-built features intended to augment researcher support. Primary institutional partners include University of Denver, University of North Carolina at Chapel Hill, Williams College Museum of Art, and Indianapolis Museum of Art, with additional institutional partners joining in the project’s second year.

Thousands of libraries, archives, museums, and memory organizations work with Internet Archive to build and make openly accessible digitized and born-digital collections. Making these collections available to as many users in as many ways as possible is critical to providing access to knowledge. We are thankful to IMLS for providing the financial support that allows us to expand the ARCH platform to empower new and emerging types of access and research.

National Library Week 2023: Brenton, user experience

To celebrate National Library Week 2023, we are introducing readers to four staff members who work behind the scenes at the Internet Archive, helping connect patrons with our collections, services and programs.

Brenton Cheng learned to program in BASIC on an Apple II Plus at age 9. His mother was one of the earliest computer programmers and his dad was a marketing consultant for technology products in Portola Valley, California. By age 12, Cheng had written a series of animated games that he put together in a hand-assembled software package. It sold about four copies.

Now, Cheng is a senior engineer at the Internet Archive, where he leads the user experience (UX) team. “Our goal is to give our patrons a great experience on the Archive.org website while making sure that under the hood, our technologies are as simple, robust and maintainable as possible,” said Cheng, who has been at the organization for seven years.

Despite his early computer exposure, Cheng wanted to study something more tangible in college. He pursued mechanical engineering and earned a bachelor’s degree from Princeton University and a master’s from Stanford University. Along the way, he developed a love of contemporary dance and improvisation. Inspired by the creativity of movement, he veered toward biomechanical engineering in graduate school. 

Entering the job market, Cheng said he wanted a flexible schedule so he would be able to take workshops and occasionally go on tour with dance companies. He was a freelance computer programmer for about a decade, then worked at Astrology.com and NBCUniversal for another 10 years. 

In 2016, Cheng said he was drawn to the Internet Archive by its mission, reputation and people. “Being in the dance world, I was constantly surrounded with all kinds of eclectic, eccentric, fascinating, brilliant people,” he said. “There were certain common elements in the way the Archive embraces and benefits from diversity. I found many artists and engineers working in novel ways. That felt very much at home.”

From his experience working with improvisation in dance, Cheng said he loves trying to create the conditions within which people contribute their best work and feel good about what they’re doing. His team is focused on fighting for users and constantly making the website better for the public. “I also serve the digital librarians who are collecting and providing content for our patrons,” Cheng said. “I am giving them the tools, platform and environment to do their magic.” 

Tell us something about your role at the Internet Archive that most people wouldn’t know about.
Simultaneously with supporting the Archive’s mission and helping our patrons, I am always holding in the back of my mind the subtext of a “small team, long term.” These ideas guide choices around process, technologies and architecture. We regularly discard choices that would entail too much complexity or require too much on-going, hands-on maintenance. And we try to resist rushing features out the door that will only add to our technical debt later.

What is the most interesting project you’ve worked on at the Internet Archive?
I set up a wiki to allow scholars to submit transcriptions of scanned Balinese palm leaves.

What has been your greatest achievement (so far) at the Internet Archive?
Creating a team that likes working together, is resilient through conflicts and pushes each other to keep getting better.

What are you reading?
The Sense of Style by Steven Pinker. It’s a contemporary writing style manual that incorporates cognitive science and linguistics and acknowledges the evolving nature of language.

Thank you for helping us increase our bandwidth

Last week the Internet Archive upped our bandwidth capacity 30%, based on increased usage and increased financial support.  Thank you.

This is our outbound bandwidth graph that has several stories to tell…

A year ago, usage was 30Gbits/sec. At the beginning of this year, we were at 40Gbits/sec, and we were handling it.  That is 13 Petabytes of downloads per month.  This has served millions of users to materials in the wayback machine, those listening 78 RPMs, those browsing digitized books, streaming from the TV archive, etc.  We were about the 250th most popular website according to Alexa Internet.

Then Covid-19 hit and demand rocketed to 50Gbits/sec and overran our network infrastructure’s ability to handle it.  So much so, our network statistics probes had difficulty collecting data (hence the white spots in the graphs).   

We bought a second router with new line cards, and got it installed and running (and none of this is easy during a pandemic), and increased our capacity from 47Gbits/sec peak to 62Gbits/sec peak.   And we are handling it better, but it is still consumed.

Alexa Internet now says we are about the 160th most popular website.

So now we are looking at the next steps up, which will take more equipment and is more wizardry, but we are working on it.

Thank you again for the support, and if you would like to donate more, please know it is going to build collections to serve millions.  https://archive.org/donate

Farewell to IE11

At the end of the movie “Titanic,” from her makeshift raft Rose Calvert promises Jack Dawson, “I will never let go,” but then, well, a floating board is only so big…

On June 1, we will gently release Internet Explorer, version 11, from the list of browsers supported on our website Archive.org into the oceanic depths of the obsolete. To give you an idea of what this means to us, a member of the UX team composed this little remembrance:

We hate you. Good-bye.

Why the ichor? Why the bile? No doubt one too many sleepless nights struggling to make our website layout work with this venerable browser, released in 2013, which lacks support for so many features that are now standard in today’s browsers: module imports, web components, CSS Grid, ES6, the list goes on. Like its ancestor IE6, version 11 has clung to life far longer than it should have.

Though Microsoft support for it will not officially end until 2025, Microsoft’s Chief of Security, Chris Jackson, recently recommended in a blog post that people stop using IE11 as their default browser. It is considered a “compatibility solution,” something you should only use for services that require it. Our analytics indicate that a mere 0.8% of our users use IE11 to browse the site. (Even worldwide usage is at 1-3%, the bulk of it from a country in which we are blocked.)

Plus, maintaining compatibility with IE11 — with its need for polyfills, transpilation, and other workarounds — gets expensive, especially for a small team such as ours. Generously supported by donations from people like you, we are committed to doing the greatest good with the resources we have, making the world’s knowledge available to as many people as possible. IE11 is a distraction, with a diminished and ever diminishing return on our efforts.

So farewell, IE11. We will sleep better and rise with a little more spring in our step, knowing that your phrase with us has reached its conclusion.

Two Thin Strands of Glass

There’s a tiny strand of glass inside that thick plastic coat.

Two thin strands of glass. When combined, these two strands of glass are so thin they still wouldn’t fill a drinking straw. That’s known in tech circles as a “fiber pair,” and these two thin strands of glass carry all the information of the world’s leading archive in and out of our data centers. When you think about it, it sounds kind of crazy that it works at all, but it does. Every day. Reliably.

Except this past Monday night, here in California…

On Monday, June 24, the real world had other ideas. As a result, the Internet Archive was down for 15 hours. For Californians, this was less of a big deal: those 15 hours stretched from mid-Monday evening (9:11pm on the US West coast), to 11:51am on Tuesday. Many Californians were asleep during several hours of that time. But in the Central European time zone (e.g. France, Germany, Italy, Poland, Tunisia), that fell on early Tuesday morning (06:11) to mid-Tuesday evening (21:51). And in the entire country of India, it was late Tuesday morning (09:41) to just after midnight on Wednesday (00:21).

Continue reading

New Views Stats for the New Year

We began developing a new system for counting views statistics on archive.org a few years ago. We had received feedback from our partners and users asking for more fine-grained information than the old system could provide. People wanted to know where their views were coming from geographically, and how many came from people vs. robots crawling the site.

The new system will debut in January 2019. Leading up to that in the next couple of weeks you may see some inconsistencies in view counts as the new numbers roll out across tens of millions of items.  

With the new system you will see changes on both items and collections.

Item page changes

An “item” refers to a media item on archive.org – this is a page that features a book, a concert, a movie, etc. Here are some examples of items: Jerky Turkey, Emma, Gunsmoke.

On item pages the lifetime views will change to a new number.  This new number will be a sum of lifetime views from the legacy system through 2016, plus total views from the new system for the past two years (January 2017 through December 2018). Because we are replacing the 2017 and 2018 views numbers with data from the new system, the lifetime views number for that item may go down. I will explain why this occurs further down in this post where we discuss how the new system differs from the legacy system.

Collection page changes

Soon on collection page About tabs (example) you will see 2 separate views graphs. One will be for the old legacy system views through the end of 2018. The other will contain 2 years of views data from the new system (2017 and 2018). Moving forward, only the graph representing the new system will be updated with views numbers. The legacy graph will “freeze” as of December 2018.

Both graphs will be on the page for a limited time, allowing you to compare your collections stats between the old and new systems.  We will not delete the legacy system data, but it may eventually move to another page. The data from both systems is also available through the views API.

People vs. Robots

The graph for new collection views will additionally contain information about whether the views came from known “robots” or “people.”  Known robots include crawlers from major search engines, like Google or Bing. It is important for these robots to crawl your items – search engines are a major source of traffic to all of the items on archive.org. The robots number here is your assurance that search engines know your items exist and can point users to them.  The robots numbers also include access from our own internal robots (which is generally a very small portion of robots traffic).

One note about robots: they like text-based files more than audio/visual files.  This means that text items on the archive that have a publicly accessible text file (the djvu.txt file) get more views from robots than other types of media in the archive. Search engines don’t just want the metadata about the book – they want the book itself.

“People” are a little harder to define. Our confidence about whether a view comes from a person varies – in some cases we are very sure, and in others it’s more fuzzy, but in all cases we know the view is not from a known robot. So we have chosen to class these all together as “people,” as they are likely to represent access by end users.

What counts as a view in the new system

  • Each media item in the archive has a views counter.
  • The view counter is increased by 1 when a user engages with the media file(s) in an item.
    • Media engagement includes experiencing the media through the player in the item page (pressing play on a video or audio player, flipping pages in the online bookreader, emulating software, etc.), downloading files, streaming files, or borrowing a book.
    • All types of engagements are treated in the same way – they are all views.
  • A single user can only increase the view count of a particular item once per day.
    • A user may view multiple media files in a single item, or view the same media file in a single item multiple times, but within one day that engagement will only count as 1 view.
  • Collection views are the sum of all the view counts of the items in the collection.
    • When an item is in more than one collection, the item’s view counts are added to each collection it is in. This includes “parent” collections if the item is in a subcollection.
    • When a user engages with a collection page (sorting, searching, browsing etc.), it does NOT count as a view of the collection.
    • Items sometimes move in or out of collections. The views number on a collection represents the sum of the views of the items that are in the collection at that time (e.g. the September 1, 2018 views number for the collection represents the sum of the views on items that were in the collection on September 1, 2018. If an item moves out of that collection, the collection does not lose the views from September 1, 2018.).

How the new system differs from the legacy system

When we designed the new system, we implemented some changes in what counted as a “view,” added some functionality, and repaired some errors that were discovered.  

  • The legacy system updated item views once per day and collection views once per month. The new system will update both item and collection views once per day.
  • The legacy system updated item views ~24 hours after a view was recorded.  The new system will update the views count ~4 days after the view was recorded. This time delay in the new system will decrease to ~24 hours at some point in the future.
  • The legacy system had no information about geographic location of users. The new system has approximate geolocation for every view. This geographic information is based on obfuscated IP addresses. It is accurate at a general level, but does not represent an individual user’s specific location.
  • The legacy system had no information about how many views were caused by robots crawling the site. The new system shows us how well the site is crawled by breaking out media access by robots (vs. interactions from people).
  • The legacy system did not count all book reader interactions as views.  The new system counts bookreader engagements as a view after 2 interactions (like page flips).
  • On audio and video items, the legacy system sometimes counted views when users saw *any* media in the item (like thumbnail images). The new system only counts engagements with the audio or video media files in an item in those media types, respectively.

In some cases, the differences above can lead to drastic changes in views numbers for both items and collections. While this may be disconcerting, we think the new system more accurately reflects end user behavior on archive.org.

If you have questions regarding the new stats system, you may email us at info@archive.org.

Identity in the Decentralized Web

In B. Traven’s The Death Ship, American sailor Gerard Gales finds himself stranded in post-World War I Antwerp after his freighter departs without him.  He’s arrested for the crime of being unable to produce a passport, sailor’s card, or birth certificate—he possesses no identification at all.  Unsure how to process him, the police dump Gales on a train leaving the country. From there Gales endures a Kafkaesque journey across Europe, escorted from one border to another by authorities who do not know what to do with a man lacking any identity.  “I was just a nobody,” Gales complains to the reader.

As The Death Ship demonstrates, the concept of verifiable identity is a cornerstone of modern life.   Today we know well the process of signing in to shopping websites, checking email, doing some banking, or browsing our social network.  Without some notion of identity, these basic tasks would be impossible.

That’s why at the Decentralized Web Summit earlier this year, questions of identity were a central topic.  Unlike the current environment, in a decentralized web users control their personal data and make it available to third-parties on a need-to-know basis.  This is sometimes referred to as self-sovereign identity: the user, not web services, owns their personal information.

The idea is that web sites will verify you much as a bartender checks your ID before pouring a drink.  The bar doesn’t store a copy of your card and the bartender doesn’t look at your name or address; only your age is pertinent to receive service.  The next time you enter the bar the bartender once again asks for proof of age, which you may or may not relinquish. That’s the promise of self-sovereign identity.

At the Decentralized Web Summit, questions and solutions were bounced around in the hopes of solving this fundamental problem.  Developers spearheading the next web hashed out the criteria for decentralized identity, including:

  • secure: to prevent fraud, maintain privacy, and ensure trust between all parties
  • self-sovereign: individual ownership of private information
  • consent: fine-tuned control over what information third-parties are privy to
  • directed identity: manage multiple identities for different contexts (for example, your doctor can access certain aspects while your insurance company accesses others)
  • and, of course, decentralized: no central authority or governing body holds private keys or generates identifiers

One problem with decentralized identity is that these problems often compete, pulling in polar directions.

For example, while security seems like a no-brainer, with self-sovereign identity the end-user is in control (and not Facebook, Google, or Twitter).  It’s incumbent on them to secure their information. This raises questions of key management, data storage practices, and so on. Facebook, Google, and Twitter pay full-time engineers to do this job; handing that responsibility to end-users shifts the burden to someone who may not be so technically savvy.  The inconvenience of key management and such also creates more hurdles for widespread adoption of the decentralized web.

The good news is, there are many working proposals today attempting to solve the above problems.  One of the more promising is DID (Decentralized Identifier).

A DID is simply a URI, a familiar piece of text to most people nowadays.  Each DID references a record stored in a blockchain. DIDs are not tied to any particular blockchain, and so they’re interoperable with existing and future technologies.  DIDs are cryptographically secure as well.

DIDs require no central authority to produce or validate.  If you want a DID, you can generate one yourself, or as many was you want.  In fact, you should generate lots of them.  Each unique DID gives the user fine-grained control over what personal information is revealed when interacting with a myriad of services and people.

If you’re interested to learn more, I recommend reading Michiel Mulders’ article on DIDs, “the Internet’s ‘missing identity layer’.”  The DID working technical specification is being developed by the W3C.  And those looking for code and community, check out the Decentralized Identity Foundation.

(While DIDs are promising, it is a nascent technology.  Other options are under development.  I’m using DIDs as an example of how decentralized identity might work.)

What does the future hold for self-sovereign identification?  From what I saw at the Decentralized Web, I’m certain a solution will be found.