Let us serve you, but don’t bring us down

What just happened on archive.org today, as best we know:

Tens of thousands of requests per second for our public domain OCR files were launched from 64 virtual hosts on amazon’s AWS services. (Even by web standards,10’s of thousands of requests per second is a lot.)

This activity brought archive.org down for all users for about an hour.

We are thankful to our engineers who could scramble on a Sunday afternoon on a holiday weekend to work on this.

We got the service back up by blocking those IP addresses.

But, another 64 addresses started the same type of activity a couple of hours later.  

We figured out how to block this new set, but again, with about an hour outage.

—- 

How this could have gone better for us:

Those wanting to use our materials in bulk should start slowly, and ramp up. 

Also, if you are starting a large project please contact us at info@archive.org, we are here to help.

If you find yourself blocked, please don’t just start again, reach out.

Again, please use the Internet Archive, but don’t bring us down in the process.

46 thoughts on “Let us serve you, but don’t bring us down

  1. Hasford Albrecht

    [quote]
    Tens of thousands of requests per second for our public domain OCR files were launched from 64 virtual hosts on amazon’s AWS services. (Even by web standards,10’s of thousands of requests per second is a lot.)
    [/quote]

    Thank you for taking the time to let your users know what happened.

    I was using IA earlier today and got a “server error 502” screen. I was so surprised at seeing this I felt compelled to take a screenshot of it, if only to convince myself that what I was seeing was real. And while getting a 502 error from IA probably wasn’t the End of the World, I admit I did trot over to the window to make sure the sun was still in the sky.

    (You know, just in case.)

    Admittedly, getting an error screen does help one to better appreciate when a website becomes available again, so it wasn’t all bad. But it *was* surprising – and I feel all the better for knowing what happened.

  2. Brewster Kahle Post author

    I guess the good news for us is that this does not happen very often or you would not doubt the sun in the sky.

    Sorry that we went down. We appreciate your message here.

    -brewster

    1. Upset Librarian

      I bet your engineers can work in fixing the new search engine too, because it’s a complete MESS, the changes in your search engine go beyond aesthetics, the old one wasn’t the cutest but it was working properly, now you can’t get significant results, utilizing the same keywords, with pretty simple queries, and this message pops up constantly: -“Your search did not match any items in the Archive. Try different keywords or a more general search.”

      New users won’t note this at all, if they never were active users of the old one, they will just think that there’s not information about what they’re looking for!!.

      1. Morphism

        What simple queries used to work but dont anymore?

        Is it a bug or it is deliberate??

  3. nobody

    Pretty ironic that you’d complain about this, considering all the websites that you forced to hide behind Cloudflare or go down with your moronic scraping behavior (Pixiv being one of the examples that neutered their api because of you).

  4. Clint F.

    Any info on the zealots yet? Maybe a possible partnership or service!? (Hoping it’ll be abandoned Macintosh software based) I’m all for internet history at the couple wags of a finger, but geez; that’s at least a good 1 grand or more in AWS services! A good metaphor might be, “the vacuum of space trying to suck the moon through a bendy straw.” lmao…

  5. PRW

    It is customary, when asking not to be brought down, to address the counter-party as ‘Bruce’.

  6. Tim H

    “please contact us at info@archive.org, we are here to help.” – I wish you’d reply to my many, many requests for help in removing a website of mine I didn’t wish indexed by Wayback (it was my fault it was, my robots.txt wasn’t accessible for a period of time)

    Hard to have any sympathy when you won’t act yourselves to help others.

    I love you guys, I do. I’d hate to you see you go down/away. But please, have the decency to reply to people’s emails to you.

    Tim

  7. Mowgli

    Absolute noob here, but is this same as a DDOS attack? If so can using something like that cloud flare thingy (no idea what they do though but have cool lava lamps) help mitigate it without effecting users?

    1. Kyle

      In effect, yes, in reality, it’s presumably not technically an “attack” since it seems like whoever did it was using the internet archive for a legitimate purpose, just way too much. But yeah, that’s functionally the same as a DDOS attack

  8. Duccio Dogheria

    Glad it’s all sorted out! I was scared because this afternoon I have to present our collection on the Internet Archive at a summer school on digital libraries and without access it would have been a bit difficult! 🙂 Thanks for all you are doing

  9. Ben

    Good job getting back up in such a short time, especially on a weekend. Huge congrats to the team. Also, thank you for telling us what happened! Much better than some other big companies. I took a bit of time, and found out that the IA actually has an API. Never knew about this!

    In my (not very good) opinion, I think that if the API had a dedicated tab on the main archive.org site, even if it was within the “More” section, it could help with situations like these. I assume the API is rate limited to a rate that the servers can reasonably keep up with. But again, my opinion probably has many flaws. Probably why I don’t have a website haha.

    Again, well done on getting the site back up in such short time!

  10. Muhammad Usama

    The amount of help I got from internet archive cannot be repaid by anything. Thanks for everything IA.

  11. macstrat

    is the also impacting the Adobe Digital Editions servers? they have been kicking back E_STREAM_ERROR’s on almost everything

    1. Bro

      I am having the same issue here.
      I always get this “E_STREAM_ERROR” on Adobe Digital Editions.

  12. Stephen Skinner

    Hello!
    I was just wondering if something is affecting the borrow books area – the 14-day acsm and book download are running very slowly again.
    Thankyou.

  13. Richard Reynolds

    There’s really no reason to allow traffic from AWS, Azure or Google Cloud. You should block all those networks. Most traffic that originates from them is scraping and often malicious.

    1. Kyle

      This is a terrible idea and fundamentally misrepresents what kind of traffic comes from the sources, on average. I guess I’d need to see a breakdown of overall traffic flowing to IA to be sure, but I’m guessing the vast majority is legitimate and not abusive.

    2. masterX244

      Doesnt work in the IA context, there is legit traffic from hosting providers. Almost all data under the archiveteam collection for example got delivered via datacenters. for example.

  14. Futur3Sn0w

    Transparency like this is why we need IA! Thank you for taking the time to update us, and for being so patient with those using your service!
    Much love and respect to the team at archive.org!

  15. John

    You guys are amazing! You do great work, and we greatly appreciate all of your efforts! This whole “train the AI with public data” projects are getting old. How hard would it have been for the trainers to distribute those downloads over the course of several days, as well as donate a good chunk to this great service? Not hard at all, I’ll tell you that for free.

  16. yuki

    I believe that you should implement a request limit per device. What is considered an abnormal number of requests from a single device?

  17. JM

    Do you think it was a DDoS attack against the IA? Do you have a back-up server or a storage facility just in case? I would’ve flipped out yesterday had I been trying to access IA and seeing that 502 error.

  18. Doug Roberts

    Is there any way that something like cloudflare or a CDN would protect against this? I would think that the request would be made to the city in and then the CDN would make a single request to your server resulting in much lower load. But, if they were trying to download tens of thousands of individual files, perhaps this won’t work. The only other thing I can think of is some sort of rate throttling or download limit with a time out. I know that sucks but you can let a single IP address download five files let’s say, and then give them a cool off Of 30 minutes before they can download another one.

  19. Aaron Read

    Forgive me for perhaps sounding paranoid, but this seems like an awfully easy thing for bad-faith actors to do in order to take down archive.org at a critical time. Like, for example, the 48 hours before a national election (in the USA or elsewhere). Just set up 50 groups of 64 ip addresses and start sequentially slamming the servers with requests. Is this something that Archive.org can better defend against in the future? Is there something others can do to help?

  20. Wingedream

    Your clarification makes me feel happy that you guys are always in the trenches fighting.With that i,we must be full of hope.
    Thanks

  21. G

    Are you experiencing the same problem again? I just tried to borrow a book, but even though the request went through, none of the pages would load past the ones already available in the preview. Just blank space and the swirling circle. Is this bc the server is overloaded again?

    Thank you.

  22. Tom McCanna

    Is this why recordings loaded as long ago as 25th May still lack the PLAY buttons?

  23. Simon Ephraim

    Glad you keep us informed. Happy to keep donating. Much appreciating the efforts of the IA team.
    Not that I get a cardiac arrest when you are down but I start to feel less free when the greatest public library gets closed.

  24. TM

    Using this site to register copyrights will solve the problem/personally eat syndicated media and process it into intellectual property

  25. ZV6

    Borrow for 1 hour? I really appreciate your realistic explanation.
    – Your site only showed me the first five pages of the book I selected.
    – Then I duly registered on your site as you stated, to be able to access the book for 1 hour.
    – After that, your site showed me the same, only the first five pages of the selected book.
    – I really don’t see the point of registering on your site when I can’t access the book, only the first five pages.
    Really disappointing.
    Regards

  26. Matthew McRae

    Love the updates and transparency. Thanks for the note and clear insturctions on the issue at hand.

Comments are closed.