Show HN: AWS S3 outage test from across the world

  • Is there any really way to design your application to handle S3 failures like this? S3's SLA has 99.99% availability, but is there a way to handle the 1% so your application is not affected? Options I can think of:

      1. Using a CDN to serve files can help in some cases
      2. On-prem systems may be able to use gateway-cached volumes and use the local disk cache vs S3
    
    Other ideas?

  • "Oh no, our server made a boo boo. Please try again."

  • I just re-ran the test and got an error rate of 1.23% (vs. 41.98%): https://pulse.turbobytes.com/results/55c88a0fecbe400bf800073...

    edit: S3 seems to be back up and running according to the AWS status page.

  • Yeah, I got 2.53% error rate, but that's nothing to worry about - using 79 servers, that's exactly 2 errors, both of them in China, which kinda makes it feel less Amazon's fault, than the Great Firewall's.... Maybe there should be some more meaningful error measure than the raw failure percentage.

  • Got some alerts from a couple services that rely on S3 this morning. Perhaps this is related, but everything is back up for now.