Don’t scale: 99.999% uptime is for Wal-Mart

  • I wish I could delete the first two words from this article's title. Uptime's related to scalability, but it's not the same thing - scalability encompasses more than that.

    In my experience, you can get away with lower uptime but your users will crucify you for poor performance. Site down for a half hour here and there? Fine. Site responds slowly and your clients' data takes a while to update? Big, big problem.

  • Relatedly, one piece of advice that 37Signals had in their Getting Real book that really helped me is that you can delay building many systems past launch.

    For example, BCC has a substantial amount of functionality in the back-end interface so that I can handle common support tasks. AR has virtually nothing -- a single page which lists customer email addresses, trial statuses, and upcoming subscription renewal dates. I could have spent 2 weeks on building out a decent amount of functionality for CS and more advanced statistical navel gazing, but a) I might not pick the right stuff and b) it would mean that the release of the next feature that actually sells software would be on 3/15 instead of 3/1.

    BCC has organically grown its backend over the years, as I get so frustrated with fixing the same issue manually that I make a one-button way to do it.

  • The people at my office who actually use Highrise and have to deal with 37s' frequent bouts of downtime would beg to differ.

  • Sounds like they are trying to fluff their reliability reputation.

    I develop a web application that is used by schools and just can't entertain the notion of anything other than 100% uptime. I take the reliability of my product very very seriously. If one of my customers had a fire at their school and couldn't access our system for registers - that would be us and them up the proverbial creak without a paddle.

    I've built up a company (over 7 years now) with a very good reputation for reliability and uptime. Don't assume that just because something is web based it doesn't require 100% uptime.

  • 98% uptime is down roughly:

    * 1 minute every hour, or

    * 3h20 every week, or

    * 1 week every year

    I know about hyperbole for making a point, but does Basecamp really total anything like a week's downtime per year? If so, why? I'm pretty sure I've never had anything like that bad a number and equally that I wouldn't be happy using a service that did.

    The general thrust is right: high reliability is expensive and you need to look at cost/benefit not chest-beating. But let's be honest about what we're actually aiming at.

  • As others have pointed out, this post is from 2005. In the past 6 years, the cost of scalability has dropped sharply, and shooting for three 9's should be the minimum for most sites. It doesn't cost thousands to go from 98% to 99% any more, and to 99.9% is still pretty cheap.

    Sure, five and six 9's does get expensive, and that will depend on your cost of downtime (ie: lost sales, etc.).

  • Indeed this article doesn't talk about scaling but about uptime... But although the topic is still open for discussion in 2011, I don't think an article written in 2005 should be posted in Hacker _News_.

  • This doesn't take into account startups who have SLA's because they have a B2B product. We have both B2C and B2B customers and as a result we can't be down for our business customers or we have to credit them. Honestly, 99.9% uptime is not hard to manage. Pick the right colo facility with a history of good uptime. Have more than 1 machine, and have them on redundant power supplies (on separate PDU's). Voila, unless you screw up deploys, you have 99.9% uptime. This doesn't take a huge amount of money.

  • Alistair Cockburn wrote an awesome book for small teams based on the "Crystal Clear Method". It has some great info.

    http://www.amazon.com/exec/obidos/ASIN/0201699478/ref=ase_al...

  • The criticality of your average “Web 2.0” application is one with loss of comfort as the result of something going wrong.

    Which is also why your average Web 2.0 application can't charge very much: without it, your comfort level slightly drops. No big deal.

  • It's amazing how much the economics of up time have changed since this article was written because of services like AWS.

    While it may not be technically or fiscally trivial yet, it's far easier and cheaper than its ever been, and far more so that in 2005.

  • > To go from 98% to 99% can cost thousands of dollars. To go from 99% to 99.9% tens of thousands more.

    Somewhere in there is the pickle problem:

    You have 1000kg of pickles in your basement. Now, pickles are mostly water. In fact, your pickles are 99% water, the rest is cellulose. Cellulose has negligible mass, so we can say that all the mass comes from the water. You leave your pickles in the basement for a year, and when you come back, they've dried out a certain amount, so they're now 98% water. What's their new mass?

  • Off topic, a sentence that really stood out for me:

    "Now what if Delicious, Feedster, or Technorati goes down for 30 minutes?"

    The article was written just six years ago, and these were the examples of popular sites that came to the author's mind. Gives you some idea on how transient this field is.

  • WoW goes down for nearly a full day every single Tuesday, and they seem to do OK.

    I agree with the post, but I don't like that it encourages settling for less than the best.

  • This would be good advice if I weren't the guy who gets the 1% downtime constantly....

  • The encouragment of frugality and pragmatism relating to spending to upgrade business systems is at odds with the author's publicized, flamboyant spending on frivolous ultra-luxuries. Apparently, purchasing $900,000 sports cars and houses in Italy is a higher priority for David than his customers always having access to what they're paying for.

    That's his decision obviously, but if I was a 37 Signals customer ever inconvenienced by problems with their infrastructure, I'd think of this article.