Failure is an Option

  • Fantastic.

    One of the biggest shifts in my thinking about deployment has been optimizing for low MTTR (Median Time to Recovery) rather than low MTBF (Median Time Between Failure). It really helps me push against all of those appealing-but-harmful "solutions" where the theory is that if we all just think a little harder we can be perfect.

    I still would prefer zero errors. But when there's a tradeoff between that and low error impact, I'll almost always take the latter.