Lessons learned writing highly available code
Many good points there, but I think a lot of people could be misled by the advice about timeouts and retries. In my experience, which is a lot more than the author's two years, having each component retry and eventually fail on its own schedule is a terrible way to produce an available and debuggable system. Simple retries are fine and even necessary, but there should only be one part of the system making the often-complex decisions about when to declare something dead and give up. That way you don't end up with things that are half-dead or half-connected, which is a real pain in the butt.
If you are interested in highly available systems, I highly recommend the Pragmatic Programmers book "Release It!: Design and Deploy Production-Ready Software". The author includes some amusing but educational horror stories, such as debugging deadlocks in Oracle's JDBC drivers when "can't happen" Java exceptions thrown connection close() were silently eaten.
https://pragprog.com/book/mnee/release-it
Another clever tip I read (not from this book) was to use prime numbers for recurring events and retry wait times. The idea was that prime numbers would help avoid unwanted system stampedes or synchronization.
> 6. Prefer battle-tested tools over the “new hotness”.
Well, that won't go over well with the HN crowd, I bet! It's also the most important point in my opinion. Just because a new tool solves an old problem doesn't mean it won't have all new problems of its own. Can probably go even further:
For any popular tool or framework less than 1 year old, it has critical problems that aren't known yet. If it's not a popular tool, it will always have critical problems that aren't known.
Exponential back off is a good idea, but to make it even better you'll want to add some random jitter to the delay period. Especially in the case of deadlocks this helps avoid another deadlock when two things would have retried at exactly the same time.
I'm guessing MySQL doesn't support this (hence the need for a cron job), but postgres lets you set a statement_timeout on the connection. It will force kill queries that go beyond that timeout. I worked on an app not too long enough that occasionally would have some queries go off the rails and start blocking everything. We set up postgres to just kill off anything taking 30s automatically, and then were able to root out the issues without worrying about everything blocking on these broken queries and taking down our systems.
Those are good advice and they are battle hardened experience.
I would like to add a few points.
1. Design the program as if it has crashed the last time. Always start up to recover from the last crash state. A normal shutdown is just simply the degenerate case. This would make restart and retry of program so simple.
2. Do retry with RANDOM exponential backoff to spread out the load. You don't want all the retries end up at the same intervals.
At Amazon (AWS) I had started to refer to some of our control plane components as "eventually reliable"; somewhat of a pun with a good dash of truth.
How about (automated) testing? Seems the author missed an important one.
Those are nice. The exponential backoff is the exception: it was the cause of much degradation in TCP. Just left unnecessary holes in service. Amazon ditched it in exchange for jitter for same reason but different problem. If using backoff, I suggest doing non-exponential where you tie it to likely disruption times w/ some error margin.
Another commenter mentioned testing. That's a good idea. Here's two articles on that:
https://queue.acm.org/detail.cfm?id=2800697
http://www.drdobbs.com/testing/testing-complex-systems/24016...
Fixed-interval + jitter can work, exponential + jitter can work too. Be super careful with retries that don't involve jitter, depending on your system architecture you may end up with highly correlated bursts of retry load. You need to be careful of intermediate levels in your system when retrying. And be able to reason about what's going to happen under common failure modes (you won't able to anticipate the rare ones or freak occurrences, but you can defend in depth as best as you can).
If a low-level dependency starts to flake out (bad code/config/data push, unreliable host or network, load spike causing your storage to melt), retrying at intermediate stages can amplify the load and turn a small/medium-sized problem into a cascading failure. You can partially mitigate by retrying 0 or 1 times and then just give up, especially if you control all your callers or can afford that flakiness in your error budget.
Otherwise, depending on your needs you could maintain a retry budget (don't unless you really know you need it): instead of a flat K retries before giving up, enforce a process-wide per-operation cap on your retry rate. When the system is healthy, overall retry rate will be low so you have "room" to issue all K retries if needed. But if your load spikes, or your downstream dependencies start flaking out for some reason (your "fault" or otherwise), you'll avoid a positive feedback loop of death by quickly hitting your retry rate cap and shedding load which only would have made things worse and was destined to fail anyway.
EDIT: A few other remarks:
* OP's general point about putting limits on everything is really important, and closely related to the general wisdom in distributed systems development: everything can and will fail, so make the failure modes obvious and explicit. You get general robustness by avoiding runaway queries, but you also force discussions at design/development time like "what happens when this queue fills up" or "what happens if you're never able to take that lock" (though you should try to avoid needing to take locks ;)
* Instrument the hell out of the paths through your system and add lots of metrics. Logging is great, do it wherever it makes sense and you can, but a lot of the time you'll be able to get more insight (or the same amount of insight with less time spent doing ad-hoc log analytics) with something simple like counters whose value you can track over time and plot on a graph somewhere. Examples: classify downstream dependency latency into buckets, count of times you hit a code path (maybe some users require a more complex DB query than others or you're coding defensively, never expect a case to happen, but want to know if it does), cache hit rate, top sources of traffic in your system. Eventually you'll want to optimize bottlenecks and without data to identify them and prove they're fixed, you're flying blind.
Another rule of mine is to avoid languages/frameworks with large, complicated run-times. They introduce a huge cognitive load and go against the rule of keeping things simple.
Imgur is the only site I regularly see offline or overloaded.
undefined
I wouldn't call imgur highly available. More like highly unavailable.
Stopped reading after story about cron script killing long running db queries - it means this system is a huge mess, where killing request is more simple solution than finding which part of code generates it. I saw such systems and I'm sure it's hopeless state and only as sarcasm I can read "thankful to architecture principles" here.