Feb 27 2023 GCP Outage Incident Report

  • > Google's automation systems mitigated this failure by pushing a complete topology snapshot during the next programming cycle. The proper sites were restored and the network converged by 05:05 US/Pacific.

    I think this is the most understated part of the whole report. The bad thing happened due to "automated clever thing" and then the system "automagically" mitigated it in ~7 minutes. Likely before a human had even figured out what had gone wrong.

  • > During a routine update to the critical elements snapshot data, an incomplete snapshot was inadvertently shared which removed several sites from the topology map.

    I wish this went into more detail about how an incomplete snapshot was created and how the incomplete snapshot was valid-enough to sort-of work.

    I'm supposing that whatever interchange format was in use does not have any "END" delimiters (e.g. closing quotes/braces), nor any checksumming to ensure the entire message was delivered. I'm mildly surprised that there wasn't a failsafe to prevent automatically replacing a currently-in-use snapshot with one that lacks many of the services. (Adding a "type 'yes, I mean this'" user interaction widget is my preferred approach to avoid this class of problem in admin interfaces.)

  • I wish literally everywhere had (mandated?) detailed public RFOs like this. My residential ISP down for 20 minutes? Tell me more about the cable headend or your bad firmware push, please!

  • At googles scale and reliability target, I would hope they have multiple independent worldwide networks.

    Each network would have its own config plane and data plane. Changes would only be made to one at once. Perhaps even different teams managing them so that one rogue or hacked employee can't take down the whole lot.

    Then, if someone screws up and pushes a totally nuts config, it will only impact one network. User traffic would flow just fine over the other networks.

    Obviously there would need to be thought on which network data will be routed over, failover logic between one and another, etc. And that failover logic would all be site specific, and rolled out site by site, so there is again so single point of global failure.

  • Maybe the folks as Fly shouldn't feel so alone.

  • Anyone want to share what a programming cycle is? (The complete snapshot was restored with the next programming cycle)

  • undefined

  • To update a critical file atomically we used to first create an updated copy on the same filesystem and then rename it. Is this not possible on GCP?

  • I wonder how much people in the know really believe in singling out a single root cause to these HA system failures.

  • "Automated clever thing wasn't as clever as it needed to be"

  • [dead]

  • [dead]

  • [flagged]

  • Oxford comma as well, how controversial can it get

  • I always think about how impossible it will be for GCP to compete with AWS. The work culture at AWS has been brutal for a decade. High standards for work, and insane amounts of oncall/ops reduction. A burn and churn machine that has created AWS. Google is a laid back company with great technology, but not the culture to grind out every detail of getting cloud to really work. Microsoft is another story altogether, as they already have a ton of corporate relationships to bring clients.