The 2020 Google Outage (Detailed Analysis)
The Backend Engineering Show with Hussein Nasser - En podcast af Hussein Nasser
Kategorier:
0:00 Intro 1:00 Summary of the Outage 4:00 Detailed Analysis of the Incident Report On Dec 14 2020 Google across the globe suffered from an outage that lasted 45 minutes nobody could access most of Google services. Google has released a detailed incident report discussing the outage, what caused it, technical details on their internal service architecture and what did they do to mitigate and prevent this from happening in this in the future In this video, I want to take a few minutes to summarize the report and then go into a detailed analysis. You can find youtube chapters to jump to the interesting part of the video. pick your favorite drink, sit back relax, and enjoy. Let's get started. let's start with an overview of how the google id service works, the client connects to Google authentication service to get authenticated or retrieve account information The account information is stored in a distributed manner between the different service ids for redundancy. when an update is made to an account on the leader node, the existing data in all nodes are marked as outdated, this is done for security reasons. Let’s say you updated your credit card info, privated your profile or deleted a comment, it is extremely dangerous to serve that outdated information. This was the key to the outage. The updated account is then replicated based on Paxos Consensus protocol. The user id service has a storage quota controlled by an automated quota management solution when the storage usage of the service changes. the quota is maintained accordingly either reduced or increased based on the demand .. So What Exactly Happened that caused the outage? In October 2020, google migrated their quota management to a new system and registered the id service with the new system. however some parts of the old system remained hooked up specifically the parts regarding the reading of the service usage. And because the service is registered to the new system, the old qouta system reported 0 usage as it should. So when the new quota manement asked its service for its usage it was incorrectly reporting 0. Nothing happened for a while since there was a grace period, but that period expired on December Thats when the new quota system kicked and saw the id service with 0 usage and started reducing the qouta for the id service down .. you are not using it why waste? The quota kept reducing until the service had no space left. This has caused updates to the leader node to fail, which caused all data to go out of date in all nodes which in turn escalated globally to what we have seen. Resource https://status.cloud.google.com/incident/zall/20013