Oliver Leaver-Smith - On how "just a monitoring change" took down the entire site and resilience engineering - #5

Software Misadventures - En podcast af Ronak Nathani, Guang Yang - Tirsdage

Kategorier:

Oliver Leaver-Smith, better known as Ols, is a Senior Devops Engineer at Sky Betting and Gaming. In this episode, we discuss how a seemingly simple monitoring change ended up taking down the entire site. We also talk about chaos and resilience engineering. We discuss how the team at Sky Betting and Gaming conducts fire drills (chaos engineering exercises) where they not only test the resiliency of their software systems but also their people systems. We walk through a recent example of a fire drill, how they have evolved over the past few years and the lessons learned in the process.

Visit the podcast's native language site