Not all production incidents are created equal. Some are routine: check the alert, review the dashboard, and follow the runbook. But others test your experience, problem-solving skills, and perseverance.

In this episode, guest Michael Abed, a former colleague of host Amin Astaneh, shares how he resolved a complex incident at Meta that stumped his team for over three days.

They also explore the connection between safety and reliability, as well as the advantages of feature flagging.

Michael, a software engineer specializing in service management and infrastructure, currently works at Datadog on release infrastructure. He can be reached at michaelabed@gmail.com.

Books referenced:

• Thinking in Systems: A Primer by Donella Meadows • Engineering a Safer World: Systems Thinking Applied to Safety by Nancy G Leveson