Learning from incidents

How to critically examinine systems and processes to continually improve

Learning from Incidents in Software

Incidents are costly. Without spending time analyzing and determining the conditions that exist in order for an incident to take place, we won’t learn how to successfully remove nor recover from these conditions in the future. Let’s help each other learn.

Agile Retrospective Wiki

This is a resource for sharing retrospective plans, tips & tricks, tools and ideas to help us get the most out of our retrospectives. Retrospectives play a crucial role in software teams. It is time specifically put aside to reflect on how the team is performing and what can be done to improve.

Moving Past Shallow Incident Data

I believe that the confidence we have in the value of shallow data (TTR/TTD, etc.) stems from a desire to make what is actually very “messy” (the real-world evolution and handling of these events) into neater and more orderly (read: simpler to understand) categories, buckets, and signals.

Mean Time to Sleep: Quantifying the On-Call Experience

Starting an on-call rotation can be like opening a door into the unknown. You don’t know if it will be a bad week or if it will be an especially bad week. You don’t know what to expect. Thinking that historical information from past on-call rotations might yield useful insights, Etsy’s Operations team set out to quantify the on-call experience, identify what made it difficult, and use those data to reduce the incidence of pain points in an attempt to make being on call more bearable.

3 lessons learned from an Elasticsearch game day

We ran a game day to manually trigger failures in one of our Elasticsearch clusters—here’s what happened.