Murali Suriar Profile picture
Mar 29, 2018 40 tweets 13 min read Read on X
Next up, following neatly from incident response: @wcgallego on "Architecting a Technical Postmortem"
#srecon
I'm a Systems Engineer at Etsy. Run many postmortems.
- Database fall over
- Bad deploys
- The time everyone got sick
- Coffeemakers overflowed

Everything had something to learn. #srecon
Questions:
- who has never done a postmortem before?
- why do we have postmortems? (think about this through the talk)

Ask these before every meeting. They are our story times. #srecon
Story time.
- Engineer joins Etsy, do the bootcamp rotation.
- Find a deprecated file.
- Test it on their VM.
- Push to prod.
- Broke it.
- Rolled back.
- Still broken!
- Had to roll back Apache everywhere.

#srecon
Was this engineer wrong to push this?

[ed: no. Engineer did all the right things].

If we fire this person, we don't learn anything from the outage, and we get rid of the expensive experience the engineer has just learned.
#srecon
Blameless postmortems.

Blameless is good - but I prefer "blame aware".

"Blameless" puts a weight on people's shoulders, because they have to be very careful about how they relate their memories of events.

#srecon
Blame aware: be aware of blame, aware it's undesirable, but don't beat yourself up if you make a mistake in how you speak. #srecon
Postmortem: "The application of a learning culture through shared discussion of our belifes on what transpired over an agreed upon limited number of events".

#srecon
Learning culture: the main goal here is learning. Blame is a barrier for learning, but participants need to be willing to share stories and learn from them.

Predicated on the belief that we can get better. (Fixed mindset vs growth mindset).
#srecon
Shared discussion of our beliefs: the incident challenged our previous beliefs. We will never know with perfect clarity what happened. There's always more to dig in to, and our memories are faulty.

#srecon
stella.report

"As the complexity of a system increases, the accuracy of any single agent's own model of that system decreases rapidly." #srecon
Agreed upon, limited number of events: we have a finite amount of time to dig in to any given event. Given infinite time, we still can't understand everything. Even assuming we could, the world is continually changing underneath us. #srecon
OK, practical tips.

1. Facilitation isn't solo.
- You can get help.
- Have a scribe
- Co-facilitator.
- Let other people shadow, so you get more facilitators.

#srecon
Note: you don't need to be a system expert to facilitate a postmortem. [ed: c.f. incident commander vs ops lead in incident command, see Brent Chapman's workshop]

#srecon
Next: Postmortems should be open invite.

Everyone should be invited (particularly actors), but no one should be forced to attend.

#srecon
Build the timeline

- Talk to actors
- Maybe interview people individually
- Schedule interviews within 2 days of incident
- Schedule review within 2 weeks (sooner if you can, but scheduling is hard)
- Else people stop caring.

#srecon
Timeboxing
- One hour for a postmortem review. (Scheduling is hard, and people can't focus for longer).
- 5 minute intro - ask the two questions (First PM? Why are we here). This allows for people who wander in late.
#srecon
Timeboxing
- 35-40m for the timeline.
- Don't tell the story yourself, let the actors work through it themselves
- Makes it more participatory.

Know the inflection points of the timeline, so you can ask probing questions/prompt people to dig in. #srecon
(Inflection points in timeline are good for rough timekeeping too.)
#srecon
Allow 10 minutes for follow up, Q&A, etc.

#srecon
Any remaining time for (optional) remediation items. This is a nice side effect of postmortem review, but not the main goal. That's learning. #srecon
As a facilitator: your main goal is to get people to open up and share lessons/knowledge. #srecon
Digging deeper
- What assumptions did we have, and how were they invalidated?
- Acting (or not acting) believed to be the right decision.
- Sources of truth - people.

Chat logs and graphs are raw data - information needs experts to interpret data. #srecon
Documentation, alerts, graphs: when are they useful, when do we discard them?

Get knowledgeable people to say out loud what they think is common knowledge.

#srecon
Root cause is a fallacy

[ed: citizen kane applause from the front row]

Complex systems never fail due to one specific thing. Highly connected graph. #srecon
Multiple root causes?

Roots are singular. Triggers are interdependent.

Saying "root cause" leaves information on the table. #srecon
Avoid counterfactuals
- "If only..."
- "They didn't..."
- "They should have..."

You don't learn anything from counter factuals. You're building a timeline for a world that didn't come to pass. #srecon
Defusing strong emotions

Review the timeline up front. See if there are indications of people who may feel uncomfortable in the discussion. #srecon
Talk to them ahead of time, and acknowledge times where communication in the incident was suboptimal.

Give people time to reflect and de-escalate themselves.

#srecon
Internalise Local Rationality
- People do the best they can with the information they have at the time.
#srecon
Retributive vs. Restorative culter
- Retribution is about punishment. Wrong action deserves punishment.
- Restoration is about learning, and making the system (and our understanding) better

#srecon
Back to our story.
- New senior engineer
- Deleted a file... a CSS file
- Took down Etsy.

More understandable.

[ed: ran the tests, wouldn't matter if it were a PHP file]
#srecon
- We have a builder which ships files.
- We don't ship the deleted files.
- The 404 page depends on the CSS
- 404 page generates another 404
- Cascading failure.

#srecon
Takeaways:
Postmortems are not a bubble. Need blame awareness everywhere, not just in the room.
#srecon
Takeaways:
You're going to be biased. This is natural. Learn from them; call them out in non-blameful ways. Get better when you run across them.
#srecon
Your systems are constantly failing. This is a universal truth.
#srecon
"Failure is not an option" --> correct, it's not optional. It will happen. #srecon
Key take away:
- Incidents can *always* be worse

#srecon
Question: how do you make employees believe in a blame free culture?

Answer: try to start with small circles. Start with your team, people you trust. People your comfortable you can talk to. Then widen it. #srecon
Question: remediation is not a necessary part of stuff. Talk more about it? How do you talk about prevention. When does the remediation happen if not part of the postmortem?

Answer: Different framing. Lack of remediation items isn't a failed postmortem. #srecon

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Murali Suriar

Murali Suriar Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @msuriar

Mar 30, 2018
Operational Excellence in April Fool's pranks, by @yesthattom #srecon
2015, April 1, 1023 UTC: stackoverflow enabled an easter egg.
#srecon
But we rolled back, and it was fine. Let's talk about reliable easter egg/April fool's features.
#srecon
Read 15 tweets
Mar 29, 2018
And lunch is done. 3 more talks before the closing plenaries.

Kicking off track 2 this afternoon, @jpaulreed on "Whispers in Chaos: Searching for Weak Signals in Incidents" #srecon
"Chaos?!"

(Incidents)
#srecon
How do you know an incidents are going on?

[ed: I get paged!]
#srecon
Read 37 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(