Murali Suriar Profile picture
Mar 29, 2018 37 tweets 12 min read Read on X
And lunch is done. 3 more talks before the closing plenaries.

Kicking off track 2 this afternoon, @jpaulreed on "Whispers in Chaos: Searching for Weak Signals in Incidents" #srecon
"Chaos?!"

(Incidents)
#srecon
How do you know an incidents are going on?

[ed: I get paged!]
#srecon
How do you know what to do during an incident? That's what we're going to be talking about.

MSc candidate: human factors and systems safety.

#srecon
I'm from a build and test background: so SRE = Super Release Engineer. :)

#srecon
So, incident. What do?

Look at how our brains work.
#srecon
Two brain systems

System 1:
- Automatic/quick
- No effort
- Not deliberate/voluntary

System 2:
- "Effortful"
- Complex computation
- Agency, choice, concenttration.

From "thinking fast and slow" by Daniel Kahneman. #srecon
System 1:
- Look to a sudden sound
- 2+2
- Finding a strong move in chess (if you're a chess master)

System 2
- Focus on a voice in a crowd
- Filling tax forms

System 2: needs focused, continued attention. #srecon
It's more complicated than that, however. Let's look at @allspaw's thesis and his findings.

Looking at "the incident" on December 4th 2014. (Black Friday at Etsy.)

#srecon
Timeline. Dekker does a good job of putting different data on a timeline. #srecon
Allspaw looked at a bunch of data (internal system logs, IRC logs, etc etc).

#srecon
Timeline of "IRC utterances".

#srecon
Found 3 heuristics that engineers use:
1. Changes. (What's changed since the last good state?)

Everyone did that, but no-one had pushed a change (because Black Friday). So answer, No. (But typically it yields a good solution).

OK, what next?
#srecon
2. "Go wide"

Widen the search to any possible potential contributor.

Once we find a plausible hypothesis, we go very deep to validate. Then we pop back up and go wide again.

[ed: this is why coming up with testable hypotheses is valuable during outages]

#srecon
3. Convergent searching

Confirm/disqualify
- A specific and past diagnosis (a really painful incident memory)
- A general but recent diagnosis (an incident still in your L1 cache. 1-2 months recent)

#srecon
The incident:
- Page load time increase
- .. CDN cache misses ..
- because of HTTP 400 status in an API ...
- From a "closed" store ...
- Referenced by a blogpost in the sidebar.

#srecon
So there was no infrastructure configuration change - a user (etsy employee) pushed user content (a blog). Not a "traditional" change.

#srecon
Interesting observations:

- Bob asks "is this a frozen shop?"
- Chases this down on their own.
- Bob had a "frozen shop" outage when they first started.

- Alice: varnish queuing?
- Previous 4-6 weeks, Etsy had a lot of Varnish cache incidents.

#srecon
Bonus heuristic: testing the fix

Would you always wait for the test to pass? Depends.

[ed: rollbacks should always be safe]

#srecon
How do you get better at detecting incidents?

Monitor things better.

How do you get better at responding to incidents? That's harder.

#srecon
Elements of "expertise"

Experts use knowledge:
- Recognise "typical"
- Make fine discriminations
- Use mental simulation (e.g. firefighters at a burning building)

Knowledge base used to apply higher level rules: know when to break rules.

(Research, Hoffman & Klein)
#srecon
"Seeing the invisible"

"Experts are able to see what is not there."

Seeing the Invisible: Perceptual-Cognitive Aspects of Expertise, Klein & Hoffman #srecon
"The Role of Deliberate Practice in the Acquisition of Expert Performance" -- K. Anders Ericsson

What makes people experts? Effortful, deliberate practice. "flow".

#srecon
Note: Ericsson's research was about relatively stable systems. Sport, music, chess, etc.

This is why chess becomes system 1.

Are the things we work on stable? #srecon
Look at other experts. Sully Sullenberger.

- Why start the APU? (Wasn't on checklist, just knew)
- Took control of aircraft. He had flown a lot of glider time, and 5k hours in that aircraft. 1st officer on first A320 ride.
- Don't land at LGA.

#srecon
Expertise in ops:

It's not DNS
There's no way it's DNS
It was DNS

#srecon
Expertise: good enough to mentor someone else.

- Personal experience: oncall
- Directed: training/code review
- Manufactured: Game days, wheel of misfortune
- Vicarious: "I remember this one time it was a DNS outage..."

#srecon
Exploring discretionary spaces. Another take on the Rasmussen Triangle.

Pressure gradients:
- Cheaper/better/faster
- Maximum work for least effort

Pushing us towards the unsafe edge. #srecon
Good incident responders get good at operating in the blue area above. #srecon
This is why monitoring and incident response are at the bottom of the SRE hierarchy of needs. #srecon
Why postmortems (Etsy):
- Did at least one person learn a way to avoid the thing?
- Will half of people come to another postmortem debrief?

#srecon
- Practice makes better.
- Expertise takes time and space.
- It's just us out here.

#srecon
[ed: comment - rollbacks are different from fix forwards, in terms of how much testing you need] #srecon
Question: how do you correlate with longer ago changes?

Answer: it's more difficult. You need to foster people's hunches and heuristics.
#srecon

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Murali Suriar

Murali Suriar Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @msuriar

Mar 30, 2018
Operational Excellence in April Fool's pranks, by @yesthattom #srecon
2015, April 1, 1023 UTC: stackoverflow enabled an easter egg.
#srecon
But we rolled back, and it was fine. Let's talk about reliable easter egg/April fool's features.
#srecon
Read 15 tweets
Mar 29, 2018
Next up, following neatly from incident response: @wcgallego on "Architecting a Technical Postmortem"
#srecon
I'm a Systems Engineer at Etsy. Run many postmortems.
- Database fall over
- Bad deploys
- The time everyone got sick
- Coffeemakers overflowed

Everything had something to learn. #srecon
Questions:
- who has never done a postmortem before?
- why do we have postmortems? (think about this through the talk)

Ask these before every meeting. They are our story times. #srecon
Read 40 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(