Post

Liz Fong-Jones (方禮真)

@lizthegrey

Jun 29, 2018 • 48 tweets • 19 min read • Read on X

@otterbook

Closing out #QConNYC in the chaos track is @otterbook on "turning it off and on again". He's promised to make my life "interesting" as a livetweeter so let's see how this goes.

@otterbook

@otterbook .@otterbook also has the best pre-talk dramatic/heroic amp-up music, which I'm digging. #QConNYC

@otterbook

This is going to be a very high audience participation talk, and is an experimental talk, says @otterbook. #QConNYC

Premise of the talk: we have a group of people who are operating computers and turning them into a resilient system. How we do this affects the resiliency of the system. #QConNYC

@otterbook

Right now, @otterbook giving a talk onstage is his production environment for shipping the talk. #QConNYC

@otterbook

The best metaphor @otterbook has for production is that... we're manually flipping switches, only for the machines to flip them back to the previous state. Is this resilience? [ed: well, it depends whether you think flipped or unflipped is the correct state?] #QConNYC

and now he cuts to Q&A! bold. Q: why do you think there's a high chance of failure? A: no idea how it will fail a priori. #QConNYC

[ed: I asked my question from a few tweets up. He said, "on is better than off" but... what's on? off?] #QConNYC

Q: is this toy available? A: yes. Q: can we introduce redundancy? A: Sure, feel free to join in giving the talk. Okay, end of questions. Soliciting 5 volunteers to get their phones out and get a timer app to go off between now and 35 minutes from now. #QConNYC

Ground rules: (1) interactive exercises can be opted out. (2) SREs should not jump to answer questions if called on. #QConNYC

Doing a level set about SRE. it's an engineering discipline to help bring your services to a desired level of reliability. Why reliability? Because your software can have amazing features but if it's not up, it's not useful to anyone. #QConNYC

@otterbook

.@otterbook will talk later about why "desired level" and not "perfect" later.

It's a discipline to engineer failure out of our systems. It solves the same problem of the tug of war between features/reliability DevOps does. #QConNYC

And there are three books about it. It's not just Google; more than a dozen brands you've heard of have SREs (or people like SREs e.g. Production Engineers, Production Infrastructure Engineers) #QConNYC

@otterbook

Some people present the idea that SRE is the next evolutionary step from sysadmin->devops->SRE "and now oh boy we can use tools". But @otterbook hopes nobody here will think that way. #QConNYC

@GCPcloud

Instead, think that they both evolved in similar environments to solve the same problems. If you were to, say, go to the @GCPcloud blog, you might find your lovely editor [ed: hi!] talking about DevOps vs SRE. #QConNYC

@otterbook

So @otterbook founded #SREcon. And Ben Treynor was there in 2014 presenting the "Keys to SRE" with a slide with a dozen points. #QConNYC

The most important three points from that: Have an SLA, measure the SLA, and gate launches on the error budget. #QConNYC

What's an error budget? We need to have an SLI defining what's up or not, and a target (say, being up 80% of the time). That pair is your SLO. We monitor the SLOs and SLIs. #QConNYC

And now we decide based on how we're doing with the SLO. If we're up 90% and our target is 80%, we can perturb the system however we want. But if it's only been up 60%, then we need to slow down and understand how to improve it. #QConNYC

We have a budget from which we can draw unreliability. Here's what's nice about SRE: it sets up virtuous and reinforcing feedback loops. #QConNYC

For instance, we need to have blameless postmortems. Not who to blame, but what process failures made it easy to accidentally nuke the entire North American fleet? #QConNYC

@otterbook

[ed: alarm going off.] @otterbook says it's time to ship, opens up a box and finds... a child's toy. [ed: another alarm goes off] #QConNYC

[ed: He forces a shape into the wrong slot with a hammer, and continues on with the talk.] "You can't fire your way to reliable." #QConNYC

So now getting to the actual talk: what operations practice make the system more resilient? It's not a complete list, but a few ideas to start off. #QConNYC

@tmu

(1) the nature of the work. @tmu said, "Stop feeding blood to the machines." Things like transactional systems administration involving tickets doesn't lead towards resilience. #QConNYC

We need some intermediation around the complexities of the system that helps us not just pass along the burden we already have. #QConNYC

[ed: another alarm goes off. volunteer is now being asked to copy digits of pi off a website with pen and paper until told they can stop] #QConNYC

The notion of toil: work that doesn't add value once it's completed, no matter how many times we do it. Using the volunteer as an example. #QConNYC

You don't get resiliency if your system requires toil to operate. It will occupy your humans with things they don't need to do, and burns your humans out. It disrupts your org and misuses your resources. #QConNYC

@otterbook

We need to limit non-project work to less than 50% of our time. "Resilience is a long game that we have to work at." --@otterbook #QConNYC

@otterbook

[ed: @otterbook's volunteer stopped, and the next person is now being told to check the first volunteer's work and then resume copying digits of pi.] #QConNYC

@otterbook

We need to consider the impact upon people and not burn out our people. Resilience and the cult of heroism around operations are in conflict with each other. "This hero worship has got to stop." --@otterbook #QConNYC

Point 2: interfaces. APIs are crucial, and APIs between people are critical. The error budget *is* an interface between people. We define up in advance, rather than coming into conflict over it. #QConNYC

@otterbook

[ed: another alarm goes off. @otterbook asks two rows to switch seats] "faster, we're losing money, we have to migrate these jobs..." #QConNYC

Inclusion is important as another part of resilience. We need to have the right people in the room. #QConNYC

Point 3: data is crucial to resilience. It's a tool for cartography. Where are the pitfalls? How can we get from A to B? Are we where we want to be now? #QConNYC

Data's value isn't for us to expose to Cambridge Analytica, it's to help us figure out the lay of the land. #QConNYC

Point 4: There's room for error. Some people think that error is bad and we should get it all out of the system. But to SREs, errors help us learn about the system. #QConNYC

They're an expected part of the system.

Point 5: ambiguity. There's a fine line between resilience and ambiguity. Some think that it's important to understand as much as we can... #QConNYC

@otterbook

[ed: a phone goes off, and @otterbook asks the rows to swap back to where they were before] #QConNYC

... but we need to embrace ambiguity. We can treasure it and figure our way through things instead of having perfect clarity. #QConNYC

Taking audience suggestions for other principles -- someone suggests "removing human error" and the audience groans. #QConNYC

@otterbook

Other suggestions: "removing computer error", and "anti-fragility". @otterbook writes them down but notes he will disagree vigorously with all of them. #QConNYC

"Providing feedback" is named and agreed with, and "observability" and "logical" are added too. #QConNYC

(1) and (3) are impossible, (2) is unlikely, and (5) is hard to achieve, so that just leaves (4). #QConNYC

Did people enjoy this? Verbal yesses [ed: except maybe not the person writing digits of pi.] We have choices about how we operate systems, and should be thoughtful. #QConNYC

The intention of the exercises was to have people physically, viscerally feel the experience of being oncall. "did you stop writing down numbers? I'm so sorry." #QConNYC

[fin] [ed: And that's the end of #QConNYC. This is your editor, out.]

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Enter URL or ID to Unroll

Liz Fong-Jones (方禮真)

Try unrolling a thread yourself!

More from @lizthegrey

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!