Closing out #QConNYC in the chaos track is @otterbook on "turning it off and on again". He's promised to make my life "interesting" as a livetweeter so let's see how this goes.
This is going to be a very high audience participation talk, and is an experimental talk, says @otterbook. #QConNYC
Premise of the talk: we have a group of people who are operating computers and turning them into a resilient system. How we do this affects the resiliency of the system. #QConNYC
Right now, @otterbook giving a talk onstage is his production environment for shipping the talk. #QConNYC
The best metaphor @otterbook has for production is that... we're manually flipping switches, only for the machines to flip them back to the previous state. Is this resilience? [ed: well, it depends whether you think flipped or unflipped is the correct state?] #QConNYC
and now he cuts to Q&A! bold. Q: why do you think there's a high chance of failure? A: no idea how it will fail a priori. #QConNYC
[ed: I asked my question from a few tweets up. He said, "on is better than off" but... what's on? off?] #QConNYC
Q: is this toy available? A: yes. Q: can we introduce redundancy? A: Sure, feel free to join in giving the talk. Okay, end of questions. Soliciting 5 volunteers to get their phones out and get a timer app to go off between now and 35 minutes from now. #QConNYC
Ground rules: (1) interactive exercises can be opted out. (2) SREs should not jump to answer questions if called on. #QConNYC
Doing a level set about SRE. it's an engineering discipline to help bring your services to a desired level of reliability. Why reliability? Because your software can have amazing features but if it's not up, it's not useful to anyone. #QConNYC
.@otterbook will talk later about why "desired level" and not "perfect" later.
It's a discipline to engineer failure out of our systems. It solves the same problem of the tug of war between features/reliability DevOps does. #QConNYC
And there are three books about it. It's not just Google; more than a dozen brands you've heard of have SREs (or people like SREs e.g. Production Engineers, Production Infrastructure Engineers) #QConNYC
Some people present the idea that SRE is the next evolutionary step from sysadmin->devops->SRE "and now oh boy we can use tools". But @otterbook hopes nobody here will think that way. #QConNYC
Instead, think that they both evolved in similar environments to solve the same problems. If you were to, say, go to the @GCPcloud blog, you might find your lovely editor [ed: hi!] talking about DevOps vs SRE. #QConNYC
So @otterbook founded #SREcon. And Ben Treynor was there in 2014 presenting the "Keys to SRE" with a slide with a dozen points. #QConNYC
The most important three points from that: Have an SLA, measure the SLA, and gate launches on the error budget. #QConNYC
What's an error budget? We need to have an SLI defining what's up or not, and a target (say, being up 80% of the time). That pair is your SLO. We monitor the SLOs and SLIs. #QConNYC
And now we decide based on how we're doing with the SLO. If we're up 90% and our target is 80%, we can perturb the system however we want. But if it's only been up 60%, then we need to slow down and understand how to improve it. #QConNYC
We have a budget from which we can draw unreliability. Here's what's nice about SRE: it sets up virtuous and reinforcing feedback loops. #QConNYC
For instance, we need to have blameless postmortems. Not who to blame, but what process failures made it easy to accidentally nuke the entire North American fleet? #QConNYC
[ed: alarm going off.] @otterbook says it's time to ship, opens up a box and finds... a child's toy. [ed: another alarm goes off] #QConNYC
[ed: He forces a shape into the wrong slot with a hammer, and continues on with the talk.] "You can't fire your way to reliable." #QConNYC
So now getting to the actual talk: what operations practice make the system more resilient? It's not a complete list, but a few ideas to start off. #QConNYC
(1) the nature of the work. @tmu said, "Stop feeding blood to the machines." Things like transactional systems administration involving tickets doesn't lead towards resilience. #QConNYC
We need some intermediation around the complexities of the system that helps us not just pass along the burden we already have. #QConNYC
[ed: another alarm goes off. volunteer is now being asked to copy digits of pi off a website with pen and paper until told they can stop] #QConNYC
The notion of toil: work that doesn't add value once it's completed, no matter how many times we do it. Using the volunteer as an example. #QConNYC
You don't get resiliency if your system requires toil to operate. It will occupy your humans with things they don't need to do, and burns your humans out. It disrupts your org and misuses your resources. #QConNYC
We need to limit non-project work to less than 50% of our time. "Resilience is a long game that we have to work at." --@otterbook#QConNYC
[ed: @otterbook's volunteer stopped, and the next person is now being told to check the first volunteer's work and then resume copying digits of pi.] #QConNYC
We need to consider the impact upon people and not burn out our people. Resilience and the cult of heroism around operations are in conflict with each other. "This hero worship has got to stop." --@otterbook#QConNYC
Point 2: interfaces. APIs are crucial, and APIs between people are critical. The error budget *is* an interface between people. We define up in advance, rather than coming into conflict over it. #QConNYC
[ed: another alarm goes off. @otterbook asks two rows to switch seats] "faster, we're losing money, we have to migrate these jobs..." #QConNYC
Inclusion is important as another part of resilience. We need to have the right people in the room. #QConNYC
Point 3: data is crucial to resilience. It's a tool for cartography. Where are the pitfalls? How can we get from A to B? Are we where we want to be now? #QConNYC
Data's value isn't for us to expose to Cambridge Analytica, it's to help us figure out the lay of the land. #QConNYC
Point 4: There's room for error. Some people think that error is bad and we should get it all out of the system. But to SREs, errors help us learn about the system. #QConNYC
They're an expected part of the system.
Point 5: ambiguity. There's a fine line between resilience and ambiguity. Some think that it's important to understand as much as we can... #QConNYC
[ed: a phone goes off, and @otterbook asks the rows to swap back to where they were before] #QConNYC
... but we need to embrace ambiguity. We can treasure it and figure our way through things instead of having perfect clarity. #QConNYC
Taking audience suggestions for other principles -- someone suggests "removing human error" and the audience groans. #QConNYC
Other suggestions: "removing computer error", and "anti-fragility". @otterbook writes them down but notes he will disagree vigorously with all of them. #QConNYC
"Providing feedback" is named and agreed with, and "observability" and "logical" are added too. #QConNYC
(1) and (3) are impossible, (2) is unlikely, and (5) is hard to achieve, so that just leaves (4). #QConNYC
Did people enjoy this? Verbal yesses [ed: except maybe not the person writing digits of pi.] We have choices about how we operate systems, and should be thoughtful. #QConNYC
The intention of the exercises was to have people physically, viscerally feel the experience of being oncall. "did you stop writing down numbers? I'm so sorry." #QConNYC
[fin] [ed: And that's the end of #QConNYC. This is your editor, out.]
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Final talk I'll be getting to at #VelocityConf before I dash to Toronto: @IanColdwater on improving container security on k8s.
@IanColdwater She focuses on hardening her employer's cloud container infrastructure, including doing work on k8s.
She also was an ethical hacker before she went into DevOps and DevSecOps. #VelocityConf
She travels around doing competitive hacking with CTFs. It's important to think like an attacker rather than assuming good intents and nice user personas that use our features in the way the devs intended things to be used. #VelocityConf
My colleague @sethvargo on microservice security at #VelocityConf: traditionally we've thought of traditional security as all-or-nothing -- that you put the biggest possible padlock on your perimeter, and you have a secure zone and untrusted zone.
@sethvargo We know that monoliths don't actually work, so we're moving towards microservices. But how does this change your security model?
You might have a loadbalancer that has software-defined rules. And you have a variety of compartmentalized networks. #VelocityConf
You might also be communicating with managed services such as Cloud SQL that are outside of your security perimeter.
You no longer have one resource, firewall, loadbalancer, and security team. You have many. Including "Chris." #VelocityConf
The problems we're solving: (1) why are monoliths harder to migrate? (2) Should you? (3) How do I start? (4) Best practices #VelocityConf
.@krisnova is a Gaypher (gay gopher), is a k8s maintainer, and is involved in two k8s SIGs (cluster lifecycle & aws, but she likes all the clouds. depending upon the day). And she did SRE before becoming a Dev Advocate! #VelocityConf
"just collect data and figure out later how you'll use it" doesn't work any more. #VelocityConf
We used to be optimistic before we ruined everything.
Mozilla also used to not collect data, and only had data on number of downloads, but its market share went down because they weren't measuring user satisfaction and actual usage. #VelocityConf