Next up in Track 1 at #SREcon is @damonedwards on brownfield SRE in enterprises!
@damonedwards "You may think, 'I don't work in an enterprise', but you will eventually when your company becomes successful enough." --@damonedwards
You'll have multiple business lines, acquisitions, generations of tech debt... #SREcon
We see a lot of companies saying they're doing every buzzword. But when you look behind the scenes and talk to the ops folks, they're squeezed between the "DevOps/digital transformation" groups pushing to move faster, vs. locking things down from audit/security. #SREcon
Why are the digital transformations failing? Well, why don't we try SRE to transform ops?
So we take Jane Doe and s/SysAdmin/SRE/ aaand.... #SREcon
all we're doing is tell people "use code and you'll be better at ops." so now we have false SRE where we're still overloaded and firefighting.
But now everyone has an SRE job title so they're getting headhunted on LinkedIn and leaving! [ed: guffaws from audience] #SREcon
Quoting @Jerub: the key principle of SRE is that we need service level objectives with consequences.
In the enterprise world, we're used to thinking about SLAs as being punitive for operations, rather than an agreement of shared responsibility as with SLOs. #SREcon
The second principle, according to @Jerub, is having time to make tomorrow better than today (e.g. toil budgets). and principle 3 is to empower SREs to regulate workload.
Both are a foreign concept in the enterprise. #SREcon
So @damonedwards's four "horsemen of the enterprise apocalypse" that undermine SRE principles: silos, queues, excessive toil, and low trust. #SREcon
On silos: we start having context breaks, process mismatches, different tools, and teams optimizing for different things rather than working together. #SREcon
The silos interfere with our feedback loops, and also cause people in the silos to become interchangeable and tactical, doing semi-manual fixes in response to tickets rather than actually fixing long-term issues. #SREcon
Disjointed silos make it hard to share responsibility and have meaningful SLOs; there's no time left after overhead and toil to do long-term projects.
So we need to fix the cross-silo problems, and we try putting in a ticket queue, only to make it worse. #SREcon
We increase risk with delay, introduce overhead, and make people feel less motivated due to lack of connection with impact -- nobody sees the totality of what they're building. #SREcon
Ticket work also becomes a creator of one-off snowflake configs en masse. This makes all your future automation efforts harder. #SREcon
Tickets reinforce silos, obfuscate value, create more work, and disconnect the pushback against excessive workload since it's all in the infinite ticket queue. #SREcon
Recapping the definition of toil from @srebook: manual, repetitive, automatable, tactical, break/fix, and O(N) with service growth. #SREcon
creative engineering work that builds enduring value as a contrast to tactical, toil-y break/fix work. #SREcon
Excessive toil results in "Engineering Bankruptcy" since there's no capacity to even get out of the toil they're buried in and reduce toil and improve the business. #SREcon
Quoting @allspaw: all our work is contextual, and the answer is "it depends" to "is this safe to run?"
Yet, the people with the context aren't making the decisions in an enterprise world, it's people 4 degrees or more removed. #SREcon
Low trust and an approval system result in an illusion of control.
How many approvers are *actually* adding value -- not the ones who are CYAs, "just FYIs", or "I guess this LGTM" #SREcon
Low trust environments fail at shared responsibility, fails at actually fixing the real problems to make tomorrow better, and fails at letting people self-regulate. #SREcon
How do we get out of this situation? We need to study our processes with lean methodology, and not just delivery, but incidents too. #SREcon
"Often the challenge is convincing executives to fix process rather than just employees working harder." --@damonedwards#SREcon
Problems can be solved by getting rid of silos and context breaks; horizontal shared responsibilities rather than everyone doing everything.
You can choose either to do cross-functional teams, or have distinct teams with clearly communicated shared responsibilities. #SREcon
Don't bounce things from ticket queue to ticket queue; once an item is pulled from the backlog, just get it done.
Remember OODA loops apply to us too - instrumentation for observing, tools for investigating/orienting, then empower deciding & acting. #SREcon
How do we deal with handoffs between teams and between teams and specialists?
Give people automated on-demand, audited access to privileged environments rather than having to go through a human or ticket. [ed: I think?] #SREcon
Single place to interact with our services (where everyone can watch), to avoid people dogpiling/freelancing into situations and making them worse.
Make sure our operation actions are pre-defined and can be changed to reflect changes in the environment [ed: ah I get it] #SREcon
[ed: the key aspect is that you're decoupling "the person who knows how to do X" and "the person who presses the button to cause X to happen" -- by making buttons that do X on demand and whose actions can be changed if how you do X changes] #SREcon
Ticket tracking should be for actual work rather than rote work.
You shift the compliance work into the operations as a service actuation framework. #SREcon
How does this work with ITIL? Complicated. They say it's compatible, but either way, we're trying to accomplish getting work done better. #SREcon
Shift left your decisionmaking.
Takeaways: reduce your toil -- track toil, set limits, and fund efforts to reduce toil.
Start a book club. Make sure people actually understand they're doing SRE when they're doing SRE. [fin] #SREcon
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Final talk I'll be getting to at #VelocityConf before I dash to Toronto: @IanColdwater on improving container security on k8s.
@IanColdwater She focuses on hardening her employer's cloud container infrastructure, including doing work on k8s.
She also was an ethical hacker before she went into DevOps and DevSecOps. #VelocityConf
She travels around doing competitive hacking with CTFs. It's important to think like an attacker rather than assuming good intents and nice user personas that use our features in the way the devs intended things to be used. #VelocityConf
My colleague @sethvargo on microservice security at #VelocityConf: traditionally we've thought of traditional security as all-or-nothing -- that you put the biggest possible padlock on your perimeter, and you have a secure zone and untrusted zone.
@sethvargo We know that monoliths don't actually work, so we're moving towards microservices. But how does this change your security model?
You might have a loadbalancer that has software-defined rules. And you have a variety of compartmentalized networks. #VelocityConf
You might also be communicating with managed services such as Cloud SQL that are outside of your security perimeter.
You no longer have one resource, firewall, loadbalancer, and security team. You have many. Including "Chris." #VelocityConf
The problems we're solving: (1) why are monoliths harder to migrate? (2) Should you? (3) How do I start? (4) Best practices #VelocityConf
.@krisnova is a Gaypher (gay gopher), is a k8s maintainer, and is involved in two k8s SIGs (cluster lifecycle & aws, but she likes all the clouds. depending upon the day). And she did SRE before becoming a Dev Advocate! #VelocityConf
"just collect data and figure out later how you'll use it" doesn't work any more. #VelocityConf
We used to be optimistic before we ruined everything.
Mozilla also used to not collect data, and only had data on number of downloads, but its market share went down because they weren't measuring user satisfaction and actual usage. #VelocityConf