Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Liz Fong-Jones (方禮真)

@lizthegrey

Aug 30, 2018 • 30 tweets • 12 min read • Read on X

@damonedwards

Next up in Track 1 at #SREcon is @damonedwards on brownfield SRE in enterprises!

@damonedwards

@damonedwards "You may think, 'I don't work in an enterprise', but you will eventually when your company becomes successful enough." --@damonedwards

You'll have multiple business lines, acquisitions, generations of tech debt... #SREcon

We see a lot of companies saying they're doing every buzzword. But when you look behind the scenes and talk to the ops folks, they're squeezed between the "DevOps/digital transformation" groups pushing to move faster, vs. locking things down from audit/security. #SREcon

Why are the digital transformations failing? Well, why don't we try SRE to transform ops?

So we take Jane Doe and s/SysAdmin/SRE/ aaand.... #SREcon

all we're doing is tell people "use code and you'll be better at ops." so now we have false SRE where we're still overloaded and firefighting.

But now everyone has an SRE job title so they're getting headhunted on LinkedIn and leaving! [ed: guffaws from audience] #SREcon

@Jerub

Quoting @Jerub: the key principle of SRE is that we need service level objectives with consequences.

In the enterprise world, we're used to thinking about SLAs as being punitive for operations, rather than an agreement of shared responsibility as with SLOs. #SREcon

@Jerub

The second principle, according to @Jerub, is having time to make tomorrow better than today (e.g. toil budgets). and principle 3 is to empower SREs to regulate workload.

Both are a foreign concept in the enterprise. #SREcon

@damonedwards

So @damonedwards's four "horsemen of the enterprise apocalypse" that undermine SRE principles: silos, queues, excessive toil, and low trust. #SREcon

On silos: we start having context breaks, process mismatches, different tools, and teams optimizing for different things rather than working together. #SREcon

The silos interfere with our feedback loops, and also cause people in the silos to become interchangeable and tactical, doing semi-manual fixes in response to tickets rather than actually fixing long-term issues. #SREcon

Disjointed silos make it hard to share responsibility and have meaningful SLOs; there's no time left after overhead and toil to do long-term projects.

So we need to fix the cross-silo problems, and we try putting in a ticket queue, only to make it worse. #SREcon

We increase risk with delay, introduce overhead, and make people feel less motivated due to lack of connection with impact -- nobody sees the totality of what they're building. #SREcon

Ticket work also becomes a creator of one-off snowflake configs en masse. This makes all your future automation efforts harder. #SREcon

Tickets reinforce silos, obfuscate value, create more work, and disconnect the pushback against excessive workload since it's all in the infinite ticket queue. #SREcon

@srebook

Recapping the definition of toil from @srebook: manual, repetitive, automatable, tactical, break/fix, and O(N) with service growth. #SREcon

creative engineering work that builds enduring value as a contrast to tactical, toil-y break/fix work. #SREcon

Excessive toil results in "Engineering Bankruptcy" since there's no capacity to even get out of the toil they're buried in and reduce toil and improve the business. #SREcon

@allspaw

Quoting @allspaw: all our work is contextual, and the answer is "it depends" to "is this safe to run?"

Yet, the people with the context aren't making the decisions in an enterprise world, it's people 4 degrees or more removed. #SREcon

Low trust and an approval system result in an illusion of control.

How many approvers are *actually* adding value -- not the ones who are CYAs, "just FYIs", or "I guess this LGTM" #SREcon

Low trust environments fail at shared responsibility, fails at actually fixing the real problems to make tomorrow better, and fails at letting people self-regulate. #SREcon

How do we get out of this situation? We need to study our processes with lean methodology, and not just delivery, but incidents too. #SREcon

@damonedwards

"Often the challenge is convincing executives to fix process rather than just employees working harder." --@damonedwards #SREcon

Problems can be solved by getting rid of silos and context breaks; horizontal shared responsibilities rather than everyone doing everything.

You can choose either to do cross-functional teams, or have distinct teams with clearly communicated shared responsibilities. #SREcon

Don't bounce things from ticket queue to ticket queue; once an item is pulled from the backlog, just get it done.

Remember OODA loops apply to us too - instrumentation for observing, tools for investigating/orienting, then empower deciding & acting. #SREcon

How do we deal with handoffs between teams and between teams and specialists?

Give people automated on-demand, audited access to privileged environments rather than having to go through a human or ticket. [ed: I think?] #SREcon

Single place to interact with our services (where everyone can watch), to avoid people dogpiling/freelancing into situations and making them worse.

Make sure our operation actions are pre-defined and can be changed to reflect changes in the environment [ed: ah I get it] #SREcon

[ed: the key aspect is that you're decoupling "the person who knows how to do X" and "the person who presses the button to cause X to happen" -- by making buttons that do X on demand and whose actions can be changed if how you do X changes] #SREcon

Ticket tracking should be for actual work rather than rote work.

You shift the compliance work into the operations as a service actuation framework. #SREcon

How does this work with ITIL? Complicated. They say it's compatible, but either way, we're trying to accomplish getting work done better. #SREcon

Shift left your decisionmaking.

Takeaways: reduce your toil -- track toil, set limits, and fund efforts to reduce toil.

Start a book club. Make sure people actually understand they're doing SRE when they're doing SRE. [fin] #SREcon

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

Read 5 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Liz Fong-Jones (方禮真)

Try unrolling a thread yourself!

More from @lizthegrey

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!