Post

Liz Fong-Jones (方禮真)

@lizthegrey

Aug 31, 2018 • 39 tweets • 15 min read • Read on X

@mipsytipsy

I'll be livetweeting @mipsytipsy's talk after lunch on monitoring vs observability and more. #SREcon

@mipsytipsy

@mipsytipsy "I don't get my speaker's notes because I arrived a bit late, but I do get a unicorn instead, and I'm happy with that." -- @mipsytipsy #SREcon

"In the beginning, people wrote software for users, and we were motivated to fix it, because they would complain to us." #SREcon

But then things changed, and engineers fled from production to staging, and from staging to devel environments...

We've been trying to get back to ownership of code since the start of DevOps/SRE. #SREcon

And we do it because it makes for better code and happier humans who have autonomy.

And we want to do it while keeping all of these nice things associated with sophisticated environments. #SREcon

What does "software ownership" mean? Those who write the code can deploy their own code to production and support it in production. #SREcon

@mipsytipsy

.@mipsytipsy says that she realized that observability is tied to the success of the software engineers oncall. [and then drops an f-bomb about having edited out the slide she needed just then, because she's Charity and swears a lot ;)] #SREcon

@mipsytipsy

"Operations is where the 'beautiful' field of computing meets reality." --@mipsytipsy #SREcon

When healthy teams (with good culture & values) fail to adopt software ownership, it's because of an observability gap.

But even then, there's some variance in stories and how they wound up. #SREcon

@mipsytipsy

.@mipsytipsy thinks that observability is not just instrumentation (logs/monitoring/tracing). There's more meaning behind it -- she uses the control theory definition -- how well can you infer the internal state from external outputs. #SREcon

We ask questions using our instrumentation. Can we debug our code just from our instrumentation? And can we answer new questions without shipping new code? #SREcon

No prior knowledge should be required for teams to track down new problems.

Monitoring is typically aggregated reporting from the outside in (e.g. mostly blackbox); observability reports from the inside out.

"The software explains itself from the inside." #SREcon

Known unknowns (writing checks for things you think will fail) vs unknown unknowns (not needing to know what will fail in order to debug it.) #SREcon

We've been trying to build observability tooling for 20+ years. And if you're giving people a bunch of arcane stuff they have to understand to be oncall, then yes, you're giving them two jobs.

The answer isn't to let them stay walled off from prod, it's better tools. #SREcon

It doesn't have to be a second job, or require having a lot of scar tissue from operations, to be able to operate your own code.

We don't need to worry about scarcity and the cost of storing/aggregating individual metrics any more. #SREcon

@mipsytipsy

But not only that, we're having too much complexity in our systems to be able to use monitoring any longer.

"Our intuition and reasoning in our head no longer work." --@mipsytipsy #SREcon

We now need a lot more context about our systems to run them successfully, which means self ownership.

But our tools are still designed for the wall dashboard and the LAMP stack. [ed: f-bomb count is now 2!] #SREcon

We have to get used to putting the knowledge in our tools rather than in our brains, and this also lets other people benefit from it! #SREcon

Some problems can only be seen by zooming out and looking at scale (e.g. systemic patterns caused by a specific kind/version of component), rather than zooming in. #SREcon

You need to know the relationships between your metrics and the context they came have. Fucking cardinality [ed: count is 3!] #SREcon

People feel like they're bad at debugging and get terrified.

But the problem is not them, it's our tools. We're misusing them for purposes they weren't designed for.

Our problems are now multitenant, and the problem is finding *where* not *what* the fault is. #SREcon

Our system is in constant chaos. Our systems are never "completely up".

And is an infinitely long list of long tail low-probability failure scenarios, one of which will inevitably fail. But not in your staging environment. #SREcon

We need to be able to test in production AND also test before production.

We need to invest in making it safe to test in production. Think canarying, safeguards, etc. #SREcon

Don't test without the ability to understand what's going on. Don't just ship code and wait to get paged. Look at it and make sure what you expected to happen is happening. [ed: although if your SLOs aren't detecting problems, your SLOs are wrong] #SREcon

We're used to in a LAMP world bisecting through the stack and trying to figure out whether the UI tier, database tier, etc. are slow. and then probing some possible causes #SREcon

Runbooks help in a monitoring world but not in an observability world -- because they only help you with problems you know to anticipate. #SREcon

Charity argues that SLOs can help you with large outages, but not with debugging long tail events (e.g. when individual customers have issues, echoing the New Relic talk yesterday about keeping large customers happy) #SREcon

e.g. that intermittent/sporadic failures can't be detected by SLOs, and that we do need the ability to debug those interactions when they are raised by support. #SREcon

The overall health of the system doesn't matter if certain critical individual requests aren't working correctly. You still need to be able to debug those individual requests. [ed: curious what the process of defining an error budget for long tail problems should be] #SREcon

@honeycombio

So you need to have things well-instrumented, support high cardinality/dimensionality, structured events, with sampling, and live from prod. (which is a thing @honeycombio does but others are doing too) #SREcon

So we need to instrument things correctly -- make sure our software explains itself to you.

You'll need *both* events and metrics [ed: yes]. Make sure that you have the context to debug. Charity doesn't like exemplars and thinks wide events/tracing is essential #SREcon

[ed: although I disagree with her diss of exemplars, I think exemplars *are* the sweet spot of being able to keep the right amount of context and have metrics and traces together, but we'll continue to argue about that ;)] #SREcon

High cardinality will save your ass. If you don't have the fields from your full dataset, you'll be unable to find problems and translate them from engineering to support problems [ed: amen.] #SREcon

You need to be able to ask new questions, which means you can't have aggregated away the context you need. And events give us the stories that we need to know. #SREcon

Dashboards are nice for some things, but work against you when you're debugging.
They cause you to mono-focus on the one thing you've debugged before, and not your current failure. [ed: a-fucking-men, see also my "stale dashboards are technical debt".] #SREcon

It's lazy to glance at dashboards and look for the squiggly line. It does not make you a good engineer. [ed: also a-fucking-men.] You need to follow the breadcrumb, not scan the drywall. #SREcon

[ed: ah, there's the nuance.] Don't aggregate at write time. Set a time threshold after which you aggregate, but don't throw away the data before you have a chance to do anything with it. #SREcon

"What do the 100 slowest have in common?" Can we reward exploration and curiosity rather than "yeah, there are some errors but we can't explain them." #SREcon

Users don't care about the system health, they only care if their individual requests are succeeding. And we need to make our systems accessible to generalist engineers. Tighten feedback loops, make better systems and happier people. [fin] #SREcon

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Enter URL or ID to Unroll

Liz Fong-Jones (方禮真)

Try unrolling a thread yourself!

More from @lizthegrey

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!