I'll be livetweeting @mipsytipsy's talk after lunch on monitoring vs observability and more. #SREcon
@mipsytipsy "I don't get my speaker's notes because I arrived a bit late, but I do get a unicorn instead, and I'm happy with that." -- @mipsytipsy #SREcon
"In the beginning, people wrote software for users, and we were motivated to fix it, because they would complain to us." #SREcon
But then things changed, and engineers fled from production to staging, and from staging to devel environments...

We've been trying to get back to ownership of code since the start of DevOps/SRE. #SREcon
And we do it because it makes for better code and happier humans who have autonomy.

And we want to do it while keeping all of these nice things associated with sophisticated environments. #SREcon
What does "software ownership" mean? Those who write the code can deploy their own code to production and support it in production. #SREcon
.@mipsytipsy says that she realized that observability is tied to the success of the software engineers oncall. [and then drops an f-bomb about having edited out the slide she needed just then, because she's Charity and swears a lot ;)] #SREcon
"Operations is where the 'beautiful' field of computing meets reality." --@mipsytipsy #SREcon
When healthy teams (with good culture & values) fail to adopt software ownership, it's because of an observability gap.

But even then, there's some variance in stories and how they wound up. #SREcon
.@mipsytipsy thinks that observability is not just instrumentation (logs/monitoring/tracing). There's more meaning behind it -- she uses the control theory definition -- how well can you infer the internal state from external outputs. #SREcon
We ask questions using our instrumentation. Can we debug our code just from our instrumentation? And can we answer new questions without shipping new code? #SREcon
No prior knowledge should be required for teams to track down new problems.

Monitoring is typically aggregated reporting from the outside in (e.g. mostly blackbox); observability reports from the inside out.

"The software explains itself from the inside." #SREcon
Known unknowns (writing checks for things you think will fail) vs unknown unknowns (not needing to know what will fail in order to debug it.) #SREcon
We've been trying to build observability tooling for 20+ years. And if you're giving people a bunch of arcane stuff they have to understand to be oncall, then yes, you're giving them two jobs.

The answer isn't to let them stay walled off from prod, it's better tools. #SREcon
It doesn't have to be a second job, or require having a lot of scar tissue from operations, to be able to operate your own code.

We don't need to worry about scarcity and the cost of storing/aggregating individual metrics any more. #SREcon
But not only that, we're having too much complexity in our systems to be able to use monitoring any longer.

"Our intuition and reasoning in our head no longer work." --@mipsytipsy #SREcon
We now need a lot more context about our systems to run them successfully, which means self ownership.

But our tools are still designed for the wall dashboard and the LAMP stack. [ed: f-bomb count is now 2!] #SREcon
We have to get used to putting the knowledge in our tools rather than in our brains, and this also lets other people benefit from it! #SREcon
Some problems can only be seen by zooming out and looking at scale (e.g. systemic patterns caused by a specific kind/version of component), rather than zooming in. #SREcon
You need to know the relationships between your metrics and the context they came have. Fucking cardinality [ed: count is 3!] #SREcon
People feel like they're bad at debugging and get terrified.

But the problem is not them, it's our tools. We're misusing them for purposes they weren't designed for.

Our problems are now multitenant, and the problem is finding *where* not *what* the fault is. #SREcon
Our system is in constant chaos. Our systems are never "completely up".

And is an infinitely long list of long tail low-probability failure scenarios, one of which will inevitably fail. But not in your staging environment. #SREcon
We need to be able to test in production AND also test before production.

We need to invest in making it safe to test in production. Think canarying, safeguards, etc. #SREcon
Don't test without the ability to understand what's going on. Don't just ship code and wait to get paged. Look at it and make sure what you expected to happen is happening. [ed: although if your SLOs aren't detecting problems, your SLOs are wrong] #SREcon
We're used to in a LAMP world bisecting through the stack and trying to figure out whether the UI tier, database tier, etc. are slow. and then probing some possible causes #SREcon
Runbooks help in a monitoring world but not in an observability world -- because they only help you with problems you know to anticipate. #SREcon
Charity argues that SLOs can help you with large outages, but not with debugging long tail events (e.g. when individual customers have issues, echoing the New Relic talk yesterday about keeping large customers happy) #SREcon
e.g. that intermittent/sporadic failures can't be detected by SLOs, and that we do need the ability to debug those interactions when they are raised by support. #SREcon
The overall health of the system doesn't matter if certain critical individual requests aren't working correctly. You still need to be able to debug those individual requests. [ed: curious what the process of defining an error budget for long tail problems should be] #SREcon
So you need to have things well-instrumented, support high cardinality/dimensionality, structured events, with sampling, and live from prod. (which is a thing @honeycombio does but others are doing too) #SREcon
So we need to instrument things correctly -- make sure our software explains itself to you.

You'll need *both* events and metrics [ed: yes]. Make sure that you have the context to debug. Charity doesn't like exemplars and thinks wide events/tracing is essential #SREcon
[ed: although I disagree with her diss of exemplars, I think exemplars *are* the sweet spot of being able to keep the right amount of context and have metrics and traces together, but we'll continue to argue about that ;)] #SREcon
High cardinality will save your ass. If you don't have the fields from your full dataset, you'll be unable to find problems and translate them from engineering to support problems [ed: amen.] #SREcon
You need to be able to ask new questions, which means you can't have aggregated away the context you need. And events give us the stories that we need to know. #SREcon
Dashboards are nice for some things, but work against you when you're debugging.
They cause you to mono-focus on the one thing you've debugged before, and not your current failure. [ed: a-fucking-men, see also my "stale dashboards are technical debt".] #SREcon
It's lazy to glance at dashboards and look for the squiggly line. It does not make you a good engineer. [ed: also a-fucking-men.] You need to follow the breadcrumb, not scan the drywall. #SREcon
[ed: ah, there's the nuance.] Don't aggregate at write time. Set a time threshold after which you aggregate, but don't throw away the data before you have a chance to do anything with it. #SREcon
"What do the 100 slowest have in common?" Can we reward exploration and curiosity rather than "yeah, there are some errors but we can't explain them." #SREcon
Users don't care about the system health, they only care if their individual requests are succeeding. And we need to make our systems accessible to generalist engineers. Tighten feedback loops, make better systems and happier people. [fin] #SREcon

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真) Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @lizthegrey

Oct 3, 2018
Final talk I'll be getting to at #VelocityConf before I dash to Toronto: @IanColdwater on improving container security on k8s.
@IanColdwater She focuses on hardening her employer's cloud container infrastructure, including doing work on k8s.

She also was an ethical hacker before she went into DevOps and DevSecOps. #VelocityConf
She travels around doing competitive hacking with CTFs. It's important to think like an attacker rather than assuming good intents and nice user personas that use our features in the way the devs intended things to be used. #VelocityConf
Read 36 tweets
Oct 3, 2018
My colleague @sethvargo on microservice security at #VelocityConf: traditionally we've thought of traditional security as all-or-nothing -- that you put the biggest possible padlock on your perimeter, and you have a secure zone and untrusted zone.
@sethvargo We know that monoliths don't actually work, so we're moving towards microservices. But how does this change your security model?

You might have a loadbalancer that has software-defined rules. And you have a variety of compartmentalized networks. #VelocityConf
You might also be communicating with managed services such as Cloud SQL that are outside of your security perimeter.

You no longer have one resource, firewall, loadbalancer, and security team. You have many. Including "Chris." #VelocityConf
Read 19 tweets
Oct 3, 2018
Leading off the k8s track today is @krisnova on migrating monoliths to k8s! #VelocityConf
@krisnova [ed: p.s. her ponies and rainbows dress is A+++]

She starts by providing a resources link: j.hept.io/velocity-nyc-2…

The problems we're solving:
(1) why are monoliths harder to migrate?
(2) Should you?
(3) How do I start?
(4) Best practices #VelocityConf
.@krisnova is a Gaypher (gay gopher), is a k8s maintainer, and is involved in two k8s SIGs (cluster lifecycle & aws, but she likes all the clouds. depending upon the day). And she did SRE before becoming a Dev Advocate! #VelocityConf
Read 29 tweets
Oct 3, 2018
Final keynote block: @lxt of Mozilla on practical ethics and user data. #VelocityConf
@lxt And also ethics of experimentation!

"just collect data and figure out later how you'll use it" doesn't work any more. #VelocityConf
We used to be optimistic before we ruined everything.

Mozilla also used to not collect data, and only had data on number of downloads, but its market share went down because they weren't measuring user satisfaction and actual usage. #VelocityConf
Read 25 tweets
Oct 3, 2018
Next up is @mrb_bk on why marketing matters. #VelocityConf
@mrb_bk Hypothesis: marketing >> code in terms of software adoption. [ed: and this is why I became a developer advocate!] #VelocityConf
You need to consider community early when developing a product.

Always ask, "Why do people matter?" "Why does adoption matter?" #VelocityConf
Read 17 tweets
Oct 3, 2018
Next up is @rogerm on O'Reilly's insights into trends with Radar. #VelocityConf
@rogerm They look at changes in search terms year on year; the two largest increases are k8s and blockchain. #VelocityConf
People are becoming less interested in broader topics and more interested in specific technologies e.g. pytorch. #VelocityConf
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(