Post

Liz Fong-Jones (方禮真)

@lizthegrey

Jul 26, 2018 • 33 tweets • 14 min read • Read on X

@btreynor

I'll be livetweeting @btreynor, @markcartertm, Morgan, & Lingyun Ruan of @Snap's spotlight session at #GoogleNext18 on SRE in hybrid and cloud-native environments, starting in 15 minutes!

@markcartertm

Leading off: @markcartertm. For past 15 years we've been building planet-scale services with performance, reliability and availability using Site Reliability Engineering.

[ed: omfg thank fuck there is CART here, I've been dying all week without it] #GoogleNext18

@btreynor

And here's @btreynor to talk about the Transparent Cloud API, alongside Lingyun from @Snap who used it successfully during the FIFA World Cup to debug performance during traffic spikes. #GoogleNext18

We need good monitoring to understand our products, and also the things that used to be black boxes from the cloud customer's perspective. #GoogleNext18

We have service level indicators instrumented early in the stack to get end to end performance and alerts when service quality degrades. But during an outage, you need to be able to figure out what's broken inside the system. #GoogleNext18

But what about instrumentation of your cloud products? Sometimes people add client-side metrics on calls to their dependent services, but it still won't tell you where the issue lies -- it could either be on caller or dependent service side. #GoogleNext18

By comparing clientside and serverside backend metrics, we can compare them. Google Cloud is offering serverside backend metrics for the first time. #GoogleNext18

How does this help in production? Example: the story service, which depends upon Google Cloud Spanner for its metadata. #GoogleNext18

So snap has service level indicators for the Stories API -- latency, error rate, and so forth. But it's mostly a wrapper of spanner, but before transparent SLIs, they couldn't be sure whether an issue was on Snap's side or on Google's side. #GoogleNext18

But now they have a single pane of glass to see from Google's side alongside Snap's internal metrics how well Spanner is handling requests from Snap on a per-project basis.

For example, Spanner resource metrics. #GoogleNext18

And a pager goes off... we can look at the dashboard of shared service level indicators and see increases in error ratio on Snap's side (from 0% to 1%) and in latency from 30ms to 1800ms at p99. But the Google metrics are unaffected. #GoogleNext18

and they can keep debugging on their side rather than filing a support ticket. It turns out that they have a hotspot in query distribution in their GCE VMs that is pegging the CPU of one VM, so the problem is only on the client side and not with Spanner. #GoogleNext18

It saves Snap a lot of time and a lot of false alarm support tickets.

They get better visibility, more efficient debugging, and clear metrics to share with support when there is a problem on Google's side. #GoogleNext18

@btreynor

Back to @btreynor: this scenario happens every day in production. When the klaxon goes off, we have to figure out what happened. Historically people would have to file tickets and hope, all while users are suffering. Lowering your time to recover matters. #GoogleNext18

They're available for all GCP services, and use the same SRE golden signals of latency, availability, etc.

You can triage your outages faster, and integrate with your dashboards [ed: and this works even if you're not a Stackdriver user e.g. w/ Datadog!]. #GoogleNext18

@SREWorkbook

You can get a copy of the @SREWorkbook at the Fifth Nine lounge until midday! [ed: and you can of course view more info on how to practically use transparent SLIs at my talk with Marie

https://twitter.com/lizthegrey/status/1022261861532590080

] #GoogleNext18

And now we're hearing from Morgan about how to debug poor application production performance with Stackdriver APM/Trace, with a live demo involving the Hipster Shop and pulling up the waterfall graph of spans. #GoogleNext18

He's showing how having the slow spans from tracing only partially helps, and that you need profiling to dive deep into specific spans without having to specifically annotate points in the span. #GoogleNext18

You'd want to start first with top-level metrics comparison, but if it didn't show what you thought, being able to compare traces and comparing profiles across versions lets you find out what got slower or faster with your changes. #GoogleNext18

[ed: we use profiling *all* the time within Google for thorny perf issues, and having it be more accessible is hugely valuable]

There's also a new Stackdriver Debug functionality that allows running live execution and debugging in prod. #GoogleNext18

To use these tools, you can just link in libraries and redeploy; it's open source and apache licensed & on Github.

Shoutout to OpenCensus! #GoogleNext18

OpenCensus exports to multiple different services; you don't just have to use Stackdriver, you can use Datadog, Prometheus, Zipkin, and so forth. At the same time. #GoogleNext18

@SREWorkbook

Now onto Q&A, with an extra incentive of @SREWorkbook copies to distribute to each person who asks a question, and snap spectacles for the best question. [livetweet ends here so I can help out & such] #GoogleNext18

okay, I'll tweet this one: question about how to start culturally transforming with SRE. When you have devs/ops w/ different goals, things fall through because collaboration stops under deadlines/stress

The best way to fix this is common incentives & error budgets. #GoogleNext18

Q: does this work in hybrid cloud? A: Stackdriver APM works regardless of environment but some of the logging stuff is dependent upon running in tagged VMs. #GoogleNext18

@markcartertm

.@markcartertm hints that automated service dashboards are prevalent at Google when you use a common framework, and that there will be news by NEXT London on bringing this to the world. [ed: and people using Envoy and Istio often get this today] #GoogleNext18

Q: what are some of the common challenges people have with adopting SRE? A: some industries struggle with separation of roles in SOX etc. and not allowing people to modify and deploy code; other issues involve rules for data locality but tooling can help. #GoogleNext18

but a lot of the issues we see aren't specific to SRE and are generic to *any* operational team structure. #GoogleNext18

An open cloud question: will Stackdriver's backend be opensourced? A: no, because unless you have a few hundred thousand servers lying around it doesn't make sense; but the client libraries are all open source and interoperate so you can use same API w/ prometheus #GoogleNext18

@btreynor

.@btreynor emphasizing in response to a question about performance that APM is cheap compared to engineer time debugging outages. The impact in prod is unnoticeable, but even if it were 0.3% of CPU fleet-wide, it would still be worth it. #GoogleNext18

@btreynor

[ed: come talk to me at the Fifth Nine about my experiences turning on exemplars in combination with traces fleet-wide at Google]

Audience question about error budgets and tradeoffs involved; @btreynor discussing setting the right SLO number of nines. #GoogleNext18

It's okay to have a 1 9 SLO, for instance, for pre-production testing.

Audience question about SRE training and certification: is google doing it? [ed: waving my hands saying YES!!! come talk to me in the red jacket or go find us at the Fifth Nine] #GoogleNext18

It is a non-trivial scaling exercise and demand for CRE very much outstrips supply. [fin] #GoogleNext18

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Enter URL or ID to Unroll

Liz Fong-Jones (方禮真)

Try unrolling a thread yourself!

More from @lizthegrey

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!