Liz Fong-Jones (方禮真) Profile picture
Jul 26, 2018 33 tweets 14 min read Read on X
I'll be livetweeting @btreynor, @markcartertm, Morgan, & Lingyun Ruan of @Snap's spotlight session at #GoogleNext18 on SRE in hybrid and cloud-native environments, starting in 15 minutes!
Leading off: @markcartertm. For past 15 years we've been building planet-scale services with performance, reliability and availability using Site Reliability Engineering.

[ed: omfg thank fuck there is CART here, I've been dying all week without it] #GoogleNext18
And here's @btreynor to talk about the Transparent Cloud API, alongside Lingyun from @Snap who used it successfully during the FIFA World Cup to debug performance during traffic spikes. #GoogleNext18
We need good monitoring to understand our products, and also the things that used to be black boxes from the cloud customer's perspective. #GoogleNext18
We have service level indicators instrumented early in the stack to get end to end performance and alerts when service quality degrades. But during an outage, you need to be able to figure out what's broken inside the system. #GoogleNext18
But what about instrumentation of your cloud products? Sometimes people add client-side metrics on calls to their dependent services, but it still won't tell you where the issue lies -- it could either be on caller or dependent service side. #GoogleNext18
By comparing clientside and serverside backend metrics, we can compare them. Google Cloud is offering serverside backend metrics for the first time. #GoogleNext18
How does this help in production? Example: the story service, which depends upon Google Cloud Spanner for its metadata. #GoogleNext18
So snap has service level indicators for the Stories API -- latency, error rate, and so forth. But it's mostly a wrapper of spanner, but before transparent SLIs, they couldn't be sure whether an issue was on Snap's side or on Google's side. #GoogleNext18
But now they have a single pane of glass to see from Google's side alongside Snap's internal metrics how well Spanner is handling requests from Snap on a per-project basis.

For example, Spanner resource metrics. #GoogleNext18
And a pager goes off... we can look at the dashboard of shared service level indicators and see increases in error ratio on Snap's side (from 0% to 1%) and in latency from 30ms to 1800ms at p99. But the Google metrics are unaffected. #GoogleNext18
and they can keep debugging on their side rather than filing a support ticket. It turns out that they have a hotspot in query distribution in their GCE VMs that is pegging the CPU of one VM, so the problem is only on the client side and not with Spanner. #GoogleNext18
It saves Snap a lot of time and a lot of false alarm support tickets.

They get better visibility, more efficient debugging, and clear metrics to share with support when there is a problem on Google's side. #GoogleNext18
Back to @btreynor: this scenario happens every day in production. When the klaxon goes off, we have to figure out what happened. Historically people would have to file tickets and hope, all while users are suffering. Lowering your time to recover matters. #GoogleNext18
They're available for all GCP services, and use the same SRE golden signals of latency, availability, etc.

You can triage your outages faster, and integrate with your dashboards [ed: and this works even if you're not a Stackdriver user e.g. w/ Datadog!]. #GoogleNext18
You can get a copy of the @SREWorkbook at the Fifth Nine lounge until midday! [ed: and you can of course view more info on how to practically use transparent SLIs at my talk with Marie ] #GoogleNext18
And now we're hearing from Morgan about how to debug poor application production performance with Stackdriver APM/Trace, with a live demo involving the Hipster Shop and pulling up the waterfall graph of spans. #GoogleNext18
He's showing how having the slow spans from tracing only partially helps, and that you need profiling to dive deep into specific spans without having to specifically annotate points in the span. #GoogleNext18
You'd want to start first with top-level metrics comparison, but if it didn't show what you thought, being able to compare traces and comparing profiles across versions lets you find out what got slower or faster with your changes. #GoogleNext18
[ed: we use profiling *all* the time within Google for thorny perf issues, and having it be more accessible is hugely valuable]

There's also a new Stackdriver Debug functionality that allows running live execution and debugging in prod. #GoogleNext18
To use these tools, you can just link in libraries and redeploy; it's open source and apache licensed & on Github.

Shoutout to OpenCensus! #GoogleNext18
OpenCensus exports to multiple different services; you don't just have to use Stackdriver, you can use Datadog, Prometheus, Zipkin, and so forth. At the same time. #GoogleNext18
Now onto Q&A, with an extra incentive of @SREWorkbook copies to distribute to each person who asks a question, and snap spectacles for the best question. [livetweet ends here so I can help out & such] #GoogleNext18
okay, I'll tweet this one: question about how to start culturally transforming with SRE. When you have devs/ops w/ different goals, things fall through because collaboration stops under deadlines/stress

The best way to fix this is common incentives & error budgets. #GoogleNext18
Q: does this work in hybrid cloud? A: Stackdriver APM works regardless of environment but some of the logging stuff is dependent upon running in tagged VMs. #GoogleNext18
.@markcartertm hints that automated service dashboards are prevalent at Google when you use a common framework, and that there will be news by NEXT London on bringing this to the world. [ed: and people using Envoy and Istio often get this today] #GoogleNext18
Q: what are some of the common challenges people have with adopting SRE? A: some industries struggle with separation of roles in SOX etc. and not allowing people to modify and deploy code; other issues involve rules for data locality but tooling can help. #GoogleNext18
but a lot of the issues we see aren't specific to SRE and are generic to *any* operational team structure. #GoogleNext18
An open cloud question: will Stackdriver's backend be opensourced? A: no, because unless you have a few hundred thousand servers lying around it doesn't make sense; but the client libraries are all open source and interoperate so you can use same API w/ prometheus #GoogleNext18
.@btreynor emphasizing in response to a question about performance that APM is cheap compared to engineer time debugging outages. The impact in prod is unnoticeable, but even if it were 0.3% of CPU fleet-wide, it would still be worth it. #GoogleNext18
[ed: come talk to me at the Fifth Nine about my experiences turning on exemplars in combination with traces fleet-wide at Google]

Audience question about error budgets and tradeoffs involved; @btreynor discussing setting the right SLO number of nines. #GoogleNext18
It's okay to have a 1 9 SLO, for instance, for pre-production testing.

Audience question about SRE training and certification: is google doing it? [ed: waving my hands saying YES!!! come talk to me in the red jacket or go find us at the Fifth Nine] #GoogleNext18
It is a non-trivial scaling exercise and demand for CRE very much outstrips supply. [fin] #GoogleNext18

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真) Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @lizthegrey

Oct 3, 2018
Final talk I'll be getting to at #VelocityConf before I dash to Toronto: @IanColdwater on improving container security on k8s.
@IanColdwater She focuses on hardening her employer's cloud container infrastructure, including doing work on k8s.

She also was an ethical hacker before she went into DevOps and DevSecOps. #VelocityConf
She travels around doing competitive hacking with CTFs. It's important to think like an attacker rather than assuming good intents and nice user personas that use our features in the way the devs intended things to be used. #VelocityConf
Read 36 tweets
Oct 3, 2018
My colleague @sethvargo on microservice security at #VelocityConf: traditionally we've thought of traditional security as all-or-nothing -- that you put the biggest possible padlock on your perimeter, and you have a secure zone and untrusted zone.
@sethvargo We know that monoliths don't actually work, so we're moving towards microservices. But how does this change your security model?

You might have a loadbalancer that has software-defined rules. And you have a variety of compartmentalized networks. #VelocityConf
You might also be communicating with managed services such as Cloud SQL that are outside of your security perimeter.

You no longer have one resource, firewall, loadbalancer, and security team. You have many. Including "Chris." #VelocityConf
Read 19 tweets
Oct 3, 2018
Leading off the k8s track today is @krisnova on migrating monoliths to k8s! #VelocityConf
@krisnova [ed: p.s. her ponies and rainbows dress is A+++]

She starts by providing a resources link: j.hept.io/velocity-nyc-2…

The problems we're solving:
(1) why are monoliths harder to migrate?
(2) Should you?
(3) How do I start?
(4) Best practices #VelocityConf
.@krisnova is a Gaypher (gay gopher), is a k8s maintainer, and is involved in two k8s SIGs (cluster lifecycle & aws, but she likes all the clouds. depending upon the day). And she did SRE before becoming a Dev Advocate! #VelocityConf
Read 29 tweets
Oct 3, 2018
Final keynote block: @lxt of Mozilla on practical ethics and user data. #VelocityConf
@lxt And also ethics of experimentation!

"just collect data and figure out later how you'll use it" doesn't work any more. #VelocityConf
We used to be optimistic before we ruined everything.

Mozilla also used to not collect data, and only had data on number of downloads, but its market share went down because they weren't measuring user satisfaction and actual usage. #VelocityConf
Read 25 tweets
Oct 3, 2018
Next up is @mrb_bk on why marketing matters. #VelocityConf
@mrb_bk Hypothesis: marketing >> code in terms of software adoption. [ed: and this is why I became a developer advocate!] #VelocityConf
You need to consider community early when developing a product.

Always ask, "Why do people matter?" "Why does adoption matter?" #VelocityConf
Read 17 tweets
Oct 3, 2018
Next up is @rogerm on O'Reilly's insights into trends with Radar. #VelocityConf
@rogerm They look at changes in search terms year on year; the two largest increases are k8s and blockchain. #VelocityConf
People are becoming less interested in broader topics and more interested in specific technologies e.g. pytorch. #VelocityConf
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(