Post

How to get URL link on X (Twitter) App

On the Twitter thread, click on or icon on the bottom
Click again on or Share Via icon
Click on Copy Link to Tweet
Paste it above and click "Unroll Thread"!
More info at Twitter Help

Liz Fong-Jones (方禮真)

@lizthegrey

Jun 29, 2018 • 45 tweets • 17 min read • Read on X

@tammybutow

Post-lunch, I'm still in the Chaos track, now with @tammybutow presenting on Chaos and Resiliency! #QConNYC

@tammybutow

@tammybutow How you apply chaos engineering depends upon the scale of your infrastructure. #QConNYC

@tammybutow

It's like riding a bicycle; you can't just hop on and ride at full speed.

"The hello world of chaos engineering is a CPU attack." --@tammybutow #QConNYC

And once you can ride a bicycle, then you can drive a car, and perhaps drive an F1 car as you get more sophisticated at operating wheeled vehicles. It is a journey that could take multiple years. #QConNYC

Top 5 most popular targets for chaos engineering: k8s, kafka, ECS, and cassandra #QConNYC

(and elasticsearch!)

Advanced uses: for CI and CD built into your pipeline every day rather than scheduled exercises. #QConNYC

[ed: I mentioned graceful degradation during the Google Microservices AMA yesterday] "Google is expert at designing services which you won't notice when there is downtime -- there will be slightly less accurate results or missing oneboxes..." #QConNYC

... and Gremlin's CEO asserts that chaos engineering is about testing that those graceful degradations work. #QConNYC

We have to be able to gracefully omit parts of our site that we aren't able to serve instead of leaving holes in our UI. It's a cross-functional effort involving product managers and UX, not just infrastructure engineering. #QConNYC

What are the implications of our services not working correctly? Sometimes it's small, but if you work in finance you could cost someone a mortgage and cost them their dream home! (and get you fined by regulators). #QConNYC

@tammybutow

"If you never get paged, you won't know what to do when a real failure happens or be able to train engineers." so @tammybutow used Chaos Engineering at dropbox to inject faults for people to train on. #QConNYC

Always be careful about affecting real customers if you can while doing chaos engineering. #QConNYC

Gremlin provides chaos engineering as a service, allowing simulations of packet loss, host shutdown, etc. with a local agent #QConNYC

Laying foundations: defining resiliency.

Resilient systems are highly available and durable. They can maintain acceptable service and weather the storm even with failures. #QConNYC

We need to know what results we want to achieve. Do thoughtful planned experiments to reveal weaknesses in our system. More like vaccines -- controlled chaos. #QConNYC

It's not widely practiced yet outside large companies, but doing game days or "failure fridays" to have dedicated time to focus on chaos engineering -- to identify weaknesses in systems is a best practice. #QConNYC

Why do we need chaos for distributed systems? Unusual failures are common and hard to debug; systems and orgs scale and chaos engineering helps us learn. #QConNYC

We can inject chaos at any layer -- API (e.g. ratelimiting, throttling, handling error codes...), app, ui, cache (e.g. empty cache -> hammered database), database, OS, host, network, power etc. #QConNYC

So why run these experiments? Are we confident that our metrics and alerting are as good as they should be? "Alert and dashboard auditing aren't that common but should be practiced more." [ed: yes.] #QConNYC

Do we know that customers are getting good experiences? Can we see customer pain? How is our collaboration with support teams? #QConNYC

Are we losing money due to downtime, broken features, and churn? #QConNYC

How do we run experiments? Need to form a hypothesis, consider the blast radius, run the experiment, measure results, then find/fix issues and repeat at larger scale. #QConNYC

Don't forget to have baseline metrics before you start experimenting. Don't run before you can walk, it's okay to start slow. Three key prerequisites: (1) monitoring & observability (e.g. 4 different systems :( :( ) #QConNYC

(2) Oncall and incident management. If you don't have any type of alerting and are manually watching dashboard, that's bad. You need a triage and incident management protocol to avoid treating all outages with the same severity. #QConNYC

(3) Know the cost of downtime per hour. [ed: or have clear Service Level Objectives so the acceptable budget is defined by/for you!] #QConNYC

@tammybutow

Tools that @tammybutow recommends: @datadoghq, @getsentry, and old fashioned Wireshark. #QConNYC

@tammybutow

The most critical thing is having an IMOC rotation, says @tammybutow [ed: although a good end goal is empowering *every* engineer to become an incident commander]. #QConNYC

How do we choose what experiments to run? Identify your top 5 critical systems and pick one! Draw the system diagram out. Choose something to attack and determine the scope. #QConNYC

Things to measure in advance: availability/errors, KPIs like latency or throughput, system metrics, and customer complaints. We need to verify we can capture failures. Does our monitoring actually work? #QConNYC

gremlin.com/gameday is a toolkit for running your own gameday. example: a chart for how many hosts we can affect and how much latency we're going to add to each. #QConNYC

Make sure you have a switch for turning off all chaos experiments in case of emergency. #QConNYC

@tammybutow

Think about what attacks you can run -- both on individual nodes, as well as on the edges between the nodes, says @tammybutow. #QConNYC

Verify that your k8s clusters are as self-healing as you think they are -- will they spin back up correctly if restarted? #QConNYC

Resource chaos is also important. Increase consumption of CPU, disk, I/O, and memory to ensure monitoring can catch problems. Make sure that you find limitations before you have to turn away customers. #QConNYC

github.com/tammybutow/cha… is a known-known experiment that tests situations we can anticipate and is a bicycle for learning. #QConNYC

Disk chaos -- issues like logs backing up. we can fill up the log partition on a replica or primary and make sure the system can recover. #QConNYC

@tammybutow

"Use your experience of past outages to prevent future engineers from being burned in the same way." --@tammybutow #QConNYC

Memory chaos: what if we run out of memory? What if it's across all the fleet?

Process chaos: kill or crashloop a process, forkbomb... #QConNYC

Shutdown chaos: turn off servers, or turn them off after a set lifetime. #QConNYC

k8s pods are a natural target for shutdowns and restarts. or simulate a container that's a noisy neighbor that kills the containers on its own host. #QConNYC

The average lifetime of a container in prod is 2.5 days, and they die in many different ways. #QConNYC

@tammybutow

Time chaos and clock skew: simulate time drift and different times. (and @tammybutow points out this could have been used for y2k tests)

Network chaos: blackhole services, take down DNS. #QConNYC

Reproducing outages on demand lets us be confident we can handle them in the future. #QConNYC

What were the motivations for chaos engineering? For one, Dropbox and Uber's worst outages ever (both involving databases).

Resources: the gremlin community and chaosconf.io. [fin] #QConNYC

(slack link: gremlin.com/slack) #QConNYC

• • •

Missing some Tweet in this thread? You can try to force a refresh

This Thread may be Removed Anytime!

Twitter may remove this content at anytime! Save it as PDF for later use!

Read 5 tweets

Support us! We are indie developers!

This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Share this page!

Enter URL or ID to Unroll

Liz Fong-Jones (方禮真)

Try unrolling a thread yourself!

More from @lizthegrey

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!