Post-lunch, I'm still in the Chaos track, now with @tammybutow presenting on Chaos and Resiliency! #QConNYC
@tammybutow How you apply chaos engineering depends upon the scale of your infrastructure. #QConNYC
It's like riding a bicycle; you can't just hop on and ride at full speed.
"The hello world of chaos engineering is a CPU attack." --@tammybutow#QConNYC
And once you can ride a bicycle, then you can drive a car, and perhaps drive an F1 car as you get more sophisticated at operating wheeled vehicles. It is a journey that could take multiple years. #QConNYC
Top 5 most popular targets for chaos engineering: k8s, kafka, ECS, and cassandra #QConNYC
(and elasticsearch!)
Advanced uses: for CI and CD built into your pipeline every day rather than scheduled exercises. #QConNYC
[ed: I mentioned graceful degradation during the Google Microservices AMA yesterday] "Google is expert at designing services which you won't notice when there is downtime -- there will be slightly less accurate results or missing oneboxes..." #QConNYC
... and Gremlin's CEO asserts that chaos engineering is about testing that those graceful degradations work. #QConNYC
We have to be able to gracefully omit parts of our site that we aren't able to serve instead of leaving holes in our UI. It's a cross-functional effort involving product managers and UX, not just infrastructure engineering. #QConNYC
What are the implications of our services not working correctly? Sometimes it's small, but if you work in finance you could cost someone a mortgage and cost them their dream home! (and get you fined by regulators). #QConNYC
"If you never get paged, you won't know what to do when a real failure happens or be able to train engineers." so @tammybutow used Chaos Engineering at dropbox to inject faults for people to train on. #QConNYC
Always be careful about affecting real customers if you can while doing chaos engineering. #QConNYC
Gremlin provides chaos engineering as a service, allowing simulations of packet loss, host shutdown, etc. with a local agent #QConNYC
Laying foundations: defining resiliency.
Resilient systems are highly available and durable. They can maintain acceptable service and weather the storm even with failures. #QConNYC
We need to know what results we want to achieve. Do thoughtful planned experiments to reveal weaknesses in our system. More like vaccines -- controlled chaos. #QConNYC
It's not widely practiced yet outside large companies, but doing game days or "failure fridays" to have dedicated time to focus on chaos engineering -- to identify weaknesses in systems is a best practice. #QConNYC
Why do we need chaos for distributed systems? Unusual failures are common and hard to debug; systems and orgs scale and chaos engineering helps us learn. #QConNYC
We can inject chaos at any layer -- API (e.g. ratelimiting, throttling, handling error codes...), app, ui, cache (e.g. empty cache -> hammered database), database, OS, host, network, power etc. #QConNYC
So why run these experiments? Are we confident that our metrics and alerting are as good as they should be? "Alert and dashboard auditing aren't that common but should be practiced more." [ed: yes.] #QConNYC
Do we know that customers are getting good experiences? Can we see customer pain? How is our collaboration with support teams? #QConNYC
Are we losing money due to downtime, broken features, and churn? #QConNYC
How do we run experiments? Need to form a hypothesis, consider the blast radius, run the experiment, measure results, then find/fix issues and repeat at larger scale. #QConNYC
Don't forget to have baseline metrics before you start experimenting. Don't run before you can walk, it's okay to start slow. Three key prerequisites: (1) monitoring & observability (e.g. 4 different systems :( :( ) #QConNYC
(2) Oncall and incident management. If you don't have any type of alerting and are manually watching dashboard, that's bad. You need a triage and incident management protocol to avoid treating all outages with the same severity. #QConNYC
(3) Know the cost of downtime per hour. [ed: or have clear Service Level Objectives so the acceptable budget is defined by/for you!] #QConNYC
The most critical thing is having an IMOC rotation, says @tammybutow [ed: although a good end goal is empowering *every* engineer to become an incident commander]. #QConNYC
How do we choose what experiments to run? Identify your top 5 critical systems and pick one! Draw the system diagram out. Choose something to attack and determine the scope. #QConNYC
Things to measure in advance: availability/errors, KPIs like latency or throughput, system metrics, and customer complaints. We need to verify we can capture failures. Does our monitoring actually work? #QConNYC
gremlin.com/gameday is a toolkit for running your own gameday. example: a chart for how many hosts we can affect and how much latency we're going to add to each. #QConNYC
Make sure you have a switch for turning off all chaos experiments in case of emergency. #QConNYC
Think about what attacks you can run -- both on individual nodes, as well as on the edges between the nodes, says @tammybutow. #QConNYC
Verify that your k8s clusters are as self-healing as you think they are -- will they spin back up correctly if restarted? #QConNYC
Resource chaos is also important. Increase consumption of CPU, disk, I/O, and memory to ensure monitoring can catch problems. Make sure that you find limitations before you have to turn away customers. #QConNYC
Disk chaos -- issues like logs backing up. we can fill up the log partition on a replica or primary and make sure the system can recover. #QConNYC
"Use your experience of past outages to prevent future engineers from being burned in the same way." --@tammybutow#QConNYC
Memory chaos: what if we run out of memory? What if it's across all the fleet?
Process chaos: kill or crashloop a process, forkbomb... #QConNYC
Shutdown chaos: turn off servers, or turn them off after a set lifetime. #QConNYC
k8s pods are a natural target for shutdowns and restarts. or simulate a container that's a noisy neighbor that kills the containers on its own host. #QConNYC
The average lifetime of a container in prod is 2.5 days, and they die in many different ways. #QConNYC
Time chaos and clock skew: simulate time drift and different times. (and @tammybutow points out this could have been used for y2k tests)
Network chaos: blackhole services, take down DNS. #QConNYC
Reproducing outages on demand lets us be confident we can handle them in the future. #QConNYC
What were the motivations for chaos engineering? For one, Dropbox and Uber's worst outages ever (both involving databases).
Final talk I'll be getting to at #VelocityConf before I dash to Toronto: @IanColdwater on improving container security on k8s.
@IanColdwater She focuses on hardening her employer's cloud container infrastructure, including doing work on k8s.
She also was an ethical hacker before she went into DevOps and DevSecOps. #VelocityConf
She travels around doing competitive hacking with CTFs. It's important to think like an attacker rather than assuming good intents and nice user personas that use our features in the way the devs intended things to be used. #VelocityConf
My colleague @sethvargo on microservice security at #VelocityConf: traditionally we've thought of traditional security as all-or-nothing -- that you put the biggest possible padlock on your perimeter, and you have a secure zone and untrusted zone.
@sethvargo We know that monoliths don't actually work, so we're moving towards microservices. But how does this change your security model?
You might have a loadbalancer that has software-defined rules. And you have a variety of compartmentalized networks. #VelocityConf
You might also be communicating with managed services such as Cloud SQL that are outside of your security perimeter.
You no longer have one resource, firewall, loadbalancer, and security team. You have many. Including "Chris." #VelocityConf
The problems we're solving: (1) why are monoliths harder to migrate? (2) Should you? (3) How do I start? (4) Best practices #VelocityConf
.@krisnova is a Gaypher (gay gopher), is a k8s maintainer, and is involved in two k8s SIGs (cluster lifecycle & aws, but she likes all the clouds. depending upon the day). And she did SRE before becoming a Dev Advocate! #VelocityConf
"just collect data and figure out later how you'll use it" doesn't work any more. #VelocityConf
We used to be optimistic before we ruined everything.
Mozilla also used to not collect data, and only had data on number of downloads, but its market share went down because they weren't measuring user satisfaction and actual usage. #VelocityConf