Next up is @micheletitolo on how to successfully operate microservices, continuing the theme @adam7mck and I started on microservice observability. #QConNYC
"What I noticed is that nobody really defined what a microservice is, yet we've been hearing about distributed systems... microservices are a distributed system." -- @micheletitolo#QConNYC
We do it for speed, safety, and to cut costs, even though there are sometimes costs associated with getting started. #QConNYC
We all start off with all the pretty straight-looking lines, but it winds up getting messy over time. The more pieces and connections you add, something will probably go wrong. #QConNYC
How do we figure out when something goes wrong? And once you've figured it out, how do you fix the problems. We have new challenges we didn't have even in a monolithic world. #QConNYC
They're about more than the size of the application. You need an ecosystem. We need to adapt our applications and or infrastructure. #QConNYC
If your infra and tooling and deploys aren't there, you'll always be playing catchup. Like having your deploys take hours. #QConNYC
Three key areas to create that foundation: Deployment, Scaling, and Debugging. #QConNYC
What are the deployment best practices? Small changes, frequent releases, and consistent releases that are standardized and have supported tooling (no manual releases). #QConNYC
Limit your number of special snowflakes. Two quickly leads to three or four. People see what things you allow to happen inside your system. #QConNYC
Invest in your deployment tooling. Automate as much as possible, don't have humans touching your deploys. #QConNYC
Do staged deployments, so that you can feel confident that things are going right. Canarying requires having a routing service upstream, and deploying to an instance serving a small percentage of traffic first. #QConNYC
And then you roll out to more and more servers, and then the old version is gone. You know that it can handle your production traffic/ecosystem and everything is good. #QConNYC
The other alternative is blue/green or red/black, which has two parallel deployments of the application and duplicates the entire environment. It costs twice as much to do. #QConNYC
We do a cutover between the old and new systems.
Both of these techniques allow doing automatic rollbacks. How do we know whether a deployment is successful? #QConNYC
We need to have features and tools to validate the success of our system -- not just the one service. Robust unit/integration tests are needed.
Frequent deployments mean frequent testing means good tooling.
You need standardized healthchecks. Everything should use the same technique (port, url, contents) #QConNYC
Consume your dependencies and secrets without recompiling, so that you're able to push the same binaries without modification. #QConNYC
Onto the system: we need to be able to aggregate healthchecks. You shouldn't need to log into a bunch of servers to figure out if they're working. #QConNYC
You need your system to be proactive about alerting you when things are broken. You still can get a coffee, but you might be paged to let you know an automatic rollback happened and you need to investigate. #QConNYC
Everyone needs to be able to see the status of deployments of their own team and of other teams. #QConNYC
More advanced technique: bots. To recap: deployments require changing the application and the infrastructure. Once we've deployed, we're done, right? No. #QConNYC
Distributed systems are constantly changing. Scaling happens when our application's load varies. We need health tracking to see our load. #QConNYC
Look at things like RAM, CPU, and Latency. But you also need custom metrics such as the queue length.
How do we scale? Automation. Please don't hand-scale your systems. #QConNYC
Smart systems enable us to scale up when under pressure, and down when the resources are no longer needed. #QConNYC
Your metrics can trigger automatic scaling. Scaling up and down are different cases -- for up, we're just deploying more instances, waiting for them to be healthy, and sending traffic to them. #QConNYC
For scaling down, detect when instances aren't being fully used to save money. Stop routing traffic to the server and gracefully shut down. #QConNYC
Report progress to the scaling automation to tell it that your individual application is ready to be terminated. 0% CPU is not a good indication. #QConNYC
What do we mean by routing traffic? Loadbalancer or service discovery. All public cloud providers have loadbalancing available. Make sure you know how to use it; it's easier than running your own. #QConNYC
You can also attach scaling groups with the public cloud's products if you want. #QConNYC
For service discovery, we route via convention; we can get from an unknown state of errors that we can't interpret (network?) to a known state of *why* the error happened (e.g. the target service was unavailable) #QConNYC
Scaling only solves so many problems. Onto troubleshooting. First we need to know there's a problem. #QConNYC
We need to know what qualifies as a problem. Not every exception or timeout matters. It may be unactionable. #QConNYC
Nuisance pages suck. It varies per application. RAM, CPU, Latency [ed: :( to RAM/CPU], but you may want less obvious things e.g. if your service has scaled 7 times in the past hour, you might have a memory leak. #QConNYC
You can also alert on your Key Performance Indicators or SLAs [ed: yup, this is what I advocated]. Alerts are for known issues that we can think of in advance. #QConNYC
You also need dashboards to be able to see in aggregate what is going on in your system. Humans make better connections when they visualize data. #QConNYC
Standard debugging 101 -- look at the application causing the page, and pretend SSH doesn't exist. Use your logs (you did set up collection/aggregation, right?). and potentially increase log levels on the fly. #QConNYC
If it's going to take time to fix, escalate. Avoid cascading failures. The failure of one application shouldn't bring everything down. #QConNYC
Identification: figure out what parts of your system depend upon each other. You need request tracing, so that we can follow requests through the system. #QConNYC
Works best as an overlay. Envoy or OpenTracing. Even adding the header manually is better than nothing. #QConNYC
Isolation: we need circuit breaking to drop queries if latencies or errors increase. It isolates services to return them to a known state. #QConNYC
Circuit breakers let our application recover while people are debugging, rather than hammering people with peak traffic. #QConNYC
So now we think we know what's gone wrong, and can deploy a fix. Slowly ramp your traffic. #QConNYC
Scaling up can ideally be done automatically with automation. Built-in loadbalancers will do it for you, or you can do it manually if necessary. #QConNYC
What happens if the problem isn't one of our apps and instead is an external dependency?
External: can't see logs, or can't see source, or we can't deploy a fix on our own. #QConNYC
If you can see everyone's source code and can make a PR, but can't commit or deploy, even within the same company, it's the external case. #QConNYC
Check the status page or monitor it; but you many need to raise an issue if nothing is posted yet.
But we can also mitigate by figuring out who to talk to with request tracing to find the failure, or circuit breaking/degrading gracefully. #QConNYC
But sometimes everything breaks under your infrastructure (e.g. AWS S3 Outage in 2017). Hopefully you can mitigate, but you should have at least some degree of error handling. "S3 always works"... until it doesn't. #QConNYC
To recap, debugging internal apps requires logging, tracing, and circuit breaking. For external apps, trace and circuitbreak. #QConNYC
All of these things share in common the issue of visibility. It's harder to see what's going on in large distributed systems and you can't observe what you don't see. #QConNYC
You need healthchecks, circuit breakers, logging, alerting, and the ability to shut down gracefully.
Your infra should consume healthchecks. do circuit breaking, loadbalancing, log aggregation [and specific log reading], and automated deploys/rollbacks. #QConNYC
And have lots of dashboards [ed: or an interactive querying system instead of too many dashboards] #QConNYC
Running microservices successfully requires smarter infrastructure. Ending with a @krisnova quote: Infrastructure won't evolve on its own. [fin] #QConNYC
Answering an audience question, @micheletitolo says that change can be incremental -- identify gaps and figure out what to automate or measure first based on your pain points. #QConNYC
Another audience question: how to monitor your circuit breakers. @micheletitolo says that open source tools like Hysterix can show you a control plane and dashboards for all of your circuits. #QConNYC
On the subject of memory leaks: if you can roll back, roll back, otherwise you can either spend a ton of money or go down until you can get a fix into place [or do rolling restarts every few hours if it's a slow leak]. #QConNYC
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Final talk I'll be getting to at #VelocityConf before I dash to Toronto: @IanColdwater on improving container security on k8s.
@IanColdwater She focuses on hardening her employer's cloud container infrastructure, including doing work on k8s.
She also was an ethical hacker before she went into DevOps and DevSecOps. #VelocityConf
She travels around doing competitive hacking with CTFs. It's important to think like an attacker rather than assuming good intents and nice user personas that use our features in the way the devs intended things to be used. #VelocityConf
My colleague @sethvargo on microservice security at #VelocityConf: traditionally we've thought of traditional security as all-or-nothing -- that you put the biggest possible padlock on your perimeter, and you have a secure zone and untrusted zone.
@sethvargo We know that monoliths don't actually work, so we're moving towards microservices. But how does this change your security model?
You might have a loadbalancer that has software-defined rules. And you have a variety of compartmentalized networks. #VelocityConf
You might also be communicating with managed services such as Cloud SQL that are outside of your security perimeter.
You no longer have one resource, firewall, loadbalancer, and security team. You have many. Including "Chris." #VelocityConf
The problems we're solving: (1) why are monoliths harder to migrate? (2) Should you? (3) How do I start? (4) Best practices #VelocityConf
.@krisnova is a Gaypher (gay gopher), is a k8s maintainer, and is involved in two k8s SIGs (cluster lifecycle & aws, but she likes all the clouds. depending upon the day). And she did SRE before becoming a Dev Advocate! #VelocityConf
"just collect data and figure out later how you'll use it" doesn't work any more. #VelocityConf
We used to be optimistic before we ruined everything.
Mozilla also used to not collect data, and only had data on number of downloads, but its market share went down because they weren't measuring user satisfaction and actual usage. #VelocityConf