Post

Liz Fong-Jones (方禮真)

@lizthegrey

Jun 28, 2018 • 58 tweets • 19 min read • Read on X

@micheletitolo

Next up is @micheletitolo on how to successfully operate microservices, continuing the theme @adam7mck and I started on microservice observability. #QConNYC

@micheletitolo

"What I noticed is that nobody really defined what a microservice is, yet we've been hearing about distributed systems... microservices are a distributed system." -- @micheletitolo #QConNYC

We do it for speed, safety, and to cut costs, even though there are sometimes costs associated with getting started. #QConNYC

We all start off with all the pretty straight-looking lines, but it winds up getting messy over time. The more pieces and connections you add, something will probably go wrong. #QConNYC

How do we figure out when something goes wrong? And once you've figured it out, how do you fix the problems. We have new challenges we didn't have even in a monolithic world. #QConNYC

They're about more than the size of the application. You need an ecosystem. We need to adapt our applications and or infrastructure. #QConNYC

If your infra and tooling and deploys aren't there, you'll always be playing catchup. Like having your deploys take hours. #QConNYC

Three key areas to create that foundation: Deployment, Scaling, and Debugging. #QConNYC

What are the deployment best practices? Small changes, frequent releases, and consistent releases that are standardized and have supported tooling (no manual releases). #QConNYC

Limit your number of special snowflakes. Two quickly leads to three or four. People see what things you allow to happen inside your system. #QConNYC

Invest in your deployment tooling. Automate as much as possible, don't have humans touching your deploys. #QConNYC

Do staged deployments, so that you can feel confident that things are going right. Canarying requires having a routing service upstream, and deploying to an instance serving a small percentage of traffic first. #QConNYC

And then you roll out to more and more servers, and then the old version is gone. You know that it can handle your production traffic/ecosystem and everything is good. #QConNYC

The other alternative is blue/green or red/black, which has two parallel deployments of the application and duplicates the entire environment. It costs twice as much to do. #QConNYC

We do a cutover between the old and new systems.

Both of these techniques allow doing automatic rollbacks. How do we know whether a deployment is successful? #QConNYC

@micheletitolo

We need to have features and tools to validate the success of our system -- not just the one service. Robust unit/integration tests are needed.

"You need to test the space between." -- @micheletitolo #QConNYC

Frequent deployments mean frequent testing means good tooling.

You need standardized healthchecks. Everything should use the same technique (port, url, contents) #QConNYC

Consume your dependencies and secrets without recompiling, so that you're able to push the same binaries without modification. #QConNYC

Onto the system: we need to be able to aggregate healthchecks. You shouldn't need to log into a bunch of servers to figure out if they're working. #QConNYC

You need your system to be proactive about alerting you when things are broken. You still can get a coffee, but you might be paged to let you know an automatic rollback happened and you need to investigate. #QConNYC

Everyone needs to be able to see the status of deployments of their own team and of other teams. #QConNYC

More advanced technique: bots. To recap: deployments require changing the application and the infrastructure. Once we've deployed, we're done, right? No. #QConNYC

Distributed systems are constantly changing. Scaling happens when our application's load varies. We need health tracking to see our load. #QConNYC

Look at things like RAM, CPU, and Latency. But you also need custom metrics such as the queue length.

How do we scale? Automation. Please don't hand-scale your systems. #QConNYC

Smart systems enable us to scale up when under pressure, and down when the resources are no longer needed. #QConNYC

Your metrics can trigger automatic scaling. Scaling up and down are different cases -- for up, we're just deploying more instances, waiting for them to be healthy, and sending traffic to them. #QConNYC

For scaling down, detect when instances aren't being fully used to save money. Stop routing traffic to the server and gracefully shut down. #QConNYC

Report progress to the scaling automation to tell it that your individual application is ready to be terminated. 0% CPU is not a good indication. #QConNYC

What do we mean by routing traffic? Loadbalancer or service discovery. All public cloud providers have loadbalancing available. Make sure you know how to use it; it's easier than running your own. #QConNYC

You can also attach scaling groups with the public cloud's products if you want. #QConNYC

For service discovery, we route via convention; we can get from an unknown state of errors that we can't interpret (network?) to a known state of *why* the error happened (e.g. the target service was unavailable) #QConNYC

Scaling only solves so many problems. Onto troubleshooting. First we need to know there's a problem. #QConNYC

We need to know what qualifies as a problem. Not every exception or timeout matters. It may be unactionable. #QConNYC

Nuisance pages suck. It varies per application. RAM, CPU, Latency [ed: :( to RAM/CPU], but you may want less obvious things e.g. if your service has scaled 7 times in the past hour, you might have a memory leak. #QConNYC

You can also alert on your Key Performance Indicators or SLAs [ed: yup, this is what I advocated]. Alerts are for known issues that we can think of in advance. #QConNYC

You also need dashboards to be able to see in aggregate what is going on in your system. Humans make better connections when they visualize data. #QConNYC

Standard debugging 101 -- look at the application causing the page, and pretend SSH doesn't exist. Use your logs (you did set up collection/aggregation, right?). and potentially increase log levels on the fly. #QConNYC

If it's going to take time to fix, escalate. Avoid cascading failures. The failure of one application shouldn't bring everything down. #QConNYC

Identification: figure out what parts of your system depend upon each other. You need request tracing, so that we can follow requests through the system. #QConNYC

Works best as an overlay. Envoy or OpenTracing. Even adding the header manually is better than nothing. #QConNYC

Isolation: we need circuit breaking to drop queries if latencies or errors increase. It isolates services to return them to a known state. #QConNYC

Circuit breakers let our application recover while people are debugging, rather than hammering people with peak traffic. #QConNYC

So now we think we know what's gone wrong, and can deploy a fix. Slowly ramp your traffic. #QConNYC

Scaling up can ideally be done automatically with automation. Built-in loadbalancers will do it for you, or you can do it manually if necessary. #QConNYC

What happens if the problem isn't one of our apps and instead is an external dependency?

External: can't see logs, or can't see source, or we can't deploy a fix on our own. #QConNYC

If you can see everyone's source code and can make a PR, but can't commit or deploy, even within the same company, it's the external case. #QConNYC

Check the status page or monitor it; but you many need to raise an issue if nothing is posted yet.

But we can also mitigate by figuring out who to talk to with request tracing to find the failure, or circuit breaking/degrading gracefully. #QConNYC

@micheletitolo

"Do as much as you can to keep as much as you can working." -- @micheletitolo #QConNYC

But sometimes everything breaks under your infrastructure (e.g. AWS S3 Outage in 2017). Hopefully you can mitigate, but you should have at least some degree of error handling. "S3 always works"... until it doesn't. #QConNYC

To recap, debugging internal apps requires logging, tracing, and circuit breaking. For external apps, trace and circuitbreak. #QConNYC

All of these things share in common the issue of visibility. It's harder to see what's going on in large distributed systems and you can't observe what you don't see. #QConNYC

@micheletitolo

"Observability is not free." -- @micheletitolo #QConNYC

You need healthchecks, circuit breakers, logging, alerting, and the ability to shut down gracefully.

Your infra should consume healthchecks. do circuit breaking, loadbalancing, log aggregation [and specific log reading], and automated deploys/rollbacks. #QConNYC

And have lots of dashboards [ed: or an interactive querying system instead of too many dashboards] #QConNYC

@krisnova

Running microservices successfully requires smarter infrastructure. Ending with a @krisnova quote: Infrastructure won't evolve on its own. [fin] #QConNYC

@micheletitolo

Answering an audience question, @micheletitolo says that change can be incremental -- identify gaps and figure out what to automate or measure first based on your pain points. #QConNYC

@micheletitolo

Another audience question: how to monitor your circuit breakers. @micheletitolo says that open source tools like Hysterix can show you a control plane and dashboards for all of your circuits. #QConNYC

On the subject of memory leaks: if you can roll back, roll back, otherwise you can either spend a ton of money or go down until you can get a fix into place [or do rolling restarts every few hours if it's a slow leak]. #QConNYC

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Enter URL or ID to Unroll

Liz Fong-Jones (方禮真)

Try unrolling a thread yourself!

More from @lizthegrey

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!