Post

Liz Fong-Jones (方禮真)

@lizthegrey

Jun 29, 2018 • 59 tweets • 22 min read • Read on X

@rynchantress

Onto the Chaos Engineering track with me, where I'll be livetweeting @rynchantress on Resilience and Human Interventions. #QConNYC

@rynchantress

@rynchantress Finally, @rynchantress gets to tell their side of the story of nearly breaking etsy.com and winning the three-armed sweater. #QConNYC

Once upon a time a few years ago, they were doing server provisioning for Etsy, who were operating own their datacenters and had a "lovingly hand-crafted set of tools to transform a server from a newly racked server into a webserver or database server." #QConNYC

There was a yak involved, shockingly, involving Apache versions. Chef comes along and says that it needs to install Apache. But the version Chef wanted to install was older than the current yum mirror. #QConNYC

They sighed, because this happened every three weeks anyways. Unfortunately, the yum repository would update itself and delete the older versions. And chef was told not to downgrade. The provisioning would fail. #QConNYC

"I just wanted to test this thing that had nothing to do with chef and apache." So why not just update the version pinned in Chef? They'd done it a million times before. #QConNYC

Tested it locally, it was one point release newer, smallest possible change. Great, let's roll it out in Chef everywhere, everyone's going to have a good time. [ed: this is starting to make my SRE spidey-sense tingle...] #QConNYC

The goal: new servers would get the new version of apache, old servers would have a no-op. Let's log into a webserver and make sure it does nothing. Whoops, it upgraded apache, which it wasn't supposed to do, but it broke apache. #QConNYC

Chef ran every 10 minutes across THE ENTIRE INFRASTRUCTURE. [ed: XD XD XD]. This change which we verified did the wrong thing was about to roll out everywhere. #QConNYC

@rynchantress

"So you're about to get paged," said @rynchantress to their friend next to them who was oncall. "Sorry about that." #QConNYC

Apache was in fact not running on... all of the webservers. Oh no, this is as bad as we thought. #QConNYC

[ed: these graphics of cats hammering on keyboards are so on point]. There was nothing we could do to stop Chef from self-"healing" everywhere to the wrong version. #QConNYC

So we used Slack to let people know things had gone "slightly awry". So we tried to figure out if we could roll back. #QConNYC

But the Yum repository was configured to keep only the newest version, and the old version had already been deleted from Yum. #QConNYC

@rynchantress

So @rynchantress and their oncall buddy tried to go forward instead. So some more people jumped in, [ed: and now we have four cats banging on keyboards] The graphs needed apache, so um... #QConNYC

How bad is the customer-facing issue? We went to etsy.com in a browser and discovered that somehow the site was still up. Really slow, but still up. #QConNYC

@rynchantress

"People could still buy hand-made tea cozies very very slowly." --@rynchantress #QConNYC

Pagerduty, meanwhile was still telling us that everything was on fire. Someone happened to figure out that Chef plus apache = fire, but re-running Chef resulted in things working again. Why? #QConNYC

The site was barely hanging on, so... we decided to try doing the second chef run everywhere, and everything went back to normal! #QConNYC

@rynchantress

"Sorry everyone, that was my bad, I was just trying to test this internal provisioning tool... well, we made some wonderful friendships along the way." --@rynchantress #QConNYC

So we did a postmortem and we all lived happily ever after. But what actually happened with all of that, and what can we learn from it? #QConNYC

It won a three-armed sweater not because it resulted in so many cat gifs, but because we learned the most from it. It wasn't the highest impact incident of the year (only an hour of being slow) #QConNYC

Other examples of worse outages: making the site say "hi I'm nick" or have a giant cart icon or go hard down. #QConNYC

So how did the site stay up? The site should have gone down. There were exactly 7 php7 servers that had decided they didn't want to run Chef successfully, nobody was going to tell them what to do. #QConNYC

We had these rogue servers that never got to installing the upgrade that did the wrong thing. Lesson 1: "keep 7 servers out config... nope, consider fallbacks for automation." #QConNYC

Distrust your automation. How will you detect problems, and test automation. Can you turn it off? Can you do it manually? #QConNYC

Persistently failing Chef on developer VMs was noise that swamped the 7 production servers that were legitimately failing. #QConNYC

Did we not have a staging environment? Yes we did. But it was an involved process and people didn't really do it if they didn't have to. #QConNYC

@rynchantress

People tested on their private instances but didn't test in the exact same configuration in staging. "If you have a staging environment that's a pain to use, you may as well not have one." --@rynchantress #QConNYC

If a bad chef change went out, we had no way of stopping it, only waiting for everything to re-run. #QConNYC

If you have things that happen automatically, can you make them not do that if things start going wrong? And, if you have automated processes, can you still do them manually with the automation off? #QConNYC

It was good to have responsive alerts and monitors. But we were lucky that the humans responded so quickly and were at their desks sitting next to each other, at work. and other people jumped in right away. #QConNYC

Maintain adaptive capacity within your organization, so that people can drop everything in event of emergency. Can people ask each other for help without friction? #QConNYC

If you have to ask your manager to talk to someone else's manager to get someone to help you, you don't have enough adaptive capacity. #QConNYC

People that are doing some degree of interrupt-driven work are better able to accommodate incidents.

What happens after work gets rearranged for an incident? Do people get penalized for jumping in to help? #QConNYC

Onwards -- what couldn't we see? Remember that we couldn't see graphs because of the circular dependency. Apache was down, so we couldn't see nagios, deployinator, graphite... #QConNYC

Nobody was really having a good time. You need to understand the dependencies in your tooling. A lot of people do it for their main site, but fewer people do it for the stuff "only the poor ops folks care about". #QConNYC

How do we monitor the monitoring [ed: aka metamonitoring]. Do you have confidence you'd find out fast enough? The last time you want to discover nagios is down is during an incident. #QConNYC

How do you communicate internally and externally? The status page was hosted on an external wordpress blog, so it was still up! yay! #QConNYC

Do you have a backup for if there's a slack outage? Having IRC available as a backup is a good idea. #QConNYC

So what actually went wrong with chef? "We weren't expecting this to happen." There was a weird part of the recipe that deleted unwanted configuration files. #QConNYC

So on re-run, it deleted the bad config file and everything returned to normal. We still didn't understand why the accidental upgrade happened. #QConNYC

As a mitigation, put a comment in saying "this is dangerous, and will upgrade everything. test it thoroughly." Always label your dragons and decide what yaks you're shaving when. #QConNYC

@rynchantress

"Engineering is about tradeoffs. You won't have capacity in your team to dig into every weird issue." --@rynchantress #QConNYC

The team didn't have enough slack to work on this one weird thing, especially given the workaround. What are the tradeoffs and opportunity costs? Know who has the yak razors, and what's the best use of their time. #QConNYC

Make sure you have inter-team relationships, and know where you have single points of failure of deep understanding. #QConNYC

"So that person goes on vacation and that's when the system decides to break..." It's not cat-astrophically bad if you have a few areas like that. But make sure you surface and share the knowledge. #QConNYC

At Etsy, everyone was open about sharing things. The "ops" channel was very popular. Anything infrastructure/production-related would go into the channel where everyone hung out. #QConNYC

Having documentation and being able to share information is really important to resilience. How does information persist? Slack channel disappears into the backscroll forever. What about people who joined after the incident? #QConNYC

What behaviors do you reward? Information hoarding is an anti-pattern. In the old-school ops view, people got promoted by hoarding all the information and being "expert dragons who sat on treasure" who could never be fired. #QConNYC

It leads to fragility within the organization. Instead, you want to reward leveling up the people around you and sharing information with them. #QConNYC

@rynchantress

What happened afterwards? Stella report, three armed sweater that weighs 100 pounds, "I broke the site, I'm a good engineer", said @rynchantress to their incoming CEO... #QConNYC

How can we build a culture where we don't incorrectly ascribe problems to "human error"? Ensure people have a culture of wanting to help each other. #QConNYC

You don't want a culture of "sucks to be you, I'm going to lunch." Even if your peers are responding positively, management needs to be onboard too. Don't have managers firing people for outages. #QConNYC

Have a learning culture. Are people able to take the time to do a postmortem? How long are your band-aid solutions persisting? Do people have the time and resources to properly fix things, or do you bandaid more and more over time? #QConNYC

It's important for engineers to be able to prioritize their own remediation items, and for the remediation items to be valued as real work e.g. P1s with a 30 day window to finish, not unimportant work. #QConNYC

@rynchantress

Technology can be robust, but only humans can be resilient. @rynchantress has heard all about hardware and software ways of making systems more resilient, but they can only deal with known problems. Humans have a unique capacity to learn. [ed: 👏🏼👏🏼👏🏼] #QConNYC

@allspaw

Quoting @allspaw: Resilience is about people and their ability to adapt to unforeseen circumstances.

External Tweet loading...
If nothing shows, it may have been deleted
by @allspaw view original on Twitter

#QConNYC

Five takeaways: Understand your automation. Maintain adaptive capacity. Know your dependencies. Build cross-team relationships. Always be learning. [fin] #QConNYC

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Enter URL or ID to Unroll

Liz Fong-Jones (方禮真)

Try unrolling a thread yourself!

More from @lizthegrey

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!