Once upon a time a few years ago, they were doing server provisioning for Etsy, who were operating own their datacenters and had a "lovingly hand-crafted set of tools to transform a server from a newly racked server into a webserver or database server." #QConNYC
There was a yak involved, shockingly, involving Apache versions. Chef comes along and says that it needs to install Apache. But the version Chef wanted to install was older than the current yum mirror. #QConNYC
They sighed, because this happened every three weeks anyways. Unfortunately, the yum repository would update itself and delete the older versions. And chef was told not to downgrade. The provisioning would fail. #QConNYC
"I just wanted to test this thing that had nothing to do with chef and apache." So why not just update the version pinned in Chef? They'd done it a million times before. #QConNYC
Tested it locally, it was one point release newer, smallest possible change. Great, let's roll it out in Chef everywhere, everyone's going to have a good time. [ed: this is starting to make my SRE spidey-sense tingle...] #QConNYC
The goal: new servers would get the new version of apache, old servers would have a no-op. Let's log into a webserver and make sure it does nothing. Whoops, it upgraded apache, which it wasn't supposed to do, but it broke apache. #QConNYC
Chef ran every 10 minutes across THE ENTIRE INFRASTRUCTURE. [ed: XD XD XD]. This change which we verified did the wrong thing was about to roll out everywhere. #QConNYC
"So you're about to get paged," said @rynchantress to their friend next to them who was oncall. "Sorry about that." #QConNYC
Apache was in fact not running on... all of the webservers. Oh no, this is as bad as we thought. #QConNYC
[ed: these graphics of cats hammering on keyboards are so on point]. There was nothing we could do to stop Chef from self-"healing" everywhere to the wrong version. #QConNYC
So we used Slack to let people know things had gone "slightly awry". So we tried to figure out if we could roll back. #QConNYC
But the Yum repository was configured to keep only the newest version, and the old version had already been deleted from Yum. #QConNYC
So @rynchantress and their oncall buddy tried to go forward instead. So some more people jumped in, [ed: and now we have four cats banging on keyboards] The graphs needed apache, so um... #QConNYC
How bad is the customer-facing issue? We went to etsy.com in a browser and discovered that somehow the site was still up. Really slow, but still up. #QConNYC
"People could still buy hand-made tea cozies very very slowly." --@rynchantress#QConNYC
Pagerduty, meanwhile was still telling us that everything was on fire. Someone happened to figure out that Chef plus apache = fire, but re-running Chef resulted in things working again. Why? #QConNYC
The site was barely hanging on, so... we decided to try doing the second chef run everywhere, and everything went back to normal! #QConNYC
"Sorry everyone, that was my bad, I was just trying to test this internal provisioning tool... well, we made some wonderful friendships along the way." --@rynchantress#QConNYC
So we did a postmortem and we all lived happily ever after. But what actually happened with all of that, and what can we learn from it? #QConNYC
It won a three-armed sweater not because it resulted in so many cat gifs, but because we learned the most from it. It wasn't the highest impact incident of the year (only an hour of being slow) #QConNYC
Other examples of worse outages: making the site say "hi I'm nick" or have a giant cart icon or go hard down. #QConNYC
So how did the site stay up? The site should have gone down. There were exactly 7 php7 servers that had decided they didn't want to run Chef successfully, nobody was going to tell them what to do. #QConNYC
We had these rogue servers that never got to installing the upgrade that did the wrong thing. Lesson 1: "keep 7 servers out config... nope, consider fallbacks for automation." #QConNYC
Distrust your automation. How will you detect problems, and test automation. Can you turn it off? Can you do it manually? #QConNYC
Persistently failing Chef on developer VMs was noise that swamped the 7 production servers that were legitimately failing. #QConNYC
Did we not have a staging environment? Yes we did. But it was an involved process and people didn't really do it if they didn't have to. #QConNYC
People tested on their private instances but didn't test in the exact same configuration in staging. "If you have a staging environment that's a pain to use, you may as well not have one." --@rynchantress#QConNYC
If a bad chef change went out, we had no way of stopping it, only waiting for everything to re-run. #QConNYC
If you have things that happen automatically, can you make them not do that if things start going wrong? And, if you have automated processes, can you still do them manually with the automation off? #QConNYC
It was good to have responsive alerts and monitors. But we were lucky that the humans responded so quickly and were at their desks sitting next to each other, at work. and other people jumped in right away. #QConNYC
Maintain adaptive capacity within your organization, so that people can drop everything in event of emergency. Can people ask each other for help without friction? #QConNYC
If you have to ask your manager to talk to someone else's manager to get someone to help you, you don't have enough adaptive capacity. #QConNYC
People that are doing some degree of interrupt-driven work are better able to accommodate incidents.
What happens after work gets rearranged for an incident? Do people get penalized for jumping in to help? #QConNYC
Onwards -- what couldn't we see? Remember that we couldn't see graphs because of the circular dependency. Apache was down, so we couldn't see nagios, deployinator, graphite... #QConNYC
Nobody was really having a good time. You need to understand the dependencies in your tooling. A lot of people do it for their main site, but fewer people do it for the stuff "only the poor ops folks care about". #QConNYC
How do we monitor the monitoring [ed: aka metamonitoring]. Do you have confidence you'd find out fast enough? The last time you want to discover nagios is down is during an incident. #QConNYC
How do you communicate internally and externally? The status page was hosted on an external wordpress blog, so it was still up! yay! #QConNYC
Do you have a backup for if there's a slack outage? Having IRC available as a backup is a good idea. #QConNYC
So what actually went wrong with chef? "We weren't expecting this to happen." There was a weird part of the recipe that deleted unwanted configuration files. #QConNYC
So on re-run, it deleted the bad config file and everything returned to normal. We still didn't understand why the accidental upgrade happened. #QConNYC
As a mitigation, put a comment in saying "this is dangerous, and will upgrade everything. test it thoroughly." Always label your dragons and decide what yaks you're shaving when. #QConNYC
"Engineering is about tradeoffs. You won't have capacity in your team to dig into every weird issue." --@rynchantress#QConNYC
The team didn't have enough slack to work on this one weird thing, especially given the workaround. What are the tradeoffs and opportunity costs? Know who has the yak razors, and what's the best use of their time. #QConNYC
Make sure you have inter-team relationships, and know where you have single points of failure of deep understanding. #QConNYC
"So that person goes on vacation and that's when the system decides to break..." It's not cat-astrophically bad if you have a few areas like that. But make sure you surface and share the knowledge. #QConNYC
At Etsy, everyone was open about sharing things. The "ops" channel was very popular. Anything infrastructure/production-related would go into the channel where everyone hung out. #QConNYC
Having documentation and being able to share information is really important to resilience. How does information persist? Slack channel disappears into the backscroll forever. What about people who joined after the incident? #QConNYC
What behaviors do you reward? Information hoarding is an anti-pattern. In the old-school ops view, people got promoted by hoarding all the information and being "expert dragons who sat on treasure" who could never be fired. #QConNYC
It leads to fragility within the organization. Instead, you want to reward leveling up the people around you and sharing information with them. #QConNYC
What happened afterwards? Stella report, three armed sweater that weighs 100 pounds, "I broke the site, I'm a good engineer", said @rynchantress to their incoming CEO... #QConNYC
How can we build a culture where we don't incorrectly ascribe problems to "human error"? Ensure people have a culture of wanting to help each other. #QConNYC
You don't want a culture of "sucks to be you, I'm going to lunch." Even if your peers are responding positively, management needs to be onboard too. Don't have managers firing people for outages. #QConNYC
Have a learning culture. Are people able to take the time to do a postmortem? How long are your band-aid solutions persisting? Do people have the time and resources to properly fix things, or do you bandaid more and more over time? #QConNYC
It's important for engineers to be able to prioritize their own remediation items, and for the remediation items to be valued as real work e.g. P1s with a 30 day window to finish, not unimportant work. #QConNYC
Technology can be robust, but only humans can be resilient. @rynchantress has heard all about hardware and software ways of making systems more resilient, but they can only deal with known problems. Humans have a unique capacity to learn. [ed: 👏🏼👏🏼👏🏼] #QConNYC
Quoting @allspaw: Resilience is about people and their ability to adapt to unforeseen circumstances.
External Tweet loading...
If nothing shows, it may have been deleted
by @allspaw view original on Twitter
Five takeaways: Understand your automation. Maintain adaptive capacity. Know your dependencies. Build cross-team relationships. Always be learning. [fin] #QConNYC
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Final talk I'll be getting to at #VelocityConf before I dash to Toronto: @IanColdwater on improving container security on k8s.
@IanColdwater She focuses on hardening her employer's cloud container infrastructure, including doing work on k8s.
She also was an ethical hacker before she went into DevOps and DevSecOps. #VelocityConf
She travels around doing competitive hacking with CTFs. It's important to think like an attacker rather than assuming good intents and nice user personas that use our features in the way the devs intended things to be used. #VelocityConf
My colleague @sethvargo on microservice security at #VelocityConf: traditionally we've thought of traditional security as all-or-nothing -- that you put the biggest possible padlock on your perimeter, and you have a secure zone and untrusted zone.
@sethvargo We know that monoliths don't actually work, so we're moving towards microservices. But how does this change your security model?
You might have a loadbalancer that has software-defined rules. And you have a variety of compartmentalized networks. #VelocityConf
You might also be communicating with managed services such as Cloud SQL that are outside of your security perimeter.
You no longer have one resource, firewall, loadbalancer, and security team. You have many. Including "Chris." #VelocityConf
The problems we're solving: (1) why are monoliths harder to migrate? (2) Should you? (3) How do I start? (4) Best practices #VelocityConf
.@krisnova is a Gaypher (gay gopher), is a k8s maintainer, and is involved in two k8s SIGs (cluster lifecycle & aws, but she likes all the clouds. depending upon the day). And she did SRE before becoming a Dev Advocate! #VelocityConf
"just collect data and figure out later how you'll use it" doesn't work any more. #VelocityConf
We used to be optimistic before we ruined everything.
Mozilla also used to not collect data, and only had data on number of downloads, but its market share went down because they weren't measuring user satisfaction and actual usage. #VelocityConf