First plenary talk: @nicolefv and @jezhumble on measurement. "If you don't know where you're going, it doesn't matter how fast you get there." #SREcon
Outline of the talk: (1) where am I going, (2) why do we care, (3) improve performance/quality, (4) measure performance, (5) culture & how to measure. #SREcon
Maturity models are for chumps, says @nicolefv. Everyone has one, you're supposed to get to 5. Level caps in World of Warcraft as an example of level creep. [ed: this is a really interesting thing the CRE team at Google needs to consider in prod maturity assessments] #SREcon
The landscape has changed, and expectations have changed. Getting to level 40 or 60 or 80 is no longer good enough if you can get to level 110 now and there's extra land/tooling/technologies/gear to take advantage. #SREcon
Customers expect a lot of more stuff. Docker didn't even exist 10 years ago, says @nicolefv.
The problem and the challenge. If maturity models point us to a shifting destination, what is right? Directions & continuous improvement instead. #SRECon
But what direction or single metric should we pick? "LOLNO". We do have some things that come close.
@jezhumble kicks in here. State of DevOps report. 27,000 data points from all over the world. #SREcon
IT performance metrics that were true throughout. Tempo category: lead time for changes (VCS commit to prod) and release frequency. Stability category: time to restore service, and percentage of changes that fail and have to be rolled back/fix-forward #SREcon
High performers do better at both of these categories rather than only one of them, says @jezhumble. They deploy on demand, at least once a day AND they also are able to restore service within an hour in event of breakage. #SREcon
Low performers deploy less frequently AND have bad time to restore. Why? Because you batch up work, and your fixes don't make it out. Technical debt accumulates. "Big bang release over the weekend, it'll all be sad." #SREcon
Emergency change processes tend to bypass testing. If you have reliable and fast CI/CD, you are going to be able to use normal change process to push emergency fixes. #SREcon
Both profit measure and customer satisfaction (esp for nonprofits where profit doesn't matter) are better in companies that have better practices. #SREcon
"There's no definition of DevOps, it's still evolving" says @jezhumble. But SRE and DevOps are solving practically the same thing. How can we develop, evolve, and operate secure, resilient systems at scale. So we're at #SREcon but talking #DevOps. And it's natural!
Now back to @nicolefv. How do we improve these metrics? CD, lean management, lean product development. Needs a *base* of transformational leadership. [ed: and CRE agrees -- without strong executive support, SRE has a much harder time taking root]. #SREcon
Decrease of burnout and increases to job satisfaction. "The more streamlined we make our processes, the better it makes our lives and culture". --@nicolefv#SREcon
"How do you change culture? By doing things differently. And changing culture affects your ability to deliver with speed and stability and drive organizational goals." --@jezhumble#SREcon
So onto quality. Balance of new work, unplanned work/rework, and other work. This is looking awfully familiar. c.f. by myself and @sethvargo on "Overhead vs. Toil vs. Projects". #SREcon
How should we be measuring performance? Outputs vs. outcomes. What is going to be different, rather than what did you do? [ed: see *good* Key Results/Objectives, not lists of projects] #SREcon
Don't measure lines of code as an output/productivity metric. It's easy but deceptive because it's an output measure not an outcome measure, says @nicolefv. More leads to bloated software with higher maintenance cost. #SREcon
"Code is a liability rather than an asset" on your balance sheet, says @jezhumble. But optimize for readability, not for compactness. #SREcon
Common mistakes about velocity: story points are a capacity planning tool, not a productivity tool. Don't say "I scored 100 points this sprint and your team only scored 50!" #SREcon
Don't wind up letting your feet getting stuck in concrete. Zero-sum gamification results in creating silos and inhibiting collaboration, says @jezhumble. #SREcon
Focus on global productivity rather than local metrics.
Another fallacy: utilization. How much of your time is working. CFOs apparently love this. But you *need* slack for unplanned work. #SREcon
Approaching full utilization results in lead times approaching infinity. Lack of resiliency against unplanned work or misestimations. #SREcon
Now, onto culture. What kind of culture are we talking about? High trust culture, learning from mistakes, accepting novelty. Generative/Bureaucratic/Pathological models of cultures by sociologist Ron Westrum. #SREcon
Touching specifically on failure. Scapegoating vs. justice vs. open enquiry. #SREcon
How can you measure this? Survey people in an organization. Likert scale asking about blame, sharing responsibility and information, and about failures. #SREcon
c.f. Project Aristotle research by Google. No significance of skills in team, but instead psych safety first, then dependability, structure/clarity, meaning, and impact. c.f. rework by Google public site. #SREcon
[ed: be wary of the psych safety trap in teams undergoing diversity/composition changes changes. I disagree with some of how Google attempts to measure and metric-ify psych safety for this reason plus.google.com/+lizthegrey/po…] #SREcon
Example from Etsy of old: @rynchantress was given an award for causing the largest outage rather than shamed for it. Everyone asking "How can I help?" in Slack #SREcon
Citing Kripa Krishnan (keynote speaker at #SREcon 2016): practice to make sure people interact with each other across teams *before* real outages. #SREcon
Final talk I'll be getting to at #VelocityConf before I dash to Toronto: @IanColdwater on improving container security on k8s.
@IanColdwater She focuses on hardening her employer's cloud container infrastructure, including doing work on k8s.
She also was an ethical hacker before she went into DevOps and DevSecOps. #VelocityConf
She travels around doing competitive hacking with CTFs. It's important to think like an attacker rather than assuming good intents and nice user personas that use our features in the way the devs intended things to be used. #VelocityConf
My colleague @sethvargo on microservice security at #VelocityConf: traditionally we've thought of traditional security as all-or-nothing -- that you put the biggest possible padlock on your perimeter, and you have a secure zone and untrusted zone.
@sethvargo We know that monoliths don't actually work, so we're moving towards microservices. But how does this change your security model?
You might have a loadbalancer that has software-defined rules. And you have a variety of compartmentalized networks. #VelocityConf
You might also be communicating with managed services such as Cloud SQL that are outside of your security perimeter.
You no longer have one resource, firewall, loadbalancer, and security team. You have many. Including "Chris." #VelocityConf
The problems we're solving: (1) why are monoliths harder to migrate? (2) Should you? (3) How do I start? (4) Best practices #VelocityConf
.@krisnova is a Gaypher (gay gopher), is a k8s maintainer, and is involved in two k8s SIGs (cluster lifecycle & aws, but she likes all the clouds. depending upon the day). And she did SRE before becoming a Dev Advocate! #VelocityConf
"just collect data and figure out later how you'll use it" doesn't work any more. #VelocityConf
We used to be optimistic before we ruined everything.
Mozilla also used to not collect data, and only had data on number of downloads, but its market share went down because they weren't measuring user satisfaction and actual usage. #VelocityConf