After lunch, I'll be livetweeting an #SREcon talk on productionizing machine learning services by Google SREs @salim and @villaviejac, in their professional capacities... [ed: and I saw some 🔥🔥🔥 slides about ML privacy & ethics in their dress rehearsal so this gonna be good]
They are SREs but not necessarily ML scientists/researchers.
Data about what can go wrong has been gathered from 40+ interviews with varying teams. #SREcon
Myths: "machine learning is a black box." "you rarely have to rollback." "ML based monitoring is like other alerting." All nope. #SREcon
ML is good for everything except... if you don't have a fallback plan, don't have enough labeled data, or require microsecond rather than millisecond latency. #SREcon
Some uses of ML today in Google production: predicting user clicks on ads, prefetching next memory/file accesses, scheduling jobs and capacity planning, speech recognition, fraud detections, smart responses, and machine vision. #SREcon
One specific important example: YouTube machine learning for recommendations.
You need to continuously train to make sure that you reflect current trends. #SREcon
YouTube's recommendation algorithm is revenue-critical to Google because it influences watch time and therefore ads clicked.
But what can go wrong? What if we trained our data on the wrong day e.g. if everyone's watching a sports event? #SREcon
But it's not that easy of a problem in production. Freshness, type of device matters, as does the view time, as does spamminess/badness polluting your training set.
Continuous deployment and continuous training is challenging to keep going. #SREcon
It's up to SRE to productionize these models so they can become reliable production pipelines.
This talk is designed to illuminate some of the ML productionization best practices. #SREcon
What's ML look like in prod? "Don't worry, it's just another data pipeline"
10% of the effort is offline training -- transforming/training prod data using TPUs, validating, and producing a trained model.
90% of the cycles are spent on using the serving model on answering queries from the serving model. Right?
Well, not quite... there are some nuances. #SREcon
Five key areas: training and data quality, hardware management, qualification, backwards compatibility, and privacy/ethics. #SREcon
Training is now no longer a point in time concrete thing that we develop offline and release.
Production data changes fast. Model loss increases with time at a constant rate. Say, the appearance of Carlos's beard and face. #SREcon
We need to be able to continuously evolve our models in productions fast. You can never stop training. #SREcon
Data quality matters. It's generally true that more data is better, but often repeated data needs to be filtered, and missing data needs to be imputed.
You may not be able to train on *all* your terabytes of data. #SREcon
It's important to save some data for validation, rather than using it all to train and overfit. Don't ever allow your validation model to have seen your validation data. No one size fits all -- some people say 70/30, other estimates vary. Qualification is separate. #SREcon
We may need to shard and train different models for different use cases.
We can snapshot our models periodically to make incremental progress instead of starting off from scratch every time. #SREcon
Data quality issues: you need to be completely consistent about how you represent values.
Make sure your models never receive unexpected inputs. Make sure your data isn't biased by one specific source. #SREcon
You may need to stop training if some of your inputs go missing, or a day is anomalous.
You may need to be prepared to add more fields to old data and re-train if you incorporate new signals. #SREcon
What changes does this require to our pipelines?
We perform data imputation and filtering prior to training, then train/validate multiple copies in parallel. #SREcon
Onto topic (2): resource management. Hardware is such a bottleneck that companies like Google and Microsoft are building custom hardware.
32 GPU-days == less than 6 hours of a TPU-pod; the cost is dropping over time. #SREcon
The cost of training resources wind up growing faster than the rate of the production serving/inference resources.
You need to make sure each part has appropriate resources it needs, rather than competing for resources. #SREcon
There's no good multitenancy for GPUs right now, although there are projects underway.
We need to qualify new models against production binaries, and/or some kind of A/B testing. Canarying matters to verify that behavior is similar and it doesn't break. #SREcon
To prevent mistakes, only allow signed qualified models in production that have their use cases and tested inputs specified. #SREcon
Models may add features that aren't backward compatible. You may not be able to reuse old models, so you need a fallback heuristic if you're unable to find a viable model to use. #SREcon
Actively deprecate old models if your system can't use them. Have explicit versioning and deployment strategies. #SREcon
Remember that API versions can drift, as well as binary and model versions.
!!! remember you can't add a new feature to an old model without re-training. !!! #SREcon
Rollbacks of ML models have subtleties that require human judgment. But this means we can't do it quickly. So future work maybe? [ed: I hope I got that right.] #SREcon
Okay, so now we have a change to our pipeline structure... we need to qualify by tee-ing user data against the same binary and two different models. Compare the champion model to other models.
Remember that no release of a model = user-visible degradation. #SREcon
What kind of alerting do we need? Need to measure how long training needs, end user serving latency [including overhead from APIs, frameworks], age of models... #SREcon
If you don't have alerting, then you might miss pushing a bad model. You need domain-specific alerting [ed: e.g. have specific user journeys and SLOs in mind]. #SREcon
Onto topic (5): privacy. ML tends to involve user data.
Make sure that you are anonymizing user data. Users shouldn't be identifiable from outcomes.
You must be able to delete user data. Models should not contain user data, or should be immediately retrained. #SREcon
Know how long your SLOs are for completing the process.
Ethics in ML: in a nutshell, SRE have distinct responsibility to be responsible for the quality of predictions and data we are processing. We as SREs must be able to make decisions. #SREcon
We need oversight and guidelines, potentially from outside groups. See the AI Now institute's work.
There needs to be followup on the recommendations and transparency, not just directing feedback to /dev/null. #SREcon
It's easier to add ML to products today than before, and we need to ensure that it's done in an ethical and fair way. #SREcon
What are the takeaways? You need to be prepared to turn off your ML features and fall back for a variety of reasons.
Training is part of your production system, not off to the side. You need to monitor your serving latency and your freshness. #SREcon
Data changes are tricky to get right, so monitor carefully and be prepared to compensate for problems.
Make sure unqualified models can't be pushed to prod. #SREcon
Predictions for the future: increasing use of open source pre-anonymized training data sets.
Need to spread load across multiple different models in parallel rather than only one big model. #SREcon
Models as a service to increase in importance -- use a central provider to do text to speech or image recognition rather than training everything yourself from scratch. #SREcon
Thanks to Alphabet collaborators including Google, Deepmind, YouTube, and outside contributors such as Clarifai and Thoughtworks. [fin] #SREcon
To capture a nuance I didn't write down fast enough: SREs need to understand who is using the model and how they may be affected by changes. And they need to be empowered to hit the 'stop' button at any time.
External Tweet loading...
If nothing shows, it may have been deleted
by @lizthegrey view original on Twitter
Final talk I'll be getting to at #VelocityConf before I dash to Toronto: @IanColdwater on improving container security on k8s.
@IanColdwater She focuses on hardening her employer's cloud container infrastructure, including doing work on k8s.
She also was an ethical hacker before she went into DevOps and DevSecOps. #VelocityConf
She travels around doing competitive hacking with CTFs. It's important to think like an attacker rather than assuming good intents and nice user personas that use our features in the way the devs intended things to be used. #VelocityConf
My colleague @sethvargo on microservice security at #VelocityConf: traditionally we've thought of traditional security as all-or-nothing -- that you put the biggest possible padlock on your perimeter, and you have a secure zone and untrusted zone.
@sethvargo We know that monoliths don't actually work, so we're moving towards microservices. But how does this change your security model?
You might have a loadbalancer that has software-defined rules. And you have a variety of compartmentalized networks. #VelocityConf
You might also be communicating with managed services such as Cloud SQL that are outside of your security perimeter.
You no longer have one resource, firewall, loadbalancer, and security team. You have many. Including "Chris." #VelocityConf
The problems we're solving: (1) why are monoliths harder to migrate? (2) Should you? (3) How do I start? (4) Best practices #VelocityConf
.@krisnova is a Gaypher (gay gopher), is a k8s maintainer, and is involved in two k8s SIGs (cluster lifecycle & aws, but she likes all the clouds. depending upon the day). And she did SRE before becoming a Dev Advocate! #VelocityConf
"just collect data and figure out later how you'll use it" doesn't work any more. #VelocityConf
We used to be optimistic before we ruined everything.
Mozilla also used to not collect data, and only had data on number of downloads, but its market share went down because they weren't measuring user satisfaction and actual usage. #VelocityConf