Liz Fong-Jones (方禮真) Profile picture
Jun 7, 2018 44 tweets 15 min read Read on X
After lunch, I'll be livetweeting an #SREcon talk on productionizing machine learning services by Google SREs @salim and @villaviejac, in their professional capacities... [ed: and I saw some 🔥🔥🔥 slides about ML privacy & ethics in their dress rehearsal so this gonna be good]
They are SREs but not necessarily ML scientists/researchers.

Data about what can go wrong has been gathered from 40+ interviews with varying teams. #SREcon
Myths: "machine learning is a black box." "you rarely have to rollback." "ML based monitoring is like other alerting." All nope. #SREcon
ML is good for everything except... if you don't have a fallback plan, don't have enough labeled data, or require microsecond rather than millisecond latency. #SREcon
Some uses of ML today in Google production: predicting user clicks on ads, prefetching next memory/file accesses, scheduling jobs and capacity planning, speech recognition, fraud detections, smart responses, and machine vision. #SREcon
One specific important example: YouTube machine learning for recommendations.

You need to continuously train to make sure that you reflect current trends. #SREcon
YouTube's recommendation algorithm is revenue-critical to Google because it influences watch time and therefore ads clicked.

But what can go wrong? What if we trained our data on the wrong day e.g. if everyone's watching a sports event? #SREcon
But it's not that easy of a problem in production. Freshness, type of device matters, as does the view time, as does spamminess/badness polluting your training set.

Continuous deployment and continuous training is challenging to keep going. #SREcon
It's up to SRE to productionize these models so they can become reliable production pipelines.

This talk is designed to illuminate some of the ML productionization best practices. #SREcon
What's ML look like in prod? "Don't worry, it's just another data pipeline"

10% of the effort is offline training -- transforming/training prod data using TPUs, validating, and producing a trained model.

We then push the trained model to prod. #SREcon
90% of the cycles are spent on using the serving model on answering queries from the serving model. Right?

Well, not quite... there are some nuances. #SREcon
Five key areas: training and data quality, hardware management, qualification, backwards compatibility, and privacy/ethics. #SREcon
Training is now no longer a point in time concrete thing that we develop offline and release.

Production data changes fast. Model loss increases with time at a constant rate. Say, the appearance of Carlos's beard and face. #SREcon
We need to be able to continuously evolve our models in productions fast. You can never stop training. #SREcon
Data quality matters. It's generally true that more data is better, but often repeated data needs to be filtered, and missing data needs to be imputed.

You may not be able to train on *all* your terabytes of data. #SREcon
It's important to save some data for validation, rather than using it all to train and overfit. Don't ever allow your validation model to have seen your validation data. No one size fits all -- some people say 70/30, other estimates vary. Qualification is separate. #SREcon
We may need to shard and train different models for different use cases.

We can snapshot our models periodically to make incremental progress instead of starting off from scratch every time. #SREcon
Data quality issues: you need to be completely consistent about how you represent values.

Make sure your models never receive unexpected inputs. Make sure your data isn't biased by one specific source. #SREcon
You may need to stop training if some of your inputs go missing, or a day is anomalous.

You may need to be prepared to add more fields to old data and re-train if you incorporate new signals. #SREcon
What changes does this require to our pipelines?

We perform data imputation and filtering prior to training, then train/validate multiple copies in parallel. #SREcon
Onto topic (2): resource management. Hardware is such a bottleneck that companies like Google and Microsoft are building custom hardware.

32 GPU-days == less than 6 hours of a TPU-pod; the cost is dropping over time. #SREcon
The cost of training resources wind up growing faster than the rate of the production serving/inference resources.

You need to make sure each part has appropriate resources it needs, rather than competing for resources. #SREcon
There's no good multitenancy for GPUs right now, although there are projects underway.

Onto part (4): qualification. #SREcon
that's part (3), sorry.

We need to qualify new models against production binaries, and/or some kind of A/B testing. Canarying matters to verify that behavior is similar and it doesn't break. #SREcon
To prevent mistakes, only allow signed qualified models in production that have their use cases and tested inputs specified. #SREcon
Models may add features that aren't backward compatible. You may not be able to reuse old models, so you need a fallback heuristic if you're unable to find a viable model to use. #SREcon
Actively deprecate old models if your system can't use them. Have explicit versioning and deployment strategies. #SREcon
Remember that API versions can drift, as well as binary and model versions.

!!! remember you can't add a new feature to an old model without re-training. !!! #SREcon
Rollbacks of ML models have subtleties that require human judgment. But this means we can't do it quickly. So future work maybe? [ed: I hope I got that right.] #SREcon
Okay, so now we have a change to our pipeline structure... we need to qualify by tee-ing user data against the same binary and two different models. Compare the champion model to other models.

Remember that no release of a model = user-visible degradation. #SREcon
What kind of alerting do we need? Need to measure how long training needs, end user serving latency [including overhead from APIs, frameworks], age of models... #SREcon
If you don't have alerting, then you might miss pushing a bad model. You need domain-specific alerting [ed: e.g. have specific user journeys and SLOs in mind]. #SREcon
Onto topic (5): privacy. ML tends to involve user data.

Make sure that you are anonymizing user data. Users shouldn't be identifiable from outcomes.

You must be able to delete user data. Models should not contain user data, or should be immediately retrained. #SREcon
Know how long your SLOs are for completing the process.

Ethics in ML: in a nutshell, SRE have distinct responsibility to be responsible for the quality of predictions and data we are processing. We as SREs must be able to make decisions. #SREcon
We need oversight and guidelines, potentially from outside groups. See the AI Now institute's work.

There needs to be followup on the recommendations and transparency, not just directing feedback to /dev/null. #SREcon
It's easier to add ML to products today than before, and we need to ensure that it's done in an ethical and fair way. #SREcon
What are the takeaways? You need to be prepared to turn off your ML features and fall back for a variety of reasons.

Training is part of your production system, not off to the side. You need to monitor your serving latency and your freshness. #SREcon
Data changes are tricky to get right, so monitor carefully and be prepared to compensate for problems.

Make sure unqualified models can't be pushed to prod. #SREcon
Predictions for the future: increasing use of open source pre-anonymized training data sets.

Need to spread load across multiple different models in parallel rather than only one big model. #SREcon
Models as a service to increase in importance -- use a central provider to do text to speech or image recognition rather than training everything yourself from scratch. #SREcon
Thanks to Alphabet collaborators including Google, Deepmind, YouTube, and outside contributors such as Clarifai and Thoughtworks. [fin] #SREcon
To capture a nuance I didn't write down fast enough: SREs need to understand who is using the model and how they may be affected by changes. And they need to be empowered to hit the 'stop' button at any time.
And there will be a video posted of this in a few weeks in case I didn't quite transcribe anything correctly.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真) Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @lizthegrey

Oct 3, 2018
Final talk I'll be getting to at #VelocityConf before I dash to Toronto: @IanColdwater on improving container security on k8s.
@IanColdwater She focuses on hardening her employer's cloud container infrastructure, including doing work on k8s.

She also was an ethical hacker before she went into DevOps and DevSecOps. #VelocityConf
She travels around doing competitive hacking with CTFs. It's important to think like an attacker rather than assuming good intents and nice user personas that use our features in the way the devs intended things to be used. #VelocityConf
Read 36 tweets
Oct 3, 2018
My colleague @sethvargo on microservice security at #VelocityConf: traditionally we've thought of traditional security as all-or-nothing -- that you put the biggest possible padlock on your perimeter, and you have a secure zone and untrusted zone.
@sethvargo We know that monoliths don't actually work, so we're moving towards microservices. But how does this change your security model?

You might have a loadbalancer that has software-defined rules. And you have a variety of compartmentalized networks. #VelocityConf
You might also be communicating with managed services such as Cloud SQL that are outside of your security perimeter.

You no longer have one resource, firewall, loadbalancer, and security team. You have many. Including "Chris." #VelocityConf
Read 19 tweets
Oct 3, 2018
Leading off the k8s track today is @krisnova on migrating monoliths to k8s! #VelocityConf
@krisnova [ed: p.s. her ponies and rainbows dress is A+++]

She starts by providing a resources link: j.hept.io/velocity-nyc-2…

The problems we're solving:
(1) why are monoliths harder to migrate?
(2) Should you?
(3) How do I start?
(4) Best practices #VelocityConf
.@krisnova is a Gaypher (gay gopher), is a k8s maintainer, and is involved in two k8s SIGs (cluster lifecycle & aws, but she likes all the clouds. depending upon the day). And she did SRE before becoming a Dev Advocate! #VelocityConf
Read 29 tweets
Oct 3, 2018
Final keynote block: @lxt of Mozilla on practical ethics and user data. #VelocityConf
@lxt And also ethics of experimentation!

"just collect data and figure out later how you'll use it" doesn't work any more. #VelocityConf
We used to be optimistic before we ruined everything.

Mozilla also used to not collect data, and only had data on number of downloads, but its market share went down because they weren't measuring user satisfaction and actual usage. #VelocityConf
Read 25 tweets
Oct 3, 2018
Next up is @mrb_bk on why marketing matters. #VelocityConf
@mrb_bk Hypothesis: marketing >> code in terms of software adoption. [ed: and this is why I became a developer advocate!] #VelocityConf
You need to consider community early when developing a product.

Always ask, "Why do people matter?" "Why does adoption matter?" #VelocityConf
Read 17 tweets
Oct 3, 2018
Next up is @rogerm on O'Reilly's insights into trends with Radar. #VelocityConf
@rogerm They look at changes in search terms year on year; the two largest increases are k8s and blockchain. #VelocityConf
People are becoming less interested in broader topics and more interested in specific technologies e.g. pytorch. #VelocityConf
Read 5 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(