Post

Liz Fong-Jones (方禮真)

@lizthegrey

Jun 7, 2018 • 44 tweets • 15 min read • Read on X

@salim

After lunch, I'll be livetweeting an #SREcon talk on productionizing machine learning services by Google SREs @salim and @villaviejac, in their professional capacities... [ed: and I saw some 🔥🔥🔥 slides about ML privacy & ethics in their dress rehearsal so this gonna be good]

They are SREs but not necessarily ML scientists/researchers.

Data about what can go wrong has been gathered from 40+ interviews with varying teams. #SREcon

Myths: "machine learning is a black box." "you rarely have to rollback." "ML based monitoring is like other alerting." All nope. #SREcon

ML is good for everything except... if you don't have a fallback plan, don't have enough labeled data, or require microsecond rather than millisecond latency. #SREcon

Some uses of ML today in Google production: predicting user clicks on ads, prefetching next memory/file accesses, scheduling jobs and capacity planning, speech recognition, fraud detections, smart responses, and machine vision. #SREcon

One specific important example: YouTube machine learning for recommendations.

You need to continuously train to make sure that you reflect current trends. #SREcon

YouTube's recommendation algorithm is revenue-critical to Google because it influences watch time and therefore ads clicked.

But what can go wrong? What if we trained our data on the wrong day e.g. if everyone's watching a sports event? #SREcon

But it's not that easy of a problem in production. Freshness, type of device matters, as does the view time, as does spamminess/badness polluting your training set.

Continuous deployment and continuous training is challenging to keep going. #SREcon

It's up to SRE to productionize these models so they can become reliable production pipelines.

This talk is designed to illuminate some of the ML productionization best practices. #SREcon

What's ML look like in prod? "Don't worry, it's just another data pipeline"

10% of the effort is offline training -- transforming/training prod data using TPUs, validating, and producing a trained model.

We then push the trained model to prod. #SREcon

90% of the cycles are spent on using the serving model on answering queries from the serving model. Right?

Well, not quite... there are some nuances. #SREcon

Five key areas: training and data quality, hardware management, qualification, backwards compatibility, and privacy/ethics. #SREcon

Training is now no longer a point in time concrete thing that we develop offline and release.

Production data changes fast. Model loss increases with time at a constant rate. Say, the appearance of Carlos's beard and face. #SREcon

We need to be able to continuously evolve our models in productions fast. You can never stop training. #SREcon

Data quality matters. It's generally true that more data is better, but often repeated data needs to be filtered, and missing data needs to be imputed.

You may not be able to train on *all* your terabytes of data. #SREcon

It's important to save some data for validation, rather than using it all to train and overfit. Don't ever allow your validation model to have seen your validation data. No one size fits all -- some people say 70/30, other estimates vary. Qualification is separate. #SREcon

We may need to shard and train different models for different use cases.

We can snapshot our models periodically to make incremental progress instead of starting off from scratch every time. #SREcon

Data quality issues: you need to be completely consistent about how you represent values.

Make sure your models never receive unexpected inputs. Make sure your data isn't biased by one specific source. #SREcon

You may need to stop training if some of your inputs go missing, or a day is anomalous.

You may need to be prepared to add more fields to old data and re-train if you incorporate new signals. #SREcon

What changes does this require to our pipelines?

We perform data imputation and filtering prior to training, then train/validate multiple copies in parallel. #SREcon

Onto topic (2): resource management. Hardware is such a bottleneck that companies like Google and Microsoft are building custom hardware.

32 GPU-days == less than 6 hours of a TPU-pod; the cost is dropping over time. #SREcon

The cost of training resources wind up growing faster than the rate of the production serving/inference resources.

You need to make sure each part has appropriate resources it needs, rather than competing for resources. #SREcon

There's no good multitenancy for GPUs right now, although there are projects underway.

Onto part (4): qualification. #SREcon

that's part (3), sorry.

We need to qualify new models against production binaries, and/or some kind of A/B testing. Canarying matters to verify that behavior is similar and it doesn't break. #SREcon

To prevent mistakes, only allow signed qualified models in production that have their use cases and tested inputs specified. #SREcon

Models may add features that aren't backward compatible. You may not be able to reuse old models, so you need a fallback heuristic if you're unable to find a viable model to use. #SREcon

Actively deprecate old models if your system can't use them. Have explicit versioning and deployment strategies. #SREcon

Remember that API versions can drift, as well as binary and model versions.

!!! remember you can't add a new feature to an old model without re-training. !!! #SREcon

Rollbacks of ML models have subtleties that require human judgment. But this means we can't do it quickly. So future work maybe? [ed: I hope I got that right.] #SREcon

Okay, so now we have a change to our pipeline structure... we need to qualify by tee-ing user data against the same binary and two different models. Compare the champion model to other models.

Remember that no release of a model = user-visible degradation. #SREcon

What kind of alerting do we need? Need to measure how long training needs, end user serving latency [including overhead from APIs, frameworks], age of models... #SREcon

If you don't have alerting, then you might miss pushing a bad model. You need domain-specific alerting [ed: e.g. have specific user journeys and SLOs in mind]. #SREcon

Onto topic (5): privacy. ML tends to involve user data.

Make sure that you are anonymizing user data. Users shouldn't be identifiable from outcomes.

You must be able to delete user data. Models should not contain user data, or should be immediately retrained. #SREcon

Know how long your SLOs are for completing the process.

Ethics in ML: in a nutshell, SRE have distinct responsibility to be responsible for the quality of predictions and data we are processing. We as SREs must be able to make decisions. #SREcon

We need oversight and guidelines, potentially from outside groups. See the AI Now institute's work.

There needs to be followup on the recommendations and transparency, not just directing feedback to /dev/null. #SREcon

It's easier to add ML to products today than before, and we need to ensure that it's done in an ethical and fair way. #SREcon

What are the takeaways? You need to be prepared to turn off your ML features and fall back for a variety of reasons.

Training is part of your production system, not off to the side. You need to monitor your serving latency and your freshness. #SREcon

Data changes are tricky to get right, so monitor carefully and be prepared to compensate for problems.

Make sure unqualified models can't be pushed to prod. #SREcon

Predictions for the future: increasing use of open source pre-anonymized training data sets.

Need to spread load across multiple different models in parallel rather than only one big model. #SREcon

Models as a service to increase in importance -- use a central provider to do text to speech or image recognition rather than training everything yourself from scratch. #SREcon

Thanks to Alphabet collaborators including Google, Deepmind, YouTube, and outside contributors such as Clarifai and Thoughtworks. [fin] #SREcon

view original on Twitter

To capture a nuance I didn't write down fast enough: SREs need to understand who is using the model and how they may be affected by changes. And they need to be empowered to hit the 'stop' button at any time.

External Tweet loading...
If nothing shows, it may have been deleted
by @lizthegrey view original on Twitter

Resources mentioned in the talk: medium.com/@eirinimalliar… and ainowinstitute.org/aiareport2018.… #SREcon

And there will be a video posted of this in a few weeks in case I didn't quite transcribe anything correctly.

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Enter URL or ID to Unroll

Liz Fong-Jones (方禮真)

Try unrolling a thread yourself!

More from @lizthegrey

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Liz Fong-Jones (方禮真)

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!