Thread by @sh_reya, Unit testing for ML pipelines is challenging given changing data, features, models, [...]

Unit testing for ML pipelines is challenging given changing data, features, models, etc. Changing I/O make it hard to have fixed unit tests.

To hackily get around this, I liberally use assert statements in scheduled tasks. These have saved me so many times. Thread: (1/11)

In ETL, whenever I do a join to get a features table, I assert that all my primary keys are unique. Last time this failed, there was an issue in data ingestion. Without the assertion, I would have duplicate predictions for some primary keys. Rankings would be screwed. (2/11)

Sometimes ML people (myself included) take fault tolerance for granted, and often times when using ML systems on top of distributed systems, we need to make sure that all upstream tasks for that timestamp succeed or transactions are committed. (3/11)

In inference, I assert that the snapshot of data (assume window size w) being fed to the model is not significantly different from random snapshots of size w sampled from the train set. “Significantly different” is different for each prediction task unfortunately. (4/11)

Once, when the “difference” assertion failed, it was because I had accidentally promoted a model I trained 4 months ago to production. I guess this means my promotion process could use some work, but when working with time series data, small differences matter! (5/11)

In inference, I assert that the dates from the snapshot of data being fed to the model *do not overlap* with the dates of the train set. Once, this failed because of a typo in the dates specified in a DAG. Glad I caught this before showing prototype results to a customer. (6/11)

In the product (API that returns predictions), I assert that we do not lose or gain any prediction values after joining the inference output/predictions on other metadata. Once this failed because of my own spark incompetence / bugs in code. (7/11)

In training, I assert that the “minimum viable metric value” is achieved for all train & val sets. A model only gets trained on the “production” training window iff the metric value is achieved on all sets. Once, this failed because of a typo in the dates in the DAG. (8/11)

In training (for tree-based models), I assert that the “feature importance” for the most “important” feature is < some threshold. Once, this assertion failed because I had accidentally whitelisted a proxy for the label as a feature. This proxy had a very high feature imp. (9/11)

These are just some of the assertions I write, and by no means is it ideal to catch errors at runtime. It would be better to have actual unit tests. We have some for code / syntax errors, but lots of room to improve in testing ML “logic” errors. (10/11)

I’m curious how others decide what to test, what others actually test, and how you test. Please do not only paste links to <random MLOps tool>. Looking forward to learning more

https://abs.twimg.com/emoji/v2/... draggable="false" alt="😊" title="Lächelndes Gesicht mit lächelnden Augen" aria-label="Emoji: Lächelndes Gesicht mit lächelnden Augen"> (11/11)

Latest Threads Unrolled: