In good software practices, you version code. Use Git. Track changes. Code in master is ground truth.

In ML, code alone isn't ground truth. I can run the same SQL query today and tomorrow and get different results. How do you replicate this good software practice for ML? (1/7)
Versioning the data is key, but you also need to version the model and artifacts. If an ML API returns different results when called the same way twice, there can be many sources to blame. Different data, different scaler, different model, etc. (2/7)
“Versioning” is not enough. How do you diff your versions? For code, you can visually inspect the diff on Github. But the size of data and artifacts >> size of a company’s codebase. You can't visually and easily inspect everything. (3/7)
Diffing versions isn’t as simple as computing a diff in bytes. Of course the bytes of data or model can change in the next iteration. For each piece of data or artifact, you need to articulate what a diff means. Ex: your database of mammal pics now contains dog pics. (4/7)
My biggest criticism of MLOps tools: they can set up Postgres tables for you to log things to and inspect, but they don’t tell you what to diff or how to compute a diff. For data, maybe it’s a high Jensen-Shannon divergence. For a model, maybe it’s a change in accuracy. (5/7)
It takes companies many iterations of data science projects to figure out what they need to track, and how to track it over time. Ingested data. Cleaning properties. Features. Models. Outputs. Deviations from baselines. Hardware specs. The list is seemingly infinite. (6/7)
So a big part of “good ML practices” is to communicate that you need to version more than the code, and work with your collaborators to align on what needs to be versioned. Any stakeholder should be able to inspect the diffs. (7/7)
You can follow @sh_reya.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: