Most benchmark datasets in the field of RecSys were collected from an existing, already deployed recommendation system (MovieLens, Netflix,…). In this situation, users only provide feedback (r) on the exposed items (e) and not others, aka “Closed Loop Feedback” [1/7]
Therefore, the deployed model (e.g. MovieLens deployed RecSys model) has a direct influence on the collected feedback dataset (i.e. MovieLens). As a result, the deployed model plays a confounding factor on any other models being evaluated based on closed loop datasets [2/7]
Stratification based on the deployed model’s characteristics (estimated based on propensities) reveals that the offline evaluation based on closed loop datasets suffers from the Simpson’s paradox [3/7]
For example, model A “significantly” performs better than model B for the majority of feedback in the dataset (i.e. 99% of dataset represented as Q1 stratum), but this trend reverses when it’s merged with the remaining 1% of feedback dataset (Q2 statum) [4/7]
We presented several examples of Simpson’s paradox based on different models and datasets. In addition, we proposed a “stratified evaluation method” to take into account the role of the deployed model in offline evaluation of RecSys models [5/7] https://arxiv.org/abs/2104.08912 
Key message: think about the “data collection process”, before applying any ML model. Specifically, when using the collected dataset for “evaluation” purpose!
-Think twice if the collected dataset pass through any type of filters? [6/7]
Thanks Closed Loop Data Science project to shed light on closed loop feedback in different domains, inc. RecSys [7/7] https://www.gla.ac.uk/schools/computing/research/researchsections/ida-section/closedloop/
You can follow @jadidinejad.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: