Thread by @karlhigley, “Why Are Deep Learning Models Not Consistently Winning Recommender Systems Competitions Yet?“https://dl.acm.org/doi/abs/10.1145/3415959.3416001My [...]

“Why Are Deep Learning Models Not Consistently Winning Recommender Systems Competitions Yet?“

https://dl.acm.org/doi/abs/10.1145/3415959.3416001

My">https://dl.acm.org/doi/abs/1... take is that we haven’t had the right model architectures. Here’s why I think that...

Going way back to the Netflix prize, multiplicative interactions have been a key component of successful modeling strategies. Matrix factorization did well on the Netflix data and became a classic approach to making recommendations.

Many further iterations on the key concept of factorizing matrices into low-rank approximations with vector embeddings per user/item/attribute have also been successful.

Drawing inspiration from word2vec, the CoFactor paper showed that you can improve the performance of MF by jointly factorizing an item-item mutual information matrix. Makes sense: there’s information in which pairs of items users like together that’s hard for MF to extract.

Factorization machines demonstrated that you can extend the concept of multiplicative interactions between vector embeddings to tabular data with side information and get improved cold-start and resilience to sparsity by incorporating metadata as side information.

Field-aware FMs have been successful in many competitions, and supercharge that idea by using separate embeddings for interactions between different features. That increases model capacity, but provides a structured inductive bias beyond increasing embedding dimensions.

What all these have in common is that the models explicitly incorporates pairwise interactions directly into their structure.

Early attempts to apply deep learning to recommendations largely did away with explicit pairwise multiplicative interactions, deciding instead to focus on data compression with autoencoders and MLPs.

Compressing a bunch of information into a low-dimensional embedding was part of the magic of matrix factorization, so it seemed like deep architectures should be able to do it better. And they can, but that wasn’t the only important element of previously successful models.

People tried concatenating user and item embeddings and feeding them through MLPs in so many different ways (Neural Matrix Factorization, Wide and Deep, DeepFMs.)

What we know now is that—despite theoretically being universal approximators—MLPs are horrendously inefficient at approximating multiplicative interactions.

(See the 2018 Latent Cross and 2020 NCF vs MF papers for details.)

So that approach hasn’t really planned out.

What we’ve seen instead is success applying RNNs—which have multiplicative interactions—to sequential recommendations, and a shift toward two-tower networks with explicit multiplicative interactions (dot products) between the towers.

What that suggests to me is that the key component of factorization-based approaches wasn’t data compression, it was multiplicative interactions.

When you look at models like FMs and FFMs, they use low-rank approximations with vector embeddings but make no attempt to compress all user/item info to one vector. They do almost the exact opposite and go wild with lots and lots of vectors!

Looking around the current DL for RecSys landscape, the only model family I know of that takes that general approach are Deep & Cross Networks, which concatenate a bunch of vectors and then form pairwise/higher-order interactions by multiplying the result by itself repeatedly.

That architecture actually makes sense, given the history of successful recommender models.

The early hype didn’t really pan out, and a lot of naive applications of DL to RecSys haven’t really worked, but I’m optimistic about deep models that build on approaches we know work.

And what do we know works?

Explicit multiplicative interactions incorporated directly into the structure of the model.

Latest Threads Unrolled: