[1/6] Our new preprint is now available on arXiv. We revisit baselines in policy gradient methods and show that they have a much bigger role than simply variance reduction! With
Wesley Chung, Valentin Thomas, and @le_roux_nicolas.
https://arxiv.org/pdf/2008.13773.pdf
[2/6] We show, for example, that two different baselines, that lead to the *same* variance, can induce different learning dynamics. It is not about variance, but the direction of the gradient, which is affected the baseline! We have both empirical and theoretical results on this.
[3/6] In fact, the baseline can impact the convergence point of the solution, even though it doesn't change the expectation of the gradients! We showed this empirically and theoretically. To do so theoretically, we looked at the stochastic estimates, not the expected setting.
[4/6] We also discuss a different way to speed up learning while ensuring convergence: importance sampling. However, we are talking about *designing* the sampling distribution instead of just correcting for trajectories someone else gave you. This opens up so many possibilities!
[5/6] I learned a lot while working with Wesley, Valentin, and Nicolas on this. I was often surprised by how the "folk knowledge" I had about baselines in PG methods was wrong. Carefully analyzing these things was very rewarding. Assumptions in theoretical results do matter!
[6/6] There's more to be done. In a sense, we're starting a conversation. In optimization, we often talk about curvature and variance but, in RL, it is more complicated than that 😅. I'm particularly excited about the consequences this can have on how we think about exploration.
[7/6] I hadn't realized @wes_chung has a twitter account. My bad 🙄. I'm tagging him in this thread.
You can follow @MarlosCMachado.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: