I want to just highlight something important that's mentioned in the latest OpenAI release, but has been said before, and stands out to me as a key motif in human feedback and alignment:

*You can't just freeze a reward model and maximize it*

1/
If your plan for “the system does what people want” is anything like “learn or write down what people want, then optimize that”,

you need to do two things differently than the naive version:


1. The reward model should be *dynamic*
2. The optimization should be *weak*

2/
(If this is already obvious to you, I hope this thread just provides you with more pointers to places people have already said this. And feel free to chime in with more examples of where this point has been made)

3/
This comes up any time there’s a simplified model standing in for “what's better/worse", and another process optimizing against it.

Most obviously in "reward learning" alignment agendas, but also in other areas I think share this underlying structure.

4/
Ibarz et al. pointed this out when training Atari agents from human feedback:

https://arxiv.org/pdf/1811.06521.pdf

(red curve: predicted reward; blue curve: actual reward)

At some point the system learns to "hack"/"game" the learned model.

5/
GAN training also has this structure: a learned discriminator which guides the generator towards better/worse outputs.

The D from iteration 10,000 isn't necessarily "very astute" in some absolute sense; it's just a good coach for *that iteration's* G.

You can't freeze D.

6/
You can follow @catherineols.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: