A very rare bit of research that is directly, straight-up relevant to real alignment problems! They trained a reward function on human preferences AND THEN measured how hard you could optimize against the trained function before the results got actually worse. https://twitter.com/OpenAI/status/1301914879721234432">https://twitter.com/OpenAI/st...
Tl;dr (he said with deliberate irony) you can ask for results as good as the best 99th percentile of rated stuff in the training data (a la Jessica Taylor& #39;s quantilization idea). Ask for things the trained reward function rates as "better" than that, and it starts to find...
..."loopholes" as seen from outside the system; places where the trained reward function poorly matches your real preferences, instead of places where your real preferences would rate high reward. ("Goodhart& #39;s Curse", the combination of Optimizer& #39;s Curse plus Goodhart& #39;s Law.)
That is: they had to impose a (new) quantitative form of "conservatism" in my terminology, producing only results similar (low KL divergence) to things already seen, in order to get human-valued output. They didn& #39;t directly optimize for the learned reward function!
Some alignment ideas to which this work seems relevant:
https://arbital.com/p/goodharts_curse/
https://arbital.com/p/goodhar... href=" https://intelligence.org/files/QuantilizersSaferAlternative.pdf
https://intelligence.org/files/Qua... href=" https://arbital.com/p/soft_optimizer/
https://arbital.com/p/soft_op... href=" https://arbital.com/p/conservative_concept/">https://arbital.com/p/conserv...
Why this doesn& #39;t solve the whole problem: with powerful AGI, you& #39;re not limited by how far you can optimize a learned reward function before the learned reward function stops well-predicting human feedback; you& #39;re limited by how hard the AI can optimize before human raters break.
Not to undersell the research: https://twitter.com/Hernandez_Danny/status/1301958297335951360?s=20">https://twitter.com/Hernandez...
To be explicit about precedents: this is not "learning a conservative concept" as I proposed that, nor "expected utility quantilization" as Jessica proposed that. OpenAI did a new thing, which you could see as simultaneously "mildly optimizing" and "conservative".
You can follow @ESYudkowsky.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: