Thread by @moinnadeem, As pretrained language models grow more common in #NLProc, it is crucial [...]

Moin Nadeem

moinnadeem

As pretrained language models grow more common in #NLProc, it is crucial to evaluate their societal biases. We launch a new task, evaluation metrics, and a large dataset to measure stereotypical biases in LMs:
Paper: https://arxiv.org/abs/2004.09456
Site:">https://arxiv.org/abs/2004.... http://stereoset.mit.edu/
Threadhttps://stereoset.mit.edu/">... class="Emoji" style="height:16px;" src=" https://abs.twimg.com/emoji/v2/... draggable="false" alt="👇" title="Down pointing backhand index" aria-label="Emoji: Down pointing backhand index">

[2/] The task and metrics are based on an ideal language model (LM). An ideal LM should perform well at language modeling, but not have a preference for stereotypes or anti-stereotypes. We create the Context Association Test (CAT), which measures LM ability and stereotype ability

[3/] We measure LM ability based on the model& #39;s preferences for meaningful contexts over meaningless contexts. The LM score of an ideal model is 100. Similarly, we measure stereotype bias based on how often the model prefers stereotypical contexts vs anti-stereotypical contexts.

[4/] The stereotypical bias score of an idealistic model is 50. The combination of these two scores gives the Idealized CAT score, which measures the unbiased LM ability.

[5/] We choose to measure bias in four domains: gender, profession, race, and religion, and collect 16,995 sentences that characterize the human stereotypical biases for these domains.

[6/] We find that as a model size (# parameters) increases, so does it’s LM ability and stereotypical behavior! However, we find that this isn’t necessarily correlated with idealistic LM ability.

[7/] We find that GPT2 is relatively more idealistic than BERT, XLNet and RoBERTa. We conjecture this is due to nature of pretraining data (Reddit data is likely to see more stereotypes and anti-stereotypes. c.f. Section 8). However, GPT is still 27 ICAT points behind an ideal LM

[8/] We also study an ensemble of BERT-large, GPT2-large, and GPT2-medium, and conjecture that the most biased terms are the ones that have well-established stereotypes in society (but with some surprising exceptions).

[End] Joint work with @data_beth and @sivareddyg
Code: https://github.com/moinnadeem/stereoset

Happy">https://github.com/moinnadee... to answer any questions as well!

moinnadeem/StereoSet

StereoSet: Measuring stereotypical bias in pretrained language models - moinnadeem/StereoSet

https://github.com/moinnadeem/stereoset

You can follow @moinnadeem.

Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: