Language model prompting is all the rage lately in #NLProc

Let me break down what it is, why researchers are studying it, and how it enables surprising zero/few-shot performance across a variety of benchmarks.

1/n đŸ§”đŸ‘‡
Language models estimate the likelihood of a particular word occurring given the text that came before it, known as the "context"

They do this after being trained on a ridiculous amount of example text from books and the web.
So when you feed a given text to an LM, you're sorta asking it, "if you came across this text, what would you expect the next word to be?"

We can hack this for a given downstream task by formulating the task in natural language and evaluating the probability of different answers
For example, let's say we want to translate a piece of text (image below from GPT-3)

We feed in the text:

"Translate English to French:
cheese =>"

and ask the model to predict the next word.

When a large enough model is trained on enough text, it will output "fromage"!
We can do something similar with classification tasks.

Below, I feed a dummy review to BERT followed by the text "This review is [MASK]." and ask the model whether the word "positive" or "negative" is more likely.

This works as a simple zero-shot sentiment classifier.
@_philschmid put together a great little widget for experimenting with few-shot priming with GPT Neo.

https://tinyurl.com/gptneo 

(you can find your API key under Settings if you're signed in on http://huggingface.co )
This type of technique has several names. @OpenAI calls it "In-context learning", but is also simply called "prompting" or "natural instructions".

There's also "priming" which means example text in the context that you want to imitate, such as in GPT-3's few-shot learning.
The thinking here is that during LM training, models learn to perform a large number of tasks implicitly in order to reduce perplexity.

The prompting can then be thought of as a way of locating the task space of interest, i.e. "prompting" the model to perform the task you wish.
You can also think of it as a kind of domain generalization.

Source domain: raw internet text
Target domain: some task instance formulated as natural text

The line between a "domain" and "task" gets blurry here. Call it "task generalization/adaptation" instead if you prefer 😊
This technique does surprisingly well on a number of benchmarks with few or no training examples.

The most valuable contribution of the GPT-3 paper IMO is the observation that the zero/few-shot prompting ability grows as the number of LM parameters grows.
@yoavgo did a great thread a while back investigating few-shot priming with GPT-3 on all kinds of linguistic tasks https://twitter.com/yoavgo/status/1284322876619882503
The method is far from flawless.

For one, the models that perform best with prompting are impractically large.

You can do some things at GPT-2 scale, but the true emergent zero-shot abilities require 10s or 100s of billions of parameters.
Another problem is that models seem to be very sensitive to the way the prompt is written.

The magic of prompt writing is reminiscent of the magic of DNN architecture tuning back in the day.
It seems to do well on some tasks but poorly on others. It struggles with things like sequence pair tasks (in particular NLI) and others which appear to just be too hard or funky for large LMs.

See "The Turking Test" for some examples: https://arxiv.org/abs/2010.11982 
Nevertheless, the fact that it works as it does is interesting, and IMO has great practical potential as well.

And as I hinted at above, a lot of research is being done on this line of method 👇
Rep'ing @huggingface, @Fluke_Ellington and @srush_nlp did great work numerically quantifying the value of prompts, showing that prompts often yield the equivalent performance to tuning with 100s of data points.

https://arxiv.org/abs/2103.08493 
@timo_schick has been doing great work for quite some time on formulating tasks as "cloze tasks" to adapt to MLMs in a semi-supervised/unsupervised way

https://arxiv.org/abs/2001.07676 
https://arxiv.org/abs/2009.07118 
Other works focus on the evaluation technique.

Zhao et al. recommend calibrating probabilities before use https://arxiv.org/abs/2102.09690 

Others show improvements from using a form of PMI rather than raw scores https://peterwestuw.github.io/surface-form-competition-project/surface_form_competition.pdf https://www.aclweb.org/anthology/D19-1109/
Lester et al. recently introduced a method for tuning prompts with T5 rather than fine-tuning model parameters https://arxiv.org/abs/2104.08691 

Shin. et al. have similar work called "AutoPrompt" using gradient-guided search to tune prompts from data https://arxiv.org/abs/2010.15980 
Then, of course, there's the GPT-3 paper: https://arxiv.org/abs/2005.14165 

Prompting is also something we're looking closely at as part of the ongoing @BigscienceW 🌾 workshop / research project.

http://bigscience.huggingface.co 
This is obviously not meant to be a comprehensive lit review but feel free to respond with any other comments or references you find relevant!
You can follow @joeddav.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: