. @Openai GPT-3 Thoughts and Takeaways

Demos are fun, but let's discuss the details.

This thread talks about about sentence completion, trade-offs, few shot learning, fine-tuning, technical takeaways, industry impacts, ethics, fun facts, and open questions.

cc @gdb

(1/13)
🤓 Short Intro

There are 4⃣ language models of improving quality at the cost of increased latency: Ada, Babbage, Curie, Davinci.

There are 2⃣ API endpoints: Completion and Search. We’ll talk mostly about Completion because it’s the main endpoint.

(2/13)
Completion Parameters

🥇 Prompt - Input text.
🥈 Max_tokens - Output token length.
🌡️ Temperature -
⬇️ = less random + more deterministic. ⬆️ = more “creative.”
4⃣ Top_p - Diversity via nucleus sampling.
5⃣ Frequency_Penalty - ⬆️ = ⬇️ repetition.

(3/13)
6⃣ Presence_penalty - ⬆️ = ⬆️ new topics.
7⃣ N - Best of how many generations.
🎱 Stream - Whether to stream back partial progress.
9⃣Logprobs - High logprob = model is more confident.
🛑 Stop - Where API will stop generating further tokens.

(4/13)
👶 Example using "Playground": Q and A

GPT-3 thinks it's 10 years old and wants to be a doctor when it grows because it wants to help people.

The playground is a fun toy, but the API makes running GPT-3 easier than running a linear regression using @scikit_learn.

(5/13)
🤨 Example using API: GPT-3 Cracking the GRE

I discovered an interesting trade-off between random creativity and reproducible logic when experimenting on the GRE multiple choice sentence completion task. Increasing the temperature (or creativity) decreased the accuracy.

(6/13)
I wonder if GPT-3 wouldn't be easily be good at writing a math book because you'd like the text part of the book to be more creative and the mathematical part to be logical and repeatable. You probably wouldn't want a math book that was creatively written and then 2+2=5.

(7/13)
😯 Few shot > Fine-tune

Back in the day (a few months ago), you needed to fine-tune a pre-trained model on a task-specific supervised dataset.

Today, you get similar results by simply prepending a few task-specific examples to the prompt during inference using GPT-3.

(8/13)
🤯 Technical Takeaways

Zero-shot performance improves steadily with model size.
Few-shot performance increases more rapidly.
Larger models are better at in-context learning.
Graph from paper: https://arxiv.org/pdf/2005.14165.pdf

(9/13)
🤑 Industry Impacts

@OpenAI will be competing with AI-as-an-API startups, like @rev, and big tech companies with ML solutions, like @googlecloud.

Bigger models need better hardware.

Companies will need to upgrade their ML serving infrastructure for bigger models.

(10/13)
🧐 AI Ethics

The paper talked about social impact and potential misuse. @openai enabled “Flag Toxicity” filter by default and allowed us to send feedback about “unsafe” content. They’re also working on a semantically-deep toxicity filter built on the API.

(11/13)
🥳 Fun Facts

GPT - June 2018 release date, 150M parameters, 5GB training set.
GPT-2 - February 2019, 1.5B, 50GB.
GPT-3 - June 2020, 175B, 570GB.
GPT-4 - June 2021, 1.5T, 5.7TB.

GPT-4 predicted by GPT-3.

(12/13)
🤔 My Personal Open Questions

How deep is the model's understanding?
How do we optimize the parameters? Random search?
How do we evaluate the model generally and specifically to priming?

If anyone has any ideas, please feel free to reply.

(13/13)
You can follow @pujaarajan.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: