What happens when we take a transformer pretrained on language data and finetune on image data? What happens if those weights are randomly initialized! This paper address this case. A summary thread 👇

paper: https://arxiv.org/abs/2103.05247  #MachineLearning #TransferLearning #GPT2
The authors take a pretrained GPT-2 and freeze all the transformer layers, just fine-tune the input, including positional encodings and output layers and layer-norm parameters, which is not properly shown in the figure! The model is called Frozen Pretrained Transformer (FPT).
The authors show that FPT performs as well as a Full transformer (fully trained on the dataset) in many tasks. But it is not really apples to apples comparison because FPT has 12 transformer layers while the full transformer has only 3 such layers for CIFAR-10 case.
As we can see, Randomly initialized weights model is also doing well. This table honestly sets back the authors claim a bit, since the difference in accuracy is not so much. And intuitively pretrained vision transformer is doing better on image data (except MNIST)
These results show ablations of FPT and Random Frozen transformers respectively. In both the cases, the impact of layernorm is significant.
You can follow @MLsummaries.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: