1/ I've taken positional embedding in transformers for granted, but now looking at it: is the main reason for the unnatural sin/cos formulation the "nice" autocovariance structure? NLP twitter help me out! @srush_nlp @colinraffel @ilyasut
2/ I could also sample each coordinate slice (total of d_model slices) from a 1D Gaussian process w/ kernel, say K(x, y) = e^{-(x-y)^2}, so that ith positional embedding vector is ith position in all d_model GP samples. This would give a nicer autocovariance. Anyone tried this?
Also, if using sin and cos, why not something more natural like sin(i t / d_model) for the ith position in the embedding for position t? The autocovariance looks perhaps not as good, but still not bad. Anyone tried?
You can follow @TheGregYang.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: