Let's talk about the geometry of Jeffreys priors! A thread

1/n
Suppose we have a family of distributions parametrized continuously. Each distribution is a point on a statistical manifold. We can choose a set of coordinates to parameterize the distribution. e.g. Normal(μ, σ) is a 2d manifold with the coordinates (μ, σ).

2/n
Because distributions are points on manifolds, we can define a Riemannian metric that gives us a notion of distance.
On statistical manifolds, there's a special metric called the Fisher-Rao metric G, also called the Fisher information.

3/n
With the Fisher-Rao metric, we can measure a distance between two distributions.
And that distance will be symmetric and invariant to the choice of coordinates.
For example, the distance between two Normals is the same whether we use variance or precision.

4/n
But we can also define a measure.
A distribution is just a probability measure, one whose total amount is 1).
On real numbers, we usually just assume the Lebesgue measure, but on manifolds, we can choose between a variety of measures.

5/n
We can use the metric to define a measure called the volume (Hausdorff) measure.
On the real numbers with the Euclidean metric, the Lebesgue measure *is* the volume measure.

6/n
Probability densities are weird.
They are formally the Radon-Nikodym derivatives of a probability measure wrt a base measure. That is, densities are always relative to something else, unlike distributions, which stand alone.

7/n
The probability density of the volume measure wrt the Lebesgue measure can be written in terms of the square root of the determinant of the Fisher-Rao metric using the *area formula*.
But this is just the density of the Jeffreys prior!

8/n
So the Jeffreys prior says, "Hey, don't know anything about your parameters? That's okay, here's a density on your parameters that is 'uniform' over the distributions, by the Fisher-Rao metric/volume measure notion of uniformity".

9/n
There's a lot going on here, and from a geometric perspective, Jeffreys priors are cool. But you still probably shouldn't use them. First, they're "distributions" on distributions, and if you're anything like me, it's far from intuitive what that even means.

10/n
You could draw random distributions and then random samples, except Jeffreys priors are often improper, meaning their total measure is infinite! That means you can't sample from your prior, so such prior predictive checks are out.

11/n
Second, they're uniform. Uniform priors toss out a ton of information, no matter how fancy they are! That's information that your model will need to learn from your data. Even weakly informative priors go a long way toward improving inferences.

12/n
Worse yet, I sometimes see Jeffreys priors for one distribution applied to parameters of another distribution, as though there's something intrinsically special about Jeffreys priors of any distribution. Yikes! Please don't do that!

13/n
Teaser: It's commonly said that the uniform distribution on the probability simplex is Dirichlet(1). But there's another uniform distribution: Dirichlet(1/2). More on this later.

14/n
Feel free to add your coolest factoid about Jeffreys priors!

15/n, n=15
You can follow @sethaxen.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: