📊 Two days ago, I asked a question. Way more people answered than expected, and... well, this week's weₐekly #quiz will be slightly ≠: a long thread on uniformity testing, trickling down all day long.

Be careful what you wish for :) #statistics

1/n https://twitter.com/ccanonne_/status/1283237083260137474
So.... uniformity testing. You have n i.i.d. samples from some unknown distribution over [k]={1,2,...,k} and want to know: is it *the* uniform distribution? Or is it statistically far from it, say, at total variation distance ε?

2/n
So, before delving deeper, let's recall what total variation distance is, and formalize the question. The former is basically a measure of how distinguishable two distributions are given a single sample.

https://en.wikipedia.org/wiki/Total_variation_distance_of_probability_measures

The latter... is below.

3/n
First question before I go to sleep, wake up, and continue unleashing tweets in this thread tomorrow. Just so that we're all on the same page: what is the optimal sample complexity of uniformity testing, for the whole range of k and ε (up to constant factors)?

4/n
That being said, how do we *perform* those tests? And are all tests created equal? 🧐
(Answer: no 🍪)

For instance, I listed below a few criteria one may have in mind; there are others! Tomorrow, I'll start discussing the 7 ≠ algorithms I know...

5/n
So, to resume with a table: here are the algorithms in question, with references. (Spoiler alert: the second column gives the answer to the poll above)

It'll be sprinkled through the day. Think of it as a taste of testing, if you will 🥁.

6/n
So, a key idea that is central to many of those is... to "forget" about TV distance and use ℓ₂ distance as proxy. If p is ε-far from uniform (in TV) it'll be ε/√k-far in ℓ₂ distance by Cauchy—Schwarz. And the advantage of ℓ₂ is that it has a very nice interpretation!

7/n
That's the idea 4 of the 7 algorithms above rely on. Want to test uniformity of p in TV? It's enough to estimate ||p||₂² to ±ε²/k. (Estimating the squared norm is much nicer than the norm itself, as we'll see.)

So... how do we do that with only √k/ε² samples?

8/n
First algorithm: collisions! Since we are looking for a natural unbiased estimator for this squared ℓ₂ norm, it's a good time to remember this nice fact.

If I take two independent samples x,y from p, the probability that x=y (a "collision") is, you guess it...

9/n
This gives our first algorithm: take n samples x₁,...xₙ. For each of the {n choose 2} pairs, check if a collision occurs. Count those collisions, and use the result as unbiased estimator for ||p||₂²; threshold appropriately.

Simple ☑️. Fast ☑️. Intuitive ☑️. Elegant ☑️.

10/n
But how data-efficient is it? To analyze that, we need to know how big n must be for Z₁ to "concentrate" around its expectation ||p||₂². By Chebyshev, we need to bound the variance. Good: getting n≍√k/ε⁴ is not too hard.

Getting √k/ε².. *much* harder. "Simplicity" ☐

11/n
It *is* possible to bound the variance tightly enough to get the optimal √k/ε². It's just not something I would recommend doing for fun.

Which is a uniquely good segue for our *second* algorithm, based on a dual-ish idea: counting the unique elements instead! #nocollision

12/n
(short break now, I am out of coffee)
Back to our algorithm #2: instead of collisions, what about counting the *non-collisions*, i.e., the number of elements appearing exactly once among the n samples? That''ll be max for the uniform distribution, so... good? Also, easy to implement.

Call that estimator Z₂...

13/n
... the expectation is a bit unwieldy, but making a bunch of approximation it's roughly 𝔼[Z₂]≈n-n²||p||₂². Good news! We can again use that as estimator, provided we can also bound its variance...

A few tricks can help us: e.g., Efron—Stein: https://en.wikipedia.org/wiki/Concentration_inequality#Efron%E2%80%93Stein_inequality

14/??
... and that leads us to the optimal n≍√k/ε² sample complexity too! With a caveat, though: it only works for ε≫1/k¼.

Why? 🤔

15/n
Spoiler: not the 🦄. We count the number of unique elements, and there can't be more than k of them if the domain size is k. If we take n too large, we start having to ignore many samples.

So we need n ≪ k, or things start to break. Since n≍√k/ε², that gives ε≫1/k¼...

16/n
To summarize the "unique elements algorithm", Algo #2:

Data efficient ☑️
Time efficient ☑️
Simple + "simple" ☑️
Elegant ☑️
... but has a restriction on the parameters ☐

17/n
(Cool, I'm done with algo 2 out of 7 and it's only been 17 tweets so far. I may wrap up before n=100. Time for a ☕ break!)
(Since I've been asked: no, I am not drinking a coffee per hour. I made coffee earlier, but it's a slow process: it's done now, time to drink it.)
Now onto algorithm #3. If you are a statistician, or just took #Stats 101, or just got lost on @Wikipedia at some point and ended up on the wrong page, you may know of Pearson’s χ² test for goodness-of-fit: let Nᵢ be the # of times i appears in the sample. Compute:

18/n
Then sit back and relax. Bad news: that does not actually lead to the optimal sample complexity: the variance of this thing can be too big, due to elements we only see zero or one time (so... most of them). The good news is that a simple "corrected" variant *will* work!

19/n
Why is that? When computing the variance, many Nᵢ terms will cancel if you only see an element i zero or one time, so the variance goes down. For the analysis, it’s helpful to think of taking Poisson(n) samples 🐟 instead of exactly n, as it simplifies things (and...

20/n
... changes ≈ nothing else here). Then the Nᵢs become independent, with Nᵢ ∼ Poisson(npᵢ) (that not magic, it’s 🐟). The expectation is exactly what we'd hope for, and bounding the variance is not insane either thanks to Poissonization—so Chebyshev, here we come!

21/n
(As an exercise, though, compute the variance *without* that "-Nᵢ" term. Things don't go so well.) Great, we get the optimal n≍√k/ε² here too. But... it's not really beautiful, is it?

Data efficient ☑️
Simple ☑️
"Simple" ☑️
Fast ☑️
Intuitive ☐
Elegant ☐

22/n
Time for our 4th algorithm! Let's take a break from ℓ₂ as proxy for TV and consider another, very natural thing: the *plugin estimator*. Given n samples from p, we can compute the empirical distribution, call that p̂. Now, recall we want to test the thing below:

23/n
Why not simply plugin p̂ instead of p in the distance, compute that, and hope for the best?

A reason might be: *this sounds like a terrible idea!* Unless n=Ω(k) (much more than what we want), the empirical p̂ will be at TV distance 1-o(1) from uₖ, even if p *is* uₖ.

24/n
That’s the thing, though: hell is in the o(1). Sure, 𝔼[Z₄] ≈1 whether p is uniform or far from it unless n=Ω(k). But this "≈" will be different in the two cases!

Carefully analyzing this tiny gap in 𝔼+showing that Z₄ concentrates well enough to preserve it... works!

25/n
That it works at all (let alone results in the optimal n≍√k/ε²) is a real source of wonder to me. It also has nice properties (e.g., low "sensitivity")!
The analysis is not really simple, though...

Data efficient ☑️
Simple ☑️
"Simple" ☐
Fast ☑️
Intuitive ☐
Elegant ☑️

26/n
Now a tester that is *not* sample-optimal (but has other advantages, and is quite cute): Algo #5, "random binary hashing."

💡If there is one thing we know how to do, it's estimating the bias of a coin. We don't have a coin here, we have a glorious (k-1)-dimensional object.

27/n
Hell, let's just randomly make it a coin, shall we? Pick your favorite random hash function h: [k]→{0,1}, which partitioning the domain in two sets S₀, S₁.

Hash all the n samples you got: NOW we have a random coin!

28/n
Let's estimate its bias then: we know exactly what this should be under uniform: uₖ(S₀). If only we could argue that p(S₀)≉uₖ(S₀) (with high proba over the choice of the hash function) whenever TV(p,uₖ)>ε, we'd be good.

Turns out... it is the case. Hurray :)

29/n
So we can just do exactly this: we need to estimate the bias p(S₀) up to ±ε/√k. This can be done with n≍k/ε² samples (coin flips), as desired.

Not optimal, but... pretty fun.

Data efficient ☐
Memory efficient ☑️
Simple ☑️
"Simple" ☑️
Fast ☑️
Intuitive ☑️
Elegant ☑️

30/n
(Five algorithms out of seven so far, time for a break. I'm still missing Sleepy and Happy—more when I return!)

In the meantime: what do you care more about?
Let's continue! Algorithm #6: "Bipartite collision tester."

In our first, "collision-based" algorithm, recall that we took a multiset S of n samples from p and looked at the number of "collisions" in S to define our statistic Z₁. That gave us an estimator for ||p||₂²...

31/n
... which is fine, but required to keep in memory all the samples observed so far. Expensive.
One related idea would be to instead take *two* multisets S₁, S₂ of n₁ and n₂ samples, and only count "bipartite collisions", b/w samples in S₁ and samples in S₂:

32/n
That still has the right expectation 𝔼[Z₅]=||p||₂². Back to ℓ₂ as proxy! Compared to the "vanilla" collision-based test, this
is more flexible: S₁,S₂ can be of ≠ size, so that lends itself to some settings where a tradeoff between n₁ and n₂ is desirable. We need:

33/n
As well as some other (annoying) technical condition (maybe not necessary?). For the case n₁=n₂, we get back the optimal n≍√k/ε², as long as [technical annoying condition] ε≫1/k^(1/10).

Not bad!

Data efficient ☑️
Simple ☑️
"Simple" ☐
Fast ☑️
Intuitive ☐
Elegant ☐

34/n
Stay tuned for the very last algorithm. Spoiler: it is adaptive!
Here we go! Last algo: "empirical subset weighting." That one I really like. It's adaptive, it's weird, and (I think) it's new (h/t @AcharyaJayadev, @hstyagi, and @SZiteng)

Fix s < n. Take n samples from p, and consider the set S (not multiset) of by the first s samples.

35/n
Let's consider the random variable p(S). Its expectation (see below) is (making a bunch of approximations) 𝔼[p(S)]≈s||p||₂².

Great: we have a new estimator for (roughly) the squared ℓ₂ norm! Assuming things went well and p(S) concentrates around its expectation, ...

36/n
... at the end of this first stage we have S s.t. p(S) is either ≈s/k or ≈s(1+Ω(ε²))/k. Neat... a biased coin problem!

Let's do a second stage then. Take the next n-s samples, check which ones fall in S, use that to estimate p(S) to ±sε²/k.

You can do that as long as:

37/n
(sanity check: why?) So in particular, for s=n/2 we get the optimal n≍√k/ε². Hurray!

OK, there is the same slight bummer as in the "unique elements" algorithm: we need s ≪ k in the first stage (can you see why?), so overall this requires ε ≫ 1/k¼. Oh, well.

38/n
To summarize our last, "empirical subset weighting" algorithm:

Data efficient ☑️
Memory efficient ☑️
Simple ☑️
"Simple" ☐ (jury's still out)
Fast ☑️
Intuitive ☑️
Elegant ☑️

Yes, I'm a bit biased, I know.

39/n
Anyways, this concludes this giant thread! Hope you enjoyed it—feel free to comment or ask questions below. ↴

I'll put up a 📝 summary soon. In the meantime, here are a subset of the algorithms mentioned: which one do *you* prefer? 🙋

40/end
You can follow @ccanonne_.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: