Before PCA (i.e SVD), I preprocess with three principals:
1) sqrt any features that are counts. log any feature with a heavy tail.
2) localization is noise. *regularize* when you normalize.
3) and my favorite rule, the Cheshire cat rule

explanations in đŸ§”... https://twitter.com/seanjtaylor/status/1297706506196905985
1)

count data and data with heavy tails can mess up the PCA.

PCA prefers things that are "homoscedastic" (which is my favorite word to ASMR and I literally do it in class)

sqrt and log are "variance stabilizing transformations". It typically fixes it!
2) localization

if you make a histogram of a component (or loading) vector and it has really big outliers, that is localization. It's bad. It means the vector is noise.

Here is a better diagnostic that my lab uses: https://github.com/karlrohe/LocalizationDiagnostic
To address localization, I would suggest normalizing by *regularized* row/column sums. This works like fucking magic. Not even kidding.

Before learning this trick from @kamalikac I had given up on spectral techniques.
Let A be your matrix, define rs to contain the row sums, and cs to contain the column sums. define

D_r = Diagonal(1/ sqrt(rs + mean(rs))
D_c = Diagonal(1/ sqrt(cs + mean(cs))

Do SVD on

D_r A D_c

The use of mean(rs) is what makes it regularized.
If you want to know why it works so well... this is my best shot:

paper: https://papers.nips.cc/paper/8262-understanding-regularized-spectral-clustering-via-graph-conductance.pdf

thread on the paper: https://twitter.com/karlrohe/status/1011269017582137346?s=20

YouTube summary of the paper:

Again, the diagnostic to assess localization: https://github.com/karlrohe/LocalizationDiagnostic
3) the Cheshire cat rule.

“One day Alice came to a fork in the road and saw a Cheshire cat in a tree. ‘Which road do I take?’ she asked. ‘Where do you want to go?’ was his response. ‘I don’t know,’ Alice answered. ‘Then,’ said the cat, ‘it doesn’t matter.”
In unsupervised learning, we often don't quite know where we are going.

So, is it ok to down-weight, discard, or interact the features? Try it out and see where it takes you!
You can follow @karlrohe.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: