Before PCA (i.e SVD), I preprocess with three principals:
1) sqrt any features that are counts. log any feature with a heavy tail.
2) localization is noise. *regularize* when you normalize.
3) and my favorite rule, the Cheshire cat rule
explanations in
... https://twitter.com/seanjtaylor/status/1297706506196905985
1) sqrt any features that are counts. log any feature with a heavy tail.
2) localization is noise. *regularize* when you normalize.
3) and my favorite rule, the Cheshire cat rule
explanations in

1)
count data and data with heavy tails can mess up the PCA.
PCA prefers things that are "homoscedastic" (which is my favorite word to ASMR and I literally do it in class)
sqrt and log are "variance stabilizing transformations". It typically fixes it!
count data and data with heavy tails can mess up the PCA.
PCA prefers things that are "homoscedastic" (which is my favorite word to ASMR and I literally do it in class)
sqrt and log are "variance stabilizing transformations". It typically fixes it!
2) localization
if you make a histogram of a component (or loading) vector and it has really big outliers, that is localization. It's bad. It means the vector is noise.
Here is a better diagnostic that my lab uses: https://github.com/karlrohe/LocalizationDiagnostic
if you make a histogram of a component (or loading) vector and it has really big outliers, that is localization. It's bad. It means the vector is noise.
Here is a better diagnostic that my lab uses: https://github.com/karlrohe/LocalizationDiagnostic
To address localization, I would suggest normalizing by *regularized* row/column sums. This works like fucking magic. Not even kidding.
Before learning this trick from @kamalikac I had given up on spectral techniques.
Before learning this trick from @kamalikac I had given up on spectral techniques.
Let A be your matrix, define rs to contain the row sums, and cs to contain the column sums. define
D_r = Diagonal(1/ sqrt(rs + mean(rs))
D_c = Diagonal(1/ sqrt(cs + mean(cs))
Do SVD on
D_r A D_c
The use of mean(rs) is what makes it regularized.
D_r = Diagonal(1/ sqrt(rs + mean(rs))
D_c = Diagonal(1/ sqrt(cs + mean(cs))
Do SVD on
D_r A D_c
The use of mean(rs) is what makes it regularized.
If you want to know why it works so well... this is my best shot:
paper: https://papers.nips.cc/paper/8262-understanding-regularized-spectral-clustering-via-graph-conductance.pdf
thread on the paper: https://twitter.com/karlrohe/status/1011269017582137346?s=20
YouTube summary of the paper:
Again, the diagnostic to assess localization: https://github.com/karlrohe/LocalizationDiagnostic
paper: https://papers.nips.cc/paper/8262-understanding-regularized-spectral-clustering-via-graph-conductance.pdf
thread on the paper: https://twitter.com/karlrohe/status/1011269017582137346?s=20
YouTube summary of the paper:
Again, the diagnostic to assess localization: https://github.com/karlrohe/LocalizationDiagnostic
3) the Cheshire cat rule.
âOne day Alice came to a fork in the road and saw a Cheshire cat in a tree. âWhich road do I take?â she asked. âWhere do you want to go?â was his response. âI donât know,â Alice answered. âThen,â said the cat, âit doesnât matter.â
âOne day Alice came to a fork in the road and saw a Cheshire cat in a tree. âWhich road do I take?â she asked. âWhere do you want to go?â was his response. âI donât know,â Alice answered. âThen,â said the cat, âit doesnât matter.â
In unsupervised learning, we often don't quite know where we are going.
So, is it ok to down-weight, discard, or interact the features? Try it out and see where it takes you!
So, is it ok to down-weight, discard, or interact the features? Try it out and see where it takes you!