a tweetorial on #broom and linear models with categorical features in #rstats based on a super common confusion and related feature request we get

suppose you estimate a linear regression where a categorical predictor is of primary interest, something like this:
by default, stats::lm() uses treatment coding, so the intercept represents the mean of the base level of cyl, in this case cyl=4. if you have other covariates, you'd get the mean conditional on cyl=4 after partialing out those covariates, more or less
when the regression is like this people often want to compare means between groups like you would with an ANOVA

(note that we have estimated four parameters: the intercept, cyl6, cyl8, and a variance parameter)
now, *in this simple case*, we have:

E[mpg|cyl=4] = 26.7
E[mpg|cyl=6] = 26.7 - 6.9
E[mpg|cyl=8] = 26.7 - 11.6

and we can easily recover the group means from the parameters themselves and plot them
there is a very common confusion that pops up here. because we are used to backing out the conditional means from the parameters in the regression, people often conflate the *parameters* with the *conditional means* themselves
(it is important to note that the map between the parameters and the conditional mean depends on the contrast we use to turn a categorical variable into numeric variables. there are lots of ways to do this, each with a different interpretation!)
in particular, people expect broom to give them conditional means. so a common feature request, for the regression above, is to rename the (Intercept) term to something nice and related to the corresponding conditional mean (`cyl:4` for example or something like that)
this is a very hard thing to do for several reasons!

the first reason is that to understand how R generates the names of the terms in the regression you pretty much have to read the model.matrix() C code! there isn't nice documentation for this 😱
the second reason is that the request doesn't generalize to the case when there are additional terms in the regression. what should happen when there are two categorical variables? what if there's an additional continuous variable?
if you try to extend tidy.lm() for this example it will not be fun. at some point you run into a fundamental incompatibility between what broom::tidy() does (put parameters in a tibble) and what you want it to do (compare means across different levels of cyl)
there is still a very natural way to compare the data points with cyl = 4, cyl = 6, and cyl = 8 that we get from this regression. we *marginalize* over hp and transmission. that is we compute

E[mpg|cyl=4] = E[E[mpg|cyl=y, transmission, hp]]
i.e. use all the parameters to estimate the "finest level" conditional mean for each data point, and then average these for data with cyl=4, cyl=6, cyl=8 and so on

doing this correctly requires patience, linear algebra, and *lots of care with respect to contrasts*
this is called calculating a marginal mean in the general case and this is not something that broom does! this is far outside the scope of what broom does, even though it feels very natural to expect broom to do this!
there are other packages that do calculate marginal means, however! my favorite is emmeans, by which i mean to say that emmeans is the only one that i've ever used and it's served me well and i trust the package author
emmeans can automatically do all sorts of wonderful things for you like get SEs, multiple comparison adjusted p-values and CIs, and on and on. it has great vignettes documenting how it works. even better, broom can tidy emmeans output!

https://github.com/rvlenth/emmeans 
so anyway, this is a PSA

- it is very natural to confuse parameters and marginal means (sometimes called conditional means too)

- broom can help in both cases, but if you need marginal means, start with emmeans and tidy() the emmeans() output, not the model!
You can follow @alexpghayes.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: