Our ( @BreskinEpi and I's) paper on double cross-fitting is now officially published in the current issue of @EpidemiologyLWW
Paper is here https://journals.lww.com/epidem/Citation/2021/05000/Machine_Learning_for_Causal_Inference__On_the_Use.12.aspx
Paper is here https://journals.lww.com/epidem/Citation/2021/05000/Machine_Learning_for_Causal_Inference__On_the_Use.12.aspx
After identification, the challenge is the estimate the quantity of interest (in our case the average causal effect). However when the confounder(s), Z, are high-dimensional we can't use nonparametric g-formula
Instead models are used to estimate the exposure and outcome (nuisance) functions. While parametric models are great, they are limited in the set of possible nuisance functions they can capture, and rely on the researcher to specify (but that's hard)
This is where machine learning can help! However there are two issues when using ML for causal inference that need to be addressed: (1) convergence & (2) complexity
Convergence is how fast the SE estimator decreases with sample size. When the nuisance model estimator is slower than root-n, we can end up with invalid inference
Doubly-robust estimators (AIPW, TMLE) only need at least quarter-root-n (under the assumption that BOTH nuisance models are correct) since the 2nd-order bias is a product of the approx errors
So doubly-robust estimators address our first issue
So doubly-robust estimators address our first issue
The 2nd issue is complexity. Restrictions on complexity allow for information to be more effectively borrowed across observations. Often nuisance models are required to be Donsker
However restricting to Donsker limits the flexibility of ML that we want to take advantage of. This is where cross-fitting comes in. Cross-fitting allows us to avoid the restriction to the Donsker class
Single cross-fitting, by estimating the nuisance model in one split and predicting in the other, prevents correlation between the nuisance estimator and the data used to estimate
Double cross-fitting further prevents correlation by using different splits for each nuisance model
Double cross-fitting further prevents correlation by using different splits for each nuisance model
In summary, using both doubly-robust estimators and cross-fitting allow for the use of ML to estimate causal effects while addressing the above hurdles
To demonstrate this, we designed a simulation study
To demonstrate this, we designed a simulation study
We found ML with AIPW or TMLE and cross-fitting to have similar performance to when the correct parametric model form was used to estimate the nuisance models
When ML was used with AIPW/TMLE *without* cross-fitting, confidence interval coverage was below expected levels
When ML was used with AIPW/TMLE *without* cross-fitting, confidence interval coverage was below expected levels
So if you want to use highly flexible ML to estimate a causal effect, you should use a doubly-robust estimator with cross-fitting
Using ML frees up brain space from model specification. Instead you can devote brain power to issues of selection bias, missing data, etc.
Using ML frees up brain space from model specification. Instead you can devote brain power to issues of selection bias, missing data, etc.
In the paper, we also provide pseudo-code to implement the double cross-fitting approach (so go read it!). Code is at
https://github.com/pzivich/publications-code
and finally, I added single and double cross-fit estimators to my Python library https://github.com/pzivich/zEpid
https://github.com/pzivich/publications-code
and finally, I added single and double cross-fit estimators to my Python library https://github.com/pzivich/zEpid