Thread by @PausalZ, Our (@BreskinEpi and I's) paper on double cross-fitting is now officially published [...]

Our ( @BreskinEpi and I& #39;s) paper on double cross-fitting is now officially published in the current issue of @EpidemiologyLWW

Paper is here https://journals.lww.com/epidem/Citation/2021/05000/Machine_Learning_for_Causal_Inference__On_the_Use.12.aspx">https://journals.lww.com/epidem/Ci...

Machine Learning for Causal Inference: On the Use of... : Epidemiology

Methods: We conducted a simulation study to assess the performance of several different estimators for the average causal effect. The data generating mechanisms for the simulated treatment and...

https://journals.lww.com/epidem/Citation/2021/05000/Machine_Learning_for_Causal_Inference__On_the_Use.12.aspx

After identification, the challenge is the estimate the quantity of interest (in our case the average causal effect). However when the confounder(s), Z, are high-dimensional we can& #39;t use nonparametric g-formula

Instead models are used to estimate the exposure and outcome (nuisance) functions. While parametric models are great, they are limited in the set of possible nuisance functions they can capture, and rely on the researcher to specify (but that& #39;s hard)

This is where machine learning can help! However there are two issues when using ML for causal inference that need to be addressed: (1) convergence & (2) complexity

Convergence is how fast the SE estimator decreases with sample size. When the nuisance model estimator is slower than root-n, we can end up with invalid inference

Doubly-robust estimators (AIPW, TMLE) only need at least quarter-root-n (under the assumption that BOTH nuisance models are correct) since the 2nd-order bias is a product of the approx errors

So doubly-robust estimators address our first issue

The 2nd issue is complexity. Restrictions on complexity allow for information to be more effectively borrowed across observations. Often nuisance models are required to be Donsker

However restricting to Donsker limits the flexibility of ML that we want to take advantage of. This is where cross-fitting comes in. Cross-fitting allows us to avoid the restriction to the Donsker class

Single cross-fitting, by estimating the nuisance model in one split and predicting in the other, prevents correlation between the nuisance estimator and the data used to estimate

Double cross-fitting further prevents correlation by using different splits for each nuisance model

In summary, using both doubly-robust estimators and cross-fitting allow for the use of ML to estimate causal effects while addressing the above hurdles

To demonstrate this, we designed a simulation study

We found ML with AIPW or TMLE and cross-fitting to have similar performance to when the correct parametric model form was used to estimate the nuisance models

When ML was used with AIPW/TMLE *without* cross-fitting, confidence interval coverage was below expected levels

So if you want to use highly flexible ML to estimate a causal effect, you should use a doubly-robust estimator with cross-fitting

Using ML frees up brain space from model specification. Instead you can devote brain power to issues of selection bias, missing data, etc.

In the paper, we also provide pseudo-code to implement the double cross-fitting approach (so go read it!). Code is at
https://github.com/pzivich/publications-code
and">https://github.com/pzivich/p... finally, I added single and double cross-fit estimators to my Python library https://github.com/pzivich/zEpid ">https://github.com/pzivich/z...

Latest Threads Unrolled: