1. For the last few years we've been hearing a lot about *evidence-based teaching* practices.

In principle, I'm a big fan of using evidence to improve my teaching.

In practice, the majority of the evidence we are presented with is deeply problematic.
2. Studying classroom interventions and other teaching practices is basically a form of social psychology research—but that field has been rocked by a reproducibility crisis that is seldom acknowledged in work on evidence-based teaching.
3. The reproducibility crisis as been attributed to numerous causes: publication bias, p-hacking, HARKing, garden-of-forking-paths analysis, maybe a bit of outcome switching, and so forth.

But if anything, an even bigger issue is that of *generalizability*.
4. Generalizability seems particularly important in education research.

To what degree will the effects of your methods used by you in your course with your students at your institution generalize to me in my course with my students at my institution?
5. And to what degree will the results depend on the assessment methods? A recent paper by @lizbarnes222, @brownell_sara, and colleagues looks at this question for evolution acceptance across three student populations with two different instruments.

https://evolution-outreach.biomedcentral.com/articles/10.1186/s12052-019-0096-z
6. It seems to me that if we are going to get anything out of efforts in education research, we need to start taking reproducibility very seriously—and generalizability even more so.

But how do we do that? Is there a way to think about these issues in a rigorous formal context?
7. I spent the afternoon reading a new working paper from @talyarkoni on exactly this issue. His applications are from psychology, but I *beg* anyone interested in education research to read it closely, esp. the first 12pp.

https://psyarxiv.com/jqw35 

Tal doesn't pull his punches.
8. One might object that demanding generalizability sets the bar unreachably high.

I disagree. First of all, not all education research has to be generalizable. It is fine, IMO, to document what works for you in your classroom, and suggest others explore it.
9. But when we get to evidence-based teaching (EBT) practice, we move in the realm of prescription.

We are exhorted to use EBT methods. When we evaluate our colleagues in peer evaluations, EBT use is often an explicit criterion. When we hire, we look for experience with of EBT.
10. Once EBT becomes prescriptive, there is an absolute obligation for generalizability. It is the height of folly to prescribe something that won't work for the person you prescribe it to in the time and place you prescribe it.

Imagine if evidence-based medicine did this.
11. The sad truth is we seldom have cause for confidence that EBT prescriptions will generalize. Many EBT researchers use careful statistical methods to control for differences across years, class sections, etc.—but draw conclusions that implicitly assume no other differences.
12. No differences across institutions. None across student populations. None across course content. Often none across instructors. And so on. Unless we believe that the variation resulting from these differences is small compared to the factors we do include in our analyses...
13 ...we must concede that our precision, significance testing, and confidence limits are entirely illusory when used to justify prescribing our practices to others in other situations.

@Talyarkoni makes a powerful argument for this in his paper.
14. The problem with twitter threads is if they get too long, you feel a need to sum them up. I hate that.

But to sum anyway, any good educational initiative needs an acronym. I'd like to advocate for RIGOR: Research on Instruction that is Generalizable, Open, and Reproducible.
You can follow @CT_Bergstrom.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: