Here's a bunch of @Stata tips for handling large datasets (millions of observations).

I wish I knew them when I was starting with admin data analysis...
1) Memory is often an issue, so store your data efficiently!

- use 'compress' command to recast your variables into appropriate data types

- declare the correct data types when generating variables
('gen byte var1 = 1')
2) Merging does not need to take forever!

- be aware that the 'merge' command sorts 'master' and 'using' data on matching variables

- you can save a lot of time by running the merge command on datasets that are already sorted!
3) Command 'joinby' does what 'merge m:m' should have been doing all along

- that is, it forms pairwise combinations between 'master' and 'using'
4) If you're working with spell-level data, learn to use Stata's native date & time functions

- these will allow you to store the time data efficiently and avoid transformation/approximation errors
5) Factor variables can help you overcome memory constraints in regressions
6) When estimating regression models with if conditions, it is often faster to drop all the irrelevant data
preserve
keep if var1 ==1
reg y x, robust
restore
I'll add more when I think of something relevant.
Feel free to chime in too 🙏
You can follow @JanKabatek.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: