I'd say the most important lesson I've learned working in industry is that applied data science depends almost entirely on logging.

All of the most interesting problems are modelling, ML, and causal inference but you don't get to any of that stuff without data.
As much as folks have proclaimed the 21st century as the era of "big data", the reality is that there's lot of stuff that *could be* data but isn't. There is stuff happening but that stuff isn't automatically data. That's the case in industry, government, everywhere.
Data engineering turns what information we do have into something usable. That's what a lot of methodological work in academia is about: turning newspapers into data, turning voting records into data, turning survey responses into data, turning chemical reactions into data, etc.
But in industry, you can't do data engineering until you have logging. Logging is what turns activity into raw data: the unprocessed stuff that allows you to eventually do something useful.

Academic projects tend to assume some crappy raw data & then engineer it into something.
But in tech, you don't even have crappy raw data until you have logging. Did anyone click this thing? Did anyone open this?

We can't get to "why?", "can we predict that?", or "can we improve that?" without someone else - typically not a data scientist - doing the work to log it.
Which then gets us back to academia. A common problem is that academics simply don't have the data they need. Maybe they can do some data engineering on crappy raw data but often they just don't have anything to start from.

And that suggests a lesson to be learned from industry:
Part of being a successful data scientist is having the skills to make a case for why we need particular logging. And not because the logging is interesting per se but because it will enable - through lots of other work - better, more useful insights into fundamental problems.
Academia hardly rewards data engineering work so it's no surprise it doesn't reward advocacy for logging. But maybe that's actually the most important thing a quantitatively-minded academic can do: advocate for more logging in the places that are central to their field of study.
You can follow @thosjleeper.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: