I& #39;d say the most important lesson I& #39;ve learned working in industry is that applied data science depends almost entirely on logging.
All of the most interesting problems are modelling, ML, and causal inference but you don& #39;t get to any of that stuff without data.
All of the most interesting problems are modelling, ML, and causal inference but you don& #39;t get to any of that stuff without data.
As much as folks have proclaimed the 21st century as the era of "big data", the reality is that there& #39;s lot of stuff that *could be* data but isn& #39;t. There is stuff happening but that stuff isn& #39;t automatically data. That& #39;s the case in industry, government, everywhere.
Data engineering turns what information we do have into something usable. That& #39;s what a lot of methodological work in academia is about: turning newspapers into data, turning voting records into data, turning survey responses into data, turning chemical reactions into data, etc.
But in industry, you can& #39;t do data engineering until you have logging. Logging is what turns activity into raw data: the unprocessed stuff that allows you to eventually do something useful.
Academic projects tend to assume some crappy raw data & then engineer it into something.
Academic projects tend to assume some crappy raw data & then engineer it into something.
But in tech, you don& #39;t even have crappy raw data until you have logging. Did anyone click this thing? Did anyone open this?
We can& #39;t get to "why?", "can we predict that?", or "can we improve that?" without someone else - typically not a data scientist - doing the work to log it.
We can& #39;t get to "why?", "can we predict that?", or "can we improve that?" without someone else - typically not a data scientist - doing the work to log it.
Which then gets us back to academia. A common problem is that academics simply don& #39;t have the data they need. Maybe they can do some data engineering on crappy raw data but often they just don& #39;t have anything to start from.
And that suggests a lesson to be learned from industry:
And that suggests a lesson to be learned from industry:
Part of being a successful data scientist is having the skills to make a case for why we need particular logging. And not because the logging is interesting per se but because it will enable - through lots of other work - better, more useful insights into fundamental problems.