Thread by @EmilyGorcenski, It’s easier to collect new data sets than it is to process [...]

Emily G (not a newspaper)

EmilyGorcenski

It’s easier to collect new data sets than it is to process existing data sets. Data volume grows linearly modulo your usage patterns. Data sets, however, often grow superlinearly.

If you have say, quadratic user growth, your usage and therefore your data volume will grow in proportion to that quadratic growth. But if you’re adding new data, this has a further increase above and beyond that.

Teams, however, cannot reliably sustain superlinear growth. For a short term, yes, but pretty soon a team’s internal communication structure essentially becomes a matter of combinatorial explosion.

This essentially means that any centralized data processing functions will eventually stop being able to keep up with the data demands put on them, and adding new people to the team won’t help

The only way to manage this is to decentralize the data handling to migrate away from “water” metaphors and instead operate in a producer-consumer mindset, where data is considered as a product or a service owned by a team, backed by SLOs

Your data lake, which you built to speed you up, will eventually become your biggest blocker

You can follow @EmilyGorcenski.

Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: