Awesome, @PrivacyPros published the second companion piece about trying to keep/delete your data appropriately (aka data retention) in a distributed system, so let me give you the rundown.

First off, this is harder than it looks.

https://iapp.org/news/a/data-retention-in-a-distributed-system/
https://iapp.org/news/a/setting-data-retention-timelines/
1. There can be only one primary data source for every piece of data. Know where it is.

With copies and processed versions of data scattered all over, you need to have one true home for each piece. And know where it is. As the system changes. And people touch it.
2. Every other copy of a piece of data is a secondary datastore. Know where and how stale they are.

There are lots of these: caches, intermediate pipeline results, other kinds of derived data etc. All of them are stale compared to the primary data store; take this into account.
Figuring out how stale a particular secondary datastore is can be complicated. For example, what if your ML model is computed using copies of two different data stores? And then results from it are stored in the logs? And then you compute your model from those logs?
3. Data items to be deleted have to be removed from both the primary and secondary datastores within the allowed window.

I know it sounds obvious but a lot of y'all aren't doing this. This may require doing all sorts of things like flushing caches.
Data deletion also has to be done reliably. This means that, for example, a publish-subscribe feed isn't good enough. You need to be able to re-run the deletion pipelines when things go wrong. Because they will.
4. If you're not monitoring, you don't know when things go wrong. And they will.

It's a computer system. Seriously, monitor it. Don't just monitor that the deletion system is running, monitor whether the stuff that's supposed to be deleted is deleted.
There's a lot more detail in the article: https://iapp.org/news/a/data-retention-in-a-distributed-system/

But all of that pre-supposes that you know how long you should keep data for. What about that?
First off, I'm not your lawyer. I'm not anybody's lawyer. But what I am is a privacy engineer and I've seen ... a lot of things data-wise. *stares off into distance*

There are good reasons, from a systems perspective, to really need to keep data, and reasonable times to keep it.
1. Your user asked you to keep the data.

It's impolite to do things like delete their baby photos when they didn't ask you to. Users have many feelings about that. Enough said.
2. You need the data for security/privacy/anti-abuse.

Please, dear goodness, don't trash the data you need here. Hackers are less privacy-respectful than you are.

For this, segregate data from the other data stores so you don't accidentally use it for "normal" purposes.
3. Data you need to run the system.

Debugging is necessary. Without debugging logs, you are not going to have a functioning system. Load-testing is necessary. Planning for traffic changes is necessary. You need data to do all of these things.
4. (IF YOU HAVE PERMISSION) Data used to improve the system.

There's a lot that goes into this bucket, like training your spell-checker model, so much that I'm not even going to try to get into this in this tweet thread.
And a reminder: if data is anonymized (and I mean *anonymized* don't just de-identify it by filing off the IDs can call it good), it's not user data any more. That can fill some of these gaps, like planning for traffic growth.
OK, so there are all these reasons to keep data for systems reasons, but how long should you keep it? Here are some heuristics.
Five weeks is good default timeframe for data kept for analysis, system maintenance and debugging. A lot of the analyses listed above need to be able to compare what is happening today to what happened a month ago.
A more limited-amount need analysis over a longer timeframe. For those, I would default to 13 months. Thirteen months gives enough data to provide a year-over-year analysis with enough wiggle room to work around holidays and system failures.
Most systems don't need data for 13 months! Some systems are *extremely* year-oriented, which might make it necessary, but this is mostly for security and anti-abuse purposes -- things like ML models to distinguish attacking traffic and examples of that traffic.
Why does security and anti-abuse need longer data retention? Because the attackers have it, too. Attackers to a system will specifically take advantage of limited data retention by recycling old attacks periodically.
A huge factor in setting timelines on data retention is how long your system takes to actually delete data completely, which has to be added to your chosen amount of time abuse. It’s almost never immediate and, in a complex system, can take a while.
Make sure you're taking everything into account: backups, datacenter maintenance (both planned and surprise), how long it takes to rebuild your ML models, how long your distributed storage system takes to delete (it's not instant!), etc.
More details here: https://iapp.org/news/a/setting-data-retention-timelines/

This a huge topic and this just scratches the surface. But at the very least: please remember that doing this right is harder than it looks.
You can follow @LeaKissner.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: