"Wanting it badly is not enough" could be the title of a postmortem on the century& #39;s tech-policy battles. Think of the crypto wars: yeah, it would be super cool if we had ciphers that worked perfectly except when "bad guys" used them, but that& #39;s not ever going to happen.

1/
Another area is anonymization of large data-sets. There are undeniably cool implications for a system that allows us to gather and analyze lots of data on how people interact with each other and their environments without compromising their privacy.

2/
But "cool" isn& #39;t the same as "possible" because wanting it badly is not enough. In the mid-2010s, privacy legislation started to gain real momentum, and privacy regulators found themselves called upon to craft compromises to pass important new privacy laws.

3/
Those compromises took the form of "anonymized data" carve-outs, leading to the passage of laws like the #GDPR, which strictly regulated processing "personally identifying information" but was a virtual free-for-all for "de-identified" data that had been "anonymized."

4/
There was just one teensy problem with this compromise: de-identifying data is REALLY hard, and it only gets harder over time. Say the NHS releases prescribing data: date, doctor, prescription, and a random identifier. That& #39;s a super-useful data-set for medical research.

5/
And say the next year, Addison-Lee or another large minicab company suffers a breach (no human language contains the phrase "as secure as minicab IT") that contains many of the patients& #39; journeys that resulted in that prescription-writing.

6/
Merge those two data-sets and you re-identify many of the patients in the data. Subsequent releases and breaches compound the problem, and there& #39;s nothing the NHS can do to either predict or prevent a breach by a minicab company.

7/
Even if the NHS is confident in its anonymization, it can never be confident in the sturdiness of that anonymity over time.

Worse: the NHS really CAN& #39;T be confident in its anonymization. Time and again, academics have shown that anonymized data from the start.

8/
Re-identification attacks are subtle, varied, and very, very hard to defend against:

https://www.cs.princeton.edu/~arvindn/publications/precautionary.pdf

Worse,">https://www.cs.princeton.edu/~arvindn/... they& #39;re highly automatable:

https://www.nature.com/articles/s41467-019-10933-3

And">https://www.nature.com/articles/... it& #39;s true in practice as well as in theory:

https://www.nytimes.com/interactive/2019/12/19/opinion/location-tracking-cell-phone.html

9/">https://www.nytimes.com/interacti...
When this pointed out to the (admittedly hard-working and torn) privacy regulators, they largely shrugged their shoulders and expressed a groundless faith that somehow this would be fixed in the future. Privacy should not be a faith-based initiative.

https://memex.craphound.com/2014/07/09/big-data-should-not-be-a-faith-based-initiative/

10/">https://memex.craphound.com/2014/07/0...
Today, we continue to see the planned releases of large datasets with assurances that they have been anonymized. It& #39;s common for terms of service to include your "consent" to have your data shared once it has been de-identified. This is a meaningless proposition.

11/
To show just how easy re-identification can be, researchers at Imperial College and the Université catholique de Louvain have released The Observatory of Anonymity, a web-app that shows you how easily you can be identified in a data-set.

https://cpg.doc.ic.ac.uk/observatory/ 

12/">https://cpg.doc.ic.ac.uk/observato...
Feed the app your country and region, birthdate, gender, employment and education status and it tells you how many people share those characteristics. For example, my identifiers boil down to a 1-in-3 chance of being identified.

13/
(Don& #39;t worry: all these calculations are done in your browser and the Observatory doesn& #39;t send any of your data to a server)

If anything, The Observatory is generous to anonymization proponents. "Anonymized" data often include identifiers like the first half of a post-code.

14/
ETA - If you& #39;d like an unrolled version of this thread to read or share, here& #39;s a link to it on http://pluralistic.net"> http://pluralistic.net , my surveillance-free, ad-free, tracker-free blog:

#pseudonymity">https://pluralistic.net/2021/04/21/re-identification/ #pseudonymity">https://pluralistic.net/2021/04/2...
You can follow @doctorow.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: