"Wanting it badly is not enough" could be the title of a postmortem on the century& #39;s tech-policy battles. Think of the crypto wars: yeah, it would be super cool if we had ciphers that worked perfectly except when "bad guys" used them, but that& #39;s not ever going to happen.
1/
1/
Another area is anonymization of large data-sets. There are undeniably cool implications for a system that allows us to gather and analyze lots of data on how people interact with each other and their environments without compromising their privacy.
2/
2/
But "cool" isn& #39;t the same as "possible" because wanting it badly is not enough. In the mid-2010s, privacy legislation started to gain real momentum, and privacy regulators found themselves called upon to craft compromises to pass important new privacy laws.
3/
3/
Those compromises took the form of "anonymized data" carve-outs, leading to the passage of laws like the #GDPR, which strictly regulated processing "personally identifying information" but was a virtual free-for-all for "de-identified" data that had been "anonymized."
4/
4/
There was just one teensy problem with this compromise: de-identifying data is REALLY hard, and it only gets harder over time. Say the NHS releases prescribing data: date, doctor, prescription, and a random identifier. That& #39;s a super-useful data-set for medical research.
5/
5/
And say the next year, Addison-Lee or another large minicab company suffers a breach (no human language contains the phrase "as secure as minicab IT") that contains many of the patients& #39; journeys that resulted in that prescription-writing.
6/
6/
Merge those two data-sets and you re-identify many of the patients in the data. Subsequent releases and breaches compound the problem, and there& #39;s nothing the NHS can do to either predict or prevent a breach by a minicab company.
7/
7/
Even if the NHS is confident in its anonymization, it can never be confident in the sturdiness of that anonymity over time.
Worse: the NHS really CAN& #39;T be confident in its anonymization. Time and again, academics have shown that anonymized data from the start.
8/
Worse: the NHS really CAN& #39;T be confident in its anonymization. Time and again, academics have shown that anonymized data from the start.
8/
Re-identification attacks are subtle, varied, and very, very hard to defend against:
https://www.cs.princeton.edu/~arvindn/publications/precautionary.pdf
Worse,">https://www.cs.princeton.edu/~arvindn/... they& #39;re highly automatable:
https://www.nature.com/articles/s41467-019-10933-3
And">https://www.nature.com/articles/... it& #39;s true in practice as well as in theory:
https://www.nytimes.com/interactive/2019/12/19/opinion/location-tracking-cell-phone.html
9/">https://www.nytimes.com/interacti...
https://www.cs.princeton.edu/~arvindn/publications/precautionary.pdf
Worse,">https://www.cs.princeton.edu/~arvindn/... they& #39;re highly automatable:
https://www.nature.com/articles/s41467-019-10933-3
And">https://www.nature.com/articles/... it& #39;s true in practice as well as in theory:
https://www.nytimes.com/interactive/2019/12/19/opinion/location-tracking-cell-phone.html
9/">https://www.nytimes.com/interacti...
When this pointed out to the (admittedly hard-working and torn) privacy regulators, they largely shrugged their shoulders and expressed a groundless faith that somehow this would be fixed in the future. Privacy should not be a faith-based initiative.
https://memex.craphound.com/2014/07/09/big-data-should-not-be-a-faith-based-initiative/
10/">https://memex.craphound.com/2014/07/0...
https://memex.craphound.com/2014/07/09/big-data-should-not-be-a-faith-based-initiative/
10/">https://memex.craphound.com/2014/07/0...
Today, we continue to see the planned releases of large datasets with assurances that they have been anonymized. It& #39;s common for terms of service to include your "consent" to have your data shared once it has been de-identified. This is a meaningless proposition.
11/
11/
To show just how easy re-identification can be, researchers at Imperial College and the Université catholique de Louvain have released The Observatory of Anonymity, a web-app that shows you how easily you can be identified in a data-set.
https://cpg.doc.ic.ac.uk/observatory/
12/">https://cpg.doc.ic.ac.uk/observato...
https://cpg.doc.ic.ac.uk/observatory/
12/">https://cpg.doc.ic.ac.uk/observato...
Feed the app your country and region, birthdate, gender, employment and education status and it tells you how many people share those characteristics. For example, my identifiers boil down to a 1-in-3 chance of being identified.
13/
13/
(Don& #39;t worry: all these calculations are done in your browser and the Observatory doesn& #39;t send any of your data to a server)
If anything, The Observatory is generous to anonymization proponents. "Anonymized" data often include identifiers like the first half of a post-code.
14/
If anything, The Observatory is generous to anonymization proponents. "Anonymized" data often include identifiers like the first half of a post-code.
14/
You can read more about The Observatory& #39;s methods in the accompanying @nature paper, "Estimating the success of re-identifications in incomplete datasets using generative models."
https://www.nature.com/articles/s41467-019-10933-3
eof/">https://www.nature.com/articles/...
https://www.nature.com/articles/s41467-019-10933-3
eof/">https://www.nature.com/articles/...
ETA - If you& #39;d like an unrolled version of this thread to read or share, here& #39;s a link to it on http://pluralistic.net"> http://pluralistic.net , my surveillance-free, ad-free, tracker-free blog:
#pseudonymity">https://pluralistic.net/2021/04/21/re-identification/ #pseudonymity">https://pluralistic.net/2021/04/2...
#pseudonymity">https://pluralistic.net/2021/04/21/re-identification/ #pseudonymity">https://pluralistic.net/2021/04/2...