Earlier last week, data began circulating that was allegedly ~250M records of US citizens (often referred to "250807711" being the number of rows). It's *massively* extensive and a heap of people got in touch with me about it. Here's what I've found:
Firstly, here's a Gist of column headings from one of the 1,255 CSV files across 229GB of data. Not all files have the same headings, this one has 279 of them: https://gist.githubusercontent.com/troyhunt/371d819a9c393e8e89c8ee826cc97db8/raw/f6b5869dd92ae5aa31a1a7908bdfaf031993f281/gistfile1.txt
Each record has different attributes completed, usually with most of them blank. Fields like name, address, geoloc, dob are almost always present. I also regex'd out 100,057,148 unique email addresses.
There are 117,736 @haveibeenpwned subscribers in this data set so call it 0.12% of the addresses in the breach. That's really low; in the Jefit breach earlier this week, 0.81% of addresses were my subscribers. It makes me question the legitimacy of the addresses.
I've seen some pretty way out attribution claims including that this is "the SolarWinds data" and that it's from "the Equifax breach". No evidence whatsoever was presented to support either claim.
So... is it real? There's no clear attribution to a source and frankly, it "feels" 50:50 in terms of whether it's original data collected from users or something cobbled together. So I started asking impacted @haveibeenpwned subscribers to help, here's what they told me:
This guy went on to say that the data is about a decade out of date:
This one wasn't data about the individual whose email address appeared on the record, but it *was* data that had been previously misattributed to them:
This person (a non-US citizen) had someone else's email address on the same row as his and found the address data was consistent with that on a public people search directory:
This person had their data appear alongside that of their relatives':
So where does that leave us? There's some legit data but also a bunch of data that at best, is very old and at worst, is completely misattributed. Arguably, the data quality is poor. Question now is what should be done with the data?
Should this breach go into @haveibeenpwned where it would be flagged as unverified and from an unattributable source?
You can follow @troyhunt.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: