It's hard to find a good criminal these days. I mean a really trustworthy one you can be confident won't lead you up the garden path with false promises of data breaches. Like this guy yesterday:
For my international friends, JB Hi-Fi is a massive electronics retailer down under and they have my data! I mean by design because I've bought a bunch of stuff from them, so I was curious not just about my own data but because a breach of 12 million plus people would be massive in a country of not much more than double that. So, I dropped the guy a message and asked if he'd be willing to help me verify the incident by sharing my own record. I didn't want to post any public commentary about this incident until I had a reasonable degree of confidence it was legit, not given how much impact it could have in my very own backyard.
Now, I wouldn't normally share a private conversation with another party, but when someone sets out to scam people, that rule goes out the window as far as I'm concerned. So here's where the conversation got interesting:
He guaranteed it for me! Sounds legit. But hey, everyone gets the benefit of the doubt until proven otherwise, so I started looking at the data. It turns out my own info wasn't in the full set, but he was happy to provide a few thousand sample records with 14 columns:
Pretty standard stuff, could be legit, let's check. I have a little Powershell script I run against the HIBP API when a new alleged breach comes in and I want to get a really good sense of how unique it is. It simply loops through all the email addresses in a file, checks which breaches they've been in and keeps track of the percentage that have been seen before. A unique breach will have anywhere from about 40% to 80% previously seen addresses, but this one had, well, more:
Spot the trend? Every single address has one breach in common. Hmmm... wonder what the guy has to say about that?
But he was in the server! And he grabbed it from the dashboard of Shopify! Must be legit, unless... what if I compared it to the actual full breach of Dymocks? That's a local Aussie bookseller (so it would have a lot of Aussie-looking email addresses in it, just like JB Hi-Fi would), and their breach dated back to mid-2023. I keep breaches like that on hand for just such occasions, let's compare the two:
Wow! What are the chances?! He's going to be so interested when he hears about this!
And that was it. The chat went silent and very shortly after, the listing was gone:
It looks like the bloke has also since been booted off the forum where he tried to run the scam so yeah, this one didn't work out great for him. That $16k would have been so tasty too!
I wrote this short post to highlight how important verification of data breach claims is. Obviously, I've seen loads of legitimate ones but I've also seen a lot of rubbish. Not usually this blatant where the party contacting me is making such demonstrably false claims about their own exploits, but very regularly from people who obtain something from another party and repeat the lie they've been told. This example also highlights how useful data from previous breaches is, even after the email addresses have been extracted and loaded into HIBP. Data is so often recycled and shipped around as something new, this was just a textbook perfect case of making use of a previous incident to disprove a new claim. Plus, it's kinda fun poking holes in a scamming criminal's claims 😊
The conundrum I refer to in the title of this post is the one faced by a breached organisation: disclose or suppress? And let me be even more specific: should they disclose to impacted individuals, or simply never let them know? I'm writing this after many recent such discussions with breached organisations where I've found myself wishing I had this blog post to point them to, so, here it is.
Let's start with tackling what is often a fundamental misunderstanding about disclosure obligations, and that is the legal necessity to disclose. Now, as soon as we start talking about legal things, we run into the problem of it being different all over the world, so I'll pick a few examples to illustrate the point. As it relates to the UK GDPR, there are two essential concepts to understand, and they're the first two bulleted items in their personal data breaches guide:
The UK GDPR introduces a duty on all organisations to report certain personal data breaches to the relevant supervisory authority. You must do this within 72 hours of becoming aware of the breach, where feasible.
If the breach is likely to result in a high risk of adversely affecting individuals’ rights and freedoms, you must also inform those individuals without undue delay.
On the first point, "certain" data breaches must be reported to "the relevant supervisory authority" within 72 hours of learning about it. When we talk about disclosure, often (not just under GDPR), that term refers to the responsibility to report it to the regulator, not the individuals. And even then, read down a bit, and you'll see the carveout of the incident needing to expose personal data that is likely to present a "risk to people’s rights and freedoms".
This brings me to the second point that has this massive carveout as it relates to disclosing to the individuals, namely that the breach has to present "a high risk of adversely affecting individuals’ rights and freedoms". We have a similar carveout in Australia where the obligation to report to individuals is predicated on the likelihood of causing "serious harm".
This leaves us with the fact that in many data breach cases, organisations may decide they don't need to notify individuals whose personal information they've inadvertently disclosed. Let me give you an example from smack bang in the middle of GDPR territory: Deezer, the French streaming media service that went into HIBP early January last year:
New breach: Deezer had 229M unique email addresses breached from a 2019 backup and shared online in late 2022. Data included names, IPs, DoBs, genders and customer location. 49% were already in @haveibeenpwned. Read more: https://t.co/1ngqDNYf6k
— Have I Been Pwned (@haveibeenpwned) January 2, 2023
229M records is a substantial incident, and there's no argument about the personally identifiable nature of attributes such as email address, name, IP address, and date of birth. However, at least initially (more on that soon), Deezer chose not to disclose to impacted individuals:
Chatting to @Scott_Helme, he never received a breach notification from them. They disclosed publicly via an announcement in November, did they never actually email impacted individuals? Did *anyone* who got an HIBP email get a notification from Deezer? https://t.co/dnRw8tkgLl https://t.co/jKvmhVCwlM
— Troy Hunt (@troyhunt) January 2, 2023
No, nothing … but then I’ve not used Deezer for years .. I did get this👇from FireFox Monitor (provided by your good selves) pic.twitter.com/JSCxB1XBil
— Andy H (@WH_Y) January 2, 2023
Yes, same situation. I got the breach notification from HaveIBeenPwned, I emailed customer service to get an export of my data, got this message in response: pic.twitter.com/w4maPwX0Qe
— Giulio Montagner (@Giu1io) January 2, 2023
This situation understandably upset many people, with many cries of "but GDPR!" quickly following. And they did know way before I loaded it into HIBP too, almost two months earlier, in fact (courtesy of archive.org):
This information came to light November 8 2022 as a result of our ongoing efforts to ensure the security and integrity of our users’ personal information
They knew, yet they chose not to contact impacted people. And they're also confident that position didn't violate any data protection regulations (current version of the same page):
Deezer has not violated any data protection regulations
And based on the carveouts discussed earlier, I can see how they drew that conclusion. Was the disclosed data likely to lead to "a high risk of adversely affecting individuals’ rights and freedoms"? You can imagine lawyers arguing that it wouldn't. Regardless, people were pissed, and if you read through those respective Twitter threads, you'll get a good sense of the public reaction to their handling of the incident. HIBP sent 445k notifications to our own individual subscribers and another 39k to those monitoring domains with email addresses in the breach, and if I were to hazard a guess, that may have been what led to this:
Is this *finally* the @Deezer disclosure notice to individuals, a month and a half later? It doesn’t look like a new incident to me, anyone else get this? https://t.co/RrWlczItLm
— Troy Hunt (@troyhunt) February 20, 2023
So, they know about the breach in Nov, and they told people in Feb. It took them a quarter of a year to tell their customers they'd been breached, and if my understanding of their position and the regulations they were adhering to is correct, they never needed to send the notice at all.
I appreciate that's a very long-winded introduction to this post, but it sets the scene and illustrates the conundrum perfectly: an organisation may not need to disclose to individuals, but if they don't, they risk a backlash that may eventually force their hand.
In my past dealing with organisations that were reticent to disclose to their customers, their positions were often that the data was relatively benign. Email addresses, names, and some other identifiers of minimal consequence. It's often clear that the organisation is leaning towards the "uh, maybe we just don't say anything" angle, and if it's not already obvious, that's not a position I'd encourage. Let's go through all the reasons:
I ask this question because the defence I've often heard from organisations choosing the non-disclosure path is that the data is theirs - the company's. I have a fundamental issue with this, and it's not one with any legal basis (but I can imagine it being argued by lawyers in favour of that position), rather the commonsense position that someone's email address, for example, is theirs. If my email address appears in a data breach, then that's my email address and I entrusted the organisation in question to look after it. Whether there's a legal basis for the argument or not, the assertion that personally identifiable attributes become the property of another party will buy you absolutely no favours with the individual who provided them to you when you don't let them know you've leaked it.
Picking those terms from earlier on, if my gender, sexuality, ethnicity, and, in my case, even my entire medical history were to be made public, I would suffer no serious harm. You'd learn nothing of any consequence that you don't already know about me, and personally, I would not feel that I suffered as a result. However...
For some people, simply the association of their email address to their name may have a tangible impact on their life, and using the term from above jeopardises their rights and freedoms. Some people choose to keep their IRL identities completely detached from their email address, only providing the two together to a handful of trusted parties. If you're handling a data breach for your organisation, do you know if any of your impacted customers are in that boat? No, of course not; how could you?
Further, let's imagine there is nothing more than email addresses and passwords exposed on a cat forum. Is that likely to cause harm to people? Well, it's just cats; how bad could it be? Now, ask that question - how bad could it be? - with the prevalence of password reuse in mind. This isn't just a cat forum; it is a repository of credentials that will unlock social media, email, and financial services. Of course, it's not the fault of the breached service that people reuse their passwords, but their breach could lead to serious harm via the compromise of accounts on totally unrelated services.
Let's make it even more benign: what if it's just email addresses? Nothing else, just addresses and, of course, the association to the breached service. Firstly, the victims of that breach may not want their association with the service to be publicly known. Granted, there's a spectrum and weaponising someone's presence in Ashley Madison is a very different story from pointing out that they're a LinkedIn user. But conversely, the association is enormously useful phishing material; it helps scammers build a more convincing narrative when they can construct their messages by repeating accurate facts about their victim: "Hey, it's Acme Corp here, we know you're a loyal user, and we'd like to make you a special offer". You get the idea.
I'll start this one in the complete opposite direction to what it sounds like it should be because this is what I've previously heard from breached organisations:
We don't want to disclose in order to protect our customers
Uh, you sure about that? And yes, you did read that paraphrasing correctly. In fact, here's a copy paste from a recent discussion about disclosure where there was an argument against any public discussion of the incident:
Our concern is that your public notification would direct bad actors to search for the file, which can potentially do harm to both the business and our mutual users.
The fundamental issue of this clearly being an attempt to suppress news of the incident aside, in this particular case, the data was already on a popular clear web hacking forum, and the incident has appeared in multiple tweets viewed by thousands of people. The argument makes no sense whatsoever; the bad guys - lots of them - already have the data. And the good guys (the customers) don't know about it.
I'll quote precisely from another company who took a similar approach around non-disclosure:
[company name] is taking steps to notify regulators and data subjects where it is legally required to do so, based on advice from external legal counsel.
By now, I don't think I need to emphasise the caveat that they inevitably relied on to suppress the incident, but just to be clear: "where it is legally required to do so". I can say with a very high degree of confidence that they never notified the 8-figure number of customers exposed in this incident because they didn't have to. (I hear about it pretty quickly when disclosure notices are sent out, and I regularly share these via my X feed).
Non-disclosure is intended to protect the brand and by extension, the shareholders, not the customers.
Usually, after being sent a data breach, the first thing I do is search for "[company name] data breach". Often, the only results I get are for a listing on a popular hacking forum (again, on the clear web) where their data was made available for download, complete with a description of the incident. Often, that description is wrong (turns out hackers like to embellish their accomplishments). Incorrect conclusions are drawn and publicised, and they're the ones people find when searching for the incident.
When a company doesn't have a public position on a breach, the vacuum it creates is filled by others. Obviously, those with nefarious intent, but also by journalists, and many of those don't have the facts right either. Public disclosure allows the breached organisation to set the narrative, assuming they're forthcoming and transparent and don't water it down such that there's no substance in the disclosure, of course.
All the way back in 2017, I wrote about The 5 Stages of Data Breach Grief as I watched The AA in the UK dig themselves into an ever-deepening hole. They were doubling down on bullshit, and there was simply no way the truth wasn't going to come out. It was such a predictable pattern that, just like with Kübler-Ross' stages of personal grief, it was very clear how this was going to play out.
If you choose not to disclose a breach - for whatever reason - how long will it be until your "truth" comes out? Tomorrow? Next month? Years from now?! You'll be looking over your shoulder until it happens, and if it does one day go public, how will you be judged? Which brings me to the next point:
I can't put any precise measure on it, but I feel we reached a turning point in 2017. I even remember where I was when it dawned on me, sitting in a car on the way to the airport to testify before US Congress on the impact of data breaches. News had recently broken that Uber had attempted to cover up its breach of the year before by passing it off as a bug bounty and, of course, not notifying impacted customers. What dawned on me at that moment of reflection was that by now, there had been so many data breaches that we were judging organisations not by whether they'd been breached but how they'd handled the breach. Uber was getting raked over the coals not for the breach itself but because they tried to conceal it. (Their CTO was also later convicted of federal charges for some of the shenanigans pulled under his watch.)
This is going to feel like I'm talking to my kids after they've done something wrong, but here goes anyway: If people entrusted you with your data and you "lost" it (had it disclosed to unauthorised parties), the only decent thing to do is own up and acknowledge it. It doesn't matter if it was your organisation directly or, as with the Deezer situation, a third party you entrusted with the data; you are the coalface to your customers, and you're the one who is accountable for their data.
I am yet to see any valid reasons not to disclose that are in the best interests of the impacted customers (the delay in the AT&T breach announcement at the request of the FBI due to national security interests is the closest I can come to justifying non-disclosure). It's undoubtedly the customers' expectation, and increasingly, it's the governments' expectations too; I'll leave you with a quote from our previous Cyber Security Minister Clare O'Neil in a recent interview:
But the real people who feel pain here are Australians when their information that they gave in good faith to that company is breached in a cyber incident, and the focus is not on those customers from the very first moment. The people whose data has been stolen are the real victims here. And if you focus on them and put their interests first every single day, you will get good outcomes. Your customers and your clients will be respectful of it, and the Australian government will applaud you for it.
I'm presently on a whirlwind North America tour, visiting government and law enforcement agencies to understand more about their challenges and where we can assist with HIBP. As I spend more time with these agencies around the world, I keep hearing that data breach victim notification is an essential piece of the cybersecurity story, and I'm making damn sure to highlight the deficiencies I've written about here. We're going to keep pushing for all data breach victims to be notified when their data is exposed, and my hope in writing this is that when it's read in future by other organisations I've disclosed to, they respect their customers and disclose promptly. Check out Data breach disclosure 101: How to succeed after you've failed for guidance and how to do this.
Edit (a couple of days later): I'm adding an addendum to this post given how relevant it is. I just saw the following from Ruben van Well of the Dutch Police, someone who has invested a lot of effort in victim notification and we had the pleasure of spending time with last year in Rotterdam:
To translate the key section:
Reporting and transparency around incidents is important. Of the companies that fall victim, between 8 and 10% report this, whether or not out of fear of reputational damage. I assume that your image will be more damaged if you do not report an incident and it does come out later.
It echos my sentiments from above precisely, and I hope that message has an impact on anyone considering whether or not to disclose.
I decided to write this post because there's no concise way to explain the nuances of what's being described as one of the largest data breaches ever. Usually, it's easy to articulate a data breach; a service people provide their information to had someone snag it through an act of unauthorised access and publish a discrete corpus of information that can be attributed back to that source. But in the case of National Public Data, we're talking about a data aggregator most people had never heard of where a "threat actor" has published various partial sets of data with no clear way to attribute it back to the source. And they're already the subject of a class action, to add yet another variable into the mix. I've been collating information related to this incident over the last couple of months, so let me talk about what's known about the incident, what data is circulating and what remains a bit of a mystery.
Let's start with the easy bit - who is National Public Data (NPD)? They're what we refer to as a "data aggregator", that is they provide services based on the large volumes of personal information they hold. From the front page of their website:
Criminal Records, Background Checks and more. Our services are currently used by investigators, background check websites, data resellers, mobile apps, applications and more.
There are many legally operating data aggregators out there... and there are many that end up with their data in Have I Been Pwned (HIBP). For example, Master Deeds, Exactis and Adapt, to name but a few. In April, we started seeing news of National Public Data and billions of breached records, with one of the first references coming from the Dark Web Intelligence account:
USDoD Allegedly Breached National Public Data Database, Selling 2.9 Billion Records https://t.co/emQIZ0lgsn pic.twitter.com/Tt8UNppPSu
— Dark Web Intelligence (@DailyDarkWeb) April 8, 2024
Back then, the breach was attributed to "USDoD", a name to remember as you'll see that throughout this post. The embedded image is the first reference of the 2.9B number we've subsequently seen flashed all over the press, and it's right there alongside the request of $3.5M for the data. Clearly, there is a financial motive involved here, so keep that in mind as we dig further into the story. That image also refers to 200GB of compressed data that expands out to 4TB when uncompressed, but that's not what initially caught my eye. Instead, something quite obvious in the embedded image doesn't add up: if this data is "the entire population of USA, CA and UK" (which is ~450M people in total), what's the 2.9B number we keep seeing? Because that doesn't reconcile with reports about "nearly 3 billion people" with social security numbers exposed. Further, SSNs are a rather American construct with Canada having SINs (Social Insurance Number) and the UK having, well, NI (National Insurance) numbers are probably the closestequivalent. This is the constant theme you'll read about in this post, stuff just being a bit... off. But hyperbole is often a theme with incidents like this, so let's take the headlines with a grain of salt and see what the data tells us.
I was first sent data allegedly sourced from NPD in early June. The corpus I received reconciled with what vx-underground reported on around the same time (note their reference to the 8th of April, which also lines up with the previous tweet):
April 8th, 2024, a Threat Actor operating under the moniker "USDoD" placed a large database up for sale on Breached titled: "National Public Data". They claimed it contained 2,900,000,000 records on United States citizens. They put the data up for sale for $3,500,000.
— vx-underground (@vxunderground) June 1, 2024
National…
In their message, they refer to having received data totalling 277.1GB uncompressed, which aligns with the sum total of the 2 files I received:
They also mentioned the data contains first and last names, addresses and SSNs, all of which appear in the first file above (among other fields):
These first rows also line up precisely with the post Dark Web Intelligence included in the earlier tweet. And in case you're looking at it and thinking "that's the same SSN repeated across multiple rows with different names", those records are all the same people, just with the names represented in different orders and with different addresses (all in the same city). In other words, those 6 rows only represent one person, which got me thinking about the ratio of rows to distinct numbers. Curious, I took 100M samples and found that only 31% of the rows had unique SSNs, so extrapolating that out, 2.9B would be more like 899M. This is something to always be conscious of when you read headline numbers: "2.9B" doesn't necessarily mean 2.9B people, it often means rows of data. Speaking of which, those 2 files contain 1,698,302,004 and 997,379,506 rows respectively for a combined total of 2.696B. Is this where the headline number comes from? Perhaps, it's close, and it's also precisely the same as Bleeping Computer reported a few days ago.
At this point in the story, there's no question that there is legitimate data in there. From the aforementioned Bleeping Computer story:
numerous people have confirmed to us that it included their and family members' legitimate information, including those who are deceased
And in vx-underground's tweet, they mention that:
It also allowed us to find their parents, and nearest siblings. We were able to identify someones parents, deceased relatives, Uncles, Aunts, and Cousins. Additionally, we can confirm this database also contains informed on individuals who are deceased. Some individuals located had been deceased for nearly 2 decades.
A quick tangential observation in the same tweet:
The database DOES NOT contain information from individuals who use data opt-out services. Every person who used some sort of data opt-out service was not present.
Which is what you'd expect from a legally operating data aggregator service. It's a minor point, but it does support the claim that the data came from NPD.
Important: None of the data discussed so far contains email addresses. That doesn't necessarily make it any less impactful for those involved, but it's an important point I'll come back to later as it relates to HIBP.
So, this data appeared in limited circulation as early as 3 months ago. It contains a huge amount of personal information (even if it isn't "2.9B people"), and then to make matters worse, it was posted publicly last week:
National Public Data, a service by Jerico Pictures Inc., suffered #databreach. Hacker “Fenice” leaked 2.9b records with personal details, including full names, addresses, & SSNs in plain text. https://t.co/fXY3SXEiKe
— Wolf Technology Group (@WolfTech) August 6, 2024
Who knows who "Fenice" is and what role they play, but clearly multiple parties had access to this data well in advance of last week. I've reviewed what they posted, and it aligns with what I was sent 2 months ago, which is bad. But on the flip side, at least it has allowed services designed to protect data breach victims to get notices out to them:
Twice this week I was alerted my SSN was found on the web thanks to a data breach at National Public Data. Cool. Thanks guys. pic.twitter.com/FAlfNmXUqm
— MrsNineTales (@MrsNineTales) August 8, 2024
Inevitably, breaches of this nature result in legal action, which, as I mentioned in the opening paragraph, began a couple of weeks ago. It looks like a tip-off from a data protection service was enough for someone to bring a case against NPD:
Named plaintiff Christopher Hofmann, a California resident, said he received a notification from his identity-theft protection service provider on July 24, notifying him that his data was exposed in a breach and leaked on the dark web.
Up until this point, pretty much everything lines up, but for one thing: Where is the 4TB of data? And this is where it gets messy as we're now into the territory of "partial" data. For example, this corpus from last month was posted to a popular hacking forum:
National Public Database Allegedly Partially Leaked
— Dark Web Intelligence (@DailyDarkWeb) July 23, 2024
It is stated that nearly 80 GB of sensitive data from the National Public Data is available.
The post contains different credits for the leakage and the alleged breach was credited to a threat actor “Sxul” and stressed that it… https://t.co/v8uq0o88NS pic.twitter.com/a6dn3MvYkf
That's 80GB, and whilst it's not clear whether that's the size of the compressed or extracted archive, either way, it's still a long way short of the full alleged 4TB. Do take note of the file name in the embedded image, though - "people_data-935660398-959524741.csv" - as this will come up again later on.
Earlier this month, a 27-part corpus of data alleged to have come from NPD was posted to Telegram, this image representing the first 10 parts at 4GB each:
The compressed archive files totalled 104GB and contained what feels like a fairly random collection of data:
Many of these files are archives themselves, with many of those then containing yet more archives. I went through and recursively extracted everything which resulted in a total corpus of 642GB of uncompressed data across more than 1k files. If this is "partial", what was the story with the 80GB "partial" from last month? Who knows, but in the in those files above were 134M unique email addresses.
Just to take stock of where we're at, we've got the first set of SSN data which is legitimate and contains no email addresses yet is allegedly only a small part of the total NPD corpus. Then we've got this second set of data which is larger and has tens of millions of email addresses yet is pretty random in appearance. The burning question I was trying to answer is "is it legit?"
The problem with verifying breaches sourced from data aggregators is that nobody willingly - knowingly - provides their data to them, so I can't do my usual trick of just asking impacted HIBP subscribers if they'd used NPD before. Usually, I also can't just look at a data aggregator breach and find pointers that tie it back to the company in question due to references in the data mentioning their service. In part, that's because this data is just so damn generic. Take the earlier screenshot with the SSN data; how many different places have your first and last name, address, SSN, etc? Attributing a source when there's only generic data to go by is extremely difficult.
The kludge of different file types and naming conventions in the image above worried me. Is this actually all from NPD? Usually, you'd see some sort of continuity, for example, a heap of .json files with similar names or a swathe of .sql files with each one representing a dumped table. The presence of "people_data-935660398-959524741.csv" ties this corpus together with the one from the earlier tweet, but then there's stuff like "Accuitty_10_1_2022.zip"; could that refer to Acuity (single "c", single "t") which I wrote about in November? HIBP isn't returning hits for email addresses in that folder against the Acuity I loaded last year, so no, it's a different corpus. But that archive alone ended up having over 250GB of data with almost 100M unique email addresses, so it forms a substantial part of the overall corpus of data.
The 3,608,086KB "criminal_export.csv.zip" file caught my eye, in part because criminal record checks are a key component NPD's services, but also because it was only a few months ago we saw another breach containing 70M rows from a US criminal database. And see who that breach was attributed to? USDoD, the same party whose name is all over the NPD breach. I did actually receive that data but filed it away and didn't load it into HIBP as there were no email addresses in it. I wonder if the data from that story lines up with the file in the image above? Let's check the archives:
Different file name, but hey, it's a 3,608,086KB file! Given the NPD breach initially occurred in April and the criminal data hit the news in May, it's entirely possible the latter was obtained from the former, but I couldn't find any mention of this correlation anywhere. (Side note: this is a perfect example of why I retain breaches in offline storage after processing because they're so often helpful when assessing the origin and legitimacy of new breaches).
Continuing the search for oddities, I decided to see if I myself was in there. On many occasions now, I've loaded a breach, started the notification process running, walked away from the PC then received an email from myself about being in the breach 🤦♂️ I'm continually surprised by the places I find myself in, including this one:
Dammit! It's an email address of mine, yet clearly, none of the other data is mine. Not my name, not my address, and the obfuscated numbers definitely aren't familiar to me (I don't believe they're SSNs or other sensitive identifiers, but because I can't be sure, I've obfuscated them). I suspect one of those numbers is a serialised date of birth, but of the total 28 rows with my email address on them, the two unique DoBs put "me" as being born in either 1936 or 1967. Both are a long way from the truth.
A cursory review of the other data in this corpus revealed a wide array of different personal attributes. One file contained information such as height, weight, eye colour, and ethnicity. The "uk.txt" file in the image above merely contained a business directory with public information. I could have dug deeper, but by now, there was no point. There's clearly some degree of invalid data in here, there's definitely data we've seen appear separately as a discrete breach, and there are many different versions of "partial" NPD data (although the 27-part archive discussed here is the largest I saw and the one I was most consistently directed to by other people). The more I searched, the more bits and pieces attributed back to NPD I found:
If I were to take a guess, there are two likely explanations for what we're seeing:
Both of these are purely speculative, though, and the only parties that know the truth are the anonymous threat actors passing the data around and the data aggregator that's now being sued in a class action, so yeah, we're not going to see any reliable clarification any time soon. Instead, we're left with 134M email addresses in public circulation and no clear origin or accountability. I sat on the fence about what to do with this data for days, not sure whether I should load it and, if I did, whether I should write about it. Eventually, I decided it deserved a place in HIBP as an unverified breach, and per the opening sentence, this blog post was the only way I could properly explain the nuances of what I discovered. This way, impacted people will know if their data is floating around in this corpus, and if they find this information unactionable, then they can do precisely what they would have done had I not loaded it - nothing.
Lastly, I want to re-emphasise a point I made earlier on: there were no email addresses in the social security number files. If you find yourself in this data breach via HIBP, there's no evidence your SSN was leaked, and if you're in the same boat as me, the data next to your record may not even be correct. And no, I don't have a mechanism to load additional attributes beyond email address into HIBP nor point people in the direction of the source data (some of you will have received a reminder about why I don't do that just a few days ago). And I'm definitely not equipped to be your personal lookup service, manually trawling through the data and pulling out individual records for you! So, treat this as informational only, an intriguing story that doesn't require any further action.
Last week, I wrote about The State of Data Breaches and got loads of feedback. It was predominantly sympathetic to the position I find myself in running HIBP, and that post was mostly one of frustration: lack of disclosure, standoffish organisations, downplaying breaches and the individual breach victims themselves making it worse by going to town on the corporate victims. But the other angle that's been milling around in my brain is the one represented by the image here:
Running HIBP has become a constant balancing act between a trilogy of three parties: hackers, corporate victims and law enforcement. Let me explain:
This is where most data breaches begin, with someone illegally accessing a protected system and snagging the data. That's a high-level generalisation, of course, but whether it's exploiting software vulnerabilities, downloading exposed database backups or phishing admin credentials and then grabbing the data, it's all in the same realm of taking something that isn't theirs. And sometimes, they contact me.
This is a hard position to find myself in, primarily because I need to weigh the potentially competing objectives of notifying impacted HIBP subscribers whilst simultaneously not pandering to the perverse incentives of likely criminals. Sometimes, it's easy: when someone reports exposed data or a security vulnerability, the advice is to contact the company involved and not turn it into a data breach. But when they already have the data, by definition it's now a breach and inevitably a bunch of my subscribers are in there. It's awkward, talking to the first party responsible for the breach.
There are all sorts of circumstances that may make it even more awkward, for example if the hacker is actively trying to shake the company down for money. Perhaps they're selling the data on the breach market. Maybe they also still have access to the corporate system. Having a discussion with someone in that position is delicate, and throughout it all, I'm conscious that they may very well end up in custody and every discussion we've had will be seen by law enforcement. Every single word I write is predicated on that assumption. And eventually, being caught is a very likely outcome; just as we say that as defenders we need to get it right every single time and the hacker only needs to get it right once, as hackers, they need to get their opsec right every single time and it only takes that one little mistake to bring them undone. A dropped VPN connection. An email address, handle or password used somewhere else that links to their identity. An incorrect assumption about the anonymity of cryptocurrency. One. Little. Mistake.
However, I also need to treat these discussions as confidential. The expectation when people reach out is that they can confide in me, and that's due to the trust I've built over more than a decade of running this service. Relaying those conversations without their permission could destroy that reputation in a heartbeat. So, I often find myself brokering conversations between the three parties mentioned here, providing contact details back and forth or relaying messages with the consent of each party.
This sort of communication gets messy: you've got the hacker (who's often suspicious of big corp) trying to draw attention to an issue, but they're trying to communicate with a party who's also naturally suspicious of anonymous characters who've accessed their data! And law enforcement is, of course, interested in the hacker because that's their job, but they're also respectful of the role I play and the confidence with which data is shared with me. Meanwhile, law enforcement is also often engaged by the corporate victim and now we've got all players conversing with each other and me in the middle.
I say this not to be grandiose about what I do, but to explain the delicate balance with which many of these data breaches need to be handled. Then, that's all wrapped in with the observations from the previous post about lack of urgency etc.
I choose to use this term because it's all too easy for people to point at a company that's suffered a data breach and level blame at them. Depending on the circumstances, some blame is likely warranted, but make no mistake: breached companies are usually the target of intentional, malicious, criminal activity. And when I say "companies", we're ultimately talking about individuals who are usually doing the best they can at their jobs and, during a period of incident response, are often having the worst time of their careers. I've heard the pain in their voices and seen the stress on their faces on so many prior occasions, and I want to make sure that the human element of this isn't lost amidst the chants of angry customers.
The way in which corporate victims engage with hackers is particularly delicate. They're understandably angry, but they're also walking the tightrope of trying to learn as much as they can about the incident (the vector by which data was obtained often isn't known in the early stages), whilst listening to often exorbitant demands and not losing their cool. It's very easy for the party who has always worked on the basis of anonymity to simply "go dark" and disappear altogether, and then what? We can see this balancing act in many of the communications later released by hackers, often after they've failed to secure the expected ransom payment; you have extremely polite corporations... who you know want nothing more than to have the guy thrown into prison!
The law enforcement angle, or perhaps, to put it more broadly, the interactions with government authorities in general, is an interesting one. Beyond the obvious engagements around the criminal activity of hackers, the corporate victims themselves have legal responsibilities. This is obviously highly dependent on jurisdiction and regulatory controls, but it may mean reporting the breach to the appropriate government entity, for example. It may even mean reporting to many government entities (i.e. state-based) depending on where they are in the world. Then there's the question of their own culpability and whether the actions they took (or didn't take) both pre and post-breach may result in punitive measures being taken. I had a headline in the previous post that included the term "covering their arses" and this doesn't just mean from customer or shareholder backlash, but increasingly, from massive corporate fines.
I suspect, based on many previous experiences, that corporations have a love-hate relationship with law enforcement. They obviously want their support when it comes to dealing with the criminals, but they're extraordinarily cautious about what they disclose lest it later contribute to the basis on which penalties are levelled against them. Imagine the balancing act involved when the corporate victims suspects the breach occurred due to some massive oversights on their behalf and they approach law enforcement for support: "So, how do you think they got in? Uh..."
Like I've already said so many times in this post: "delicate".
This is the most multidimensional player in the trilogy, interfacing backwards and forwards with each party in various ways. Most obviously, they're there to bring criminals to justice, and that clearly puts hackers well within their remit. I've often referred to "the FBI and friends" or similar terms that illustrate how much of a partnership international law enforcement efforts are, as is regularly evidenced by the takedown notices on cybercrime initiatives:
The hackers themselves are often all too eager to engage with law enforcement too. Sometimes to taunt, other times to outright target, often at a very individual level such as naming specific agents. It should be said also that "hacker" is a very broad term that, at its worst, is outright criminal activity intended to be destructive for their own financial gain. But at the other end of the scale is a much more nuanced space where folks who may be labelled with this title aren't necessarily malicious in their intent but to paraphrase: "I was poking around and I found something, can you help me report it to the authorities".
The engagement between law enforcement and corporate victims often begins with the latter reporting an incident. We see this all the time in disclosure statements "we've notified the authorities", and that's a very natural outcome following a criminal act. It's not just the hacking itself, this is often accompanied by a ransom demand which piles on yet another criminal activity that needs to be referred to the authorities. Conversely, law enforcement regularly sees early indications of compromise before the corporate victim does and is able to communicate that directly. Increasingly, we're seeing formal government entities issue much broader infosec advice, for example, as our Australian Signals Directorate regularly does.
I often end up finding myself in a variety of different roles with law enforcement agencies. For example, providing a pipeline for the FBI to feed breached passwords into, supporting the Estonian Central Criminal Police by making data impacting their citizens searchable, spending time with the Dutch police on victim notification, and even testifying in front of US Congress. And, of course, supporting three dozen national CERTs around the world with open access to exposure of their federal domains in HIBP. Many of these agencies also have a natural interest in the folks who contact me, especially from that first category listed above. That said, I've always found law enforcement to be respectful of the confidence with which hackers share information with me; they understand the importance of the trust I mentioned earlier on, and it's significance in playing the role I do.
A decade on, I still find this to be an odd space to occupy, sitting on the fringe and sometimes right in the middle of the interactions between these three parties. It's unpredictable, fascinating, exciting, stressful, and I hope you found this interesting reading 🙂
I've been harbouring some thoughts about the state of data breaches over recent months, and I feel they've finally manifested themselves into a cohesive enough story to write down. Parts of this story relate to very sensitive incidents and parts to criminal activity, not just on behalf of those executing data breaches but also very likely on behalf of some organisations handling them. As such, I'm not going to refer to any specific incidents or company names, rather I'm going to speak more generally to what I'm seeing in the industry.
Generally, when I disclose a breach to an impacted company, it's already out there in circulation and for all I know, the company is already aware of it. Or not. And that's the problem: a data breach circulating broadly on a popular clear web hacking forum doesn't mean the incident is known by the corporate victim. Now, if I can find press about the incident, then I have a pretty high degree of confidence that someone has at least tried to notify the company involved (journos generally reach out for comment when writing about a breach), but often that's non-existent. So, too, are any public statements from the company, and I very often haven't seen any breach notifications sent to impacted individuals either (I usually have a slew of these forwarded to me after they're sent out). So, I attempt to get in touch, and this is where the pain begins.
I've written before on many occasions about how hard it can be to contact a company and disclose a breach to them. Often, contact details aren't easily discoverable; if they are, they may be for sales, customer support, or some other capacity that's used to getting bombarded with spam. Is it any wonder, then, that so many breach disclosures that I (and others) attempt to make end up going to the spam folder? I've heard this so many times before after a breach ends up in the headlines - "we did have someone try to reach out to us, but we thought it was junk" - which then often results in news of the incident going public before the company has had an opportunity to respond. That's not good for anyone; the breached firm is caught off-guard, they may very well direct their ire at the reporter, and it may also be that the underlying flaw remains unpatched, and now you've got a bunch more people looking for it.
An approach like security.txt is meant to fix this, and I'm enormously supportive of this, but in my experience, there are usually two problems:
That one instance was so exceptional that, honestly, I hadn't even looked for the file before asking the public for a security contact at the firm. Shame on me for that, but is it any wonder?
Once I do manage to make contact, I'd say about half the time, the organisation is good to deal with. They often already know of HIBP and are already using it themselves for domain searches. We've joked before (the company and I) that they're grateful for the service but never wanted to hear from me!
The other half of the time, the response borders on open hostility. In one case that comes to mind, I got an email from their lawyer after finally tracking down a C-suite tech exec via LinkedIn and sending them a message. It wasn't threatening, but I had to go through a series of to-and-fro explaining what HIBP was, why I had their data and how the process usually unfolded. When in these positions, I find myself having to try and talk up the legitimacy of my service without sounding conceited, especially as it relates to publicly documented relationships with law enforcement agencies. It's laborious.
My approach during disclosure usually involves laying out the facts, pointing out where data has been published, and offering to provide the data to the impacted organisation if they can't obtain it themselves. I then ask about their timelines for notifying impacted customers and welcome their commentary to be included in the HIBP notifications sent to our subscribers. This last point is where things get more interesting, so let's talk about breach notifications.
This is perhaps one of my greatest bugbears right now and whilst the title will give you a pretty good sense of where I'm going, the nuances make this particularly interesting.
I suggest that most of us believe that if your personal information is compromised in a data breach, you'll be notified following this discovery by the organisation responsible for the service. Whether it's one day, one week, or even a month later isn't really the issue; frankly, any of these time frames would be a good step forward from where we frequently find ourselves. But constantly, I'm finding that companies are taking the position of consciously not notifying individuals at all. Let me give you a handful of examples:
During the disclosure process of a recent breach, it turned out the organisation was already aware of the incident and had taken "appropriate measures" (their term was something akin to that being vague enough to avoid saying what had been done, but, uh, "something" had been done). When pressed for a breach notice that would go to their customers, they advised they wouldn't be sending one as the incident had occurred more than 6 months ago. That stunned me - the outright admission that they wouldn't be communicating this incident - and in case you're thinking "this would never be allowed under GDPR", the company was HQ'd well within that scope being based in a major European city.
Another one that I need to be especially vague about (for reasons that will soon become obvious), involved a sizeable breach of customer data with the folks exposed inhabiting every corner of the globe. During my disclosure to them, I pushed them on a timeline for notifying victims and found their responses to be indirect but almost certainly indicating they'd never speak publicly about it. Statements to the effect of "we'll send notifications where we deem we're legally obligated to", which clearly left it up to them to make the determination. I later learned from a contact close to the incident that this particular organisation had an impending earnings call and didn't want the market to react negatively to news of a breach. "Uh, you know that's a whole different thing if they deliberately cover that up, right?"
An important point to make here, though, is that when it comes to companies themselves disclosing they've been breached, disclosure to individuals is often not what people think it is. In the various regulatory regimes we have across the globe, the legal requirement often stops at notifying the regulator and does not extend to notifying the individual victims. This surprises many people, and I constantly hear the rant of "But I'm in [insert your country here], and we have laws that demand I'm notified!" No, you almost certainly don't... but you should. We all should.
You can see further evidence by looking at recent Form 8-K SEC filings in the US. There are many examples of filings from companies that never notified the individuals themselves, yet here, you'll clearly see disclosure to the regulator. The breach is known, it's been reported in the public domain, but good luck ever getting an email about it yourself.
During one disclosure, I had the good fortune of a very close friend of mine working for the company involved in an infosec capacity. They were clearly stalling, being well over a week from my disclosure yet no public statements or notices to impacted individuals. I had a quiet chat with my contact, who explained it as follows:
Mate, it's a room full of lawyers working out how to spin this
Meanwhile, millions of records of customer data were in the hands of criminals, and every hour that went by was another hour victims went without any knowledge whatsoever that their personal info had been exposed. And as much as it pains me to say this, I get it: the company's priority is the company or, more specifically, the shareholders. That's who the board is accountable to, and maintaining the corporate reputation and profitability of the firm is their number one priority.
I see this all the time in post-breach communication too. One incident that comes to mind was the result of some egregiously stupid technical decisions. Once that breach hit the press, the CEO immediately went on the offence. Blame was laid firstly at those who obtained the data, then at me for my reporting of the incident (my own disclosure was absolutely "by the book").
I'm talking about class actions. I wrote about my views on this a few years ago and nothing has changed, other than it getting worse. I regularly hear from data breach victims about them wanting compensation for the impact a breach has had on them yet when pushed, most struggle to explain why. We've had multiple recent incidents in Australia where drivers' licences have been exposed and required reissuing, which is usually a process of going to a local transport office and waiting in a queue. "Are you looking for your time to be compensated for?", I asked one person. We have to rotate our licenses every 5 years anyway, so would you pro-rata that time based on the hourly value of your time and when you were due to be back in there anyway? And if there has been identity theft, was it from the breach you're now seeking compensation for? Or the other ones (both known and unknown) from which your data was taken?
Lawyers are a big part of the problem, and I still regularly hear from them seeking product placement on HIBP. What a time and a place to cash in if you could get your class action pitch right there in front of people at the moment they learn they were in a breach!
Frankly, I don't care too much about individuals getting a few bucks in compensation (and it's only ever a few), and I also don't even care about lawyers doing lawyer things. But I do care about the adverse consequences it has on the corporate victims, as it makes my job a hell of a lot harder when I'm talking to a company that's getting ready to get sued because of the information I've just disclosed to them.
These are all intertwined problems without single answers. But there are some clear paths forward:
Firstly, and this seems so obvious that it's frankly ridiculous I need to write it, but there should always be disclosure to individual victims. This may not need to be with the same degree of expeditiousness as disclosure to the regulator, but it has to happen. It is a harder problem for businesses; submitting a form to a gov body can be infinitely easier than emailing potentially hundreds of millions of breached customers. However, it is, without any doubt, the right thing to do and there should be legal constructs that mandate it.
Simultaneously providing protection from frivolous lawsuits where no material harm can be demonstrated and throwing the book at firms who deliberately conceal breaches also seems reasonable. No company is ever immune from a breach, and so frequently, it occurs not due to malicious behaviour by the organisation but a series of often unfortunate events. Ambitious lawyers shouldn't be in a position where they can make hell for a company at their worst possible hour unless there there is significant harm and negligence that can be clearly attributed back to the incident.
And then there's all the periphery stuff that pours fuel on the current dumpster fire. The aforementioned beg bounties that cause companies to be suspicious of even the most genuine disclosures, for example. On the other hand, the standoff-ish behaviour of many organisations receiving reports from folks who just want to see incidents disclosed. Flip side again is the number of people occupying that periphery of "security researcher / extortionist" who cause the aforementioned behaviours described in this paragraph. It's a mess, and writing it down like this makes it so abundantly apparent how many competing objectives there are.
I don't see anything changing any time soon, and anecdotally, it's worse now than it was 5 or 10 years ago. In part, I suspect that's due to how all those undesirable behaviours I described above have evolved over time, and in part I also believe the increasingly complexity of external dependencies is driving this. How many breaches have we seen in just the last year that can be attributed to "a third party"? I quote that term because it's often used by organisations who've been breached as though it somehow absolves them of some responsibility; "it wasn't us who was breached, it was those guys over there". Of course, it doesn't work that way, and more external dependencies leads to more points of failure, all of which you're still accountable for even if you've done everything else right.
Ah well, as I often end up lamenting, it's a fascinating time to be in the industry 🤷♂️
I hate having to use that word - "alleged" - because it's so inconclusive and I know it will leave people with many unanswered questions. (Edit: 12 days after publishing this blog post, it looks like the "alleged" caveat can be dropped, see the addition at the end of the post for more.) But sometimes, "alleged" is just where we need to begin and over the course of time, proper attribution is made and the dots are joined. We're here at "alleged" for two very simple reasons: one is that AT&T is saying "the data didn't come from us", and the other is that I have no way of proving otherwise. But I have proven, with sufficient confidence, that the data is real and the impact is significant. Let me explain:
Firstly, just as a primer if you're new to this story, read BleepingComputer's piece on the incident. What it boils down to is in August 2021, someone with a proven history of breaching large organisations posted what they claimed were 70 million AT&T records to a popular hacking forum and asked for a very large amount of money should anyone wish to purchase the data. From that story:
From the samples shared by the threat actor, the database contains customers' names, addresses, phone numbers, Social Security numbers, and date of birth.
Fast forward two and a half years and the successor to this forum saw a post this week alleging to contain the entire corpus of data. Except that rather than put it up for sale, someone has decided to just dump it all publicly and make it easily accessible to the masses. This isn't unusual: "fresh" data has much greater commercial value and is often tightly held for a long period before being released into the public domain. The Dropbox and LinkedIn breaches, for example, occurred in 2012 before being broadly distributed in 2016 and just like those incidents, the alleged AT&T data is now in very broad circulation. It is undoubtedly in the hands of thousands of internet randos.
AT&T's position on this is pretty simple:
AT&T continues to tell BleepingComputer today that they still see no evidence of a breach in their systems and still believe that this data did not originate from them.
The old adage of "absence of evidence is not evidence of absence" comes to mind (just because they can't find evidence of it doesn't mean it didn't happen), but as I said earlier on, I (and others) have so far been unable to prove otherwise. So, let's focus on what we can prove, starting with the accuracy of the data.
The linked article talks about the author verifying the data with various people he knows, as well as other well-known infosec identities verifying its accuracy. For my part, I've got 4.8M Have I Been Pwned (HIBP) subscribers I can lean on to assist with verification, and it turns out that 153k of them are in this data set. What I'll typically do in a scenario like this is reach out to the 30 newest subscribers (people who will hopefully recall the nature of HIBP from their recent memory), and ask them if they're willing to assist. I linked to the story from the beginning of this blog post and got a handful of willing respondents for whom I sent their data and asked two simple questions:
The first reply I received was simple, but emphatic:
This individual had their name, phone number, home address and most importantly, their social security number exposed. Per the linked story, social security numbers and dates of birth exist on most rows of the data in encrypted format, but two supplemental files expose these in plain text. Taken at face value, it looks like whoever snagged this data also obtained the private encryption key and simply decrypted the vast bulk (but not all of) the protected values.
The above example simply didn't have plain text entries for the encrypted data. Just by way of raw numbers, the file that aligns with the "70M" headline actually has 73,481,539 lines with 49,102,176 unique email addresses. The file with decrypted SSNs has 43,989,217 lines and the decrypted dates of birth file only has 43,524 rows. (Edit: the reason for this later became clear - there is only one entry per date of birth which is then referenced from multiple records.) The last file, for example, has rows that look just like this:
.encrypted_value='*0g91F1wJvGV03zUGm6mBWSg==' .decrypted_value='1996-07-18'
That encrypted value is precisely what appears in the large file hence providing an easy way of matching all the data together. But those numbers also obviously mean that not every impacted individual had their SSN exposed, and most individuals didn't have their date of birth leaked. (Edit: per above, the same entries in the DoB file are referenced by multiple source records so whilst not every record had a DoB recorded, the difference isn't as stark as I originally reported.)
As I'm fond of saying, there's only one thing worse than your data appearing on the dark web: it's appearing on the clear web. And that's precisely where it is; the forum this was posted to isn't within the shady underbelly of a Tor hidden service, it's out there in plain sight on a public forum easily accessed by a normal web browser. And the data is real.
That last response is where most people impacted by this will now find themselves - "what do I do?" Usually I'd tell them to get in touch with the impacted organisation and request a copy of their data from the breach, but if AT&T's position is that it didn't come from them then they may not be much help. (Although if you are a current or previous customer, you can certainly request a copy of your personal information regardless of this incident.) I've personally also used identity theft protection services since as far back as the 90's now, simply to know when actions such as credit enquiries appear against my name. In the US, this is what services like Aura do and it's become common practice for breached organisations to provide identity protection subscriptions to impacted customers (full disclosure: Aura is a previous sponsor of this blog, although we have no ongoing or upcoming commercial relationship).
What I can't do is send you your breached data, or an indication of what fields you had exposed. Whilst I did this in that handful of aforementioned cases as part of the breach verification process, this is something that happens entirely manually and is infeasible en mass. HIBP only ever stores email addresses and never the additional fields of personal information that appear in data breaches. In case you're wondering why that is, we got a solid reminder only a couple of months ago when a service making this sort of data available to the masses had an incident that exposed tens of billions of rows of personal information. That's just an unacceptable risk for which the old adage of "you cannot lose what you do not have" provides the best possible fix.
As I said in the intro, this is not the conclusive end I wanted for this blog post... yet. As impacted HIBP subscribers receive their notifications and particularly as those monitoring domains learn of the aliases in the breach (many domain owners use unique aliases per service they sign up to), we may see a more conclusive outcome to this incident. That may not necessarily be confirmation that the data did indeed originate from AT&T, it could be that it came from a third party processor they use or from another entity altogether that's entirely unrelated. The truth is somewhere there in the data, I'll add any relevant updates to this blog post if and when it comes out.
As of now, all 49M impacted email addresses are searchable within HIBP.
Edit (31 March): AT&T have just released a short statement making 2 important points:
AT&T data-specific fields were contained in a data set
it is not yet known whether the data in those fields originated from AT&T or one of its vendors
They've also been mass-resetting account passcodes after TechCrunch apparently alerted AT&T to the presence of these in the data set. That article also includes the following statement from AT&T:
Based on our preliminary analysis, the data set appears to be from 2019 or earlier, impacting approximately 7.6 million current AT&T account holders and approximately 65.4 million former account holders
Between originally publishing this blog post and AT&T's announcements today, there have been dozens of comments left below that attribute the source of the breach to AT&T in ways that made it increasingly unlikely that the data could have been sourced from anywhere else. I know that many journos (and myself) reached out to folks in AT&T to draw their attention to this, I'm happy to now end this blog post by quoting myself from the opening para 😊
But sometimes, "alleged" is just where we need to begin and over the course of time, proper attribution is made and the dots are joined.
Ever hear one of those stories where as it unravels, you lean in ever closer and mutter “No way! No way! NO WAY!” This one, as far as infosec stories go, had me leaning and muttering like never before. Here goes:
Last week, someone reached out to me with what they claimed was a Spoutible data breach obtained by exploiting an enumerable API. Just your classic case of putting someone else's username in the URL and getting back data about them, which at first glance I assumed was another scraping situation like we recently saw with Trello. They sent me a file with 207k scraped records and a URL that looked like this:
https://spoutible.com/sptbl_system_api/main/user_profile_box?username=troyhunt
But they didn't send me my account, in fact I didn't even have an account at the time and if I'm honest, I had to go and look up exactly what Spoutible was. The penny dropped as I read into it: Spoutible emerged in the wake of Elon taking over Twitter, which left a bunch of folks unhappy with their new social overlord so they sought out alternate platforms. Mastodon and Bluesky were popular options, Spoutible was another which was clearly intended to be an alternative to the incumbent.
In order to unravel this saga in increasing increments of "no way!" reactions, let's just start with the basics of what that API endpoint was returning:
{
err_code: 0,
status: 200,
user: {
id: 735525,
username: "troyhunt",
fname: "Troy",
lname: "Hunt",
about: "Creator of Have I Been Pwned. Microsoft Regional Director. Pluralsight author. Online security, technology and “The Cloud”. Australian.",
Pretty standard stuff and I'd expect any of the major social platforms to do exactly the same thing. Name, username, bio and ID are all the sorts of data attributes you'd expect to find publicly available via an API or rendered into the HTML of the website. These fields, however, are quite different:
email: "[redacted]",
ip_address: "[redacted]",
verified_phone: "[redacted]",
gender: "M",
Ok, that's now a "no way!" because I had no expectation at all of any of that data being publicly available (note: phone number is optional, I chose to add mine). It's certainly not indicated on the pages where I entered it:
But it's also not that different to previous scraping incidents; the aforementioned Trello scrape exposed the association of email addresses to usernames and the Facebook scrape of a few years ago did the same thing with phone numbers. That's not unprecedented, but this is:
password: "$2y$10$B0EhY/bQsa5zUYXQ6J.NkunGvUfYeVOH8JM1nZwHyLPBagbVzpEM2",
No way! Is it... real? Is that genuinely a bcrypt hash of my own password? Yep, that's exactly what it is:
The Spoutible API enabled any user to retrieve the bcrypt hash of any other user's password.
I had to check, double check then triple check to make sure this was the case because I can only think of one other time I've ever seen an API do this...
<TangentialStory>
During my 14 years at Pfizer, I once reviewed an iOS app built for us by a low-cost off-shored development shop. I proxied the app through Fiddler, watched the requests and found an API that was returning every user record in the system and for each user, their corresponding password in plain text. When quizzing the developers about this design decision, their response was - and I kid you not, this isn't made up - "don't worry, our users don't use Fiddler" 🤦♂️
</TangentialStory>
I cannot think of any reason ever to return any user's hashed password to any interface, including an appropriately auth'd one where only the user themselves would receive it. There is never a good reason to do this. And even though bcrypt is the accepted algorithm of choice for storing passwords these days, it's far from uncrackable as I showed 7 years ago now after the Cloudpets breach. Here I used a small dictionary of weak, predictable passwords and easily cracked a bunch of the hashes. Weak passwords like... "spoutible". Wondering just how crazy things would get, I checked the change password page and found I could easily create a password of 6 or more characters (so long as it didn't exceed 20 characters) with no checks on strength whatsoever:
Strong hashing algorithms like bcrypt are weakened when poor password choices are allowed and strong password choices (such as having more than 20 characters in it), are blocked. For exactly the same reason breached services advise customers to change their passwords even when hashed with a strong algorithm, all Spoutible users are now in the same boat - change you password!
But fortunately these days many people make use of 2 factor authentication to protect against account takeover attacks where the adversary knows the password. Which brings us to the next piece of data the API returned:
2fa_secret: "7GIVXLSNKM47AM4R",
2fa_enabled_at: "2024-02-03 02:26:11",
2fa_backup_code: "$2y$10$6vQRDRDHVjyZdndGUEKLM.gmIIZVDq.E5NWTWti18.nZNQcqsEYki",
Oh wow! Why?! Let's break this down and explore both the first and last line. The 2FA secret is the seed that's used to generate the one time password to be used as the second factor. If you - as an attacker - know this value then 2FA is rendered useless. To test that this was what it looked like, I asked Stefán to retrieve my data from the public API, take the 2FA secret and send me the OTP:
It was a match. If Stefán could have cracked my bcrypted password hash (and he's a smart guy so "spoutible" would have definitely been in his word list), he could have then passed the second factor challenge. And the 2FA backup code? Thinking that would also be exactly what it looked like, I'd screen grabbed it when enabling 2FA:
Now, using the same bcrypt hash checker as I did for the password, here's what I found:
What I just don't get is if you're going to return the 2FA secret anyway, why bother bcrypting the backup code? And further, it's only a 6 digit number, do you know how long it takes to crack a bcrypted 6 digit number? Let's find out:
570075, 2m59s
— Martin Sundhaug (@sundhaug92@mastodon.social) (@sundhaug92) February 4, 2024
Many other people worked it out in single-digit minutes as well, but Martin did it fastest at the time of writing so he gets the shout-out 😊
You know how I said you'd keep leaning in further and further? Yeah, we're not done yet because then I found this:
em_code: "c62fcf3563dc3ab38d52ba9ddb37f9b1577d1986"
Maybe I've just seen too many data breaches before, but as vague as this looks I had a really good immediate hunch of what it was but just to be sure, I logged out and went to the password reset page:
Leaning in far enough now, anticipating what's going to happen next? Yep, it's exactly what you thought:
NO WAY! Exposed password reset tokens meant that anyone could immediately takeover anyone else's account 🤯
After changing the password, no notification email was sent to the account holder so just to make things even worse, if someone's account was taken over using this technique they'd have absolutely no idea until they either realised their original password no longer worked or their account started spouting weird messages. There's also no way to see if there are other active sessions, for example the way Twitter shows them:
Further, changing the password doesn't invalidate existing sessions so as best as I can tell, if someone has successfully accessed someone else's Spoutible account there's no way to know and no way to boot them out again. That's going to make recovering from this problematic unless Spoutible has another mechanism to invalidate all active sessions.
The one saving grace is that the token was rotated after reset so you can't use the one in the image above, but of course the new one was now publicly exposed in the API! And there's no 2FA challenge on password reset either but of course even if there was, well, you already read this far so you know how that could have been easily circumvented.
There's just one more "oh wow!" remaining, and it's the ease with which the vulnerable API was found. Spoutible has a feature called Pods and when you browse to that page, people listening to the pod are displayed with the ability to hover over their profile and display further information. For example, here's Rosetta and if we watch the request that's made in the dev tools...
By design, all the personal information including email and IP address, phone number, gender, bcrypt hashed password, 2FA secret and backup code and the code that can be immediately used to reset the password is returned to every single person that uses this feature. How many times has this API spouted troves of personal data out to people without them even knowing? Who knows, but I do know it wasn't the only API doing that because the one that listed the pods also did it:
Because the vulnerable APIs was requested organically as a natural part of using the service as it was intended, Spoutible almost certainly won't be able to fully identify abuse of it. To use the definition of the infamous Missouri governor who recently attempt to prosecute a journalist for pressing F12, everyone who used those features inadvertently became a hacker.
Just one last finding and I've not been able to personally validate it so let's keep it out of "oh wow!" scope: the individual that sent me the data and details of the vulnerability said that the exposed data includes access tokens for other platforms. A couple of months ago, Spoutible announced cross-posting to Mastodon and Bluesky and my own data does have a "cross_posting_auth" node, albeit set to null. I couldn't see anywhere within the UI to enable this feature, but there are profiles with values in there. During the disclosure process (more on that soon), Spoutible did say that those value were encrypted and without evidence of a private key compromise, they believe they're safe.
Here's my full record as it was originally returned by the vulnerable API:
To be as charitable as possible to Spoutible, you could argue that this is largely just the one vulnerability that is the inadvertent exposure of internal data via a public API. This is data that has a legitimate purpose in their system and it may simply be a case of a framework automatically picking all entity attributes up from the data tier and returning them via the UI. But it's the circumstances that allowed this to happen and then exacerbated the problem when it did that concern me more; clearly there's been no security review around this feature because it was so easily discoverable (at least there certainly wasn't review whilst it was live), nor has been any thought put in to notifying people of potential account takeovers or providing them with the means to invalidate other sessions. Then there are periphery issues such as very weak password rules that make cracking bcrypt so much easier, weak 2FA backup codes and pointless bcrypting of them. Not major issues in and of themselves, but they amplify the problems the exposed data presents.
Clearly this required disclosure before publication, unfortunately Spoutible does not publish a security.txt file so I went directly to the founder Christopher Bouzy on both Twitter and email (obviously I could have reached out on Spoutible, but he's very active on Twitter and my profile has more credibility there than a brand new Spoutible account). Here's the timeline, all AEST:
To give credit where it's due, Spoutible's response time was excellent. In the space of only about 4 hours, the data returned by the API had a huge number of attributes trimmed off it and now aligns with what I'd expect to see (although the 207k previously scraped records obviously still contain all the data). I'll also add that Christopher's communication with me commendable; he's clearly genuinely passionate about the platform and was dismayed to learn of the vulnerability. I've dealt with many founders of projects in the past that had suffered data breaches and it's especially personal for them, having poured so much of themselves into it.
Here's their disclosure in its entirety:
The revised API is now returning over 80% less data and looks like this:
If you're a detail person, yes, the forward slashes are no longer escaped and the remaining fields are ordered slightly differently so it looks like the JSON encoder has changed. In case you're interested, here's a link to a diff between the two with a little bit of manipulation to make it easier to see precisely what's changed.
As to my own advice to Spoutible users, here are the actions I'd recommend:
The 207k exposed email addresses that were sent to me are now searchable in Have I Been Pwned and my impacted subscribers have received email notifications.
Imagine you wanted to buy some shit on the internet. Not the metaphorical kind in terms of "I bought some random shit online", but literal shit. Turds. Faeces. The kind of thing you never would have thought possible to buy online until... Shitexpress came along. Here's a service that enables you to send an actual piece of smelly shit to "An irritating colleague. School teacher. Your ex-wife. Filthy boss. Jealous neighbour. That successful former classmate. Or all those pesky haters." But it would be weird if the intended recipient of the aforementioned shit knew it came from you, so, Shitexpress makes a bold commitment:
100% anonymous! Not 90%, not 95% but the full whack 100%! And perhaps they really did deliver on that promise, at least until one day last year:
New sensitive breach: Faeces delivery service Shitexpress had 24k email addresses breached last week. Data also included IP and physical addresses, names, and messages accompanying the posted shit. 76% were already in @haveibeenpwned. Read more: https://t.co/7R7vdi1ftZ
— Have I Been Pwned (@haveibeenpwned) August 16, 2022
When you think about it now, the simple mechanics of purchasing either metaphorical or literal shit online dictates collecting information that, if disclosed, leaves you anything but anonymous. At the very least, you're probably going to provide your own email address, your IP will be logged somewhere and payment info will be provided that links back to you (Bitcoin was one of many payment options and is still frequently traceable to an identity). Then of course if it's a physical good, there's a delivery address although in the case above, that's inevitably not going to be the address of the purchaser (sending yourself shit would also just be weird). Which is why following the Shitexpress data breach, we can now easily piece together information such as this:
Here we have an individual who one day last year, went on an absolute (literal) shit-posting bender posting off half a dozen boxes of excrement to heavy hitters in the US justice system. For 42 minutes, this bright soul (whose IP address was logged with each transaction), sent abusive messages from their iPhone (the user agent is also in the logs) to some of the most powerful people in the land. Did they only do this on the assumption of being "100% anonymous"? Possibly, it certainly doesn't seem like the sort of activity you'd want to put your actual identity to but hey, here we are. Who knows if there were any precautions taken by this individual to use an IP that wasn't easily traceable back to them, but that's not really the point; an attribute that will very likely be tied back to a specific individual if required was captured, stored and then leaked. IP not enough to identify someone? Hmmm... I wonder what other information might be captured during a purchase...
Uh, yeah, that's all pretty personally identifiable! And there are nearly 10k records in the "invoices_stripe.csv" file that include invoice IDs so if you paid by credit card, good luck not having that traced back to you (KYC obligations ain't real compatible with anonymously posting shit).
Now, where have we heard all this before? The promise of anonymity and data protection? Hmmm...
"Anonymous". "Discreet". That was July 2015, and we all know what happened next. It wasn't just the 30M+ members of the adultery website that were exposed in the breach, it was also the troves of folks who joined the service, thought better of it, paid to have their data deleted and then realised the "full delete" service, well, didn't. Why did they think their data would actually be deleted? Because the website told them it would be.
Vastaamo, the Finnish service referred to "the McDonalds of psychotherapy" was very clear around the privacy of the data they collected:
Until a few years ago when the worst conceivable scenario was realised:
A security flaw in the company’s IT systems had exposed its entire patient database to the open internet—not just email addresses and social security numbers, but the actual written notes that therapists had taken.
What made the Vastaamo incident particularly insidious was that after failing to extract the ransom demand from the company itself, the perpetrator (for whom things haven't worked out so well this year), then proceeded to ransom the individuals:
If we do not receive this payment within 24 hours, you still have another 48 hours to acquire and send us 500 euros worth of Bitcoins. If we still don't receive our money after this, your information will be published: your address, phone number, social security number, and your exact patient report, which includes e.g. transcriptions of your conversations with the Receptionist's therapist/psychiatrist.
And then it was all dumped publicly anyway.
Here's what I'm getting at with all this:
Assurances of safety, security and anonymity aren't statements of fact, they're objectives, and they may not be achieved
I've written this post as I have so many others so that it may serve as a reference in the future. Time and time again, I see the same promises as above as though somehow words on a webpage are sufficient to ensure data security. You can trust those words just about as much as you can trust the promise of being able to choose the animal the excrement is sourced from, which turns out to be total horseshit 🐎
I want to try something new here - bear with me here:
Data breach processing is hard and the hardest part of all is getting in touch with organisations and disclosing the incident before I load anything into Have I Been Pwned (HIBP). It's also something I do almost entirely in isolation, sitting here on my own trying to put the pieces together to work out what happened. I don't want to just chuck data into HIBP and the first an organisation knows about it is angry customers smashing out their inbox, there's got to be a reasonable attempt from my side to get in touch, disclose and then coordinate on communication to impacted parties and the public at large. Very frequently, I end up reaching out publicly and asking for a security contact at the impacted company. I dislike doing this because it's a very public broadcast that regular followers easily read between the lines of and draw precisely the correct conclusion before the organisation has had a chance to respond. And the vast majority of the time, nobody has a contact anyway but a small handful of people trawl through the site and find obscure email addresses or look up employees on LinkedIn or similar. There has to be a better way.
Yesterday, I posted this tweet:
After I shared this, multiple people said "ah, but at least we have GDPR", as though that somehow fixes the problem. No, it doesn't, at least not in any absolute sense. Case in point: I'm now going through the disclosure process after someone sent me data from a company HQ'd well… https://t.co/yMYIlFXkCU
— Troy Hunt (@troyhunt) April 18, 2023
And around the same time I got to thinking about Twitter Subscriptions as a channel for communication with a much more carefully curated subset of the 214k people that follow my public feed. Tweets within a subscription are visible only to subscribers so the public broadcast problem goes away. (Of course, you'd always work on the assumption that a subscriber could take a tweet and share it more broadly, but the intention is to make content visible to a much smaller, more dedicated audience.) Issues around where to find contact details, verification of the breach, what's in it or all sorts of other discussions I'd rather not have with the masses prior to loading into HIBP can be had with a much more curated audience.
I don't know how well this will work and it's something I've come up with on a whim (hey, I'm nothing if not honest about it!) But that's also how HIBP started and sometimes the best ideas just emerge out of gut feel. So, I set up the subscription and of the 3 pricing options Twitter suggested ($3, $5 or $10 per month), I went middle of the road and made it 5 bucks (that's American bucks, YMMV). You can sign up directly from the big "Subscribe" button on my Twitter profile or follow the link behind this text. Just one suggestion from Twitter's "welcome on board" email if you do:
Encourage your followers to Subscribe on the web. Web Subscriptions go through Stripe, which takes a 3% fee from each purchase, compared to the 30% fee that Apple and Google currently take. Meaning web Subscriptions may potentially lead to more money in your pocket.
My hope is that this subscription helps me have much more candid discussions about data breaches with people that are invested in following them than the masses that see my other tweets. I also hope it helps me go through this process feeling a little less isolated from the world and with the support of some of the great people I regularly engage with more publicly. If that's you, then give it a go and if it isn't floating your boat, cancel the subscription. I think there's something in this and I'd appreciate all the support I can get to help make it a worthwhile exercise.