We're now eyeball-deep into the HIBP rebrand and UX work, totally overhauling the image of the service as we know it. That said, a guiding principle has been to ensure the new looks is immediately recognisable and over months of work, I think we've achieved that. I'm holding off sharing anything until we're far enough down the road that we're confident in the direction we're heading, and then I want to invite the masses to contribute as we head towards a (re)launch.
Whilst I didn't talk about it in this week's video, let me just recap on why we're doing this: the decisions made for a pet project nearly 12 years ago now are very different to the decisions made for a mainstream service with so many dependencies on it today. We're at a point where we need more professionalism and cohesion and that's across everything from the website design and content, the branding on our formal documentation, the stickers I hand out all over the place, the swag we want to make and even the signatures on our emails. Our task is to keep the heart and soul of a humble community-first project whilst simultaneously making sure it actually looks like we know what we're doing 🙂
I think what's really scratching an itch for me with the home theatre thing is that it's this whole geeky world of stuff that I always knew was out there, but I'd just never really understood. For example, I mentioned waveforming in the video, and I'd never even heard of that let alone understood that there may be science where sound waves are smashed into each other in opposing directions in order to cancel each other out. And I'm sure I've got that completely wrong, but that's what's so fun about this! Anyway, that's all just part of the next adventure, and I hope you enjoy hearing about it and sending over your thoughts because I'm pretty sure there's a gazillion things I don't know yet 🙂
It's IoT time! We're embarking on a very major home project (more detail of which is in the video), and some pretty big decisions need to be made about a very simple device: the light switch. I love having just about every light in our connected... when it works. The house has just the right light early each morning, it transitions into daytime mode right at the perfect time based on the amount of solar radiation in the sky, into evening time courtesy of the same device and then blacks out when we go to bed. And some lights come on with movement based on motion sensors in fans (Big Ass fans have occupancy sensors), cameras (Ubiquiti camera raise motion events), and tiny dedicated Zigbee sensors. But getting the right physical switches in combination with the right IoT relays has been a bit more challenging. Listen to this week's show let me know if you have any "bright" ideas 🙂
We're heading back to London! And making a trip to Reykjavik. And Dublin. I talked about us considering this in the video yesterday, and just before publishing this post, we pulled the trigger and booked the tickets. The plan is to pretty much repeat the US and Canada trip we did in September and spend the time meeting up with some of the law enforcement agencies and various other organisations we've been working with over the years. As I say in the video, if you're in one of these locations and are in a position to stand up a meetup or user group session, I'd love to hear from you. Europe is a hell of a long way to go so we do want to make the most of the travel, stand by for more plans as they emerge.
It's hard to find a good criminal these days. I mean a really trustworthy one you can be confident won't lead you up the garden path with false promises of data breaches. Like this guy yesterday:
For my international friends, JB Hi-Fi is a massive electronics retailer down under and they have my data! I mean by design because I've bought a bunch of stuff from them, so I was curious not just about my own data but because a breach of 12 million plus people would be massive in a country of not much more than double that. So, I dropped the guy a message and asked if he'd be willing to help me verify the incident by sharing my own record. I didn't want to post any public commentary about this incident until I had a reasonable degree of confidence it was legit, not given how much impact it could have in my very own backyard.
Now, I wouldn't normally share a private conversation with another party, but when someone sets out to scam people, that rule goes out the window as far as I'm concerned. So here's where the conversation got interesting:
He guaranteed it for me! Sounds legit. But hey, everyone gets the benefit of the doubt until proven otherwise, so I started looking at the data. It turns out my own info wasn't in the full set, but he was happy to provide a few thousand sample records with 14 columns:
Pretty standard stuff, could be legit, let's check. I have a little Powershell script I run against the HIBP API when a new alleged breach comes in and I want to get a really good sense of how unique it is. It simply loops through all the email addresses in a file, checks which breaches they've been in and keeps track of the percentage that have been seen before. A unique breach will have anywhere from about 40% to 80% previously seen addresses, but this one had, well, more:
Spot the trend? Every single address has one breach in common. Hmmm... wonder what the guy has to say about that?
But he was in the server! And he grabbed it from the dashboard of Shopify! Must be legit, unless... what if I compared it to the actual full breach of Dymocks? That's a local Aussie bookseller (so it would have a lot of Aussie-looking email addresses in it, just like JB Hi-Fi would), and their breach dated back to mid-2023. I keep breaches like that on hand for just such occasions, let's compare the two:
Wow! What are the chances?! He's going to be so interested when he hears about this!
And that was it. The chat went silent and very shortly after, the listing was gone:
It looks like the bloke has also since been booted off the forum where he tried to run the scam so yeah, this one didn't work out great for him. That $16k would have been so tasty too!
I wrote this short post to highlight how important verification of data breach claims is. Obviously, I've seen loads of legitimate ones but I've also seen a lot of rubbish. Not usually this blatant where the party contacting me is making such demonstrably false claims about their own exploits, but very regularly from people who obtain something from another party and repeat the lie they've been told. This example also highlights how useful data from previous breaches is, even after the email addresses have been extracted and loaded into HIBP. Data is so often recycled and shipped around as something new, this was just a textbook perfect case of making use of a previous incident to disprove a new claim. Plus, it's kinda fun poking holes in a scamming criminal's claims 😊
If I'm honest, I was in two minds about adding additional stealer logs to HIBP. Even with the new feature to include the domains an email address appears against in the logs, my concern was that I'd get a barrage of "that's useless information" messages like I normally do when I load stealer logs! Instead, the feedback was resoundingly positive. This week I'm talking more about the logic behind this, some of the challenges we faced with it and what we might see in the future. Stay tuned, because I think we're going to be seeing a lot more of this in HIBP.
TL;DR — Email addresses in stealer logs can now be queried in HIBP to discover which websites they've had credentials exposed against. Individuals can see this by verifying their address using the notification service and organisations monitoring domains can pull a list back via a new API.
Nasty stuff, stealer logs. I've written about them and loaded them into Have I Been Pwned (HIBP) before but just as a recap, we're talking about the logs created by malware running on infected machines. You know that game cheat you downloaded? Or that crack for the pirated software product? Or the video of your colleague doing something that sounded crazy but you thought you'd better download and run that executable program showing it just to be sure? That's just a few different ways you end up with malware on your machine that then watches what you're doing and logs it, just like this:
These logs all came from the same person and each time the poor bloke visited a website and logged in, the malware snared the URL, his email address and his password. It's akin to a criminal looking over his shoulder and writing down the credentials for every service he's using, except rather than it being one shoulder-surfing bad guy, it's somewhat larger than that. We're talking about billions of records of stealer logs floating around, often published via Telegram where they're easily accessible to the masses. Check out Bitsight's piece titled Exfiltration over Telegram Bots: Skidding Infostealer Logs if you'd like to get into the weeds of how and why this happens. Or, for a really quick snapshot, here's an example that popped up on Telegram as I was writing this post:
As it relates to HIBP, stealer logs have always presented a bit of a paradox: they contain huge troves of personal information that by any reasonable measure constitute a data breach that victims would like to know about, but then what can they actually do about it? What are the websites listed against their email address? And what password was used? Reading the comments from the blog post in the first para, you can sense the frustration; people want more info and merely saying "your email address appeared in stealer logs" has left many feeling more frustrated than informed. I've been giving that a lot of thought over recent months and today, we're going to take a big step towards addressing that concern:
The domains an email address appears next to in stealer logs can now be returned to authorised users.
This means the guy with the Gmail address from the screen grab above can now see that his address has appeared against Amazon, Facebook and H&R Block. Further, his password is also searchable in Pwned Passwords so every piece of info we have from the stealer log is now accessible to him. Let me explain the mechanics of this:
Firstly, the volumes of data we're talking about are immense. In the case of the most recent corpus of data I was sent, there are hundreds of text files with well over 100GB of data and billions of rows. Filtering it all down, we ended up with 220 million unique rows of email address and domain pairs covering 69 million of the total 71 million email addresses in the data. The gap is explained by a combination of email addresses that appeared against invalidly formed domains and in some cases, addresses that only appeared with a password and not a domain. Criminals aren't exactly renowned for dumping perfectly formed data sets we can seamlessly work with, and I hope folks that fall into that few percent gap understand this limitation.
So, we now have 220 million records of email addresses against domains, how do we surface that information? Keeping in mind that "experimental" caveat in the title, the first decision we made is that it should only be accessible to the following parties:
At face value it might look like that first point deviates from the current model of just entering an email address on the front page of the site and getting back a result (and there are very good reasons why the service works this way). There are some important differences though, the first of which is that whilst your classic email address search on HIBP returns verified breaches of specific services, stealer logs contain a list of services that have never have been breached. It means we're talking about much larger numbers that build up far richer profiles; instead of a few breached services someone used, we're talking about potentially hundreds of them. Secondly, many of the services that appear next to email addresses in the stealer logs are precisely the sort of thing we flag as sensitive and hide from public view. There's a heap of Pornhub. There are health-related services. Religious one. Political websites. There are a lot of services there that merely by association constitute sensitive information, and we just don't want to take the risk of showing that info to the masses.
The second point means that companies doing domain searches (for which they already need to prove control of the domain), can pull back the list of the websites people in their organisation have email addresses next to. When the company controls the domain, they also control the email addresses on that domain and by extension, have the technical ability to view messages sent to their mailbox. Whether they have policies prohibiting this is a different story but remember, your work email address is your work's email address! They can already see the services sending emails to their people, and in the case of stealer logs, this is likely to be enormously useful information as it relates to protecting the organisation. I ran a few big names through the data, and even I was shocked at the prevalence of corporate email addresses against services you wouldn't expect to be used in the workplace (then again, using the corp email address in places you definitely shouldn't be isn't exactly anything new). That in itself is an issue, then there's the question of whether these logs came from an infected corporate machine or from someone entering their work email address into their personal device.
I started thinking more about what you can learn about an organisation's exposure in these logs, so I grabbed a well-known brand in the Fortune 500. Here are some of the highlights:
That said, let me emphasise a critical point:
This data is prepared and sold by criminals who provide zero guarantees as to its accuracy. The only guarantee is that the presence of an email address next to a domain is precisely what's in the stealer log; the owner of the address may never have actually visited the indicated website.
Stealer logs are not like typical data breaches where it's a discrete incident leading to the dumping of customers of a specific service. I know that the presence of my personal email address in the LinkedIn and Dropbox data breaches, for example, is a near-ironclad indication that those services exposed my data. Stealer logs don't provide that guarantee, so please understand this when reviewing the data.
The way we've decided to implement these two use cases differs:
We'll make the individual searches cleaner in the near future as part of the rebrand I've recently been talking about. For now, here's what it looks like:
Because of the recirculation of many stealer logs, we're not tracking which domains appeared against which breaches in HIBP. Depending on how this experiment with stealer logs goes, we'll likely add more in the future (and fill in the domain data for existing stealer logs in HIBP), but additional domains will only appear in the screen above if they haven't already been seen.
We've done the searches by domain owners via API as we're talking about potentially huge volumes of data that really don't scale well to the browser experience. Imagine a company with tens or hundreds of thousands of breached addresses and then a whole heap of those addresses have a bunch of stealer log entries against them. Further, by putting this behind a per-email address API rather than automatically showing it on domain search means it's easy for an org to not see these results, which I suspect some will elect to do for privacy reasons. The API approach was easiest while we explore this service then we can build on that based on feedback. I mentioned this was experimental, right? For now, it looks like this:
Lastly, there's another opportunity altogether that loading stealer logs in this fashion opens up, and the penny dropped when I loaded that last one mentioned earlier. I was contacted by a couple of different organisations that explained how around the time the data I'd loaded was circulating, they were seeing an uptick in account takeovers "and the attackers were getting the password right first go every time!" Using HIBP to try and understand where impacted customers might have been exposed, they posited that it was possible the same stealer logs I had were being used by criminals to extract every account that had logged onto their service. So, we started delving into the data and sure enough, all the other email addresses against their domain aligned with customers who were suffering from account takeover. We now have that data in HIBP, and it would be technically feasible to provide this to domain owners so that they can get an early heads up on which of their customers they probably have to rotate credentials for. I love the idea as it's a great preventative measure, perhaps that will be our next experiment.
Onto the passwords and as mentioned earlier, these have all been extracted and added to the existing Pwned Passwords service. This service remains totally free and open source (both code and data), has a really cool anonymity model allowing you to hit the API without disclosing the password being searched for, and has become absolutely MASSIVE!
I thought that doing more than 10 billion requests a month was cool, but look at that data transfer - more than a quarter of a petabyte just last month! And it's in use at some pretty big name sites as well:
That's just where the API is implemented client-side, and we can identify the source of the requests via the referrer header. Most implementations are done server-side, and by design, we have absolutely no idea who those folks are. Shoutout to Cloudflare while we're here for continuing to provide the service behind this for free to help make a more secure web.
In terms of the passwords in this latest stealer log corpus, we found 167 million unique ones of which only 61 million were already in HIBP. That's a massive number, so we did some checks, and whilst there's always a bit of junk in these data sets (remember - criminals and formatting!) there's also a heap of new stuff. For example:
And about 106M other non-kangaroo themed passwords. Admittedly, we did start to get a bit preoccupied looking at some of the creative ways people were creating previously unseen passwords:
And here's something especially ironic: check out these stealer log entries:
People have been checking these passwords on HIBP's service whilst infected with malware that logged the search! None of those passwords were in HIBP... but they all are now 🙂
Want to see something equally ironic? People using my Hack Yourself First website to learn about secure coding practices have also been infected with malware and ended up in stealer logs:
So, that's the experiment we're trying with stealer logs, and that's how to see the websites exposed against an email address. Just one final comment as it comes up every single time we load data like this:
We cannot manually provide data on a per-individual basis.
Hopefully, there's less need to now given the new feature outlined above, and I hope the massive burden of looking up individual records when there are 71 million people impacted is evident. Do leave your comments below and help us improve this feature to become as useful as we can possibly make it.
This week I'm giving a little teaser as to what's coming with stealer logs in HIBP and in about 24 hours from the time of writing, you'll be able to see the whole thing in action. This has been a huge amount of work trawling through vast volumes of data and trying to make it usable by the masses, but I think what we're launchung tomorrow will be awesome. Along with a new feature around these stealer logs, we've also added a huge number of new passwords to Pwned Passwords not previously seen before. Now, for the first time ever, "fuckkangaroos" will be flagged by any websites using the service 😮 More awesome examples coming in tomorrow's blog post, stay tuned!
It sounds easy - "just verify people's age before they access the service" - but whether we're talking about porn in the US or Australia's incoming social media laws, the reality is way more complex than that. There's no unified approach across jurisdictions and even within a single country like Australia, the closest we've got to that is a government scheme usually intended for accessing public services. And even if there was a technically workable model, who wants to get either the gov or some big tech firm involved in their use of Instagram or Pornhub?! There's a social acceptance to be considered and not only that, circumvention of age controls is very easy when you can simply VPN into another jurisdiction and access the same website blocked in your locale. Or in the case of the adult material, I'm told (🤷♂️) there are many other legally operating websites in other parts of the world that are less inclined to block individuals in specific states from foreign countries. There'll be no easy solutions for this one, but it'll make for an entertaining year 😊
There's a certain irony to the Bluesky situation where people are pushing back when I include links to X. Now, where have we seen this sort of behaviour before? 🤔 When I'm relying on content that only appears on that platform to add context to a data breach in HIBP and that content is freely accessible from within the native Bluesky app (without needing an X account), we're out of reasonable excuses for the negativity. And if "because Elon" is the sole reason and someone is firm enough in their convictions on that, there's a very easy solution 🙂
I fell waaay behind the normal video cadence this week, and I couldn't care less 😊 I mean c'mon, would you rather be working or sitting here looking at this view after snowboarding through Christmas?!
Christmas Day awesomeness in Norway 🇳🇴 Have a great one friends, wherever you are 🧑🎄 pic.twitter.com/F2FtcJYzRC
— Troy Hunt (@troyhunt) December 25, 2024
That said, Scott and I did carve out some time to chat about the, uh, "colourful" feedback he's had after finally putting a price on some Report URI features he'd been giving away free for years. And there's more data breaches, of course, including a couple I loaded over the previous week that I think were particularly interesting. Enjoy this week's video, next week's will be a 2024 wrap-up from somewhere much, much sunnier 😎
I'm back in Oslo! Writing this the day after recording, it feels like I couldn't be further from Dubai; the temperature starts with a minus, it's snowing and there's not a supercar in sight.
Back on business, this week I'm talking about the challenge of loading breaches and managing costs. A breach load immediately takes us from a very high percentage cache hit ratio on Cloudflare to zero. Consequently, our SQL costs skyrocket as the DB scales to support the load. Approximately 28 hours after loading the two breaches I mention in this week's update, we're still running a DB scale that's 350% larger than once we have a high cache hit ratio, and that directly hits my wallet. We need to work on this more because as I say in the video, I really don't like financial incentives that influence how breaches are handled, such as delaying them and bulking them together to reduce the impact of cache flush events like this. We'll give that more thought, I think there are a few ways to tackle this. For now, here's this week's video and some of the challenges we're facing:
A super quick intro today as I rush off to do the next very Dubai thing: drive a Lambo through the desert to go dirt bike riding before jumping in a Can-Am off-roader and then heading to the kart track for a couple of afternoon sessions. I post lots of pics to my Facebook account, and if none of that is interesting, here's this week's video on more infosec-related topics:
Nearly four years ago now, I set out to write a book with Charlotte and RobIt was the stories behind the stories, the things that drove me to write my most important blog posts, and then the things that happened afterwards. It's almost like a collection of meta posts, each one adding behind-the-scenes commentary that most people reading my material didn't know about at the time.
It was a strange time for all of us back then. I didn't leave the country for the first time in over a decade. I barely even left the state. I had time to toil on the passion project that became this book. As I wrote about years later, there were also other things occupying my mind at the time. Writing this book was cathartic, providing me the opportunity to express some of the emotions I was feeling at the time and to reflect on life.
Speaking of reflecting, this week was Have I Been Pwned's 11th birthday. Reaching this milestone, getting back to travel (I'm writing this poolside with a beer at a beautiful hotel in Dubai), life settling down (while sitting next to my amazing wife), and it now being 2 years since we launched the book, I decided we should just give it away for free. I mean really free, not "give me all your personal details, then here's a download link" I mean, here are the direct download links:
I hope you enjoy the book. It's the culmination of so many things I worked so hard to create over the preceding decade and a half, and I'm really happy to just be giving it away now. Enjoy the book 😊
Today, we're happy to welcome the 37th government to have full and free access to domain searches of their gov domains in Have I Been Pwned, Armenia. Armenia's National Computer Incident Response Team AM-CERT now joins three dozen other national counterparts in gaining visibility into how data breaches impact their national interests.
As we expand the reach of governments and organisations into HIBP, we hope to give defenders better insights into the impact of data breaches on their people so that the impact and value to attackers diminish.
I wouldn't say this is a list of my favourite breaches from this year as that's a bit of a disingenuous term, but oh boy were there some memorable ones. So many of the incidents I deal with are relatively benign in terms of either the data they expose or the nature of the service, but some of them this year were absolute zingers. This week, I'm talking about the ones that really stuck out to me for one reason or another, here's the top 5:
I was going to write about how much I've enjoyed "tinkering" with the HIBP API, but somehow, that term doesn't really seem appropriate any more for a service of this scale. On the contrary, we're putting in huge amounts of effort to get this thing fast, stable, and sustainable. We could do the first two very easily just by throwing money at the cloud, but that makes the last one a bit hard. Besides, both Stefán and I do enjoy the challenge of optimising an increasingly large system to run on a shoestring and even though the days of "a coffee a day of running costs" are well behind us, arguably the cost per request (or some other usage-based metric) is better than ever. I hope you enjoy this chat between the two of us and as I say in the video, do please chime in with your thoughts and suggestions.
I've spent more than a decade now writing about how to make Have I Been Pwned (HIBP) fast. Really fast. Fast to the extent that sometimes, it was even too fast:
The response from each search was coming back so quickly that the user wasn’t sure if it was legitimately checking subsequent addresses they entered or if there was a glitch.
Over the years, the service has evolved to use emerging new techniques to not just make things fast, but make them scale more under load, increase availability and sometimes, even drive down cost. For example, 8 years ago now I started rolling the most important services to Azure Functions, "serverless" code that was no longer bound by logical machines and would just scale out to whatever volume of requests was thrown at it. And just last year, I turned on Cloudflare cache reserve to ensure that all cachable objects remained cached, even under conditions where they previously would have been evicted.
And now, the pièce de résistance, the coolest performance thing we've done to date (and it is now "we", thank you Stefán): just caching the whole lot at Cloudflare. Everything. Every search you do... almost. Let me explain, firstly by way of some background:
When you hit any of the services on HIBP, the first place the traffic goes from your browser is to one of Cloudflare's 330 "edge nodes":
As I sit here writing this on the Gold Coast on Australia's most eastern seaboard, any request I make to HIBP hits that edge node on the far right of the Aussie continent which is just up the road in Brisbane. The capital city of our great state of Queensland is just a short jet ski away, about 80km as the crow flies. Before now, every single time I searched HIBP from home, my request bytes would travel up the wire to Brisbane and then take a giant 12,000km trip to Seattle where the Azure Function in the West US Azure data would query the database before sending the response 12,000km back west to Cloudflare's edge node, then the final 80km down to my Surfers Paradise home. But what if it didn't have to be that way? What if that data was already sitting on the Cloudflare edge node in Brisbane? And the one in Paris, and the one in well, I'm not even sure where all those blue dots are, but what if it was everywhere? Several awesome things would happen:
In short, pushing data and processing "closer to the edge" benefits both our customers and ourselves. But how do you do that for 5 billion unique email addresses? (Note: As of today, HIBP reports over 14 billion breached accounts, the number of unique email addresses is lower as on average, each breached address has appeared in multiple breaches.) To answer this question, let's recap on how the data is queried:
Let's delve into that last point further because it's the secret sauce to how this whole caching model works. In order to provide subscribers of this service with complete anonymity over the email addresses being searched for, the only data passed to the API is the first six characters of the SHA-1 hash of the full email address. If this sounds odd, read the blog post linked to in that last bullet point for full details. The important thing for now, though, is that it means there are a total of 16^6 different possible requests that can be made to the API, which is just over 16 million. Further, we can transform the first two use cases above into k-anonymity searches on the server side as it simply involved hashing the email address and taking those first six characters.
In summary, this means we can boil the entire searchable database of email addresses down to the following:
That's a large albeit finite list, and that's what we're now caching. So, here's what a search via email address looks like:
K-anonymity searches obviously go straight to step four, skipping the first few steps as we already know the hash prefix. All of this happens in a Cloudflare worker, so it's "code on the edge" creating hashes, checking cache then retrieving from the origin where necessary. That code also takes care of handling parameters that transform queries, for example, filtering by domain or truncating the response. It's a beautiful, simple model that's all self-contained within a worker and a very simple origin API. But there's a catch - what happens when the data changes?
There are two events that can change cached data, one is simple and one is major:
The second point is kind of frustrating as we've built up this beautiful collection of data all sitting close to the consumer where it's super fast to query, and then we nuke it all and go from scratch. The problem is it's either that or we selectively purge what could be many millions of individual hash prefixes, which you can't do:
For Zones on Enterprise plan, you may purge up to 500 URLs in one API call.
And:
Cache-Tag, host, and prefix purging each have a rate limit of 30,000 purge API calls in every 24 hour period.
We're giving all this further thought, but it's a non-trivial problem and a full cache flush is both easy and (near) instantaneous.
Enough words, let's get to some pictures! Here's a typical week of queries to the enterprise k-anonymity API:
This is a very predictable pattern, largely due to one particular subscriber regularly querying their entire customer base each day. (Sidenote: most of our enterprise level subscribers use callbacks such that we push updates to them via webhook when a new breach impacts their customers.) That's the total volume of inbound requests, but the really interesting bit is the requests that hit the origin (blue) versus those served directly by Cloudflare (orange):
Let's take the lowest blue data point towards the end of the graph as an example:
At that time, 96% of requests were served from Cloudflare's edge. Awesome! But look at it only a little bit later:
That's when I flushed cache for the Finsure breach, and 100% of traffic started being directed to the origin. (We're still seeing 14.24k hits via Cloudflare as, inevitably, some requests in that 1-hour block were to the same hash range and were served from cache.) It then took a whole 20 hours for the cache to repopulate to the extent that the hit:miss ratio returned to about 50:50:
Look back towards the start of the graph and you can see the same pattern from when I loaded the DemandScience breach. This all does pretty funky things to our origin API:
That last sudden increase is more than a 30x traffic increase in an instant! If we hadn't been careful about how we managed the origin infrastructure, we would have built a literal DDoS machine. Stefán will write later about how we manage the underlying database to ensure this doesn't happen, but even still, whilst we're dealing with the cyclical support patterns seen in that first graph above, I know that the best time to load a breach is later in the Aussie afternoon when the traffic is a third of what it is first thing in the morning. This helps smooth out the rate of requests to the origin such that by the time the traffic is ramping up, more of the content can be returned directly from Cloudflare. You can see that in the graphs above; that big peaky block towards the end of the last graph is pretty steady, even though the inbound traffic the first graph over the same period of time increases quite significantly. It's like we're trying to race the increasing inbound traffic by building ourselves up a bugger in cache.
Here's another angle to this whole thing: now more than ever, loading a data breach costs us money. For example, by the end of the graphs above, we were cruising along at a 50% cache hit ratio, which meant we were only paying for half as many of the Azure Function executions, egress bandwidth, and underlying SQL database as we would have been otherwise. Flushing cache and suddenly sending all the traffic to the origin doubles our cost. Waiting until we're back at 90% cache it ratio literally increases those costs 10x when we flush. If I were to be completely financially ruthless about it, I would need to either load fewer breaches or bulk them together such that a cache flush is only ejecting a small amount of data anyway, but clearly, that's not what I've been doing 😄
There's just one remaining fly in the ointment...
Of those three methods of querying email addresses, the first is a no-brainer: searches from the front page of the website hit a Cloudflare Worker where it validates the Turnstile token and returns a result. Easy. However, the second two models (the public and enterprise APIs) have the added burden of validating the API key against Azure API Management (APIM), and the only place that exists is in the West US origin service. What this means for those endpoints is that before we can return search results from a location that may be just a short jet ski ride away, we need to go all the way to the other side of the world to validate the key and ensure the request is within the rate limit. We do this in the lightest possible way with barely any data transiting the request to check the key, plus we do it in async with pulling the data back from the origin service if it isn't already in cache. In other words, we're as efficient as humanly possible, but we still cop a massive latency burden.
Doing API management at the origin is super frustrating, but there are really only two alternatives. The first is to distribute our APIM instance to other Azure data centres, and the problem with that is we need a Premium instance of the product. We presently run on a Basic instance, which means we're talking about a 19x increase in price just to unlock that ability. But that's just to go Premium; we then need at least one more instance somewhere else for this to make sense, which means we're talking about a 28x increase. And every region we add amplifies that even further. It's a financial non-starter.
The second option is for Cloudflare to build an API management product. This is the killer piece of this puzzle, as it would put all the checks and balances within the one edge node. It's a suggestion I've put forward on many occasions now, and who knows, maybe it's already in the works, but it's a suggestion I make out of a love of what the company does and a desire to go all-in on having them control the flow of our traffic. I did get a suggestion this week about rolling what is effectively a "poor man's API management" within workers, and it's a really cool suggestion, but it gets hard when people change plans or when we want to apply quotas to APIs rather than rate limits. So c'mon Cloudflare, let's make this happen!
Finally, just one more stat on how powerful serving content directly from the edge is: I shared this stat last month for Pwned Passwords which serves well over 99% of requests from Cloudflare's cache reserve:
There it is - we’ve now passed 10,000,000,000 requests to Pwned Password in 30 days 😮 This is made possible with @Cloudflare’s support, massively edge caching the data to make it super fast and highly available for everyone. pic.twitter.com/kw3C9gsHmB
— Troy Hunt (@troyhunt) October 5, 2024
That's about 3,900 requests per second, on average, non-stop for 30 days. It's obviously way more than that at peak; just a quick glance through the last month and it looks like about 17k requests per second in a one-minute period a few weeks ago:
But it doesn't matter how high it is, because I never even think about it. I set up the worker, I turned on cache reserve, and that's it 😎
I hope you've enjoyed this post, Stefán and I will be doing a live stream on this topic at 06:00 AEST Friday morning for this week's regular video update, and it'll be available for replay immediately after. It's also embedded here for convenience:
I have absolutely no problem at all talking about the code I've screwed up. Perhaps that's partly because after 3 decades of writing software (and doing some meaningful stuff along the way), I'm not particularly concerned about showing my weaknesses. And this week, I screwed up a bunch of stuff; database queries that weren't resilient to SQL database scale changes, partially completed breach notifications I didn't notice until it was too late to easily fix, and some queries that performed so badly they crashed the entire breach notification process after loading the massive DemandScience incident. Fortunately, none of them had any impact of note, we fixed them all and re-ran processes, and now we're more resilient than ever 😄
Oh - and if you like this style of content, this coming Friday, Stefan and I will do a joint live stream on all sorts of other bits about how now HIBP runs.
Apparently, before a child reaches the age of 13, advertisers will have gathered more 72 million data points on them. I knew I'd seen a metric about this sometime recently, so I went looking for "7,000", which perfectly illustrates how unaware we are of the extent of data collection on all of us. I started Have I Been Pwned (HIBP) in the first place because I was surprised at where my data had turned up in breaches. 11 years and 14 billion breached records later, I'm still surprised!
Jason (not his real name) was also recently surprised at where his data had appeared. He found it in a breach of a service called "Pure Incubation", a company whose records had appeared on a popular hacking forum earlier this year:
#DataLeak Alert ⚠️⚠️⚠️
— HackManac (@H4ckManac) February 28, 2024
🚨Over 183 Million Pure Incubation Ventures Records for Sale 🚨
183,754,481 records belonging to Pure Incubation Ventures (https://t.co/m3sjzAMlXN) have been put up for sale on a hacking forum for $6,000 negotiable.
Additionally, the threat actor with… pic.twitter.com/tqsyb8plPG
When Jason found his email address and other info in this corpus, he had the same question so many others do when their data turns up in a place they've never heard of before - how? Why?! So, he asked them:
I seem to have found my email in your data breach. I am interested in finding how my information ended up in your database.
To their credit, he got a very comprehensive answer, which I've included below:
Well, that answers the "how" part of the equation; they've aggregated data from public sources. And the "why" part? It's the old "data is the new oil" analogy that recognises how valuable our info is, and as such, there's a market for it. There are lots of terms used to describe what DemandScience does, including "B2B demand generation", "buyer intelligence solutions provider", "empowering technology companies to accelerate ROI", "supercharging pipelines" and "account intelligence". Or, to put it in a more lay-person-friendly fashion, they sell data on people.
DemandScience is what we refer to as a "data aggregator" in that they combine identity data from multiple locations, bundle it up, and then sell it. Occasionally, data aggregators end up having sizeable data breaches; before today, HIBP already contained Adapt (9M records), Data & Leads (44M records), Exactis (132M records), Factual (2M records), and You've Been Scraped (66M records). According to DemandScience, "none of our current operational systems were exploited", yet simultaneously, "the leaked data originated from a system that has been decommissioned". So, it's a breach of an old system.
Does it matter? I mean, if it's just public data, should people care? Jason cared, at least enough to make the original enquiry and for DemandScience to look him up and realise he's not in their current database. Still, he existed in the breached one (I later sent Jason his record from the breach, and he confirmed the accuracy). As I often do in these cases, I reached out to a bunch of recent HIBP subscribers in the breach and asked them three simple questions:
The answers were all the same: the data is accurate, it's already in the public domain, and people aren't too concerned about it appearing in this breach. Well that was easy 🙂 However...
There are two nuances that aren't captured here, and the first one is that this is valuable data, that's why DemandScience sells it! It comes back to that "new oil" analogy and if you have enough of it, you can charge good money for it. Companies typically use data such as this to do precisely the sort of catchphrasey stuff the company talks about, primarily around maximising revenue from their customers by understanding them better.
The second nuance is that whilst this data may already be in the public domain, did the owners of it expect it to be used in this fashion? For example, if you publish your details in a business directory, is your expectation that this info may then be sold to other companies to help them upsell you on their products? Probably not. And if, like many of the records in the data, someone's row is accompanied by their LinkedIn profile, would they expect that data to matched and sold? I suggest the responses would likely be split here, and that in itself is an important observation: how we view the sensitivity of our data and the impact of it being exposed (whether personal or business) is extremely personal. Some people take the view of "I have nothing to hide", whilst others become irate if even just their email address is exposed.
Whilst considering how to add more insights to this blog post, I thought I'd do a quick check on just one more email address:
"54543060",,"0","TROY","HUNT","PO BOX 57",,"WEST RYDE",,,"AU","61298503333",,,,"troy.hunt@pfizer.com","pfizer.com","PFIZER INC",,"250-499","$50 - 99 Million","Healthcare, Pharmaceuticals and Biotech","VICE PRESIDENT OF INFORMATION TECHNOLOGY","VP Level","2834",,"Senior Management (SVP/GM/Director)","IT",,"1","GemsTarget INTL","GEMSTARGET_INTL_648K_10.17.18",,,,,,,,,"18/10/2018 05:12:39","5/10/2021 16:47:56","PFIZER.COM",,,,,"IT Management General","Information Technology"
I'll be entirely transparent and honest here - my exact words after finding this were "motherfucker!" True story, told uncensored here because I want to impress on the audience how I feel when my data turns up somewhere publicly. And I do feel like it's "my" data; it's certainly my name and even though it's my old Pfizer email address I've not used for almost a decade now, that also has my name in it. My job title is also there... and it's completely wrong! I never had a VP-level role, even though the other data around my tech role is at least in the vicinity of being correct. But other than the initial shock of finding myself in yet another data breach, personally, I'm in the same boat as the HIBP subscribers I contacted, and this doesn't bother me too much. But I also agree with the following responses I received to my third question:
I think it is useful to be notified of such breaches, even if it is just to confirm no sensitive data has been compromised. As I said, our IT department recently notified me that some of my data was leaked and a pre-emptive password reset was enforced as they didn't know what was leaked.
It would be good to see it as an informational notification in case there's an increase in attack attempts against my email address.
I would like to opt-out of here to reduce the SPAM and Phishing emails.
That last one seems perfectly reasonable, and fortunately, DemandScience does have a link on their website to Do Not Sell My Information:
Dammit! If, like me, you're part of the 99.5% of the world that doesn't live in California, then apparently this form isn't for you. However, they do list dataprivacy@demandscience.com on that page, which is the same address Jason was communicating with above. Chances are, if you want to remove your data then that's where to start.
There were almost 122M unique email addresses in this corpus and those have now been added to HIBP. Treat this as informational; I suspect that for most people, it won't bother them, whilst others will ask for their data not to be sold (regardless of where they live in the world). But in all likelihood, there will be more than a handful of domain subscribers who take issue with that volume of people data sitting there in one corpus easily downloadable via a clear web hacking forum. For example, mine was just one of many tens of thousands of Pfizer email addresses, and that sort of thing is going to raise the ire of some folks in corporate infosec capacities.
One last comment: there was a story published earlier this year titled Our Investigation of the Pure Incubation Ventures Leak and in there they refer to "encrypted passwords" being present in the data. Many of the files do contain a column with bcrypt hashes (which is definitely not encryption), but given the way in which this data was collated, I can see no evidence whatsoever that these are password hashes. As such, I haven't listed "Passwords" as one of the compromised data classes in HIBP and you find yourself in this breach, I wouldn't be at all worried about this.
This was a much longer than usual update, largely due to the amount of time spent discussing the Earth 2 incident. As I said in the video (many times!), the amount of attention this has garnered from both Earth 2 users and the company itself is incommensurate with the impact of the incident itself. It's a nothing-burger. Email addresses and usernames, that's it, and of course, their association with the service, which may lead to some very targeted spam or phishing attempts. It's still a breach by any reasonable definition of the term, but it should have been succinctly summarised and disclosed to impacted parties with everyone moving on with more important things in life a few moments later. And that's exactly what I'm going to do right now 😊
I have really clear memories of listening to the Stack Overflow podcast in the late 2000's and hearing Jeff and Joel talk about the various challenges they were facing and the things they did to overcome them. I just suddenly thought of that when realising how long this week's video went for with no real plan other than to talk about our HIBP backlog. People seem to love this in the same way I loved listening to the guys a decade and a half ago. I'll do one of these with Stefan as well over the course of this month, let us know what you'd like to hear about 😊
Firstly, my apologies for the minute and a bit of echo at the start of this video, OBS had somehow magically decided to start recording both the primary mic and the one built into my camera. Easy fix, moving on...
During the livestream, I was perplexed as to why the HIBP DB was suddenly maxing out. Turns out that this aligned with dropping a constraint on the table of domains which appears to have caused the table to reindex and massively slow down the queries for breached email addresses. Further, we simultaneously started having problems related to MAXDOP (the maximum degree of parallelism for the stored procedure running the query), which was only resolved after we forced it to not run on multiple CPUs by setting it to 1 (weirdly, 2 is also fine but 3 or higher completely killed perf). Fun times, running a service like this.
Apparently, Stefan and I trying to work stuff out in real time about how to build more efficient features in HIBP is entertaining watching! If I was to guess, I think it's just seeing people work through the logic of how things work and how we might be able to approach things differently, and doing it in real time very candidly. I'm totally happy doing that, and the comments from the audience did give us more good food for thought too. I'll try and line up a session just like that before the end of the year, we've certainly got no shortage of material!
It wasn't easy talking about the Muah.AI data breach. It's not just the rampant child sexual abuse material throughout the system (or at least requests for the AI to generate images of it), it's the reactions of people to it. The tweets justifying it on the basis of there being noo "actual" abuse, the characterisation of this being akin to "merely thoughts in someone's head", and following my recording of this video, the backlash from their users about any attempts to curb creating sexual image of young children being "too much":
Which is making customers unhappy - "any censorship is too much": pic.twitter.com/fzfrFdKL8w
— Troy Hunt (@troyhunt) October 12, 2024
The law will catch up with this (and anyone in that breach creating this sort of material should be feel very bloody nervous right now), and the writing is already on the wall for people generating CSAM via AI:
This bill would expand the scope of certain of these provisions to include matter that is digitally altered or generated by the use of artificial intelligence, as such matter is defined.
The bill can't pass soon enough.
Ok, the scenery here is amazing, but the real story is data breach victim notification. Charlotte and I wanted to do this one together today and chat about some of the things we'd been hearing from government and law enforcement on our travels, and the victim notification angle featured heavily. She reminded me of the trouble even the police have when reaching out to organisations about security issues, often being confronted by lawyers or other company representatives worried about legal reprisals. It's nuts, and if it's hard for the law to get someone's attention, what hope is there for us?!
It's not a green screen! It's just a weird a weird hotel room in Pittsburgh, but it did make for a cool backdrop for this week's video. We were there visiting our FBI friends after coming from Washington DC and a visit to CISA, the "America's Cyber Defence Agency". This week, I'm talking about those visits, some really cool new Cloudflare features, and our ongoing effort to push more and more of HIBP's data to Cloudflare's edges. Enjoy!
The conundrum I refer to in the title of this post is the one faced by a breached organisation: disclose or suppress? And let me be even more specific: should they disclose to impacted individuals, or simply never let them know? I'm writing this after many recent such discussions with breached organisations where I've found myself wishing I had this blog post to point them to, so, here it is.
Let's start with tackling what is often a fundamental misunderstanding about disclosure obligations, and that is the legal necessity to disclose. Now, as soon as we start talking about legal things, we run into the problem of it being different all over the world, so I'll pick a few examples to illustrate the point. As it relates to the UK GDPR, there are two essential concepts to understand, and they're the first two bulleted items in their personal data breaches guide:
The UK GDPR introduces a duty on all organisations to report certain personal data breaches to the relevant supervisory authority. You must do this within 72 hours of becoming aware of the breach, where feasible.
If the breach is likely to result in a high risk of adversely affecting individuals’ rights and freedoms, you must also inform those individuals without undue delay.
On the first point, "certain" data breaches must be reported to "the relevant supervisory authority" within 72 hours of learning about it. When we talk about disclosure, often (not just under GDPR), that term refers to the responsibility to report it to the regulator, not the individuals. And even then, read down a bit, and you'll see the carveout of the incident needing to expose personal data that is likely to present a "risk to people’s rights and freedoms".
This brings me to the second point that has this massive carveout as it relates to disclosing to the individuals, namely that the breach has to present "a high risk of adversely affecting individuals’ rights and freedoms". We have a similar carveout in Australia where the obligation to report to individuals is predicated on the likelihood of causing "serious harm".
This leaves us with the fact that in many data breach cases, organisations may decide they don't need to notify individuals whose personal information they've inadvertently disclosed. Let me give you an example from smack bang in the middle of GDPR territory: Deezer, the French streaming media service that went into HIBP early January last year:
New breach: Deezer had 229M unique email addresses breached from a 2019 backup and shared online in late 2022. Data included names, IPs, DoBs, genders and customer location. 49% were already in @haveibeenpwned. Read more: https://t.co/1ngqDNYf6k
— Have I Been Pwned (@haveibeenpwned) January 2, 2023
229M records is a substantial incident, and there's no argument about the personally identifiable nature of attributes such as email address, name, IP address, and date of birth. However, at least initially (more on that soon), Deezer chose not to disclose to impacted individuals:
Chatting to @Scott_Helme, he never received a breach notification from them. They disclosed publicly via an announcement in November, did they never actually email impacted individuals? Did *anyone* who got an HIBP email get a notification from Deezer? https://t.co/dnRw8tkgLl https://t.co/jKvmhVCwlM
— Troy Hunt (@troyhunt) January 2, 2023
No, nothing … but then I’ve not used Deezer for years .. I did get this👇from FireFox Monitor (provided by your good selves) pic.twitter.com/JSCxB1XBil
— Andy H (@WH_Y) January 2, 2023
Yes, same situation. I got the breach notification from HaveIBeenPwned, I emailed customer service to get an export of my data, got this message in response: pic.twitter.com/w4maPwX0Qe
— Giulio Montagner (@Giu1io) January 2, 2023
This situation understandably upset many people, with many cries of "but GDPR!" quickly following. And they did know way before I loaded it into HIBP too, almost two months earlier, in fact (courtesy of archive.org):
This information came to light November 8 2022 as a result of our ongoing efforts to ensure the security and integrity of our users’ personal information
They knew, yet they chose not to contact impacted people. And they're also confident that position didn't violate any data protection regulations (current version of the same page):
Deezer has not violated any data protection regulations
And based on the carveouts discussed earlier, I can see how they drew that conclusion. Was the disclosed data likely to lead to "a high risk of adversely affecting individuals’ rights and freedoms"? You can imagine lawyers arguing that it wouldn't. Regardless, people were pissed, and if you read through those respective Twitter threads, you'll get a good sense of the public reaction to their handling of the incident. HIBP sent 445k notifications to our own individual subscribers and another 39k to those monitoring domains with email addresses in the breach, and if I were to hazard a guess, that may have been what led to this:
Is this *finally* the @Deezer disclosure notice to individuals, a month and a half later? It doesn’t look like a new incident to me, anyone else get this? https://t.co/RrWlczItLm
— Troy Hunt (@troyhunt) February 20, 2023
So, they know about the breach in Nov, and they told people in Feb. It took them a quarter of a year to tell their customers they'd been breached, and if my understanding of their position and the regulations they were adhering to is correct, they never needed to send the notice at all.
I appreciate that's a very long-winded introduction to this post, but it sets the scene and illustrates the conundrum perfectly: an organisation may not need to disclose to individuals, but if they don't, they risk a backlash that may eventually force their hand.
In my past dealing with organisations that were reticent to disclose to their customers, their positions were often that the data was relatively benign. Email addresses, names, and some other identifiers of minimal consequence. It's often clear that the organisation is leaning towards the "uh, maybe we just don't say anything" angle, and if it's not already obvious, that's not a position I'd encourage. Let's go through all the reasons:
I ask this question because the defence I've often heard from organisations choosing the non-disclosure path is that the data is theirs - the company's. I have a fundamental issue with this, and it's not one with any legal basis (but I can imagine it being argued by lawyers in favour of that position), rather the commonsense position that someone's email address, for example, is theirs. If my email address appears in a data breach, then that's my email address and I entrusted the organisation in question to look after it. Whether there's a legal basis for the argument or not, the assertion that personally identifiable attributes become the property of another party will buy you absolutely no favours with the individual who provided them to you when you don't let them know you've leaked it.
Picking those terms from earlier on, if my gender, sexuality, ethnicity, and, in my case, even my entire medical history were to be made public, I would suffer no serious harm. You'd learn nothing of any consequence that you don't already know about me, and personally, I would not feel that I suffered as a result. However...
For some people, simply the association of their email address to their name may have a tangible impact on their life, and using the term from above jeopardises their rights and freedoms. Some people choose to keep their IRL identities completely detached from their email address, only providing the two together to a handful of trusted parties. If you're handling a data breach for your organisation, do you know if any of your impacted customers are in that boat? No, of course not; how could you?
Further, let's imagine there is nothing more than email addresses and passwords exposed on a cat forum. Is that likely to cause harm to people? Well, it's just cats; how bad could it be? Now, ask that question - how bad could it be? - with the prevalence of password reuse in mind. This isn't just a cat forum; it is a repository of credentials that will unlock social media, email, and financial services. Of course, it's not the fault of the breached service that people reuse their passwords, but their breach could lead to serious harm via the compromise of accounts on totally unrelated services.
Let's make it even more benign: what if it's just email addresses? Nothing else, just addresses and, of course, the association to the breached service. Firstly, the victims of that breach may not want their association with the service to be publicly known. Granted, there's a spectrum and weaponising someone's presence in Ashley Madison is a very different story from pointing out that they're a LinkedIn user. But conversely, the association is enormously useful phishing material; it helps scammers build a more convincing narrative when they can construct their messages by repeating accurate facts about their victim: "Hey, it's Acme Corp here, we know you're a loyal user, and we'd like to make you a special offer". You get the idea.
I'll start this one in the complete opposite direction to what it sounds like it should be because this is what I've previously heard from breached organisations:
We don't want to disclose in order to protect our customers
Uh, you sure about that? And yes, you did read that paraphrasing correctly. In fact, here's a copy paste from a recent discussion about disclosure where there was an argument against any public discussion of the incident:
Our concern is that your public notification would direct bad actors to search for the file, which can potentially do harm to both the business and our mutual users.
The fundamental issue of this clearly being an attempt to suppress news of the incident aside, in this particular case, the data was already on a popular clear web hacking forum, and the incident has appeared in multiple tweets viewed by thousands of people. The argument makes no sense whatsoever; the bad guys - lots of them - already have the data. And the good guys (the customers) don't know about it.
I'll quote precisely from another company who took a similar approach around non-disclosure:
[company name] is taking steps to notify regulators and data subjects where it is legally required to do so, based on advice from external legal counsel.
By now, I don't think I need to emphasise the caveat that they inevitably relied on to suppress the incident, but just to be clear: "where it is legally required to do so". I can say with a very high degree of confidence that they never notified the 8-figure number of customers exposed in this incident because they didn't have to. (I hear about it pretty quickly when disclosure notices are sent out, and I regularly share these via my X feed).
Non-disclosure is intended to protect the brand and by extension, the shareholders, not the customers.
Usually, after being sent a data breach, the first thing I do is search for "[company name] data breach". Often, the only results I get are for a listing on a popular hacking forum (again, on the clear web) where their data was made available for download, complete with a description of the incident. Often, that description is wrong (turns out hackers like to embellish their accomplishments). Incorrect conclusions are drawn and publicised, and they're the ones people find when searching for the incident.
When a company doesn't have a public position on a breach, the vacuum it creates is filled by others. Obviously, those with nefarious intent, but also by journalists, and many of those don't have the facts right either. Public disclosure allows the breached organisation to set the narrative, assuming they're forthcoming and transparent and don't water it down such that there's no substance in the disclosure, of course.
All the way back in 2017, I wrote about The 5 Stages of Data Breach Grief as I watched The AA in the UK dig themselves into an ever-deepening hole. They were doubling down on bullshit, and there was simply no way the truth wasn't going to come out. It was such a predictable pattern that, just like with Kübler-Ross' stages of personal grief, it was very clear how this was going to play out.
If you choose not to disclose a breach - for whatever reason - how long will it be until your "truth" comes out? Tomorrow? Next month? Years from now?! You'll be looking over your shoulder until it happens, and if it does one day go public, how will you be judged? Which brings me to the next point:
I can't put any precise measure on it, but I feel we reached a turning point in 2017. I even remember where I was when it dawned on me, sitting in a car on the way to the airport to testify before US Congress on the impact of data breaches. News had recently broken that Uber had attempted to cover up its breach of the year before by passing it off as a bug bounty and, of course, not notifying impacted customers. What dawned on me at that moment of reflection was that by now, there had been so many data breaches that we were judging organisations not by whether they'd been breached but how they'd handled the breach. Uber was getting raked over the coals not for the breach itself but because they tried to conceal it. (Their CTO was also later convicted of federal charges for some of the shenanigans pulled under his watch.)
This is going to feel like I'm talking to my kids after they've done something wrong, but here goes anyway: If people entrusted you with your data and you "lost" it (had it disclosed to unauthorised parties), the only decent thing to do is own up and acknowledge it. It doesn't matter if it was your organisation directly or, as with the Deezer situation, a third party you entrusted with the data; you are the coalface to your customers, and you're the one who is accountable for their data.
I am yet to see any valid reasons not to disclose that are in the best interests of the impacted customers (the delay in the AT&T breach announcement at the request of the FBI due to national security interests is the closest I can come to justifying non-disclosure). It's undoubtedly the customers' expectation, and increasingly, it's the governments' expectations too; I'll leave you with a quote from our previous Cyber Security Minister Clare O'Neil in a recent interview:
But the real people who feel pain here are Australians when their information that they gave in good faith to that company is breached in a cyber incident, and the focus is not on those customers from the very first moment. The people whose data has been stolen are the real victims here. And if you focus on them and put their interests first every single day, you will get good outcomes. Your customers and your clients will be respectful of it, and the Australian government will applaud you for it.
I'm presently on a whirlwind North America tour, visiting government and law enforcement agencies to understand more about their challenges and where we can assist with HIBP. As I spend more time with these agencies around the world, I keep hearing that data breach victim notification is an essential piece of the cybersecurity story, and I'm making damn sure to highlight the deficiencies I've written about here. We're going to keep pushing for all data breach victims to be notified when their data is exposed, and my hope in writing this is that when it's read in future by other organisations I've disclosed to, they respect their customers and disclose promptly. Check out Data breach disclosure 101: How to succeed after you've failed for guidance and how to do this.
Edit (a couple of days later): I'm adding an addendum to this post given how relevant it is. I just saw the following from Ruben van Well of the Dutch Police, someone who has invested a lot of effort in victim notification and we had the pleasure of spending time with last year in Rotterdam:
To translate the key section:
Reporting and transparency around incidents is important. Of the companies that fall victim, between 8 and 10% report this, whether or not out of fear of reputational damage. I assume that your image will be more damaged if you do not report an incident and it does come out later.
It echos my sentiments from above precisely, and I hope that message has an impact on anyone considering whether or not to disclose.
Just watching back through bits of this week's video, the thing that's really getting at me is the same thing I've come back to in so many past videos: lack of organisational disclosure after a breach. Lack of disclosure to impacted customers, lack of disclosure to the public, and a general apathy towards the transparency with which we expect organisations to behave post-breach. This is a topic I'm increasingly pushing in front of governments and law enforcement agencies, and it'll be front of mind during my visits to the US and Canada this coming week and next. I have a longer form blog post in draft I'll try and wrap up before those meetings, hopefully that'll be one to talk about in next week's update. For now, see what you think of how I've framed the issue here:
I was in my mid-30s before I felt comfortable standing up in front of an audience and talking about technology. Come to think of it, "comfortable" isn't really the right word, as, frankly, it was nerve-racking. This, with my obvious bias as her father, makes it all the more remarkable that Elle was able to do it at NDC Oslo when she was just 11 years old. That she was able to do that and teach a room full of hundreds of technology professionals things they almost certainly hadn't seen before makes it all the more remarkable, and I'm very happy to now share the full video from that event in June with you all:
If you watch nothing else in this video, fast forward through to the 55-minute mark and watch Elle with our 3D printed catapult launching projectiles generated from a Chat GPT prompt into the audience during the Q&A. That kid is having the time of her life 😊
Today was all about this whole idea of how we index and track data breaches. Not as HIBP, but rather as an industry; we simply don't have a canonical reference of breaches and their associated attributes. When they happened, how many people were impacted, any press on the incident, the official disclosure messaging and so on and so forth. As someone in the video today said, "what about the Airtel data breach?" Yeah, whatever happened to that?! A quick Google reminds me that this was a few months ago, but did they ever acknowledge it? Send disclosure notices? Did the data go public? I began talking about all this after someone mentioned a breach during the week and for the life of me, I had no idea whether I'd heard about it before, looked into it, or even seen the data. Surely, with so many incidents floating around that have so much impact, we should have a way of cataloguing it all? Have a listen to this week's video and see what you think.
It's been a while since I've just gone all "AMA" on a weekly update, but this was just one of those weeks that flew by with my head mostly in the code and not doing much else. There's a bit of discussion about that this week, but it's mostly around the ongoing pain of resellers and all the various issues supporting them then creates as a result. I think we just need to get on with writing the code to automate everything they do so I just don't need to think about them any more 😭
I still find the reactions to the Telegram situation with Durov's arrest odd. There are no doubt all sorts of politics surrounding it, but even putting all that aside for a moment, the assertion that a platform provider should not be held accountable for moderating content on the platform is just nuts. As I say in this week's video, there's lots of content that you can put in the "grey" bucket (free speech versus hate speech, for example) and there are valid arguments to be had there. But there's also a bunch of content on Telegram that's not even close to grey, it's the outright illegal recalcitrant stuff that there must be accountability for when you're running the platform that allows people to publish this content. This goes well beyond direct interpersonal communication on genuine E2E encrypted platforms like Signal (or the terrible analogy of Telegram somehow being "just like AT&T"), and the current situation in France really shouldn't be that surprising. More in this week's video:
It was 2019 that I was last in North America, spending time in San Francisco, Los Angeles, Vegas, Denver, Minnesota, New York and Seattle. The year before, it was Montreal and Vancouver and since then, well, things got a bit weird for a while. It's a shame it's been this long because North America is such an important part of the world for so many of the things we (including Charlotte in this too) do; it's the lion's share of the audience for my content, the companies whose services we rely on, and the customers for the various things we do to make a living. In fact, North America is so important, that we're coming back next month 😄
The reason I'm writing this post is because in the past, when I've announced travel plans a whole bunch of really interesting people at cool place have reached out and said "hey, come and say g'day while you're here". So, here's the plan over late September and early October:
Things have panned out that way to align with where various partners we want to visit in government, law enforcement and other services are located. I'll wait until I'm there to share more about most those visits, but one I can share in advance is this one:
We’re proud to announce that 1Password is a global partner of the 2024 @PresidentsCup.🏆 ⛳
— 1Password (@1Password) July 9, 2024
Sponsoring the #PresidentsCup, watched by tens of millions of fans around the world, puts a global spotlight on the cybersecurity challenges that businesses, individuals, and families… pic.twitter.com/kNUnKkZCFt
I'm really happy to be joining my friends from 1Password at the Presidents Cup and I'm looking forward to giving them a hand to get the good word out to those tens of millions of people around the world.
On one of my last visits to San Francisco before the world went strange, I spent several days visiting companies in the city and in Silicon Valley. It was very much just that "say g'day" sort of casual visit, and that's where we'll have the most availability on that trip. Time is always finite, of course, but if you're on the path of our travels and would like a visit, please get in touch.
Oh, and just so we don't make it all hard work, Charlotte and I might take a few days out between Canada and San Franciso. If you have any suggestions for epic locations along the way, drop us a comment below.
This is such a significant week for us, to finally have Stefan join us as a proper employee at HIBP. When you start out as a pet project, you never really consider yourself a "proper" employee because, well, it's just you mucking around. And then when Charlotte started "officially" working for HIBP a few years ago, well, that's my wife helping me out. To have someone whose sole purpose it is to write code that makes this thing tick and build all sorts of amazing new features expands our capacity to actually produce stuff many times over. I use that term "actually produce stuff" because it was precious little time I had to do this, given all the things involved in running HIBP. It's an exciting time for all three of us now 😊
It should be so simple: you're a customer who wants to purchase something so you whip out the credit card and buy it. I must have done this thousands of times, and it's easy! I've bought stuff with plastic credit cards, stuff with Apple Pay on my phone and watch and, like all of us, loads of stuff simply by entering credit card details into a website. A lot of that has been business expenses for which I've obtained a receipt and then claimed back, either in my joyful life of independence or in a decade and a half at one of the world's largest companies.
Don't get me wrong, credit cards have their faults too. I've had them defrauded many times now, but that's usually just a minor inconvenience. Getting one can be painful if you're young and have no credit record, but a debit card is much more easily obtainable and still gives you all the same purchasing power so long as you've got cash in the account. And, well, that's about it. Frankly, I find it increasingly weird when a credit card can't be used. I travel a lot to different corners of the world, and it's extraordinarily rare not to pay for pretty much everything with "plastic". Increasingly, that's the only possible way with it not just being the defacto payment standard in most countries I visit but increasingly being the only form of payment accepted by many stores. Not using a card is becoming increasingly antiquated; today, I stood behind a woman buying a carton of milk with a $20 note, and geez, it was slow compared to the 5 seconds it took me to buy a few bucks of potatoes with my watch. Which makes me all the more frustrated by what I'm about to write:
Five years ago, we started charging a few bucks to use the Have I Been Pwned (HIBP) API. It was primarily to stop abuse, and it worked beautifully! Turns out that once you have to stump up a credit card tied your IRL identity, you're a lot less inclined to do nasty things. Implementing this via Stripe was a breeze, and as the years progressed, that model gradually evolved. A few years later, we added higher rate limits, and with that came annual billing as well. The following year, we put a price on searches of the largest domains and coincidental to this blog post, that was about a year before the time of writing. This led to a lot of new subscriptions, which means now is when those that were taken out annually are starting to renew and subsequently, now is when the problems are really hitting.
Let's scroll back a bit and talk about "Neville". Neville works in the procurement department of a large organisation, and his job is to take that dead simple credit card purchase process and turn it into the most convoluted, laborious, and mind-numbingly stupid exercise possible. That all begins with wanting a quote for the product. Now, you might reasonably look at this request and think "Neville, why on earth do you want a quote when the price is already on the website?!" These services start at $3.95; does creating a quote that merely reproduces the publicly published price and puts Neville's company name on it actually achieve anything? No! Occasionally, Neville will just ask for an invoice, which again, just reproduces publicly available information in a PDF just for him. But that's "protocol", and Neville is a stickler for doing things by the book. Oh - and no credit cards. WTF Neville?! Yep, electronic funds transfer only and that'll be on 60 day terms in arrears too, thankyou very much. That's just Neville's way.
I often liken this to when I'd travel to Hong Kong for work. It was always a multi-day trip and inevitably, I'd need to eat. So, I'd find a nice noodle bar somewhere and order my fill, probably paying about the same amount as that $3.95 HIBP entry price. Now, imagine a Neville driving my noodle purchasing such that as I rocked up for lunch, ordered my noodles and was presented with the EFTPOS machine I said "No, I'm going to need a quote first". For noodles. And further, once I accept that quote, I'll need an invoice which I'll pay within 60 days (which really means somewhere between 59 and 60 days, because "Neville"), and I'll need your banking details too please, because we don't do credit cards. I never tried that, but I'm confident that if I did, I'd get a stern Cantonese version of GTFO. And rightly so.
But that's where we quite literally are with HIBP; the Nevilles of the world eschewing the most simple, universal form of payment in favour of, well, the exact opposite. Every organisation is different, of course, but these are the patterns that play out over and over again. The "we can't purchase anything by credit card" line in particular is pervasive, and in all seriousness, I do wonder: WTF do you do when you travel?! Surely there are people within your organisation that go far enough away from HQ to legitimately justify a meal somewhere, and clearly they're playing out the scenario described in the previous para. So, you have a way, it's just selective. But Neville persists, and we make it possible to do quotes and invoices for sufficiently sized subscriptions at the expense of our own time. Neville is now happy - almost.
"We can't pay by credit card". Over and over again, we get this line. Often it's not even that blunt, it's more akin to "can we have an invoice" which usually turns out to be code speak for "can we have an invoice and pay by EFT". Now, I don't particularly mind electronically transferring funds around account to account, but it's messy. We've got BSBs identifying the bank here in Australia but then it's typically SWIFT codes elsewhere. No problem, we can do that, but then we have the next problem: who just paid us? I mean, there's a sum of money that has appeared and it's identical to all the other sums because we're providing the same service to many different parties, so... who just paid? You can always reference an invoice number in the description of the payment, but that's up to the sender of the money and very frequently, that just doesn't happen. The sender's name doesn't necessarily make sense either, and even if it does, we're now burning time doing manual reconciliation. Further, it's not instantaneous like a credit card so not only are we waiting around for a payment before being able to do reconciliation and then turn on the service, the purchaser is also waiting around; they've just been notified a bunch of their staff are caught in the latest data breach that could have serious impact on the org and they need to... wait. And all of that doesn't matter anyway because we have another major problem: Stripe.
You know how earlier I said that Stripe was a breeze? Let me clarify that by saying that Stripe can be a breeze, and it can be an absolute nightmare. If you've implemented their service before and read the previous paragraph you may be thinking "why aren't you just using their bank transfer payments feature", and you click on that link and realise why:
No Australia ☹️ Which is a real shame because they have these cool virtual bank account numbers which take care of the reconciliation problem I mentioned earlier. I've prompted them directly on this over the years and the best solution I've had is to go and jump on Stripe Atlas which involves incorporating an entirely new company in the US:
I have absolutely no desire to have yet another corporate entity anywhere, let alone on the other side of the world. The current accounting burden I have each year is already a bit nuts, the last thing I want to do is make it even worse because Neville is being a stickler for an arbitrary internal protocol.
But there's another solution, one that opens up a whole new can of worms: resellers. I'm struggling a bit to even know where to start here, but imagine a digital product like an HIBP API key but instead of purchasing it direct from us, you take the time to go to someone else (the reseller) who purchases it on your behalf, adds a hefty markup and then charges you for the exact same thing you could have bought yourself with a lot less mucking around. Genius!
I blame Neville. It was inevitably a Neville in my previous place of work that demanded all purchases be routed through a single supplier. That meant that every time I found a great online service or product, I had to stop, think of Neville, and then contact the reseller of choice. That, of course, added overhead in terms of both time and the inevitable markup on the price of the product. I'm sure there's a reason why this makes sense in Neville's world. It probably has something to do with having fewer vendors the organisation deals with directly or delaying payment and doing it in bulk or something like "consolidated accounting"; who knows? The point is it's more time, more money and more pain. But they do solve one problem: the credit card situation.
We ended up acquiescing and allowing resellers to purchase on behalf of our customers primarily because the reseller is happy to pay with a card and then invoice the actual customer and do the whole EFT thing. That solves the problem where Neville has decreed that all purchases must go through the reseller because "reasons". What we don't do is entertain the request for a "reseller discount", a request we receive on a pretty frequent basis. It's nonsensical: instead of the end customer just going to the website, chucking in their Visa and being done with the whole thing in a couple of minutes without any assistance from us, instead, we have to do the legwork of interfacing back and forth with the reseller to create the appropriate documents, and after spending our precious time doing that, they also want a discount! Not gonna happen. But given the reseller is in the money-making business, they whack a hefty markup on the top to the detriment of the customer who decided to use them. It's nuts.
This brings me to today's problem and the (other) problem with Stripe. As annual renewals approach, resellers are reaching out wanting invoices for their customers' subscriptions. No problem (this is my naive brain response), both those constructs exist in Stripe, and we often use them for new subscriptions. However, there's a massive blind spot in Stripe's implementation: an invoice can only be raised when a subscription starts, creating a huge problem when resellers reach out three months in advance. I blew the better part of all of yesterday trying to work out how to do this and concluded the following:
Charlotte went backwards and forwards with Stripe support on this. Here's their final position:
There is no option to make an advance payment for a subscription or renewal yet. I've passed it on to our product team. Unfortunately, we can't guarantee that this feature will be implemented at the moment, but we’re rapidly evolving our product suites, so be sure to stay close to hearing about new products and features as we announce them.
To summarise everything you've read thus far, because Neville won't approve credit card purchases and Stripe won't allow EFT payments, we're stuck with resellers who now want renewal invoices, which we can't raise because Stripe won't let you do that in advance of a subscription starting. Look, it's the finance industry, so there are lots of regulatory hurdles that I can see making things like EFT payments to virtual bank account numbers in Australia painful. I think Stripe's implementation is beautiful in so many ways, and I have massive respect for their technical implementation; the API docs are top notch, their CLI is really cool, the webhooks on events work beautifully, and their portal is lovely to use. Kudos all round to the Stripe devs. But why something as simple as raising an invoice in advance of the service starting isn't possible just doesn't make sense for a digital product like so many of the ones we buy online. I mean, what's the answer: wait until the subscription cycle renews, at which point the service is discontinued until the newly issued invoice is paid? That's where we're headed because so many orgs (including resellers) simply won't pay until there's an invoice. (In case you're wondering, they often use credit cards with short-lived expiry so they can't just be automatically charged on renewal.)
By writing this, I know I will get a flood of either "don't use Stripe" responses or "support multiple payment providers" suggestions. Both of these are a nightmare as Stripe is so closely integrated into our service. If someone wants a historical invoice or receipt, we bounce them off to the Stripe portal. When a Stripe payment is successful, that's the event that calls a webhook on HIBP and enables (or extends) the subscription. And yes, we could build a heap of additional code to support other providers, but that's masses of overhead just because Neville doesn't like credit cards.
Curious, I ran the numbers on resellers and came up with 0.6% of total current subscriptions. In other words, if you're in an org that insists on EFT and / or a reseller, there are 160 other people just like you who are working in a Neville-free environment. I feel your pain as I've walked in your shoes, but I also know that credit cards are an arbitrary barrier a sufficiently motivated organisation can overcome.
If you've read this and have suggestions, please chime in via the comments below.
Whilst there definitely weren't 2.x billion people in the National Public Data breach, it is bad. It really is fascinating how much data can be collected and monetised in this fashion and as we've seen many times before, data breaches do often follow. The NPD incident has received a huge amount of exposure this week and as is often the case, there are some interesting turns; partial data sets, an actor turned data broker, a disclosure notice (almost) nobody can load and bad actors peddling partial sets of data. See what you make of this one, I'm sure there'll be insights come to light on this yet.
I decided to write this post because there's no concise way to explain the nuances of what's being described as one of the largest data breaches ever. Usually, it's easy to articulate a data breach; a service people provide their information to had someone snag it through an act of unauthorised access and publish a discrete corpus of information that can be attributed back to that source. But in the case of National Public Data, we're talking about a data aggregator most people had never heard of where a "threat actor" has published various partial sets of data with no clear way to attribute it back to the source. And they're already the subject of a class action, to add yet another variable into the mix. I've been collating information related to this incident over the last couple of months, so let me talk about what's known about the incident, what data is circulating and what remains a bit of a mystery.
Let's start with the easy bit - who is National Public Data (NPD)? They're what we refer to as a "data aggregator", that is they provide services based on the large volumes of personal information they hold. From the front page of their website:
Criminal Records, Background Checks and more. Our services are currently used by investigators, background check websites, data resellers, mobile apps, applications and more.
There are many legally operating data aggregators out there... and there are many that end up with their data in Have I Been Pwned (HIBP). For example, Master Deeds, Exactis and Adapt, to name but a few. In April, we started seeing news of National Public Data and billions of breached records, with one of the first references coming from the Dark Web Intelligence account:
USDoD Allegedly Breached National Public Data Database, Selling 2.9 Billion Records https://t.co/emQIZ0lgsn pic.twitter.com/Tt8UNppPSu
— Dark Web Intelligence (@DailyDarkWeb) April 8, 2024
Back then, the breach was attributed to "USDoD", a name to remember as you'll see that throughout this post. The embedded image is the first reference of the 2.9B number we've subsequently seen flashed all over the press, and it's right there alongside the request of $3.5M for the data. Clearly, there is a financial motive involved here, so keep that in mind as we dig further into the story. That image also refers to 200GB of compressed data that expands out to 4TB when uncompressed, but that's not what initially caught my eye. Instead, something quite obvious in the embedded image doesn't add up: if this data is "the entire population of USA, CA and UK" (which is ~450M people in total), what's the 2.9B number we keep seeing? Because that doesn't reconcile with reports about "nearly 3 billion people" with social security numbers exposed. Further, SSNs are a rather American construct with Canada having SINs (Social Insurance Number) and the UK having, well, NI (National Insurance) numbers are probably the closestequivalent. This is the constant theme you'll read about in this post, stuff just being a bit... off. But hyperbole is often a theme with incidents like this, so let's take the headlines with a grain of salt and see what the data tells us.
I was first sent data allegedly sourced from NPD in early June. The corpus I received reconciled with what vx-underground reported on around the same time (note their reference to the 8th of April, which also lines up with the previous tweet):
April 8th, 2024, a Threat Actor operating under the moniker "USDoD" placed a large database up for sale on Breached titled: "National Public Data". They claimed it contained 2,900,000,000 records on United States citizens. They put the data up for sale for $3,500,000.
— vx-underground (@vxunderground) June 1, 2024
National…
In their message, they refer to having received data totalling 277.1GB uncompressed, which aligns with the sum total of the 2 files I received:
They also mentioned the data contains first and last names, addresses and SSNs, all of which appear in the first file above (among other fields):
These first rows also line up precisely with the post Dark Web Intelligence included in the earlier tweet. And in case you're looking at it and thinking "that's the same SSN repeated across multiple rows with different names", those records are all the same people, just with the names represented in different orders and with different addresses (all in the same city). In other words, those 6 rows only represent one person, which got me thinking about the ratio of rows to distinct numbers. Curious, I took 100M samples and found that only 31% of the rows had unique SSNs, so extrapolating that out, 2.9B would be more like 899M. This is something to always be conscious of when you read headline numbers: "2.9B" doesn't necessarily mean 2.9B people, it often means rows of data. Speaking of which, those 2 files contain 1,698,302,004 and 997,379,506 rows respectively for a combined total of 2.696B. Is this where the headline number comes from? Perhaps, it's close, and it's also precisely the same as Bleeping Computer reported a few days ago.
At this point in the story, there's no question that there is legitimate data in there. From the aforementioned Bleeping Computer story:
numerous people have confirmed to us that it included their and family members' legitimate information, including those who are deceased
And in vx-underground's tweet, they mention that:
It also allowed us to find their parents, and nearest siblings. We were able to identify someones parents, deceased relatives, Uncles, Aunts, and Cousins. Additionally, we can confirm this database also contains informed on individuals who are deceased. Some individuals located had been deceased for nearly 2 decades.
A quick tangential observation in the same tweet:
The database DOES NOT contain information from individuals who use data opt-out services. Every person who used some sort of data opt-out service was not present.
Which is what you'd expect from a legally operating data aggregator service. It's a minor point, but it does support the claim that the data came from NPD.
Important: None of the data discussed so far contains email addresses. That doesn't necessarily make it any less impactful for those involved, but it's an important point I'll come back to later as it relates to HIBP.
So, this data appeared in limited circulation as early as 3 months ago. It contains a huge amount of personal information (even if it isn't "2.9B people"), and then to make matters worse, it was posted publicly last week:
National Public Data, a service by Jerico Pictures Inc., suffered #databreach. Hacker “Fenice” leaked 2.9b records with personal details, including full names, addresses, & SSNs in plain text. https://t.co/fXY3SXEiKe
— Wolf Technology Group (@WolfTech) August 6, 2024
Who knows who "Fenice" is and what role they play, but clearly multiple parties had access to this data well in advance of last week. I've reviewed what they posted, and it aligns with what I was sent 2 months ago, which is bad. But on the flip side, at least it has allowed services designed to protect data breach victims to get notices out to them:
Twice this week I was alerted my SSN was found on the web thanks to a data breach at National Public Data. Cool. Thanks guys. pic.twitter.com/FAlfNmXUqm
— MrsNineTales (@MrsNineTales) August 8, 2024
Inevitably, breaches of this nature result in legal action, which, as I mentioned in the opening paragraph, began a couple of weeks ago. It looks like a tip-off from a data protection service was enough for someone to bring a case against NPD:
Named plaintiff Christopher Hofmann, a California resident, said he received a notification from his identity-theft protection service provider on July 24, notifying him that his data was exposed in a breach and leaked on the dark web.
Up until this point, pretty much everything lines up, but for one thing: Where is the 4TB of data? And this is where it gets messy as we're now into the territory of "partial" data. For example, this corpus from last month was posted to a popular hacking forum:
National Public Database Allegedly Partially Leaked
— Dark Web Intelligence (@DailyDarkWeb) July 23, 2024
It is stated that nearly 80 GB of sensitive data from the National Public Data is available.
The post contains different credits for the leakage and the alleged breach was credited to a threat actor “Sxul” and stressed that it… https://t.co/v8uq0o88NS pic.twitter.com/a6dn3MvYkf
That's 80GB, and whilst it's not clear whether that's the size of the compressed or extracted archive, either way, it's still a long way short of the full alleged 4TB. Do take note of the file name in the embedded image, though - "people_data-935660398-959524741.csv" - as this will come up again later on.
Earlier this month, a 27-part corpus of data alleged to have come from NPD was posted to Telegram, this image representing the first 10 parts at 4GB each:
The compressed archive files totalled 104GB and contained what feels like a fairly random collection of data:
Many of these files are archives themselves, with many of those then containing yet more archives. I went through and recursively extracted everything which resulted in a total corpus of 642GB of uncompressed data across more than 1k files. If this is "partial", what was the story with the 80GB "partial" from last month? Who knows, but in the in those files above were 134M unique email addresses.
Just to take stock of where we're at, we've got the first set of SSN data which is legitimate and contains no email addresses yet is allegedly only a small part of the total NPD corpus. Then we've got this second set of data which is larger and has tens of millions of email addresses yet is pretty random in appearance. The burning question I was trying to answer is "is it legit?"
The problem with verifying breaches sourced from data aggregators is that nobody willingly - knowingly - provides their data to them, so I can't do my usual trick of just asking impacted HIBP subscribers if they'd used NPD before. Usually, I also can't just look at a data aggregator breach and find pointers that tie it back to the company in question due to references in the data mentioning their service. In part, that's because this data is just so damn generic. Take the earlier screenshot with the SSN data; how many different places have your first and last name, address, SSN, etc? Attributing a source when there's only generic data to go by is extremely difficult.
The kludge of different file types and naming conventions in the image above worried me. Is this actually all from NPD? Usually, you'd see some sort of continuity, for example, a heap of .json files with similar names or a swathe of .sql files with each one representing a dumped table. The presence of "people_data-935660398-959524741.csv" ties this corpus together with the one from the earlier tweet, but then there's stuff like "Accuitty_10_1_2022.zip"; could that refer to Acuity (single "c", single "t") which I wrote about in November? HIBP isn't returning hits for email addresses in that folder against the Acuity I loaded last year, so no, it's a different corpus. But that archive alone ended up having over 250GB of data with almost 100M unique email addresses, so it forms a substantial part of the overall corpus of data.
The 3,608,086KB "criminal_export.csv.zip" file caught my eye, in part because criminal record checks are a key component NPD's services, but also because it was only a few months ago we saw another breach containing 70M rows from a US criminal database. And see who that breach was attributed to? USDoD, the same party whose name is all over the NPD breach. I did actually receive that data but filed it away and didn't load it into HIBP as there were no email addresses in it. I wonder if the data from that story lines up with the file in the image above? Let's check the archives:
Different file name, but hey, it's a 3,608,086KB file! Given the NPD breach initially occurred in April and the criminal data hit the news in May, it's entirely possible the latter was obtained from the former, but I couldn't find any mention of this correlation anywhere. (Side note: this is a perfect example of why I retain breaches in offline storage after processing because they're so often helpful when assessing the origin and legitimacy of new breaches).
Continuing the search for oddities, I decided to see if I myself was in there. On many occasions now, I've loaded a breach, started the notification process running, walked away from the PC then received an email from myself about being in the breach 🤦♂️ I'm continually surprised by the places I find myself in, including this one:
Dammit! It's an email address of mine, yet clearly, none of the other data is mine. Not my name, not my address, and the obfuscated numbers definitely aren't familiar to me (I don't believe they're SSNs or other sensitive identifiers, but because I can't be sure, I've obfuscated them). I suspect one of those numbers is a serialised date of birth, but of the total 28 rows with my email address on them, the two unique DoBs put "me" as being born in either 1936 or 1967. Both are a long way from the truth.
A cursory review of the other data in this corpus revealed a wide array of different personal attributes. One file contained information such as height, weight, eye colour, and ethnicity. The "uk.txt" file in the image above merely contained a business directory with public information. I could have dug deeper, but by now, there was no point. There's clearly some degree of invalid data in here, there's definitely data we've seen appear separately as a discrete breach, and there are many different versions of "partial" NPD data (although the 27-part archive discussed here is the largest I saw and the one I was most consistently directed to by other people). The more I searched, the more bits and pieces attributed back to NPD I found:
If I were to take a guess, there are two likely explanations for what we're seeing:
Both of these are purely speculative, though, and the only parties that know the truth are the anonymous threat actors passing the data around and the data aggregator that's now being sued in a class action, so yeah, we're not going to see any reliable clarification any time soon. Instead, we're left with 134M email addresses in public circulation and no clear origin or accountability. I sat on the fence about what to do with this data for days, not sure whether I should load it and, if I did, whether I should write about it. Eventually, I decided it deserved a place in HIBP as an unverified breach, and per the opening sentence, this blog post was the only way I could properly explain the nuances of what I discovered. This way, impacted people will know if their data is floating around in this corpus, and if they find this information unactionable, then they can do precisely what they would have done had I not loaded it - nothing.
Lastly, I want to re-emphasise a point I made earlier on: there were no email addresses in the social security number files. If you find yourself in this data breach via HIBP, there's no evidence your SSN was leaked, and if you're in the same boat as me, the data next to your record may not even be correct. And no, I don't have a mechanism to load additional attributes beyond email address into HIBP nor point people in the direction of the source data (some of you will have received a reminder about why I don't do that just a few days ago). And I'm definitely not equipped to be your personal lookup service, manually trawling through the data and pulling out individual records for you! So, treat this as informational only, an intriguing story that doesn't require any further action.
When is a breach a breach? If it's been breached then re-breached, is the second incident still a breach? Here's what the masses said when I asked if they'd want to know when something like this happened to their data:
If you're in a breach and your data is aggregated by a third party, then *they* have a breach that discloses your data (again), would you want to know? Should this constitute a notifiable breach?
— Troy Hunt (@troyhunt) August 5, 2024
And what if that second incident wasn't a breach per se, but rather a legitimate service being abused to locate where the re-breached data was? That seems to be the situation with SOCRadar, but regardless of the precise mechanics, there's now another 282M breached records in HIBP. Full story in this week's video:
The ongoing scourge that is spyware (or, as it is commonly known, "stalkerware"), and the subsequent breaches that so often befall them continue to amaze me. More specifically, it's the way they tackle the non-consensual spying aspect of the service which, on the one hand is represented as a big "no-no" but on the others hand, the likes of Spytech in this week's update literally have a dedicated page for! Ok, so they say "get consent first" on the page, but only after pre-positioning the service as a way to catch cheating spouses! And further, the testimonials page has multiple references to people doing precisely this! Do you think the cheating spouses were aware that spyware was installed before using that very device to carry out extramarital affairs?! The mind boggles... 🤯
TL;DR — Tens of millions of credentials obtained from info stealer logs populated by malware were posted to Telegram channels last month and used to shake down companies for bug bounties under the misrepresentation the data originated from their service.
How many attempted scams do you get each day? I woke up to yet another "redeem your points" SMS this morning, I'll probably receive a phone call from "my bank" today (edit: I was close, it was "Amazon Prime" 🤷♂️) and don't even get me started on my inbox. We're bombarded to the point of desensitisation, which itself is dangerous because it creates the risk of inadvertently dismissing something that really does require your attention. Which brings me to the email Scott Helme from Report URI (disclosure: a service I've long partnered with and advised) received yesterday titled "Bug bounty Program - PII leak Credentials more than 170". It began as follows:
Through open-source intelligence gathering, I discovered a significant amount of "report-uri.com" user credentials and sensitive documents have been leaked and are publicly accessible.
The sender then attached a text file with 197 lines of email addresses and passwords belonging to users of Scott's pride and joy. The first lines looked like this (url:email:password):
Imagine the heart-in-mouth moment he had when first seeing that; had someone compromised his service? Was this the data of his customers who had entrusted it to him and it was now floating around the internet? Isn't he the guy who's meant to be teaching others about application security?! The email went on:
The impact of this vulnerability is severe, potentially resulting in:
Mass account takeovers by malicious actors.
Exposure of sensitive user data including names, emails, addresses, and documents.
Unauthorized transactions or malicious activities using compromised accounts.
Further compromise of organizational infrastructure through account abuse.
Financial and reputational damage due to security breaches.
Just to avoid any semblance of doubt as to the motive of the sender, the subject began by flagging the desire for a bug bounty (Report URI does not advertise a bounty program, but clearly a reward was being sought), followed by an email body stating it related to leaked Report URI credentials and then highlighted that "this vulnerability is severe". And then there's that last line about financial and reputation damage. It looked bad. However, cooler heads prevailed, and we started looking closer at the email addresses in the "breach" by checking them against Have I Been Pwned. Very quickly, a pattern emerged:
Most of the addresses we checked had appeared in the lists posted to Telegram I'd loaded into HIBP a couple of months ago. These were stealer logs, not a breach of Report URI! To validate that assertion, I pulled the original data source and parsed out every line containing "report-uri.com". Sure enough, the lines from the file sent to Scott were usually contained in the stealer log files. So, let's talk about how this works:
Take the URL you saw at the beginning of each line earlier on, the one being for the registration page. Here's what it looks like:
Now, imagine you're filling out this form and your machine is infected with malware that can observe the data entered into each field. It takes that data, "steals" it and logs it at the attacker's server, hence the term "info stealer logs". There is absolutely nothing Scott can do to prevent this; the user's machine is compromised, not Report URI.
To illustrate the point, I grabbed the first email address in the file Scott was sent and pulled the rows just for that address rather than solely the Report URI rows. This would show us all the other services this person's credentials were snared from, and there were dozens. Here are just the first ten:
Google. Apple. Twitter. Most with the same password too, because a normal person obviously owns this email address. So, has each of these organisations also received a beg bounty? No, that's not a typo, this is classic behaviour where unsophisticated and self-proclaimed "security researchers" use automated tooling to identify largely benign security configurations that could be construed as vulnerabilities. For example, they'll send through a report that an SPF record is too permissive (they probably can't even spell "SPF", let alone understand the nuances of sender policies), then try to shake people like Scott down for money under the guise of a "bug bounty". This isn't Scott's problem, nor is it Google's or Apple's or Twitter's, it's something only the malware infected victim's can address.
In this post, I referred to "most" of the addresses already being in HIBP and the lines from the file he was sent "usually" occurring in the logs I had. But there were gaps. For example, whilst there were 197 rows in "his" file, I only found 161 in the data I'd previously loaded. But I had a hunch on how to fill that gap and make up the difference...
Two weeks ago, I was sent a further 22GB of stealer logs found in Telegram channels. Unlike the previous corpus of data, this set contained only stealer logs (no credential stuffing lists) and had a total of 26,105,473 unique email addresses. That's significant, as it implies that every single one of those addresses belongs to someone infected with malware that's stealing their creds. Of the total count, 89.7% had been seen in previous data breaches already in HIBP which is a high crossover, but it also meant that 2,679,550 addresses were all new. I'd been considering whether or not it made sense to load this data given corpuses such as this create frustration when people don't know which site their record was snared from nor which password was impacted. One particular frustration you'll read in comments on the previous post was that people weren't sure whether their email address was in a stealer log or a credential stuffing list; did they have a machine infected with malware or was it merely recycled credentials from an old data breach? But given the way in which this new corpus of data is being used (to attempt to scam Scott and, one would assume, many others), the 7-figure number of previously unseen addresses and the fact that this time, they can all emphatically be tied back to malware campaigns, this is now searchable in HIBP as "Stealer Logs Posted to Telegram".
Ultimately, this is just scam on top of scam: the victims in the logs have had their credentials scammed, and the person who emailed Scott attempted to use that to scam him out of a bounty. Making data like this searchable in HIBP helps people do exactly what I did as soon as Scott forwarded me over the email: validate the origin and as Scott will now do, send a terse reply encouraging the guy to show some decency and stop with the beg bounties.
Lastly, I'm increasingly conscious of how useful the information contained in stealer logs is to organisations like Report URI, and after loading that previous corpus posted from Telegram, I did help out a few companies who thought they might have been hit by it. The position they were coming from was "we keep seeing account takeovers by what looks like credential stuffing attacks, but the attackers are getting the credentials right on the first go". When I pulled the data for their domain as I later did for Report URI, the email addresses were precisely the ones being targeted for account takeover. I want to address this via HIBP, but it's non-trivial for a variety of reasons, especially those related to privacy. In order for this data to be useful to companies like Report URI, I'd need to give them other people's email addresses (the password wouldn't be necessary) based on the assumption they were customers of the service. I'm working out how to do this in way that makes sense for everyone (well, everyone except for the bad guys), stay tuned for more and please do chime in via the comments if you have ideas on how to turn this into a useful service.
Who would have thought that just a few hours after recording the previous week's video, the world would descend into what has undoubtedly become the largest IT outage we've ever seen:
I don’t think it’s too early to call it: this will be the largest IT outage in history
— Troy Hunt (@troyhunt) July 19, 2024
By virtue of the CrowdStrike incident occurring in friendly office hours for my corner of the world, I was able to get a thread on it going pretty early on. That tweet above has been seen 315k times at the time of writing and ended up being splashed across the global media. Unsurprisingly, I then spent the better part of the next three days doing endless interviews... and very little else. The question that constantly came up was "could this happen again?" to which the answer is obviously "yes". However, it was unprecedented despite the huge number of previous updates CloudStrike had sent out, not to mention all the times Windows itself has updated without a calamity anywhere near the scape of this one. But the mind does wander - "what if?" - and you think about just how bad this could be if things went wrong at just the right point 🤔
Just over 13 years ago, Microsoft gave me my first "Most Valuable Professional" award. Out of the blue, as far as I was concerned. It wasn't something I'd planned for and it certainly wasn't something I'd expected, but it has become a cornerstone of my professional identity.
Indulge me while I go off on a bit of a tangent here: like the other things in my professional life that have turned into a success, the things I did to earn that first MVP award were things I was going to do anyway. Things like speaking at events, writing blog posts, and, of course, running Have I Been Pwned. It's an enormous privilege that these activities are things I do out of passion, and they've given me a successful career. That's wild, and I can still hardly believe it 😊
As I've said each year when this award has landed in my inbox, and I'll say again now after the 14th one just landed, thank you for giving me the platform on which I've been able to build this career. It takes people like you reading this now to turn up to my talks, consume the posts I write and use HIBP to do useful things after data breaches happen to make me successful at what I do.
So, cheers to everyone who has contributed to my 14th Microsoft MVP award 🍻
It feels weird to be writing anything right now that isn't somehow related to Friday's CrowdStrike incident, but given I recorded this video just a few hours before all hell broke loose, it'll have to wait until next week. This week, the issue that really has me worked up is data breach victim notification or more specifically, lack thereof. Following my time in Melbourne and Canberra during the week where I spent a bunch of time with smart people close to the legal, political and law enforcement aspects of infosec, it really hit home how aligned most of us are on protecting the individual victims. Most, but not all; the corporate victims (and yes, companies who suffer data breaches are still victims themselves), rarely set individual victim notification as a priority. That sucks, and it's at direct odds with the messaging we're now hearing loud and clear from our own government. I'm giving a lot of thought to how we bridge that gap so stay tuned, this area has to get better. Much better.
I get the frustration and anger those working at organisations that have been breached feel, and I've seen it firsthand in my communications with them on so many prior occasions. They're the victim of a criminal act and they're rightly outraged. However... thinking back to similar examples to The Heritage Foundation situation this week, I can't think of a single case where losing your mind and becoming abusive has ever worked out well. In fact, it usually just has the effect of losing the victim sympathy whilst an engrossed audience watches a slow-motion trainwreck get worse and worse. That it came from a spokesperson at an organisation that prides itself on religious righteousness makes the whole situation all the more perplexing. Perplexing, but admittedly, entertaining to watch.
It's a long one this week, in part due to the constant flood of new breaches and disclosures I discuss. I regularly have disclosure notices forwarded to me by followers who find themselves in new breaches, and it's always fascinating to hear how they're worded. You get a real sense of how much personal ownership a company is taking, how much blame they're putting back on the hackers and increasingly, how much they've been written by lawyers. That last one, in particular, seems to have a knack for diluting all the useful information into high-level generic statements that tell you very little about what's actually happened. See if you can spot those in this week's disclosure notices. Once you see the patterns, you'll be spotting them all over the place in the future.
Last week, I wrote about The State of Data Breaches and got loads of feedback. It was predominantly sympathetic to the position I find myself in running HIBP, and that post was mostly one of frustration: lack of disclosure, standoffish organisations, downplaying breaches and the individual breach victims themselves making it worse by going to town on the corporate victims. But the other angle that's been milling around in my brain is the one represented by the image here:
Running HIBP has become a constant balancing act between a trilogy of three parties: hackers, corporate victims and law enforcement. Let me explain:
This is where most data breaches begin, with someone illegally accessing a protected system and snagging the data. That's a high-level generalisation, of course, but whether it's exploiting software vulnerabilities, downloading exposed database backups or phishing admin credentials and then grabbing the data, it's all in the same realm of taking something that isn't theirs. And sometimes, they contact me.
This is a hard position to find myself in, primarily because I need to weigh the potentially competing objectives of notifying impacted HIBP subscribers whilst simultaneously not pandering to the perverse incentives of likely criminals. Sometimes, it's easy: when someone reports exposed data or a security vulnerability, the advice is to contact the company involved and not turn it into a data breach. But when they already have the data, by definition it's now a breach and inevitably a bunch of my subscribers are in there. It's awkward, talking to the first party responsible for the breach.
There are all sorts of circumstances that may make it even more awkward, for example if the hacker is actively trying to shake the company down for money. Perhaps they're selling the data on the breach market. Maybe they also still have access to the corporate system. Having a discussion with someone in that position is delicate, and throughout it all, I'm conscious that they may very well end up in custody and every discussion we've had will be seen by law enforcement. Every single word I write is predicated on that assumption. And eventually, being caught is a very likely outcome; just as we say that as defenders we need to get it right every single time and the hacker only needs to get it right once, as hackers, they need to get their opsec right every single time and it only takes that one little mistake to bring them undone. A dropped VPN connection. An email address, handle or password used somewhere else that links to their identity. An incorrect assumption about the anonymity of cryptocurrency. One. Little. Mistake.
However, I also need to treat these discussions as confidential. The expectation when people reach out is that they can confide in me, and that's due to the trust I've built over more than a decade of running this service. Relaying those conversations without their permission could destroy that reputation in a heartbeat. So, I often find myself brokering conversations between the three parties mentioned here, providing contact details back and forth or relaying messages with the consent of each party.
This sort of communication gets messy: you've got the hacker (who's often suspicious of big corp) trying to draw attention to an issue, but they're trying to communicate with a party who's also naturally suspicious of anonymous characters who've accessed their data! And law enforcement is, of course, interested in the hacker because that's their job, but they're also respectful of the role I play and the confidence with which data is shared with me. Meanwhile, law enforcement is also often engaged by the corporate victim and now we've got all players conversing with each other and me in the middle.
I say this not to be grandiose about what I do, but to explain the delicate balance with which many of these data breaches need to be handled. Then, that's all wrapped in with the observations from the previous post about lack of urgency etc.
I choose to use this term because it's all too easy for people to point at a company that's suffered a data breach and level blame at them. Depending on the circumstances, some blame is likely warranted, but make no mistake: breached companies are usually the target of intentional, malicious, criminal activity. And when I say "companies", we're ultimately talking about individuals who are usually doing the best they can at their jobs and, during a period of incident response, are often having the worst time of their careers. I've heard the pain in their voices and seen the stress on their faces on so many prior occasions, and I want to make sure that the human element of this isn't lost amidst the chants of angry customers.
The way in which corporate victims engage with hackers is particularly delicate. They're understandably angry, but they're also walking the tightrope of trying to learn as much as they can about the incident (the vector by which data was obtained often isn't known in the early stages), whilst listening to often exorbitant demands and not losing their cool. It's very easy for the party who has always worked on the basis of anonymity to simply "go dark" and disappear altogether, and then what? We can see this balancing act in many of the communications later released by hackers, often after they've failed to secure the expected ransom payment; you have extremely polite corporations... who you know want nothing more than to have the guy thrown into prison!
The law enforcement angle, or perhaps, to put it more broadly, the interactions with government authorities in general, is an interesting one. Beyond the obvious engagements around the criminal activity of hackers, the corporate victims themselves have legal responsibilities. This is obviously highly dependent on jurisdiction and regulatory controls, but it may mean reporting the breach to the appropriate government entity, for example. It may even mean reporting to many government entities (i.e. state-based) depending on where they are in the world. Then there's the question of their own culpability and whether the actions they took (or didn't take) both pre and post-breach may result in punitive measures being taken. I had a headline in the previous post that included the term "covering their arses" and this doesn't just mean from customer or shareholder backlash, but increasingly, from massive corporate fines.
I suspect, based on many previous experiences, that corporations have a love-hate relationship with law enforcement. They obviously want their support when it comes to dealing with the criminals, but they're extraordinarily cautious about what they disclose lest it later contribute to the basis on which penalties are levelled against them. Imagine the balancing act involved when the corporate victims suspects the breach occurred due to some massive oversights on their behalf and they approach law enforcement for support: "So, how do you think they got in? Uh..."
Like I've already said so many times in this post: "delicate".
This is the most multidimensional player in the trilogy, interfacing backwards and forwards with each party in various ways. Most obviously, they're there to bring criminals to justice, and that clearly puts hackers well within their remit. I've often referred to "the FBI and friends" or similar terms that illustrate how much of a partnership international law enforcement efforts are, as is regularly evidenced by the takedown notices on cybercrime initiatives:
The hackers themselves are often all too eager to engage with law enforcement too. Sometimes to taunt, other times to outright target, often at a very individual level such as naming specific agents. It should be said also that "hacker" is a very broad term that, at its worst, is outright criminal activity intended to be destructive for their own financial gain. But at the other end of the scale is a much more nuanced space where folks who may be labelled with this title aren't necessarily malicious in their intent but to paraphrase: "I was poking around and I found something, can you help me report it to the authorities".
The engagement between law enforcement and corporate victims often begins with the latter reporting an incident. We see this all the time in disclosure statements "we've notified the authorities", and that's a very natural outcome following a criminal act. It's not just the hacking itself, this is often accompanied by a ransom demand which piles on yet another criminal activity that needs to be referred to the authorities. Conversely, law enforcement regularly sees early indications of compromise before the corporate victim does and is able to communicate that directly. Increasingly, we're seeing formal government entities issue much broader infosec advice, for example, as our Australian Signals Directorate regularly does.
I often end up finding myself in a variety of different roles with law enforcement agencies. For example, providing a pipeline for the FBI to feed breached passwords into, supporting the Estonian Central Criminal Police by making data impacting their citizens searchable, spending time with the Dutch police on victim notification, and even testifying in front of US Congress. And, of course, supporting three dozen national CERTs around the world with open access to exposure of their federal domains in HIBP. Many of these agencies also have a natural interest in the folks who contact me, especially from that first category listed above. That said, I've always found law enforcement to be respectful of the confidence with which hackers share information with me; they understand the importance of the trust I mentioned earlier on, and it's significance in playing the role I do.
A decade on, I still find this to be an odd space to occupy, sitting on the fringe and sometimes right in the middle of the interactions between these three parties. It's unpredictable, fascinating, exciting, stressful, and I hope you found this interesting reading 🙂
Why does it need to be a crazy data breach week right when I'm struggling with jet lag?! I came home from Europe just as a bunch of the Snowflake-sourced breaches started being publicly dumped, and things went a little crazy. Lots of data to review, lots of media enquiries and many discussions with impacted individuals, breached companies, incident response folks and law enforcement agencies. This situation is wreaking absolute havoc, and I suspect it has a way to run yet with only a small slice of the data from the apparent 165 impacted orgs appearing online so far. Looks like another interesting week ahead.
I've been harbouring some thoughts about the state of data breaches over recent months, and I feel they've finally manifested themselves into a cohesive enough story to write down. Parts of this story relate to very sensitive incidents and parts to criminal activity, not just on behalf of those executing data breaches but also very likely on behalf of some organisations handling them. As such, I'm not going to refer to any specific incidents or company names, rather I'm going to speak more generally to what I'm seeing in the industry.
Generally, when I disclose a breach to an impacted company, it's already out there in circulation and for all I know, the company is already aware of it. Or not. And that's the problem: a data breach circulating broadly on a popular clear web hacking forum doesn't mean the incident is known by the corporate victim. Now, if I can find press about the incident, then I have a pretty high degree of confidence that someone has at least tried to notify the company involved (journos generally reach out for comment when writing about a breach), but often that's non-existent. So, too, are any public statements from the company, and I very often haven't seen any breach notifications sent to impacted individuals either (I usually have a slew of these forwarded to me after they're sent out). So, I attempt to get in touch, and this is where the pain begins.
I've written before on many occasions about how hard it can be to contact a company and disclose a breach to them. Often, contact details aren't easily discoverable; if they are, they may be for sales, customer support, or some other capacity that's used to getting bombarded with spam. Is it any wonder, then, that so many breach disclosures that I (and others) attempt to make end up going to the spam folder? I've heard this so many times before after a breach ends up in the headlines - "we did have someone try to reach out to us, but we thought it was junk" - which then often results in news of the incident going public before the company has had an opportunity to respond. That's not good for anyone; the breached firm is caught off-guard, they may very well direct their ire at the reporter, and it may also be that the underlying flaw remains unpatched, and now you've got a bunch more people looking for it.
An approach like security.txt is meant to fix this, and I'm enormously supportive of this, but in my experience, there are usually two problems:
That one instance was so exceptional that, honestly, I hadn't even looked for the file before asking the public for a security contact at the firm. Shame on me for that, but is it any wonder?
Once I do manage to make contact, I'd say about half the time, the organisation is good to deal with. They often already know of HIBP and are already using it themselves for domain searches. We've joked before (the company and I) that they're grateful for the service but never wanted to hear from me!
The other half of the time, the response borders on open hostility. In one case that comes to mind, I got an email from their lawyer after finally tracking down a C-suite tech exec via LinkedIn and sending them a message. It wasn't threatening, but I had to go through a series of to-and-fro explaining what HIBP was, why I had their data and how the process usually unfolded. When in these positions, I find myself having to try and talk up the legitimacy of my service without sounding conceited, especially as it relates to publicly documented relationships with law enforcement agencies. It's laborious.
My approach during disclosure usually involves laying out the facts, pointing out where data has been published, and offering to provide the data to the impacted organisation if they can't obtain it themselves. I then ask about their timelines for notifying impacted customers and welcome their commentary to be included in the HIBP notifications sent to our subscribers. This last point is where things get more interesting, so let's talk about breach notifications.
This is perhaps one of my greatest bugbears right now and whilst the title will give you a pretty good sense of where I'm going, the nuances make this particularly interesting.
I suggest that most of us believe that if your personal information is compromised in a data breach, you'll be notified following this discovery by the organisation responsible for the service. Whether it's one day, one week, or even a month later isn't really the issue; frankly, any of these time frames would be a good step forward from where we frequently find ourselves. But constantly, I'm finding that companies are taking the position of consciously not notifying individuals at all. Let me give you a handful of examples:
During the disclosure process of a recent breach, it turned out the organisation was already aware of the incident and had taken "appropriate measures" (their term was something akin to that being vague enough to avoid saying what had been done, but, uh, "something" had been done). When pressed for a breach notice that would go to their customers, they advised they wouldn't be sending one as the incident had occurred more than 6 months ago. That stunned me - the outright admission that they wouldn't be communicating this incident - and in case you're thinking "this would never be allowed under GDPR", the company was HQ'd well within that scope being based in a major European city.
Another one that I need to be especially vague about (for reasons that will soon become obvious), involved a sizeable breach of customer data with the folks exposed inhabiting every corner of the globe. During my disclosure to them, I pushed them on a timeline for notifying victims and found their responses to be indirect but almost certainly indicating they'd never speak publicly about it. Statements to the effect of "we'll send notifications where we deem we're legally obligated to", which clearly left it up to them to make the determination. I later learned from a contact close to the incident that this particular organisation had an impending earnings call and didn't want the market to react negatively to news of a breach. "Uh, you know that's a whole different thing if they deliberately cover that up, right?"
An important point to make here, though, is that when it comes to companies themselves disclosing they've been breached, disclosure to individuals is often not what people think it is. In the various regulatory regimes we have across the globe, the legal requirement often stops at notifying the regulator and does not extend to notifying the individual victims. This surprises many people, and I constantly hear the rant of "But I'm in [insert your country here], and we have laws that demand I'm notified!" No, you almost certainly don't... but you should. We all should.
You can see further evidence by looking at recent Form 8-K SEC filings in the US. There are many examples of filings from companies that never notified the individuals themselves, yet here, you'll clearly see disclosure to the regulator. The breach is known, it's been reported in the public domain, but good luck ever getting an email about it yourself.
During one disclosure, I had the good fortune of a very close friend of mine working for the company involved in an infosec capacity. They were clearly stalling, being well over a week from my disclosure yet no public statements or notices to impacted individuals. I had a quiet chat with my contact, who explained it as follows:
Mate, it's a room full of lawyers working out how to spin this
Meanwhile, millions of records of customer data were in the hands of criminals, and every hour that went by was another hour victims went without any knowledge whatsoever that their personal info had been exposed. And as much as it pains me to say this, I get it: the company's priority is the company or, more specifically, the shareholders. That's who the board is accountable to, and maintaining the corporate reputation and profitability of the firm is their number one priority.
I see this all the time in post-breach communication too. One incident that comes to mind was the result of some egregiously stupid technical decisions. Once that breach hit the press, the CEO immediately went on the offence. Blame was laid firstly at those who obtained the data, then at me for my reporting of the incident (my own disclosure was absolutely "by the book").
I'm talking about class actions. I wrote about my views on this a few years ago and nothing has changed, other than it getting worse. I regularly hear from data breach victims about them wanting compensation for the impact a breach has had on them yet when pushed, most struggle to explain why. We've had multiple recent incidents in Australia where drivers' licences have been exposed and required reissuing, which is usually a process of going to a local transport office and waiting in a queue. "Are you looking for your time to be compensated for?", I asked one person. We have to rotate our licenses every 5 years anyway, so would you pro-rata that time based on the hourly value of your time and when you were due to be back in there anyway? And if there has been identity theft, was it from the breach you're now seeking compensation for? Or the other ones (both known and unknown) from which your data was taken?
Lawyers are a big part of the problem, and I still regularly hear from them seeking product placement on HIBP. What a time and a place to cash in if you could get your class action pitch right there in front of people at the moment they learn they were in a breach!
Frankly, I don't care too much about individuals getting a few bucks in compensation (and it's only ever a few), and I also don't even care about lawyers doing lawyer things. But I do care about the adverse consequences it has on the corporate victims, as it makes my job a hell of a lot harder when I'm talking to a company that's getting ready to get sued because of the information I've just disclosed to them.
These are all intertwined problems without single answers. But there are some clear paths forward:
Firstly, and this seems so obvious that it's frankly ridiculous I need to write it, but there should always be disclosure to individual victims. This may not need to be with the same degree of expeditiousness as disclosure to the regulator, but it has to happen. It is a harder problem for businesses; submitting a form to a gov body can be infinitely easier than emailing potentially hundreds of millions of breached customers. However, it is, without any doubt, the right thing to do and there should be legal constructs that mandate it.
Simultaneously providing protection from frivolous lawsuits where no material harm can be demonstrated and throwing the book at firms who deliberately conceal breaches also seems reasonable. No company is ever immune from a breach, and so frequently, it occurs not due to malicious behaviour by the organisation but a series of often unfortunate events. Ambitious lawyers shouldn't be in a position where they can make hell for a company at their worst possible hour unless there there is significant harm and negligence that can be clearly attributed back to the incident.
And then there's all the periphery stuff that pours fuel on the current dumpster fire. The aforementioned beg bounties that cause companies to be suspicious of even the most genuine disclosures, for example. On the other hand, the standoff-ish behaviour of many organisations receiving reports from folks who just want to see incidents disclosed. Flip side again is the number of people occupying that periphery of "security researcher / extortionist" who cause the aforementioned behaviours described in this paragraph. It's a mess, and writing it down like this makes it so abundantly apparent how many competing objectives there are.
I don't see anything changing any time soon, and anecdotally, it's worse now than it was 5 or 10 years ago. In part, I suspect that's due to how all those undesirable behaviours I described above have evolved over time, and in part I also believe the increasingly complexity of external dependencies is driving this. How many breaches have we seen in just the last year that can be attributed to "a third party"? I quote that term because it's often used by organisations who've been breached as though it somehow absolves them of some responsibility; "it wasn't us who was breached, it was those guys over there". Of course, it doesn't work that way, and more external dependencies leads to more points of failure, all of which you're still accountable for even if you've done everything else right.
Ah well, as I often end up lamenting, it's a fascinating time to be in the industry 🤷♂️
Ah, sunshine! As much as I love being back in Norway, the word "summer" is used very loosely there. Not as much in Greece, however, which is just spectacular:
Finally escaped the bitterly cold Norwegian summer for something… warmer 🇬🇷 pic.twitter.com/jk9knZvJar
— Troy Hunt (@troyhunt) June 17, 2024
3 nights in Mykonos, 2 in Santorini and I'm pushing this post out just before our second night in Athens before embarking on the long journey home. It's been an experience, between the NDC talks in Oslo and the downtime in Greece, but it's time to get home to our gorgeous Gold Coast winter weather ☀️
What a week! The NDC opening keynote and 3D printing talk both went off beautifully, the latter being the first time for 11-year old Elle on stage:
And the pro shots are really cool 😎 pic.twitter.com/ud7ad0pF1x
— Troy Hunt (@troyhunt) June 15, 2024
Videos of both will be available in the coming weeks so stay tuned for them. For now, we're at the end of a mostly cold and rainy Norwegian summer trip, heading to the sunny Greek isles for next week's update 😎
I just watched back a little segment from this week's video and somehow landed at exactly the point where I said "I am starting to lose my patience with repeating the same thing over and over again" (about 46 mins if you want to skip to it), which is precisely how I wanted to start this post. In running HIBP for the last 10 and a bit years, there have been so many breaches where people have asked for the data within them beyond just the email address to be made available. As I say in the video, I understand the reasons for the interest in the data, my frustration is when there's an unwillingness to understand why that isn't feasible, and for so many good reasons.
There's a very simple course of action available for anyone that feels strongly enough about this to be critical of my not providing additional data: do exactly what you would have done had I not loaded anything about this incident into HIBP. Of course, this simply then amounts to "ignorance is bliss" whereby your data is out there but you choose not to know about it, which can also be achieved by unsubscribing from the HIBP notification service. But complaining because I'm unwilling to take on huge amounts of additional overhead and risks whilst running a service on a shoestring that the vast majority of people use for free is just not cool. Alrighty, that feels better, here's the video 🙂
Last week, a security researcher sent me 122GB of data scraped out of thousands of Telegram channels. It contained 1.7k files with 2B lines and 361M unique email addresses of which 151M had never been seen in HIBP before. Alongside those addresses were passwords and, in many cases, the website the data pertains to. I've loaded it into Have I Been Pwned (HIBP) today because there's a huge amount of previously unseen email addresses and based on all the checks I've done, it's legitimate data. That's the high-level overview, now here are the details:
Telegram is a popular messaging platform that makes it easy to stand up a "channel" and share information to those who wish to visit it. As Telegram describes the service, it's simple, private and secure and as such, has become very popular with those wishing to share content anonymously, including content related to data breaches. Many of the breaches I've previously loaded into HIBP have been distributed via Telegram as it's simple to publish this class of data to the platform. Here's what data posted to Telegram often looks like:
These are referred to as "combolists", that is they're combinations of email addresses or usernames and passwords. The combination of these is obviously what's used to authenticate to various services, and we often see attackers using these to mount "credential stuffing" attacks where they use the lists to attempt to access accounts en mass. The list above is simply breaking the combos into their respective email service providers. For example, that last Gmail example contains over a quarter of a million rows like this:
That's only one of many files across many different Telegram channels. The data that was sent to me last week was sourced from 518 different channels and amounted to 1,748 separate files similar to the one above. Some of the files have literally no data (0kb), others are many gigabytes with many tens of millions of rows. For example, the largest file starts like this:
That looks very much like the result of info stealer malware that has obtained credentials as they were entered into websites on compromised machines. For example, the first record appears to have been snared when someone attempted to login to Nike. There's an easy way to get a sense of the accuracy of this data, just head over to the Nike homepage and click the login link which presents the following screen:
They serve the same page to both existing subscribers and new ones but then serve different pages depending on whether the email address already has an account (a classic enumeration vector). Mash the keyboard to create a fake email address and you'll be shown a registration form, but enter the address in the stealer log and, well, you get something different:
The email address has an account, hence the prompt for a password. I'm not going to test the password because that would constitute unauthorised access, but I also don't need to as the goal has already been achieved: I've demonstrated that the address has an account on Nike. (Also note that if the password didn't work it wouldn't necessarily mean it wasn't valid at some point in time at the past, it would simply mean it isn't valid now.)
Footlocker tries to be a bit more clever in avoiding enumeration on password reset, but they'll happily tell you via the registration page if the email address you've entered already exists:
Even the Italian tyre retailer happily confirmed the existence of the tested account:
Time and time again, each service I tested confirmed the presence of the email address in the stealer log. But are (or were) the passwords correct? Again, I'm not going to test those myself, but I have nearly 5M subscribers in HIBP and there's always a handful of them in any new breach that are happy to help out. So, I emailed some of the most recent ones, asked if they could help with verification and upon confirmation, sent them their data.
In reaching out to existing subscribers, I expected some repetition in terms of them already appearing in existing data breaches. For one person already in 13 different breaches in HIBP, this was their response:
Thanks Troy. These details were leaked in previous data breaches.
So accurate, but not new, and several of the breaches for this one were of a similar structure to the one we're talking about today in terms of them being combolists used for credential stuffing attacks. Same with another subscriber who was in 7 prior breaches:
Yes that’s familiar. Most likely would have used those credentials on the previous data breaches.
That one was more interesting as of the 7 prior breaches, only 6 had passwords exposed and none of them were combolists. Instead, it was incidents including MyFitnessPal, 8fit, FlexBooker, Jefit, MyHeritage and ShopBack; have passwords been cracked out of those (most were hashed) and used to create new lists? Very possibly. (Sidenote: this unfortunate person is obviously a bit of a fitness buff and has managed to end up in 3 different "fit" breaches.)
Another subscriber had an entry in the following format, similar to what we saw earlier on in the stealer log:
https://accounts.epicgames.com/login:[email]:[password]
They responded to my queries with the following:
I think that epic games account was for my daughter a couple of years ago but I cancelled it last year from memory. That sds like a password she may have chosen so I'll check with her in an hour or two when I see her again.
And then, a little bit later
My daughter doesn't remember if that was her password as it was 4-5 years ago when she was only 8-9 years old. However it does sound like something she would have chosen so in all probability, I would say that is a legitimate link. We believe it was used when she played a game called Fortnite which she did infrequently at that time hence her memory is sketchy.
I realised that whilst each of these responses confirmed the legitimacy of the data, they really weren't giving me much insight into the factor that made it worth loading into HIBP: the unseen addresses. So, I went through the same process of contacting HIBP subscribers again but this time, only the ones that I'd never seen in a breach before. This would then rule out all the repurposed prior incidents and give me a much better idea of how impactful this data really was. And that's when things got really interesting.
Let's start with the most interesting one and what you're about to see is two hundred rows of stealer logs:
https://steuer.check24.de/customer-center/aff/check24/authentication:[email]:[password]
https://www.disneyplus.com/de-de/reset-password:[email]:[password]
https://auth.rtl.de/auth/realms/rtlplus/protocol/openid-connect/auth:[email]:[password]
https://www.tink.de/checkout/login:[email]:[password]
https://signin.ebay.de/ws/eBayISAPI.dll:[email]:[password]
https://vrr-db-ticketshop.de/authentication/login:[email]:[password]
https://www.planet-sports.de/checkout/register:[email]:[password]
https://www.bstn.com/eu_de/checkout/:[email]:[password]
https://www.lico-nature.de/index.php:[email]:[password]
https://ticketshop.mobil.nrw/authentication/register:[email]:[password]
https://softwareindustrie24.de/checkout/confirm/as/customer:[email]:[password]
https://www.zurbrueggen.de/checkout/register:[email]:[password]
https://www.hertz247.de/ikeage/de-de/SignUp/Profile:[email]:[password]
https://www.bluemovement.com/de-de/checkout2:[email]:[password]
android://pfDvxsQIIXYFer6DxBcqXjgyr9X3z0_f4GlJfpZMErP2oGHX74fUnXpWA29CNgnCyZ_phC8IyV0exIV6hg3iyQ==@com.sixt.reservation/:[email]:[password]
https://members.persil-service.de/login/:[email]:[password]
https://www.nicotel.de/index.php:[email]:[password]
https://www.hellofresh.de/login:[email]:[password]
https://login.live.com/login.srf:[email]:[password]
https://accounts.login.idm.telekom.com/factorx:[email]:[password]
https://grillhaus-bei-reimann.order.dish.co/register:[email]:[password]
https://signup.sipgateteam.de/:[email]:[password]
https://www.baur.de/kasse/registrieren:[email]:[password]
https://buchung.carlundcarla.de/28572879/schritt-3:[email]:[password]
https://www.qvc.de/checkout/your-information.html:[email]:[password]
https://de.omio.com/app/search-frontend/booking/96720342-e20e-4de7-8b21-ddefc0fa44bd/passenger-details:[email]:[password]
https://www.shop-apotheke.com/nx/login/:[email]:[password]
https://druckmittel.de/checkout/confirm:[email]:[password]
https://www.global-carpet.de/checkout/confirm:[email]:[password]
https://software-hero.de/checkout/confirm:[email]:[password]
https://myenergykey.com/login:[email]:[password]
https://www.sixt.de/:[email]:[password]
https://www.wlan-shop24.de/Bestellvorgang:[email]:[password]
https://www.cyberport.de/checkout/anmelden.html:[email]:[password]
https://waschmal.de/registerCustomer:[email]:[password]
https://www.wgv.de/app/moped201802/rechner/abschluss/moped:[email]:[password]
https://www.persil-service.de/signup:[email]:[password]
https://nicotel.de/:[email]:[password]
https://temial.vorwerk.de/register/checkout:[email]:[password]
https://accounts.bahn.de/auth/realms/db/login-actions/required-action:[email]:[password].
https://www.petsdeli.de/login:[email]:[password]
https://www.netflix.com/de/login:[email]:[password]
https://login.live.com/login.srf:[email]:[password]
https://accounts.login.idm.telekom.com/factorx:[email]:[password]
https://www.netflix.com/de/login:[email]:[password]
https://www.zoll-portal.de/registrierung/benutzerkonto/daten:[email]:[password]
https://v3.account.samsung.com/iam/passwords/register:[email]:[password]
https://www.amazon.pl/ap/signin:[email]:[password]
https://www.amazon.de/:[email]:[password]
https://meinkonto.telekom-dienste.de/wiederherstellung/passwort/web-pw-setzen.xhtml:[email]:[password]
https://www.netflix.com/de/login:[email]:[password]
https://steuer.check24.de/customer-center/aff/check24/authentication [email]:[password]
https://www.disneyplus.com/de-de/reset-password [email]:[password]
https://auth.rtl.de/auth/realms/rtlplus/protocol/openid-connect/auth [email]:[password]
https://www.tink.de/checkout/login [email]:[password]
https://signin.ebay.de/ws/eBayISAPI.dll [email]:[password]
https://vrr-db-ticketshop.de/authentication/login [email]:[password]
https://www.planet-sports.de/checkout/register [email]:[password]
https://www.bstn.com/eu_de/checkout/ [email]:[password]
https://www.lico-nature.de/index.php [email]:[password]
https://ticketshop.mobil.nrw/authentication/register [email]:[password]
https://softwareindustrie24.de/checkout/confirm/as/customer [email]:[password]
https://www.zurbrueggen.de/checkout/register [email]:[password]
https://www.hertz247.de/ikeage/de-de/SignUp/Profile [email]:[password]
https://www.bluemovement.com/de-de/checkout2 [email]:[password]
android://pfDvxsQIIXYFer6DxBcqXjgyr9X3z0_f4GlJfpZMErP2oGHX74fUnXpWA29CNgnCyZ_phC8IyV0exIV6hg3iyQ==@com.sixt.reservation/[email]:[password]
https://members.persil-service.de/login/ [email]:[password]
https://www.nicotel.de/index.php [email]:[password]
https://www.hellofresh.de/login [email]:[password]
https://login.live.com/login.srf [email]:[password]
https://accounts.login.idm.telekom.com/factorx [email]:[password]
https://grillhaus-bei-reimann.order.dish.co/register [email]:[password]
https://signup.sipgateteam.de/ [email]:[password]
https://www.baur.de/kasse/registrieren [email]:[password]
https://buchung.carlundcarla.de/28572879/schritt-3 [email]:[password]
https://www.qvc.de/checkout/your-information.html [email]:[password]
https://de.omio.com/app/search-frontend/booking/96720342-e20e-4de7-8b21-ddefc0fa44bd/passenger-details [email]:[password]
https://www.shop-apotheke.com/nx/login/ [email]:[password]
https://druckmittel.de/checkout/confirm [email]:[password]
https://www.global-carpet.de/checkout/confirm [email]:[password]
https://software-hero.de/checkout/confirm [email]:[password]
https://myenergykey.com/login [email]:[password]
https://www.sixt.de/ [email]:[password]
https://www.wlan-shop24.de/Bestellvorgang [email]:[password]
https://www.cyberport.de/checkout/anmelden.html [email]:[password]
https://waschmal.de/registerCustomer [email]:[password]
https://www.wgv.de/app/moped201802/rechner/abschluss/moped [email]:[password]
https://www.persil-service.de/signup [email]:[password]
https://nicotel.de/ [email]:[password]
https://temial.vorwerk.de/register/checkout [email]:[password]
https://accounts.bahn.de/auth/realms/db/login-actions/required-action [email]:[password].
https://www.petsdeli.de/login [email]:[password]
https://www.netflix.com/de/login [email]:[password]
https://login.live.com/login.srf [email]:[password]
https://accounts.login.idm.telekom.com/factorx [email]:[password]
https://www.netflix.com/de/login [email]:[password]
https://www.zoll-portal.de/registrierung/benutzerkonto/daten [email]:[password]
https://v3.account.samsung.com/iam/passwords/register [email]:[password]
https://www.amazon.pl/ap/signin [email]:[password]
https://www.amazon.de/ [email]:[password]
https://meinkonto.telekom-dienste.de/wiederherstellung/passwort/web-pw-setzen.xhtml [email]:[password]
https://www.netflix.com/de/login [email]:[password]
https://steuer.check24.de/customer-center/aff/check24/authentication:[email]:[password]
https://www.disneyplus.com/de-de/reset-password:[email]:[password]
https://auth.rtl.de/auth/realms/rtlplus/protocol/openid-connect/auth:[email]:[password]
https://www.tink.de/checkout/login:[email]:[password]
https://signin.ebay.de/ws/eBayISAPI.dll:[email]:[password]
https://vrr-db-ticketshop.de/authentication/login:[email]:[password]
https://www.planet-sports.de/checkout/register:[email]:[password]
https://www.bstn.com/eu_de/checkout/:[email]:[password]
https://www.lico-nature.de/index.php:[email]:[password]
https://ticketshop.mobil.nrw/authentication/register:[email]:[password]
https://softwareindustrie24.de/checkout/confirm/as/customer:[email]:[password]
https://www.zurbrueggen.de/checkout/register:[email]:[password]
https://www.hertz247.de/ikeage/de-de/SignUp/Profile:[email]:[password]
https://www.bluemovement.com/de-de/checkout2:[email]:[password]
android://pfDvxsQIIXYFer6DxBcqXjgyr9X3z0_f4GlJfpZMErP2oGHX74fUnXpWA29CNgnCyZ_phC8IyV0exIV6hg3iyQ==@com.sixt.reservation/:[email]:[password]
https://members.persil-service.de/login/:[email]:[password]
https://www.nicotel.de/index.php:[email]:[password]
https://www.hellofresh.de/login:[email]:[password]
https://login.live.com/login.srf:[email]:[password]
https://accounts.login.idm.telekom.com/factorx:[email]:[password]
https://grillhaus-bei-reimann.order.dish.co/register:[email]:[password]
https://signup.sipgateteam.de/:[email]:[password]
https://www.baur.de/kasse/registrieren:[email]:[password]
https://buchung.carlundcarla.de/28572879/schritt-3:[email]:[password]
https://www.qvc.de/checkout/your-information.html:[email]:[password]
https://de.omio.com/app/search-frontend/booking/96720342-e20e-4de7-8b21-ddefc0fa44bd/passenger-details:[email]:[password]
https://www.shop-apotheke.com/nx/login/:[email]:[password]
https://druckmittel.de/checkout/confirm:[email]:[password]
https://www.global-carpet.de/checkout/confirm:[email]:[password]
https://software-hero.de/checkout/confirm:[email]:[password]
https://myenergykey.com/login:[email]:[password]
https://www.sixt.de/:[email]:[password]
https://www.wlan-shop24.de/Bestellvorgang:[email]:[password]
https://www.cyberport.de/checkout/anmelden.html:[email]:[password]
https://waschmal.de/registerCustomer:[email]:[password]
https://www.wgv.de/app/moped201802/rechner/abschluss/moped:[email]:[password]
https://www.persil-service.de/signup:[email]:[password]
https://nicotel.de/:[email]:[password]
https://temial.vorwerk.de/register/checkout:[email]:[password]
https://accounts.bahn.de/auth/realms/db/login-actions/required-action:[email]:[password].
https://www.petsdeli.de/login:[email]:[password]
https://www.netflix.com/de/login:[email]:[password]
https://login.live.com/login.srf:[email]:[password]
https://accounts.login.idm.telekom.com/factorx:[email]:[password]
https://www.netflix.com/de/login:[email]:[password]
https://www.zoll-portal.de/registrierung/benutzerkonto/daten:[email]:[password]
https://v3.account.samsung.com/iam/passwords/register:[email]:[password]
https://www.amazon.pl/ap/signin:[email]:[password]
https://www.amazon.de/:[email]:[password]
https://meinkonto.telekom-dienste.de/wiederherstellung/passwort/web-pw-setzen.xhtml:[email]:[password]
steuer.check24.de/customer-center/aff/check24/authentication:[email]:[password]
www.disneyplus.com/de-de/reset-password:[email]:[password]
auth.rtl.de/auth/realms/rtlplus/protocol/openid-connect/auth:[email]:[password]
www.tink.de/checkout/login:[email]:[password]
signin.ebay.de/ws/eBayISAPI.dll:[email]:[password]
vrr-db-ticketshop.de/authentication/login:[email]:[password]
www.planet-sports.de/checkout/register:[email]:[password]
www.bstn.com/eu_de/checkout/:[email]:[password]
www.lico-nature.de/index.php:[email]:[password]
ticketshop.mobil.nrw/authentication/register:[email]:[password]
softwareindustrie24.de/checkout/confirm/as/customer:[email]:[password]
www.zurbrueggen.de/checkout/register:[email]:[password]
www.hertz247.de/ikeage/de-de/SignUp/Profile:[email]:[password]
www.bluemovement.com/de-de/checkout2:[email]:[password]
members.persil-service.de/login/:[email]:[password]
www.nicotel.de/index.php:[email]:[password]
www.hellofresh.de/login:[email]:[password]
login.live.com/login.srf:[email]:[password]
accounts.login.idm.telekom.com/factorx:[email]:[password]
grillhaus-bei-reimann.order.dish.co/register:[email]:[password]
signup.sipgateteam.de/:[email]:[password]
www.baur.de/kasse/registrieren:[email]:[password]
buchung.carlundcarla.de/28572879/schritt-3:[email]:[password]
www.qvc.de/checkout/your-information.html:[email]:[password]
de.omio.com/app/search-frontend/booking/96720342-e20e-4de7-8b21-ddefc0fa44bd/passenger-details:[email]:[password]
www.shop-apotheke.com/nx/login/:[email]:[password]
druckmittel.de/checkout/confirm:[email]:[password]
www.global-carpet.de/checkout/confirm:[email]:[password]
software-hero.de/checkout/confirm:[email]:[password]
myenergykey.com/login:[email]:[password]
www.sixt.de/:[email]:[password]
www.wlan-shop24.de/Bestellvorgang:[email]:[password]
www.cyberport.de/checkout/anmelden.html:[email]:[password]
waschmal.de/registerCustomer:[email]:[password]
www.wgv.de/app/moped201802/rechner/abschluss/moped:[email]:[password]
www.persil-service.de/signup:[email]:[password]
nicotel.de/:[email]:[password]
temial.vorwerk.de/register/checkout:[email]:[password]
accounts.bahn.de/auth/realms/db/login-actions/required-action:[email]:[password].
www.petsdeli.de/login:[email]:[password]
login.live.com/login.srf:[email]:[password]
accounts.login.idm.telekom.com/factorx:[email]:[password]
www.netflix.com/de/login:[email]:[password]
www.zoll-portal.de/registrierung/benutzerkonto/daten:[email]:[password]
v3.account.samsung.com/iam/passwords/register:[email]:[password]
www.amazon.pl/ap/signin:[email]:[password]
www.amazon.de/:[email]:[password]
meinkonto.telekom-dienste.de/wiederherstellung/passwort/web-pw-setzen.xhtml:[email]:[password]
Even without seeing the email address and password, the commonality is clear: German websites. Whilst the email address is common, the passwords are not... at least not always. In 168 instances they were near identical with only a handful of them deviating by a character or two. There's some duplication across the lines (9 different rows of Netflix, 4 of Disney Plus, etc), but clearly this remains a significant volume of data. But is it real? Let's find out:
The data seems accurate so far. I have already changed some of the passwords as I was notified by the provider that my account was hacked. It is strange that the Telekom password was already generated and should not be guessable. I store my passwords in Firefox, so is it possible that they were stolen from there?
It's legit. Stealer malware explains both the Telekom password and why passwords in Firefox were obtained; there's not necessarily anything wrong with either service, but if a machine is infected with software that can grab passwords straight out of the fields they've been entered into in the browser, it's game over.
We started having some to-and-fro as I gathered more info, especially as it related to the timeframe:
It started about a month ago, maximum 6 weeks. I use a Macbook and an iPhone, only a Windows PC at work, maybe it happened there? About a week ago there was an extreme spam attack on my Gmail account, and several expensive items were ordered with my accounts in the same period, which fortunately could be canceled.
We had the usual discussion about password managers and of course before that, tracking down which device is infected and siphoning off secrets. This was obviously distressing for her to see all her accounts laid out like this, not to mention learning that they were being exchanged in channels frequented by criminals. But from the perspective of verifying both the legitimacy and uniqueness of the data (not to mention the freshness), this was an enormously valuable exchange.
Next up was another subscriber who'd previously dodged all the data breaches in HIBP yet somehow managed to end up with 53 rows of data in the corpus:
[email]:Gru[redacted password]
[email]:fux[redacted password]
[email]:zWi[redacted password]
[email]:6ii[redacted password]
[email]:qTM[redacted password]
[email]:Pre[redacted password]
[email]:i8$[redacted password]
[email]:9cr[redacted password]
[email]:fuc[redacted password]
[email]:kuM[redacted password]
[email]:Fuc[redacted password]
[email]:Pre[redacted password]
[email]:Vxt[redacted password]
[email]:%3r[redacted password]
[email]:But[redacted password]
[email]:1qH[redacted password]
[email]:^VS[redacted password]
[email]:But[redacted password]
[email]:Nbs[redacted password]
[email]:*W2[redacted password]
[email]:$aM[redacted password]
[email]:DA^[redacted password]
[email]:vPE[redacted password]
[email]:Z8u[redacted password]
[email]:But[redacted password]
[email]:aXi[redacted password]
[email]:rPe[redacted password]
[email]:b4F[redacted password]
[email]:2u&[redacted password]
[email]:5%f[redacted password]
[email]:Lmt[redacted password]
[email]:p
[email]:Tem[redacted password]
[email]:fuc[redacted password]
[email]:*e@[redacted password]
[email]:(k+[redacted password]
[email]:Ste[redacted password]
[email]:^@f[redacted password]
[email]:XT$[redacted password]
[email]:25@[redacted password]
[email]:Jav[redacted password]
[email]:U8![redacted password]
[email]:LsZ[redacted password]
[email]:But[redacted password]
[email]:g$V[redacted password]
[email]:M9@[redacted password]
[email]:!6D[redacted password]
[email]:Fac[redacted password]
[email]:but[redacted password]
[email]:Why[redacted password]
[email]:h45[redacted password]
[email]:blo[redacted password]
[email]:azT[redacted password]
I've redacted everything after the first three characters of the password so you can get a sense of the breadth of different ones here. In this instance, there was no accompanying website, but the data checked out:
Oh damn a lot of those do seem pretty accurate. Some are quite old and outdated too. I tend to use that gmail account for inconsequential shit so I'm not too fussed, but I'll defintely get stuck in and change all those passwords ASAP. This actually explains a lot because I've noticed some pretty suspicious activity with a couple of different accounts lately.
Another with 35 records of website, email and password triplets responded as follows (I'll stop pasting in the source data, you know what that looks like by now):
Thank you very much for the information, although I already knew about this (I think it was due to a breach in LastPass) and I already changed the passwords, your information is much more complete and clear. It helped me find some pages where I haven't changed the password.
The final one of note really struck a chord with me, not because of the thrirteen rows of records similar to the ones above, but because of what he told me in his reply:
Thank you for your kindness. Most of these I have been able to change the passwords of and they do look familiar. The passwords on there have been changed. Is there a way we both can fix this problem as seeing I am only 14?
That's my son's age and predictably, all the websites listed were gaming sites. The kid had obviously installed something nasty and had signed up to HIBP notifications only a week earlier. He explained he'd recently received an email attempting to extort him for $1.3k worth of Bitcoin and shared the message. It was clearly a mass-mailed, indiscriminate shakedown and I advised him that it in no way targeted him directly. Concerned, he countered with a second extortion email he'd received, this time it was your classic "we caught you watching porn and masturbating" scam, and this one really had him worried:
I have been stressed and scared about these scams (even though I shouldn’t be). I have been very stressed and scared today because of another one of those emails.
Imagine being a young teenage boy and receiving that?! That's the sort of thing criminals frequenting Telegram channels such as the ones in question are using this data for, and it's reprehensible. I gave him some tips (I see the sorts of things my son's friends randomly install!) and hopefully, that'll set him on the right course.
They were the most noteworthy responses, the others that were often just a single email address and password pair just simply reinforced the same message:
Yes, this is an old password that I have used in the past, and matches the password of my accounts that had been logged into recently.
And:
Yes that password is familiar and accurate. I used to practice password re-use with this password across many services 5+ years ago.This makes it impossible to correlate it to a particular service or breach. It is known to me to be out there already, I've received crypto extortion emails containing it.
I know that many people who find themselves in this incident will be confused; which breach is it? I've never used Telegram before, why am I there? Those questions came through during my verification process and I know from loading previous similar breaches, they'll come up over and over again in the coming days and I hope that the overview above sufficiently answers these.
The questions that are harder to answer (and again, I know these will come up based on prior experience), are what the password is that was exposed, what the website it appeared next to was and, indeed, if it appeared next to a website at all or just alongside an email address. Right at the beginning of this project more than a decade ago, I made the decision not to load the data that would answer these questions due to the risk it posed to individuals and by extension, the risk to my ability to continue running HIBP. We were reminded of how important this decision was earlier in the year when a service aggregating data breaches left the whole thing exposed and put everyone in there at even more risk.
So, if you're in here, what do you do? It's a repeat of the same old advice we've been giving in this industry for decades now, namely keeping devices patched and updated, running security software appropriate for your device (I use Microsoft Defender on my PCs), using strong and unique passwords (get a password manager!) and enabling 2FA wherever possible. Each HIBP subscriber I contacted wasn't doing at least one of these things, which was evident in their password selection. Time and time again, passwords consisted of highly predictable patterns and often included their name, year of birth (I assume) and common character substitutions, usually within a dozen characters of length too. It's the absolute basics that are going wrong here.
To the point one of the HIBP subscribers made above, loading this data will help many people explain why they've been seeing unusual behaviour on their accounts. It's also the wakeup call to lift everyone's security game per the previous paragraph. But this also isn't the end of it, and more combolists have been posted in more Telegram channels since loading this incident. Whilst I'm still of the view from years ago that I'm not going to continuously load endless lists, I do hope people recognise that their security posture is an ongoing concern and not just something you think about after appearing in a breach.
The data is now searchable in Have I Been Pwned.
What a week! It was Ticketmaster that consumed the bulk of my time this week with the media getting themselves into a bit of a frenzy over a data breach that at the time of recording, still hadn't even been confirmed. But as predicted in the video, confirmation came late on a Friday arvo and since that time we've learned a lot more about just how bad the situation is. I'm going to save that discussion for Weekly Update 403 and between the time of writing this and going live, I'm sure we'll learn a lot more about the whole Snowflake situation as well.
Today we loaded 16.5M email addresses and 13.5M unique passwords provided by law enforcement agencies into Have I Been Pwned (HIBP) following botnet takedowns in a campaign they've coined Operation Endgame. That link provides an excellent overview so start there then come back to this blog post which adds some insight into the data and explains how HIBP fits into the picture.
Since 2013 when I kicked off HIBP as a pet project, it has become an increasingly important part of the security posture of individuals, organisations, governments and law enforcement agencies. Gradually and organically, it has found a fit where it's able to provide a useful service to the good guys after the bad guys have done evil cyber things. The phrase I've been fond of this last decade is that HIBP is there to do good things with data after bad things happen. The reputation and reach the service has gained in this time has led to partnerships such as the one you're reading about here today. So, with that in mind, let's get into the mechanics of the data:
In terms of the email addresses, there were 16.5M in total with 4.5M of them not having been seen in previous data breaches already in HIBP. We found 25k of our own individual subscribers in the corpus of data, plus another 20k domain subscribers which is usually organisations monitoring the exposure of their customers (all of these subscribers have now been sent notification emails). As the data was provided to us by law enforcement for the public good, the breach is flagged as subscription free which means any organisation that can prove control of the domain can search it irrespective of the subscription model we launched for large domains in August last year.
The only data we've been provided with is email addresses and disassociated password hashes, that is they don't appear alongside a corresponding address. This is the bare minimum we need to make that data searchable and useful to those impacted. So, let's talk about those standalone passwords:
There are 13.5 million unique passwords of which 8.9M were already in Pwned Passwords. Those passwords have had their prevalence counts updated accordingly (we received counts for each password with many appearing in the takedown multiple times over), so if you're using Pwned Passwords already, you'll see new numbers next to some entries. That also means there are 4.6M passwords we've never seen before which you can freely download using our open source tool. Or even better, if you're querying Pwned Passwords on demand you don't need to do anything as the new entries are automatically added to the result set. All this is made possible by feeding the data into the law enforcement pipeline we built for the FBI and NCA a few years ago.
A quick geek-out moment on Pwned Passwords: at present, we're serving almost 8 billion requests per month to this service:
Taking just last week as an example, we're a rounding error off 100% of requests being served directly from Cloudflare's cache:
That's over 99.99% of all requests during that period that were served from one of Cloudflare's edge nodes that sit in 320 cities globally. What that means for consumers of the service is massively fast response times due to the low latency of serving content from a nearby location and huge confidence in availability as there's only about a one-in-ten-thousand chance of the request being served by our origin service. If you'd like to know more about how we achieved this, check out my post from a year ago on using Cloudflare Cache Reserve.
After pushing out the new passwords today, all but 5 hash prefixes were modified (read more about how we use hashes to enable anonymous password searches) so we did a complete Cloudflare cache flush. By the time you read this, almost the entire 16^5 possible hash ranges have been completely repopulated into cache due to the volume of requests the service receives:
Lastly, when we talk about passwords in HIBP, the inputs we receive from law enforcement consist of 3 parts:
The rationale for this is explained in the links above but in a nutshell, the SHA-1 format ensures any badly parsed data that may inadvertently include PII is protected and it aligns with the underlying data structure that drives the k-anonymity searches. We have NTLM hashes as well because many orgs use them to check passwords in their own Active Directory instances.
So, what can you do if you find your data in this incident? It's a similar story to the Emotet malware provided by the FBI and NHTCU a few years ago in that the sage old advice applies: get a password manager and make them all strong and unique, turn on 2FA everywhere, keep machines patched, etc. If you find your password in the data (the HIBP password search feature anonymises it before searching, or password managers like 1Password can scan all of your passwords in one go), obviously change it everywhere you've used it.
This operation will be significant in terms of the impact on cybercrime, and I'm glad we've been able to put this little project to good use by supporting our friends in law enforcement who are doing their best to support all of us as online citizens.
Ah, episode 401, the unauthorised one! Ok, that was terrible, but what's not terrible is finally getting some serious dev resources behind HIBP. I touch on it in the blog post but imagine all the different stuff I have to spread myself across to run this thing, and how much time is left for actual coding. By welcoming Stefan to the team we're not doubling or tripling or even quadrupling the potential dev hours, it's genuinely getting close to 10x. That's exciting, and it will result in both foundational improvements and new features that'll be highly visible to everyone. Stay tuned 😊
We often do that in this industry, the whole "1.0" thing, but it seems apt here. I started Have I Been Pwned (HIBP) in 2013 as a pet project that scratched an itch, so I never really thought of myself as an "employee". Over time, it grew (and I tell you what, nobody is more surprised by that than me!) and over the last few years, my wife Charlotte got more and more involved. Technically, we're both employees and we work on HIBP things but we're like, well, beta versions.
Today, I'm very happy to announce our first full-time, production-ready employee: Stefán Jökull Sigurðarson. This is both a massive commitment on Charlotte's and my part and a leap of faith on Stefán's and deserves some background:
I suffer somewhat from what I'll call the "founder's paradox", that is I find myself having built something genuinely useful and wanting to see it grow and mature yet also not wanting to let go. I want to be involved in everything, but I also want to go on holidays sometimes and tune out. I like making decisions on every aspect of how the service runs, but I want it to outlive me. Bringing any outside party into any business can be hard to come to terms with, but especially in the case of HIBP where it's become so critical to so many people and deals with so much sensitive data. Which is why I have to trust people like Stefán because if I don't, I'm one shark / snake / croc incident away from disappointing a lot of people.
Trust is the cornerstone of why Stefán is joining us now. Not just trust in his technical skills, but trust in him as a person. I've known Stefán for many years now, initially when he came to one of my Hack Yourself First workshops in Oslo back in 2018, then as a blogger writing about how he was implementing Pwned Passwords at EVE Online, then as conference speaker himself, a Microsoft MVP, and in 2021, as the person who selflessly gave up his own time to support the open source Pwned Passwords. What we never made any formal announcements about is that we did hire Stefán on a part-time basis beginning earlier last year to help out with the coding when he had free cycles amidst his full-time work. That went great and he obviously enjoyed working at HIBP so earlier this month, Stefán handed in his resignation and will shortly be a full-time employee.
I'm really happy with the timing of this and how it's all worked out. We're in a position to make the financial commitment largely because of finally putting a price on searches for large domains last year. What this has allowed us to do is shift money from companies who see value in the service (more than half the Fortune 500 use the domain search feature), and reinvest it into making HIBP more sustainable. Getting Stefán onboard is the manifestation of that investment and you'll very shortly see his work begin to translate into highly visible new features. But what you won't see is the stuff that's even more important, especially as it relates to running a more sustainable service that no longer has me as a single point of failure.
So, welcome Stefán, and thank you for your commitment 😊
Oh - just one more thing: I was looking around for a great hero image for this blog post and I found this awesome video of Stefán swimming through a semi-frozen Norwegian fjord before riding an iceberg. For real, this it perhaps the most Nordic thing I've ever seen (Stefán being from Iceland and all), but unfortunately videos don't really lend themselves to hero images, so I went switch a stylised AI-generated rendition of the event.
This is the 400th time I've sat down in front of the camera and done one of these videos. Every single week since the 23rd of September in 2016 regardless of location, health, stress and all sorts of other crazy things that have gone on in my life for nearly the last 8 years now, I've done a video. As with so many of the things I create, these are as much for me as they are for you; doing these videos every week has given me a regular cadence amidst some pretty crazy times. I've written before about dealing with stress and I honestly cannot tell you how many times I was having the worst time of my life right up until the point where I went live... and then my entire mindset changed. I had to focus on what I was talking about and just like that, I had a reprieve from the stress.
So, thank you for tuning in, for engaging and commenting, and for giving me a platform not just to talk about tech (and coffee and beer), but to help keep me sane 😊
The Post Millennial breach in this week's video is an interesting one, most notably because of the presence of the mailing lists. Now, as I've said in every piece of communication I've put out on this incident, the lists are what whoever defaced the site said TPM had and they certainly posted that data in the defacement message, but we're yet to hear a statement from the company itself. Taking it at face value, where does their responsibility lie as it relates to individuals in this data set? I mean, let's say you signed a petition aligned to your political ideals many years ago and agreed to the terms and conditions (which you didn't read, because you're a normal human) then your data pops up somewhere like TPM. Is it their responsibility to let you know? Or the service that sold your data to them? Or... something else? It's messy, real messy, and the only thing I'm confident in saying is that the most likely thing to happen is the same as every other time we see this pattern: nothing.
How many different angles can you have on one data breach? Facial recognition (which probably isn't actual biometrics), gambling, offshore developers, unpaid bills, extortion, sloppy password practices and now, an arrest. On pondering it more after today's livestream, it's the unfathomable stupidity of publishing this data publicly that really strikes me. By all means, have contractual disputes, get lawyers involved and showdown in the courts if you need to, but take data in this fashion and chuck it up online and you're well into criminal territory. It's just nuts, and I suspect there's a lot more yet to play out in this saga.