TL;DR — Email addresses in stealer logs can now be queried in HIBP to discover which websites they've had credentials exposed against. Individuals can see this by verifying their address using the notification service and organisations monitoring domains can pull a list back via a new API.
Nasty stuff, stealer logs. I've written about them and loaded them into Have I Been Pwned (HIBP) before but just as a recap, we're talking about the logs created by malware running on infected machines. You know that game cheat you downloaded? Or that crack for the pirated software product? Or the video of your colleague doing something that sounded crazy but you thought you'd better download and run that executable program showing it just to be sure? That's just a few different ways you end up with malware on your machine that then watches what you're doing and logs it, just like this:
These logs all came from the same person and each time the poor bloke visited a website and logged in, the malware snared the URL, his email address and his password. It's akin to a criminal looking over his shoulder and writing down the credentials for every service he's using, except rather than it being one shoulder-surfing bad guy, it's somewhat larger than that. We're talking about billions of records of stealer logs floating around, often published via Telegram where they're easily accessible to the masses. Check out Bitsight's piece titled Exfiltration over Telegram Bots: Skidding Infostealer Logs if you'd like to get into the weeds of how and why this happens. Or, for a really quick snapshot, here's an example that popped up on Telegram as I was writing this post:
As it relates to HIBP, stealer logs have always presented a bit of a paradox: they contain huge troves of personal information that by any reasonable measure constitute a data breach that victims would like to know about, but then what can they actually do about it? What are the websites listed against their email address? And what password was used? Reading the comments from the blog post in the first para, you can sense the frustration; people want more info and merely saying "your email address appeared in stealer logs" has left many feeling more frustrated than informed. I've been giving that a lot of thought over recent months and today, we're going to take a big step towards addressing that concern:
The domains an email address appears next to in stealer logs can now be returned to authorised users.
This means the guy with the Gmail address from the screen grab above can now see that his address has appeared against Amazon, Facebook and H&R Block. Further, his password is also searchable in Pwned Passwords so every piece of info we have from the stealer log is now accessible to him. Let me explain the mechanics of this:
Firstly, the volumes of data we're talking about are immense. In the case of the most recent corpus of data I was sent, there are hundreds of text files with well over 100GB of data and billions of rows. Filtering it all down, we ended up with 220 million unique rows of email address and domain pairs covering 69 million of the total 71 million email addresses in the data. The gap is explained by a combination of email addresses that appeared against invalidly formed domains and in some cases, addresses that only appeared with a password and not a domain. Criminals aren't exactly renowned for dumping perfectly formed data sets we can seamlessly work with, and I hope folks that fall into that few percent gap understand this limitation.
So, we now have 220 million records of email addresses against domains, how do we surface that information? Keeping in mind that "experimental" caveat in the title, the first decision we made is that it should only be accessible to the following parties:
At face value it might look like that first point deviates from the current model of just entering an email address on the front page of the site and getting back a result (and there are very good reasons why the service works this way). There are some important differences though, the first of which is that whilst your classic email address search on HIBP returns verified breaches of specific services, stealer logs contain a list of services that have never have been breached. It means we're talking about much larger numbers that build up far richer profiles; instead of a few breached services someone used, we're talking about potentially hundreds of them. Secondly, many of the services that appear next to email addresses in the stealer logs are precisely the sort of thing we flag as sensitive and hide from public view. There's a heap of Pornhub. There are health-related services. Religious one. Political websites. There are a lot of services there that merely by association constitute sensitive information, and we just don't want to take the risk of showing that info to the masses.
The second point means that companies doing domain searches (for which they already need to prove control of the domain), can pull back the list of the websites people in their organisation have email addresses next to. When the company controls the domain, they also control the email addresses on that domain and by extension, have the technical ability to view messages sent to their mailbox. Whether they have policies prohibiting this is a different story but remember, your work email address is your work's email address! They can already see the services sending emails to their people, and in the case of stealer logs, this is likely to be enormously useful information as it relates to protecting the organisation. I ran a few big names through the data, and even I was shocked at the prevalence of corporate email addresses against services you wouldn't expect to be used in the workplace (then again, using the corp email address in places you definitely shouldn't be isn't exactly anything new). That in itself is an issue, then there's the question of whether these logs came from an infected corporate machine or from someone entering their work email address into their personal device.
I started thinking more about what you can learn about an organisation's exposure in these logs, so I grabbed a well-known brand in the Fortune 500. Here are some of the highlights:
That said, let me emphasise a critical point:
This data is prepared and sold by criminals who provide zero guarantees as to its accuracy. The only guarantee is that the presence of an email address next to a domain is precisely what's in the stealer log; the owner of the address may never have actually visited the indicated website.
Stealer logs are not like typical data breaches where it's a discrete incident leading to the dumping of customers of a specific service. I know that the presence of my personal email address in the LinkedIn and Dropbox data breaches, for example, is a near-ironclad indication that those services exposed my data. Stealer logs don't provide that guarantee, so please understand this when reviewing the data.
The way we've decided to implement these two use cases differs:
We'll make the individual searches cleaner in the near future as part of the rebrand I've recently been talking about. For now, here's what it looks like:
Because of the recirculation of many stealer logs, we're not tracking which domains appeared against which breaches in HIBP. Depending on how this experiment with stealer logs goes, we'll likely add more in the future (and fill in the domain data for existing stealer logs in HIBP), but additional domains will only appear in the screen above if they haven't already been seen.
We've done the searches by domain owners via API as we're talking about potentially huge volumes of data that really don't scale well to the browser experience. Imagine a company with tens or hundreds of thousands of breached addresses and then a whole heap of those addresses have a bunch of stealer log entries against them. Further, by putting this behind a per-email address API rather than automatically showing it on domain search means it's easy for an org to not see these results, which I suspect some will elect to do for privacy reasons. The API approach was easiest while we explore this service then we can build on that based on feedback. I mentioned this was experimental, right? For now, it looks like this:
Lastly, there's another opportunity altogether that loading stealer logs in this fashion opens up, and the penny dropped when I loaded that last one mentioned earlier. I was contacted by a couple of different organisations that explained how around the time the data I'd loaded was circulating, they were seeing an uptick in account takeovers "and the attackers were getting the password right first go every time!" Using HIBP to try and understand where impacted customers might have been exposed, they posited that it was possible the same stealer logs I had were being used by criminals to extract every account that had logged onto their service. So, we started delving into the data and sure enough, all the other email addresses against their domain aligned with customers who were suffering from account takeover. We now have that data in HIBP, and it would be technically feasible to provide this to domain owners so that they can get an early heads up on which of their customers they probably have to rotate credentials for. I love the idea as it's a great preventative measure, perhaps that will be our next experiment.
Onto the passwords and as mentioned earlier, these have all been extracted and added to the existing Pwned Passwords service. This service remains totally free and open source (both code and data), has a really cool anonymity model allowing you to hit the API without disclosing the password being searched for, and has become absolutely MASSIVE!
I thought that doing more than 10 billion requests a month was cool, but look at that data transfer - more than a quarter of a petabyte just last month! And it's in use at some pretty big name sites as well:
That's just where the API is implemented client-side, and we can identify the source of the requests via the referrer header. Most implementations are done server-side, and by design, we have absolutely no idea who those folks are. Shoutout to Cloudflare while we're here for continuing to provide the service behind this for free to help make a more secure web.
In terms of the passwords in this latest stealer log corpus, we found 167 million unique ones of which only 61 million were already in HIBP. That's a massive number, so we did some checks, and whilst there's always a bit of junk in these data sets (remember - criminals and formatting!) there's also a heap of new stuff. For example:
And about 106M other non-kangaroo themed passwords. Admittedly, we did start to get a bit preoccupied looking at some of the creative ways people were creating previously unseen passwords:
And here's something especially ironic: check out these stealer log entries:
People have been checking these passwords on HIBP's service whilst infected with malware that logged the search! None of those passwords were in HIBP... but they all are now 🙂
Want to see something equally ironic? People using my Hack Yourself First website to learn about secure coding practices have also been infected with malware and ended up in stealer logs:
So, that's the experiment we're trying with stealer logs, and that's how to see the websites exposed against an email address. Just one final comment as it comes up every single time we load data like this:
We cannot manually provide data on a per-individual basis.
Hopefully, there's less need to now given the new feature outlined above, and I hope the massive burden of looking up individual records when there are 71 million people impacted is evident. Do leave your comments below and help us improve this feature to become as useful as we can possibly make it.
Nearly four years ago now, I set out to write a book with Charlotte and RobIt was the stories behind the stories, the things that drove me to write my most important blog posts, and then the things that happened afterwards. It's almost like a collection of meta posts, each one adding behind-the-scenes commentary that most people reading my material didn't know about at the time.
It was a strange time for all of us back then. I didn't leave the country for the first time in over a decade. I barely even left the state. I had time to toil on the passion project that became this book. As I wrote about years later, there were also other things occupying my mind at the time. Writing this book was cathartic, providing me the opportunity to express some of the emotions I was feeling at the time and to reflect on life.
Speaking of reflecting, this week was Have I Been Pwned's 11th birthday. Reaching this milestone, getting back to travel (I'm writing this poolside with a beer at a beautiful hotel in Dubai), life settling down (while sitting next to my amazing wife), and it now being 2 years since we launched the book, I decided we should just give it away for free. I mean really free, not "give me all your personal details, then here's a download link" I mean, here are the direct download links:
I hope you enjoy the book. It's the culmination of so many things I worked so hard to create over the preceding decade and a half, and I'm really happy to just be giving it away now. Enjoy the book 😊
I've spent more than a decade now writing about how to make Have I Been Pwned (HIBP) fast. Really fast. Fast to the extent that sometimes, it was even too fast:
The response from each search was coming back so quickly that the user wasn’t sure if it was legitimately checking subsequent addresses they entered or if there was a glitch.
Over the years, the service has evolved to use emerging new techniques to not just make things fast, but make them scale more under load, increase availability and sometimes, even drive down cost. For example, 8 years ago now I started rolling the most important services to Azure Functions, "serverless" code that was no longer bound by logical machines and would just scale out to whatever volume of requests was thrown at it. And just last year, I turned on Cloudflare cache reserve to ensure that all cachable objects remained cached, even under conditions where they previously would have been evicted.
And now, the pièce de résistance, the coolest performance thing we've done to date (and it is now "we", thank you Stefán): just caching the whole lot at Cloudflare. Everything. Every search you do... almost. Let me explain, firstly by way of some background:
When you hit any of the services on HIBP, the first place the traffic goes from your browser is to one of Cloudflare's 330 "edge nodes":
As I sit here writing this on the Gold Coast on Australia's most eastern seaboard, any request I make to HIBP hits that edge node on the far right of the Aussie continent which is just up the road in Brisbane. The capital city of our great state of Queensland is just a short jet ski away, about 80km as the crow flies. Before now, every single time I searched HIBP from home, my request bytes would travel up the wire to Brisbane and then take a giant 12,000km trip to Seattle where the Azure Function in the West US Azure data would query the database before sending the response 12,000km back west to Cloudflare's edge node, then the final 80km down to my Surfers Paradise home. But what if it didn't have to be that way? What if that data was already sitting on the Cloudflare edge node in Brisbane? And the one in Paris, and the one in well, I'm not even sure where all those blue dots are, but what if it was everywhere? Several awesome things would happen:
In short, pushing data and processing "closer to the edge" benefits both our customers and ourselves. But how do you do that for 5 billion unique email addresses? (Note: As of today, HIBP reports over 14 billion breached accounts, the number of unique email addresses is lower as on average, each breached address has appeared in multiple breaches.) To answer this question, let's recap on how the data is queried:
Let's delve into that last point further because it's the secret sauce to how this whole caching model works. In order to provide subscribers of this service with complete anonymity over the email addresses being searched for, the only data passed to the API is the first six characters of the SHA-1 hash of the full email address. If this sounds odd, read the blog post linked to in that last bullet point for full details. The important thing for now, though, is that it means there are a total of 16^6 different possible requests that can be made to the API, which is just over 16 million. Further, we can transform the first two use cases above into k-anonymity searches on the server side as it simply involved hashing the email address and taking those first six characters.
In summary, this means we can boil the entire searchable database of email addresses down to the following:
That's a large albeit finite list, and that's what we're now caching. So, here's what a search via email address looks like:
K-anonymity searches obviously go straight to step four, skipping the first few steps as we already know the hash prefix. All of this happens in a Cloudflare worker, so it's "code on the edge" creating hashes, checking cache then retrieving from the origin where necessary. That code also takes care of handling parameters that transform queries, for example, filtering by domain or truncating the response. It's a beautiful, simple model that's all self-contained within a worker and a very simple origin API. But there's a catch - what happens when the data changes?
There are two events that can change cached data, one is simple and one is major:
The second point is kind of frustrating as we've built up this beautiful collection of data all sitting close to the consumer where it's super fast to query, and then we nuke it all and go from scratch. The problem is it's either that or we selectively purge what could be many millions of individual hash prefixes, which you can't do:
For Zones on Enterprise plan, you may purge up to 500 URLs in one API call.
And:
Cache-Tag, host, and prefix purging each have a rate limit of 30,000 purge API calls in every 24 hour period.
We're giving all this further thought, but it's a non-trivial problem and a full cache flush is both easy and (near) instantaneous.
Enough words, let's get to some pictures! Here's a typical week of queries to the enterprise k-anonymity API:
This is a very predictable pattern, largely due to one particular subscriber regularly querying their entire customer base each day. (Sidenote: most of our enterprise level subscribers use callbacks such that we push updates to them via webhook when a new breach impacts their customers.) That's the total volume of inbound requests, but the really interesting bit is the requests that hit the origin (blue) versus those served directly by Cloudflare (orange):
Let's take the lowest blue data point towards the end of the graph as an example:
At that time, 96% of requests were served from Cloudflare's edge. Awesome! But look at it only a little bit later:
That's when I flushed cache for the Finsure breach, and 100% of traffic started being directed to the origin. (We're still seeing 14.24k hits via Cloudflare as, inevitably, some requests in that 1-hour block were to the same hash range and were served from cache.) It then took a whole 20 hours for the cache to repopulate to the extent that the hit:miss ratio returned to about 50:50:
Look back towards the start of the graph and you can see the same pattern from when I loaded the DemandScience breach. This all does pretty funky things to our origin API:
That last sudden increase is more than a 30x traffic increase in an instant! If we hadn't been careful about how we managed the origin infrastructure, we would have built a literal DDoS machine. Stefán will write later about how we manage the underlying database to ensure this doesn't happen, but even still, whilst we're dealing with the cyclical support patterns seen in that first graph above, I know that the best time to load a breach is later in the Aussie afternoon when the traffic is a third of what it is first thing in the morning. This helps smooth out the rate of requests to the origin such that by the time the traffic is ramping up, more of the content can be returned directly from Cloudflare. You can see that in the graphs above; that big peaky block towards the end of the last graph is pretty steady, even though the inbound traffic the first graph over the same period of time increases quite significantly. It's like we're trying to race the increasing inbound traffic by building ourselves up a bugger in cache.
Here's another angle to this whole thing: now more than ever, loading a data breach costs us money. For example, by the end of the graphs above, we were cruising along at a 50% cache hit ratio, which meant we were only paying for half as many of the Azure Function executions, egress bandwidth, and underlying SQL database as we would have been otherwise. Flushing cache and suddenly sending all the traffic to the origin doubles our cost. Waiting until we're back at 90% cache it ratio literally increases those costs 10x when we flush. If I were to be completely financially ruthless about it, I would need to either load fewer breaches or bulk them together such that a cache flush is only ejecting a small amount of data anyway, but clearly, that's not what I've been doing 😄
There's just one remaining fly in the ointment...
Of those three methods of querying email addresses, the first is a no-brainer: searches from the front page of the website hit a Cloudflare Worker where it validates the Turnstile token and returns a result. Easy. However, the second two models (the public and enterprise APIs) have the added burden of validating the API key against Azure API Management (APIM), and the only place that exists is in the West US origin service. What this means for those endpoints is that before we can return search results from a location that may be just a short jet ski ride away, we need to go all the way to the other side of the world to validate the key and ensure the request is within the rate limit. We do this in the lightest possible way with barely any data transiting the request to check the key, plus we do it in async with pulling the data back from the origin service if it isn't already in cache. In other words, we're as efficient as humanly possible, but we still cop a massive latency burden.
Doing API management at the origin is super frustrating, but there are really only two alternatives. The first is to distribute our APIM instance to other Azure data centres, and the problem with that is we need a Premium instance of the product. We presently run on a Basic instance, which means we're talking about a 19x increase in price just to unlock that ability. But that's just to go Premium; we then need at least one more instance somewhere else for this to make sense, which means we're talking about a 28x increase. And every region we add amplifies that even further. It's a financial non-starter.
The second option is for Cloudflare to build an API management product. This is the killer piece of this puzzle, as it would put all the checks and balances within the one edge node. It's a suggestion I've put forward on many occasions now, and who knows, maybe it's already in the works, but it's a suggestion I make out of a love of what the company does and a desire to go all-in on having them control the flow of our traffic. I did get a suggestion this week about rolling what is effectively a "poor man's API management" within workers, and it's a really cool suggestion, but it gets hard when people change plans or when we want to apply quotas to APIs rather than rate limits. So c'mon Cloudflare, let's make this happen!
Finally, just one more stat on how powerful serving content directly from the edge is: I shared this stat last month for Pwned Passwords which serves well over 99% of requests from Cloudflare's cache reserve:
There it is - we’ve now passed 10,000,000,000 requests to Pwned Password in 30 days 😮 This is made possible with @Cloudflare’s support, massively edge caching the data to make it super fast and highly available for everyone. pic.twitter.com/kw3C9gsHmB
— Troy Hunt (@troyhunt) October 5, 2024
That's about 3,900 requests per second, on average, non-stop for 30 days. It's obviously way more than that at peak; just a quick glance through the last month and it looks like about 17k requests per second in a one-minute period a few weeks ago:
But it doesn't matter how high it is, because I never even think about it. I set up the worker, I turned on cache reserve, and that's it 😎
I hope you've enjoyed this post, Stefán and I will be doing a live stream on this topic at 06:00 AEST Friday morning for this week's regular video update, and it'll be available for replay immediately after. It's also embedded here for convenience:
Apparently, before a child reaches the age of 13, advertisers will have gathered more 72 million data points on them. I knew I'd seen a metric about this sometime recently, so I went looking for "7,000", which perfectly illustrates how unaware we are of the extent of data collection on all of us. I started Have I Been Pwned (HIBP) in the first place because I was surprised at where my data had turned up in breaches. 11 years and 14 billion breached records later, I'm still surprised!
Jason (not his real name) was also recently surprised at where his data had appeared. He found it in a breach of a service called "Pure Incubation", a company whose records had appeared on a popular hacking forum earlier this year:
#DataLeak Alert ⚠️⚠️⚠️
— HackManac (@H4ckManac) February 28, 2024
🚨Over 183 Million Pure Incubation Ventures Records for Sale 🚨
183,754,481 records belonging to Pure Incubation Ventures (https://t.co/m3sjzAMlXN) have been put up for sale on a hacking forum for $6,000 negotiable.
Additionally, the threat actor with… pic.twitter.com/tqsyb8plPG
When Jason found his email address and other info in this corpus, he had the same question so many others do when their data turns up in a place they've never heard of before - how? Why?! So, he asked them:
I seem to have found my email in your data breach. I am interested in finding how my information ended up in your database.
To their credit, he got a very comprehensive answer, which I've included below:
Well, that answers the "how" part of the equation; they've aggregated data from public sources. And the "why" part? It's the old "data is the new oil" analogy that recognises how valuable our info is, and as such, there's a market for it. There are lots of terms used to describe what DemandScience does, including "B2B demand generation", "buyer intelligence solutions provider", "empowering technology companies to accelerate ROI", "supercharging pipelines" and "account intelligence". Or, to put it in a more lay-person-friendly fashion, they sell data on people.
DemandScience is what we refer to as a "data aggregator" in that they combine identity data from multiple locations, bundle it up, and then sell it. Occasionally, data aggregators end up having sizeable data breaches; before today, HIBP already contained Adapt (9M records), Data & Leads (44M records), Exactis (132M records), Factual (2M records), and You've Been Scraped (66M records). According to DemandScience, "none of our current operational systems were exploited", yet simultaneously, "the leaked data originated from a system that has been decommissioned". So, it's a breach of an old system.
Does it matter? I mean, if it's just public data, should people care? Jason cared, at least enough to make the original enquiry and for DemandScience to look him up and realise he's not in their current database. Still, he existed in the breached one (I later sent Jason his record from the breach, and he confirmed the accuracy). As I often do in these cases, I reached out to a bunch of recent HIBP subscribers in the breach and asked them three simple questions:
The answers were all the same: the data is accurate, it's already in the public domain, and people aren't too concerned about it appearing in this breach. Well that was easy 🙂 However...
There are two nuances that aren't captured here, and the first one is that this is valuable data, that's why DemandScience sells it! It comes back to that "new oil" analogy and if you have enough of it, you can charge good money for it. Companies typically use data such as this to do precisely the sort of catchphrasey stuff the company talks about, primarily around maximising revenue from their customers by understanding them better.
The second nuance is that whilst this data may already be in the public domain, did the owners of it expect it to be used in this fashion? For example, if you publish your details in a business directory, is your expectation that this info may then be sold to other companies to help them upsell you on their products? Probably not. And if, like many of the records in the data, someone's row is accompanied by their LinkedIn profile, would they expect that data to matched and sold? I suggest the responses would likely be split here, and that in itself is an important observation: how we view the sensitivity of our data and the impact of it being exposed (whether personal or business) is extremely personal. Some people take the view of "I have nothing to hide", whilst others become irate if even just their email address is exposed.
Whilst considering how to add more insights to this blog post, I thought I'd do a quick check on just one more email address:
"54543060",,"0","TROY","HUNT","PO BOX 57",,"WEST RYDE",,,"AU","61298503333",,,,"troy.hunt@pfizer.com","pfizer.com","PFIZER INC",,"250-499","$50 - 99 Million","Healthcare, Pharmaceuticals and Biotech","VICE PRESIDENT OF INFORMATION TECHNOLOGY","VP Level","2834",,"Senior Management (SVP/GM/Director)","IT",,"1","GemsTarget INTL","GEMSTARGET_INTL_648K_10.17.18",,,,,,,,,"18/10/2018 05:12:39","5/10/2021 16:47:56","PFIZER.COM",,,,,"IT Management General","Information Technology"
I'll be entirely transparent and honest here - my exact words after finding this were "motherfucker!" True story, told uncensored here because I want to impress on the audience how I feel when my data turns up somewhere publicly. And I do feel like it's "my" data; it's certainly my name and even though it's my old Pfizer email address I've not used for almost a decade now, that also has my name in it. My job title is also there... and it's completely wrong! I never had a VP-level role, even though the other data around my tech role is at least in the vicinity of being correct. But other than the initial shock of finding myself in yet another data breach, personally, I'm in the same boat as the HIBP subscribers I contacted, and this doesn't bother me too much. But I also agree with the following responses I received to my third question:
I think it is useful to be notified of such breaches, even if it is just to confirm no sensitive data has been compromised. As I said, our IT department recently notified me that some of my data was leaked and a pre-emptive password reset was enforced as they didn't know what was leaked.
It would be good to see it as an informational notification in case there's an increase in attack attempts against my email address.
I would like to opt-out of here to reduce the SPAM and Phishing emails.
That last one seems perfectly reasonable, and fortunately, DemandScience does have a link on their website to Do Not Sell My Information:
Dammit! If, like me, you're part of the 99.5% of the world that doesn't live in California, then apparently this form isn't for you. However, they do list dataprivacy@demandscience.com on that page, which is the same address Jason was communicating with above. Chances are, if you want to remove your data then that's where to start.
There were almost 122M unique email addresses in this corpus and those have now been added to HIBP. Treat this as informational; I suspect that for most people, it won't bother them, whilst others will ask for their data not to be sold (regardless of where they live in the world). But in all likelihood, there will be more than a handful of domain subscribers who take issue with that volume of people data sitting there in one corpus easily downloadable via a clear web hacking forum. For example, mine was just one of many tens of thousands of Pfizer email addresses, and that sort of thing is going to raise the ire of some folks in corporate infosec capacities.
One last comment: there was a story published earlier this year titled Our Investigation of the Pure Incubation Ventures Leak and in there they refer to "encrypted passwords" being present in the data. Many of the files do contain a column with bcrypt hashes (which is definitely not encryption), but given the way in which this data was collated, I can see no evidence whatsoever that these are password hashes. As such, I haven't listed "Passwords" as one of the compromised data classes in HIBP and you find yourself in this breach, I wouldn't be at all worried about this.
The conundrum I refer to in the title of this post is the one faced by a breached organisation: disclose or suppress? And let me be even more specific: should they disclose to impacted individuals, or simply never let them know? I'm writing this after many recent such discussions with breached organisations where I've found myself wishing I had this blog post to point them to, so, here it is.
Let's start with tackling what is often a fundamental misunderstanding about disclosure obligations, and that is the legal necessity to disclose. Now, as soon as we start talking about legal things, we run into the problem of it being different all over the world, so I'll pick a few examples to illustrate the point. As it relates to the UK GDPR, there are two essential concepts to understand, and they're the first two bulleted items in their personal data breaches guide:
The UK GDPR introduces a duty on all organisations to report certain personal data breaches to the relevant supervisory authority. You must do this within 72 hours of becoming aware of the breach, where feasible.
If the breach is likely to result in a high risk of adversely affecting individuals’ rights and freedoms, you must also inform those individuals without undue delay.
On the first point, "certain" data breaches must be reported to "the relevant supervisory authority" within 72 hours of learning about it. When we talk about disclosure, often (not just under GDPR), that term refers to the responsibility to report it to the regulator, not the individuals. And even then, read down a bit, and you'll see the carveout of the incident needing to expose personal data that is likely to present a "risk to people’s rights and freedoms".
This brings me to the second point that has this massive carveout as it relates to disclosing to the individuals, namely that the breach has to present "a high risk of adversely affecting individuals’ rights and freedoms". We have a similar carveout in Australia where the obligation to report to individuals is predicated on the likelihood of causing "serious harm".
This leaves us with the fact that in many data breach cases, organisations may decide they don't need to notify individuals whose personal information they've inadvertently disclosed. Let me give you an example from smack bang in the middle of GDPR territory: Deezer, the French streaming media service that went into HIBP early January last year:
New breach: Deezer had 229M unique email addresses breached from a 2019 backup and shared online in late 2022. Data included names, IPs, DoBs, genders and customer location. 49% were already in @haveibeenpwned. Read more: https://t.co/1ngqDNYf6k
— Have I Been Pwned (@haveibeenpwned) January 2, 2023
229M records is a substantial incident, and there's no argument about the personally identifiable nature of attributes such as email address, name, IP address, and date of birth. However, at least initially (more on that soon), Deezer chose not to disclose to impacted individuals:
Chatting to @Scott_Helme, he never received a breach notification from them. They disclosed publicly via an announcement in November, did they never actually email impacted individuals? Did *anyone* who got an HIBP email get a notification from Deezer? https://t.co/dnRw8tkgLl https://t.co/jKvmhVCwlM
— Troy Hunt (@troyhunt) January 2, 2023
No, nothing … but then I’ve not used Deezer for years .. I did get this👇from FireFox Monitor (provided by your good selves) pic.twitter.com/JSCxB1XBil
— Andy H (@WH_Y) January 2, 2023
Yes, same situation. I got the breach notification from HaveIBeenPwned, I emailed customer service to get an export of my data, got this message in response: pic.twitter.com/w4maPwX0Qe
— Giulio Montagner (@Giu1io) January 2, 2023
This situation understandably upset many people, with many cries of "but GDPR!" quickly following. And they did know way before I loaded it into HIBP too, almost two months earlier, in fact (courtesy of archive.org):
This information came to light November 8 2022 as a result of our ongoing efforts to ensure the security and integrity of our users’ personal information
They knew, yet they chose not to contact impacted people. And they're also confident that position didn't violate any data protection regulations (current version of the same page):
Deezer has not violated any data protection regulations
And based on the carveouts discussed earlier, I can see how they drew that conclusion. Was the disclosed data likely to lead to "a high risk of adversely affecting individuals’ rights and freedoms"? You can imagine lawyers arguing that it wouldn't. Regardless, people were pissed, and if you read through those respective Twitter threads, you'll get a good sense of the public reaction to their handling of the incident. HIBP sent 445k notifications to our own individual subscribers and another 39k to those monitoring domains with email addresses in the breach, and if I were to hazard a guess, that may have been what led to this:
Is this *finally* the @Deezer disclosure notice to individuals, a month and a half later? It doesn’t look like a new incident to me, anyone else get this? https://t.co/RrWlczItLm
— Troy Hunt (@troyhunt) February 20, 2023
So, they know about the breach in Nov, and they told people in Feb. It took them a quarter of a year to tell their customers they'd been breached, and if my understanding of their position and the regulations they were adhering to is correct, they never needed to send the notice at all.
I appreciate that's a very long-winded introduction to this post, but it sets the scene and illustrates the conundrum perfectly: an organisation may not need to disclose to individuals, but if they don't, they risk a backlash that may eventually force their hand.
In my past dealing with organisations that were reticent to disclose to their customers, their positions were often that the data was relatively benign. Email addresses, names, and some other identifiers of minimal consequence. It's often clear that the organisation is leaning towards the "uh, maybe we just don't say anything" angle, and if it's not already obvious, that's not a position I'd encourage. Let's go through all the reasons:
I ask this question because the defence I've often heard from organisations choosing the non-disclosure path is that the data is theirs - the company's. I have a fundamental issue with this, and it's not one with any legal basis (but I can imagine it being argued by lawyers in favour of that position), rather the commonsense position that someone's email address, for example, is theirs. If my email address appears in a data breach, then that's my email address and I entrusted the organisation in question to look after it. Whether there's a legal basis for the argument or not, the assertion that personally identifiable attributes become the property of another party will buy you absolutely no favours with the individual who provided them to you when you don't let them know you've leaked it.
Picking those terms from earlier on, if my gender, sexuality, ethnicity, and, in my case, even my entire medical history were to be made public, I would suffer no serious harm. You'd learn nothing of any consequence that you don't already know about me, and personally, I would not feel that I suffered as a result. However...
For some people, simply the association of their email address to their name may have a tangible impact on their life, and using the term from above jeopardises their rights and freedoms. Some people choose to keep their IRL identities completely detached from their email address, only providing the two together to a handful of trusted parties. If you're handling a data breach for your organisation, do you know if any of your impacted customers are in that boat? No, of course not; how could you?
Further, let's imagine there is nothing more than email addresses and passwords exposed on a cat forum. Is that likely to cause harm to people? Well, it's just cats; how bad could it be? Now, ask that question - how bad could it be? - with the prevalence of password reuse in mind. This isn't just a cat forum; it is a repository of credentials that will unlock social media, email, and financial services. Of course, it's not the fault of the breached service that people reuse their passwords, but their breach could lead to serious harm via the compromise of accounts on totally unrelated services.
Let's make it even more benign: what if it's just email addresses? Nothing else, just addresses and, of course, the association to the breached service. Firstly, the victims of that breach may not want their association with the service to be publicly known. Granted, there's a spectrum and weaponising someone's presence in Ashley Madison is a very different story from pointing out that they're a LinkedIn user. But conversely, the association is enormously useful phishing material; it helps scammers build a more convincing narrative when they can construct their messages by repeating accurate facts about their victim: "Hey, it's Acme Corp here, we know you're a loyal user, and we'd like to make you a special offer". You get the idea.
I'll start this one in the complete opposite direction to what it sounds like it should be because this is what I've previously heard from breached organisations:
We don't want to disclose in order to protect our customers
Uh, you sure about that? And yes, you did read that paraphrasing correctly. In fact, here's a copy paste from a recent discussion about disclosure where there was an argument against any public discussion of the incident:
Our concern is that your public notification would direct bad actors to search for the file, which can potentially do harm to both the business and our mutual users.
The fundamental issue of this clearly being an attempt to suppress news of the incident aside, in this particular case, the data was already on a popular clear web hacking forum, and the incident has appeared in multiple tweets viewed by thousands of people. The argument makes no sense whatsoever; the bad guys - lots of them - already have the data. And the good guys (the customers) don't know about it.
I'll quote precisely from another company who took a similar approach around non-disclosure:
[company name] is taking steps to notify regulators and data subjects where it is legally required to do so, based on advice from external legal counsel.
By now, I don't think I need to emphasise the caveat that they inevitably relied on to suppress the incident, but just to be clear: "where it is legally required to do so". I can say with a very high degree of confidence that they never notified the 8-figure number of customers exposed in this incident because they didn't have to. (I hear about it pretty quickly when disclosure notices are sent out, and I regularly share these via my X feed).
Non-disclosure is intended to protect the brand and by extension, the shareholders, not the customers.
Usually, after being sent a data breach, the first thing I do is search for "[company name] data breach". Often, the only results I get are for a listing on a popular hacking forum (again, on the clear web) where their data was made available for download, complete with a description of the incident. Often, that description is wrong (turns out hackers like to embellish their accomplishments). Incorrect conclusions are drawn and publicised, and they're the ones people find when searching for the incident.
When a company doesn't have a public position on a breach, the vacuum it creates is filled by others. Obviously, those with nefarious intent, but also by journalists, and many of those don't have the facts right either. Public disclosure allows the breached organisation to set the narrative, assuming they're forthcoming and transparent and don't water it down such that there's no substance in the disclosure, of course.
All the way back in 2017, I wrote about The 5 Stages of Data Breach Grief as I watched The AA in the UK dig themselves into an ever-deepening hole. They were doubling down on bullshit, and there was simply no way the truth wasn't going to come out. It was such a predictable pattern that, just like with Kübler-Ross' stages of personal grief, it was very clear how this was going to play out.
If you choose not to disclose a breach - for whatever reason - how long will it be until your "truth" comes out? Tomorrow? Next month? Years from now?! You'll be looking over your shoulder until it happens, and if it does one day go public, how will you be judged? Which brings me to the next point:
I can't put any precise measure on it, but I feel we reached a turning point in 2017. I even remember where I was when it dawned on me, sitting in a car on the way to the airport to testify before US Congress on the impact of data breaches. News had recently broken that Uber had attempted to cover up its breach of the year before by passing it off as a bug bounty and, of course, not notifying impacted customers. What dawned on me at that moment of reflection was that by now, there had been so many data breaches that we were judging organisations not by whether they'd been breached but how they'd handled the breach. Uber was getting raked over the coals not for the breach itself but because they tried to conceal it. (Their CTO was also later convicted of federal charges for some of the shenanigans pulled under his watch.)
This is going to feel like I'm talking to my kids after they've done something wrong, but here goes anyway: If people entrusted you with your data and you "lost" it (had it disclosed to unauthorised parties), the only decent thing to do is own up and acknowledge it. It doesn't matter if it was your organisation directly or, as with the Deezer situation, a third party you entrusted with the data; you are the coalface to your customers, and you're the one who is accountable for their data.
I am yet to see any valid reasons not to disclose that are in the best interests of the impacted customers (the delay in the AT&T breach announcement at the request of the FBI due to national security interests is the closest I can come to justifying non-disclosure). It's undoubtedly the customers' expectation, and increasingly, it's the governments' expectations too; I'll leave you with a quote from our previous Cyber Security Minister Clare O'Neil in a recent interview:
But the real people who feel pain here are Australians when their information that they gave in good faith to that company is breached in a cyber incident, and the focus is not on those customers from the very first moment. The people whose data has been stolen are the real victims here. And if you focus on them and put their interests first every single day, you will get good outcomes. Your customers and your clients will be respectful of it, and the Australian government will applaud you for it.
I'm presently on a whirlwind North America tour, visiting government and law enforcement agencies to understand more about their challenges and where we can assist with HIBP. As I spend more time with these agencies around the world, I keep hearing that data breach victim notification is an essential piece of the cybersecurity story, and I'm making damn sure to highlight the deficiencies I've written about here. We're going to keep pushing for all data breach victims to be notified when their data is exposed, and my hope in writing this is that when it's read in future by other organisations I've disclosed to, they respect their customers and disclose promptly. Check out Data breach disclosure 101: How to succeed after you've failed for guidance and how to do this.
Edit (a couple of days later): I'm adding an addendum to this post given how relevant it is. I just saw the following from Ruben van Well of the Dutch Police, someone who has invested a lot of effort in victim notification and we had the pleasure of spending time with last year in Rotterdam:
To translate the key section:
Reporting and transparency around incidents is important. Of the companies that fall victim, between 8 and 10% report this, whether or not out of fear of reputational damage. I assume that your image will be more damaged if you do not report an incident and it does come out later.
It echos my sentiments from above precisely, and I hope that message has an impact on anyone considering whether or not to disclose.
TL;DR — Tens of millions of credentials obtained from info stealer logs populated by malware were posted to Telegram channels last month and used to shake down companies for bug bounties under the misrepresentation the data originated from their service.
How many attempted scams do you get each day? I woke up to yet another "redeem your points" SMS this morning, I'll probably receive a phone call from "my bank" today (edit: I was close, it was "Amazon Prime" 🤷♂️) and don't even get me started on my inbox. We're bombarded to the point of desensitisation, which itself is dangerous because it creates the risk of inadvertently dismissing something that really does require your attention. Which brings me to the email Scott Helme from Report URI (disclosure: a service I've long partnered with and advised) received yesterday titled "Bug bounty Program - PII leak Credentials more than 170". It began as follows:
Through open-source intelligence gathering, I discovered a significant amount of "report-uri.com" user credentials and sensitive documents have been leaked and are publicly accessible.
The sender then attached a text file with 197 lines of email addresses and passwords belonging to users of Scott's pride and joy. The first lines looked like this (url:email:password):
Imagine the heart-in-mouth moment he had when first seeing that; had someone compromised his service? Was this the data of his customers who had entrusted it to him and it was now floating around the internet? Isn't he the guy who's meant to be teaching others about application security?! The email went on:
The impact of this vulnerability is severe, potentially resulting in:
Mass account takeovers by malicious actors.
Exposure of sensitive user data including names, emails, addresses, and documents.
Unauthorized transactions or malicious activities using compromised accounts.
Further compromise of organizational infrastructure through account abuse.
Financial and reputational damage due to security breaches.
Just to avoid any semblance of doubt as to the motive of the sender, the subject began by flagging the desire for a bug bounty (Report URI does not advertise a bounty program, but clearly a reward was being sought), followed by an email body stating it related to leaked Report URI credentials and then highlighted that "this vulnerability is severe". And then there's that last line about financial and reputation damage. It looked bad. However, cooler heads prevailed, and we started looking closer at the email addresses in the "breach" by checking them against Have I Been Pwned. Very quickly, a pattern emerged:
Most of the addresses we checked had appeared in the lists posted to Telegram I'd loaded into HIBP a couple of months ago. These were stealer logs, not a breach of Report URI! To validate that assertion, I pulled the original data source and parsed out every line containing "report-uri.com". Sure enough, the lines from the file sent to Scott were usually contained in the stealer log files. So, let's talk about how this works:
Take the URL you saw at the beginning of each line earlier on, the one being for the registration page. Here's what it looks like:
Now, imagine you're filling out this form and your machine is infected with malware that can observe the data entered into each field. It takes that data, "steals" it and logs it at the attacker's server, hence the term "info stealer logs". There is absolutely nothing Scott can do to prevent this; the user's machine is compromised, not Report URI.
To illustrate the point, I grabbed the first email address in the file Scott was sent and pulled the rows just for that address rather than solely the Report URI rows. This would show us all the other services this person's credentials were snared from, and there were dozens. Here are just the first ten:
Google. Apple. Twitter. Most with the same password too, because a normal person obviously owns this email address. So, has each of these organisations also received a beg bounty? No, that's not a typo, this is classic behaviour where unsophisticated and self-proclaimed "security researchers" use automated tooling to identify largely benign security configurations that could be construed as vulnerabilities. For example, they'll send through a report that an SPF record is too permissive (they probably can't even spell "SPF", let alone understand the nuances of sender policies), then try to shake people like Scott down for money under the guise of a "bug bounty". This isn't Scott's problem, nor is it Google's or Apple's or Twitter's, it's something only the malware infected victim's can address.
In this post, I referred to "most" of the addresses already being in HIBP and the lines from the file he was sent "usually" occurring in the logs I had. But there were gaps. For example, whilst there were 197 rows in "his" file, I only found 161 in the data I'd previously loaded. But I had a hunch on how to fill that gap and make up the difference...
Two weeks ago, I was sent a further 22GB of stealer logs found in Telegram channels. Unlike the previous corpus of data, this set contained only stealer logs (no credential stuffing lists) and had a total of 26,105,473 unique email addresses. That's significant, as it implies that every single one of those addresses belongs to someone infected with malware that's stealing their creds. Of the total count, 89.7% had been seen in previous data breaches already in HIBP which is a high crossover, but it also meant that 2,679,550 addresses were all new. I'd been considering whether or not it made sense to load this data given corpuses such as this create frustration when people don't know which site their record was snared from nor which password was impacted. One particular frustration you'll read in comments on the previous post was that people weren't sure whether their email address was in a stealer log or a credential stuffing list; did they have a machine infected with malware or was it merely recycled credentials from an old data breach? But given the way in which this new corpus of data is being used (to attempt to scam Scott and, one would assume, many others), the 7-figure number of previously unseen addresses and the fact that this time, they can all emphatically be tied back to malware campaigns, this is now searchable in HIBP as "Stealer Logs Posted to Telegram".
Ultimately, this is just scam on top of scam: the victims in the logs have had their credentials scammed, and the person who emailed Scott attempted to use that to scam him out of a bounty. Making data like this searchable in HIBP helps people do exactly what I did as soon as Scott forwarded me over the email: validate the origin and as Scott will now do, send a terse reply encouraging the guy to show some decency and stop with the beg bounties.
Lastly, I'm increasingly conscious of how useful the information contained in stealer logs is to organisations like Report URI, and after loading that previous corpus posted from Telegram, I did help out a few companies who thought they might have been hit by it. The position they were coming from was "we keep seeing account takeovers by what looks like credential stuffing attacks, but the attackers are getting the credentials right on the first go". When I pulled the data for their domain as I later did for Report URI, the email addresses were precisely the ones being targeted for account takeover. I want to address this via HIBP, but it's non-trivial for a variety of reasons, especially those related to privacy. In order for this data to be useful to companies like Report URI, I'd need to give them other people's email addresses (the password wouldn't be necessary) based on the assumption they were customers of the service. I'm working out how to do this in way that makes sense for everyone (well, everyone except for the bad guys), stay tuned for more and please do chime in via the comments if you have ideas on how to turn this into a useful service.
Last week, a security researcher sent me 122GB of data scraped out of thousands of Telegram channels. It contained 1.7k files with 2B lines and 361M unique email addresses of which 151M had never been seen in HIBP before. Alongside those addresses were passwords and, in many cases, the website the data pertains to. I've loaded it into Have I Been Pwned (HIBP) today because there's a huge amount of previously unseen email addresses and based on all the checks I've done, it's legitimate data. That's the high-level overview, now here are the details:
Telegram is a popular messaging platform that makes it easy to stand up a "channel" and share information to those who wish to visit it. As Telegram describes the service, it's simple, private and secure and as such, has become very popular with those wishing to share content anonymously, including content related to data breaches. Many of the breaches I've previously loaded into HIBP have been distributed via Telegram as it's simple to publish this class of data to the platform. Here's what data posted to Telegram often looks like:
These are referred to as "combolists", that is they're combinations of email addresses or usernames and passwords. The combination of these is obviously what's used to authenticate to various services, and we often see attackers using these to mount "credential stuffing" attacks where they use the lists to attempt to access accounts en mass. The list above is simply breaking the combos into their respective email service providers. For example, that last Gmail example contains over a quarter of a million rows like this:
That's only one of many files across many different Telegram channels. The data that was sent to me last week was sourced from 518 different channels and amounted to 1,748 separate files similar to the one above. Some of the files have literally no data (0kb), others are many gigabytes with many tens of millions of rows. For example, the largest file starts like this:
That looks very much like the result of info stealer malware that has obtained credentials as they were entered into websites on compromised machines. For example, the first record appears to have been snared when someone attempted to login to Nike. There's an easy way to get a sense of the accuracy of this data, just head over to the Nike homepage and click the login link which presents the following screen:
They serve the same page to both existing subscribers and new ones but then serve different pages depending on whether the email address already has an account (a classic enumeration vector). Mash the keyboard to create a fake email address and you'll be shown a registration form, but enter the address in the stealer log and, well, you get something different:
The email address has an account, hence the prompt for a password. I'm not going to test the password because that would constitute unauthorised access, but I also don't need to as the goal has already been achieved: I've demonstrated that the address has an account on Nike. (Also note that if the password didn't work it wouldn't necessarily mean it wasn't valid at some point in time at the past, it would simply mean it isn't valid now.)
Footlocker tries to be a bit more clever in avoiding enumeration on password reset, but they'll happily tell you via the registration page if the email address you've entered already exists:
Even the Italian tyre retailer happily confirmed the existence of the tested account:
Time and time again, each service I tested confirmed the presence of the email address in the stealer log. But are (or were) the passwords correct? Again, I'm not going to test those myself, but I have nearly 5M subscribers in HIBP and there's always a handful of them in any new breach that are happy to help out. So, I emailed some of the most recent ones, asked if they could help with verification and upon confirmation, sent them their data.
In reaching out to existing subscribers, I expected some repetition in terms of them already appearing in existing data breaches. For one person already in 13 different breaches in HIBP, this was their response:
Thanks Troy. These details were leaked in previous data breaches.
So accurate, but not new, and several of the breaches for this one were of a similar structure to the one we're talking about today in terms of them being combolists used for credential stuffing attacks. Same with another subscriber who was in 7 prior breaches:
Yes that’s familiar. Most likely would have used those credentials on the previous data breaches.
That one was more interesting as of the 7 prior breaches, only 6 had passwords exposed and none of them were combolists. Instead, it was incidents including MyFitnessPal, 8fit, FlexBooker, Jefit, MyHeritage and ShopBack; have passwords been cracked out of those (most were hashed) and used to create new lists? Very possibly. (Sidenote: this unfortunate person is obviously a bit of a fitness buff and has managed to end up in 3 different "fit" breaches.)
Another subscriber had an entry in the following format, similar to what we saw earlier on in the stealer log:
https://accounts.epicgames.com/login:[email]:[password]
They responded to my queries with the following:
I think that epic games account was for my daughter a couple of years ago but I cancelled it last year from memory. That sds like a password she may have chosen so I'll check with her in an hour or two when I see her again.
And then, a little bit later
My daughter doesn't remember if that was her password as it was 4-5 years ago when she was only 8-9 years old. However it does sound like something she would have chosen so in all probability, I would say that is a legitimate link. We believe it was used when she played a game called Fortnite which she did infrequently at that time hence her memory is sketchy.
I realised that whilst each of these responses confirmed the legitimacy of the data, they really weren't giving me much insight into the factor that made it worth loading into HIBP: the unseen addresses. So, I went through the same process of contacting HIBP subscribers again but this time, only the ones that I'd never seen in a breach before. This would then rule out all the repurposed prior incidents and give me a much better idea of how impactful this data really was. And that's when things got really interesting.
Let's start with the most interesting one and what you're about to see is two hundred rows of stealer logs:
https://steuer.check24.de/customer-center/aff/check24/authentication:[email]:[password]
https://www.disneyplus.com/de-de/reset-password:[email]:[password]
https://auth.rtl.de/auth/realms/rtlplus/protocol/openid-connect/auth:[email]:[password]
https://www.tink.de/checkout/login:[email]:[password]
https://signin.ebay.de/ws/eBayISAPI.dll:[email]:[password]
https://vrr-db-ticketshop.de/authentication/login:[email]:[password]
https://www.planet-sports.de/checkout/register:[email]:[password]
https://www.bstn.com/eu_de/checkout/:[email]:[password]
https://www.lico-nature.de/index.php:[email]:[password]
https://ticketshop.mobil.nrw/authentication/register:[email]:[password]
https://softwareindustrie24.de/checkout/confirm/as/customer:[email]:[password]
https://www.zurbrueggen.de/checkout/register:[email]:[password]
https://www.hertz247.de/ikeage/de-de/SignUp/Profile:[email]:[password]
https://www.bluemovement.com/de-de/checkout2:[email]:[password]
android://pfDvxsQIIXYFer6DxBcqXjgyr9X3z0_f4GlJfpZMErP2oGHX74fUnXpWA29CNgnCyZ_phC8IyV0exIV6hg3iyQ==@com.sixt.reservation/:[email]:[password]
https://members.persil-service.de/login/:[email]:[password]
https://www.nicotel.de/index.php:[email]:[password]
https://www.hellofresh.de/login:[email]:[password]
https://login.live.com/login.srf:[email]:[password]
https://accounts.login.idm.telekom.com/factorx:[email]:[password]
https://grillhaus-bei-reimann.order.dish.co/register:[email]:[password]
https://signup.sipgateteam.de/:[email]:[password]
https://www.baur.de/kasse/registrieren:[email]:[password]
https://buchung.carlundcarla.de/28572879/schritt-3:[email]:[password]
https://www.qvc.de/checkout/your-information.html:[email]:[password]
https://de.omio.com/app/search-frontend/booking/96720342-e20e-4de7-8b21-ddefc0fa44bd/passenger-details:[email]:[password]
https://www.shop-apotheke.com/nx/login/:[email]:[password]
https://druckmittel.de/checkout/confirm:[email]:[password]
https://www.global-carpet.de/checkout/confirm:[email]:[password]
https://software-hero.de/checkout/confirm:[email]:[password]
https://myenergykey.com/login:[email]:[password]
https://www.sixt.de/:[email]:[password]
https://www.wlan-shop24.de/Bestellvorgang:[email]:[password]
https://www.cyberport.de/checkout/anmelden.html:[email]:[password]
https://waschmal.de/registerCustomer:[email]:[password]
https://www.wgv.de/app/moped201802/rechner/abschluss/moped:[email]:[password]
https://www.persil-service.de/signup:[email]:[password]
https://nicotel.de/:[email]:[password]
https://temial.vorwerk.de/register/checkout:[email]:[password]
https://accounts.bahn.de/auth/realms/db/login-actions/required-action:[email]:[password].
https://www.petsdeli.de/login:[email]:[password]
https://www.netflix.com/de/login:[email]:[password]
https://login.live.com/login.srf:[email]:[password]
https://accounts.login.idm.telekom.com/factorx:[email]:[password]
https://www.netflix.com/de/login:[email]:[password]
https://www.zoll-portal.de/registrierung/benutzerkonto/daten:[email]:[password]
https://v3.account.samsung.com/iam/passwords/register:[email]:[password]
https://www.amazon.pl/ap/signin:[email]:[password]
https://www.amazon.de/:[email]:[password]
https://meinkonto.telekom-dienste.de/wiederherstellung/passwort/web-pw-setzen.xhtml:[email]:[password]
https://www.netflix.com/de/login:[email]:[password]
https://steuer.check24.de/customer-center/aff/check24/authentication [email]:[password]
https://www.disneyplus.com/de-de/reset-password [email]:[password]
https://auth.rtl.de/auth/realms/rtlplus/protocol/openid-connect/auth [email]:[password]
https://www.tink.de/checkout/login [email]:[password]
https://signin.ebay.de/ws/eBayISAPI.dll [email]:[password]
https://vrr-db-ticketshop.de/authentication/login [email]:[password]
https://www.planet-sports.de/checkout/register [email]:[password]
https://www.bstn.com/eu_de/checkout/ [email]:[password]
https://www.lico-nature.de/index.php [email]:[password]
https://ticketshop.mobil.nrw/authentication/register [email]:[password]
https://softwareindustrie24.de/checkout/confirm/as/customer [email]:[password]
https://www.zurbrueggen.de/checkout/register [email]:[password]
https://www.hertz247.de/ikeage/de-de/SignUp/Profile [email]:[password]
https://www.bluemovement.com/de-de/checkout2 [email]:[password]
android://pfDvxsQIIXYFer6DxBcqXjgyr9X3z0_f4GlJfpZMErP2oGHX74fUnXpWA29CNgnCyZ_phC8IyV0exIV6hg3iyQ==@com.sixt.reservation/[email]:[password]
https://members.persil-service.de/login/ [email]:[password]
https://www.nicotel.de/index.php [email]:[password]
https://www.hellofresh.de/login [email]:[password]
https://login.live.com/login.srf [email]:[password]
https://accounts.login.idm.telekom.com/factorx [email]:[password]
https://grillhaus-bei-reimann.order.dish.co/register [email]:[password]
https://signup.sipgateteam.de/ [email]:[password]
https://www.baur.de/kasse/registrieren [email]:[password]
https://buchung.carlundcarla.de/28572879/schritt-3 [email]:[password]
https://www.qvc.de/checkout/your-information.html [email]:[password]
https://de.omio.com/app/search-frontend/booking/96720342-e20e-4de7-8b21-ddefc0fa44bd/passenger-details [email]:[password]
https://www.shop-apotheke.com/nx/login/ [email]:[password]
https://druckmittel.de/checkout/confirm [email]:[password]
https://www.global-carpet.de/checkout/confirm [email]:[password]
https://software-hero.de/checkout/confirm [email]:[password]
https://myenergykey.com/login [email]:[password]
https://www.sixt.de/ [email]:[password]
https://www.wlan-shop24.de/Bestellvorgang [email]:[password]
https://www.cyberport.de/checkout/anmelden.html [email]:[password]
https://waschmal.de/registerCustomer [email]:[password]
https://www.wgv.de/app/moped201802/rechner/abschluss/moped [email]:[password]
https://www.persil-service.de/signup [email]:[password]
https://nicotel.de/ [email]:[password]
https://temial.vorwerk.de/register/checkout [email]:[password]
https://accounts.bahn.de/auth/realms/db/login-actions/required-action [email]:[password].
https://www.petsdeli.de/login [email]:[password]
https://www.netflix.com/de/login [email]:[password]
https://login.live.com/login.srf [email]:[password]
https://accounts.login.idm.telekom.com/factorx [email]:[password]
https://www.netflix.com/de/login [email]:[password]
https://www.zoll-portal.de/registrierung/benutzerkonto/daten [email]:[password]
https://v3.account.samsung.com/iam/passwords/register [email]:[password]
https://www.amazon.pl/ap/signin [email]:[password]
https://www.amazon.de/ [email]:[password]
https://meinkonto.telekom-dienste.de/wiederherstellung/passwort/web-pw-setzen.xhtml [email]:[password]
https://www.netflix.com/de/login [email]:[password]
https://steuer.check24.de/customer-center/aff/check24/authentication:[email]:[password]
https://www.disneyplus.com/de-de/reset-password:[email]:[password]
https://auth.rtl.de/auth/realms/rtlplus/protocol/openid-connect/auth:[email]:[password]
https://www.tink.de/checkout/login:[email]:[password]
https://signin.ebay.de/ws/eBayISAPI.dll:[email]:[password]
https://vrr-db-ticketshop.de/authentication/login:[email]:[password]
https://www.planet-sports.de/checkout/register:[email]:[password]
https://www.bstn.com/eu_de/checkout/:[email]:[password]
https://www.lico-nature.de/index.php:[email]:[password]
https://ticketshop.mobil.nrw/authentication/register:[email]:[password]
https://softwareindustrie24.de/checkout/confirm/as/customer:[email]:[password]
https://www.zurbrueggen.de/checkout/register:[email]:[password]
https://www.hertz247.de/ikeage/de-de/SignUp/Profile:[email]:[password]
https://www.bluemovement.com/de-de/checkout2:[email]:[password]
android://pfDvxsQIIXYFer6DxBcqXjgyr9X3z0_f4GlJfpZMErP2oGHX74fUnXpWA29CNgnCyZ_phC8IyV0exIV6hg3iyQ==@com.sixt.reservation/:[email]:[password]
https://members.persil-service.de/login/:[email]:[password]
https://www.nicotel.de/index.php:[email]:[password]
https://www.hellofresh.de/login:[email]:[password]
https://login.live.com/login.srf:[email]:[password]
https://accounts.login.idm.telekom.com/factorx:[email]:[password]
https://grillhaus-bei-reimann.order.dish.co/register:[email]:[password]
https://signup.sipgateteam.de/:[email]:[password]
https://www.baur.de/kasse/registrieren:[email]:[password]
https://buchung.carlundcarla.de/28572879/schritt-3:[email]:[password]
https://www.qvc.de/checkout/your-information.html:[email]:[password]
https://de.omio.com/app/search-frontend/booking/96720342-e20e-4de7-8b21-ddefc0fa44bd/passenger-details:[email]:[password]
https://www.shop-apotheke.com/nx/login/:[email]:[password]
https://druckmittel.de/checkout/confirm:[email]:[password]
https://www.global-carpet.de/checkout/confirm:[email]:[password]
https://software-hero.de/checkout/confirm:[email]:[password]
https://myenergykey.com/login:[email]:[password]
https://www.sixt.de/:[email]:[password]
https://www.wlan-shop24.de/Bestellvorgang:[email]:[password]
https://www.cyberport.de/checkout/anmelden.html:[email]:[password]
https://waschmal.de/registerCustomer:[email]:[password]
https://www.wgv.de/app/moped201802/rechner/abschluss/moped:[email]:[password]
https://www.persil-service.de/signup:[email]:[password]
https://nicotel.de/:[email]:[password]
https://temial.vorwerk.de/register/checkout:[email]:[password]
https://accounts.bahn.de/auth/realms/db/login-actions/required-action:[email]:[password].
https://www.petsdeli.de/login:[email]:[password]
https://www.netflix.com/de/login:[email]:[password]
https://login.live.com/login.srf:[email]:[password]
https://accounts.login.idm.telekom.com/factorx:[email]:[password]
https://www.netflix.com/de/login:[email]:[password]
https://www.zoll-portal.de/registrierung/benutzerkonto/daten:[email]:[password]
https://v3.account.samsung.com/iam/passwords/register:[email]:[password]
https://www.amazon.pl/ap/signin:[email]:[password]
https://www.amazon.de/:[email]:[password]
https://meinkonto.telekom-dienste.de/wiederherstellung/passwort/web-pw-setzen.xhtml:[email]:[password]
steuer.check24.de/customer-center/aff/check24/authentication:[email]:[password]
www.disneyplus.com/de-de/reset-password:[email]:[password]
auth.rtl.de/auth/realms/rtlplus/protocol/openid-connect/auth:[email]:[password]
www.tink.de/checkout/login:[email]:[password]
signin.ebay.de/ws/eBayISAPI.dll:[email]:[password]
vrr-db-ticketshop.de/authentication/login:[email]:[password]
www.planet-sports.de/checkout/register:[email]:[password]
www.bstn.com/eu_de/checkout/:[email]:[password]
www.lico-nature.de/index.php:[email]:[password]
ticketshop.mobil.nrw/authentication/register:[email]:[password]
softwareindustrie24.de/checkout/confirm/as/customer:[email]:[password]
www.zurbrueggen.de/checkout/register:[email]:[password]
www.hertz247.de/ikeage/de-de/SignUp/Profile:[email]:[password]
www.bluemovement.com/de-de/checkout2:[email]:[password]
members.persil-service.de/login/:[email]:[password]
www.nicotel.de/index.php:[email]:[password]
www.hellofresh.de/login:[email]:[password]
login.live.com/login.srf:[email]:[password]
accounts.login.idm.telekom.com/factorx:[email]:[password]
grillhaus-bei-reimann.order.dish.co/register:[email]:[password]
signup.sipgateteam.de/:[email]:[password]
www.baur.de/kasse/registrieren:[email]:[password]
buchung.carlundcarla.de/28572879/schritt-3:[email]:[password]
www.qvc.de/checkout/your-information.html:[email]:[password]
de.omio.com/app/search-frontend/booking/96720342-e20e-4de7-8b21-ddefc0fa44bd/passenger-details:[email]:[password]
www.shop-apotheke.com/nx/login/:[email]:[password]
druckmittel.de/checkout/confirm:[email]:[password]
www.global-carpet.de/checkout/confirm:[email]:[password]
software-hero.de/checkout/confirm:[email]:[password]
myenergykey.com/login:[email]:[password]
www.sixt.de/:[email]:[password]
www.wlan-shop24.de/Bestellvorgang:[email]:[password]
www.cyberport.de/checkout/anmelden.html:[email]:[password]
waschmal.de/registerCustomer:[email]:[password]
www.wgv.de/app/moped201802/rechner/abschluss/moped:[email]:[password]
www.persil-service.de/signup:[email]:[password]
nicotel.de/:[email]:[password]
temial.vorwerk.de/register/checkout:[email]:[password]
accounts.bahn.de/auth/realms/db/login-actions/required-action:[email]:[password].
www.petsdeli.de/login:[email]:[password]
login.live.com/login.srf:[email]:[password]
accounts.login.idm.telekom.com/factorx:[email]:[password]
www.netflix.com/de/login:[email]:[password]
www.zoll-portal.de/registrierung/benutzerkonto/daten:[email]:[password]
v3.account.samsung.com/iam/passwords/register:[email]:[password]
www.amazon.pl/ap/signin:[email]:[password]
www.amazon.de/:[email]:[password]
meinkonto.telekom-dienste.de/wiederherstellung/passwort/web-pw-setzen.xhtml:[email]:[password]
Even without seeing the email address and password, the commonality is clear: German websites. Whilst the email address is common, the passwords are not... at least not always. In 168 instances they were near identical with only a handful of them deviating by a character or two. There's some duplication across the lines (9 different rows of Netflix, 4 of Disney Plus, etc), but clearly this remains a significant volume of data. But is it real? Let's find out:
The data seems accurate so far. I have already changed some of the passwords as I was notified by the provider that my account was hacked. It is strange that the Telekom password was already generated and should not be guessable. I store my passwords in Firefox, so is it possible that they were stolen from there?
It's legit. Stealer malware explains both the Telekom password and why passwords in Firefox were obtained; there's not necessarily anything wrong with either service, but if a machine is infected with software that can grab passwords straight out of the fields they've been entered into in the browser, it's game over.
We started having some to-and-fro as I gathered more info, especially as it related to the timeframe:
It started about a month ago, maximum 6 weeks. I use a Macbook and an iPhone, only a Windows PC at work, maybe it happened there? About a week ago there was an extreme spam attack on my Gmail account, and several expensive items were ordered with my accounts in the same period, which fortunately could be canceled.
We had the usual discussion about password managers and of course before that, tracking down which device is infected and siphoning off secrets. This was obviously distressing for her to see all her accounts laid out like this, not to mention learning that they were being exchanged in channels frequented by criminals. But from the perspective of verifying both the legitimacy and uniqueness of the data (not to mention the freshness), this was an enormously valuable exchange.
Next up was another subscriber who'd previously dodged all the data breaches in HIBP yet somehow managed to end up with 53 rows of data in the corpus:
[email]:Gru[redacted password]
[email]:fux[redacted password]
[email]:zWi[redacted password]
[email]:6ii[redacted password]
[email]:qTM[redacted password]
[email]:Pre[redacted password]
[email]:i8$[redacted password]
[email]:9cr[redacted password]
[email]:fuc[redacted password]
[email]:kuM[redacted password]
[email]:Fuc[redacted password]
[email]:Pre[redacted password]
[email]:Vxt[redacted password]
[email]:%3r[redacted password]
[email]:But[redacted password]
[email]:1qH[redacted password]
[email]:^VS[redacted password]
[email]:But[redacted password]
[email]:Nbs[redacted password]
[email]:*W2[redacted password]
[email]:$aM[redacted password]
[email]:DA^[redacted password]
[email]:vPE[redacted password]
[email]:Z8u[redacted password]
[email]:But[redacted password]
[email]:aXi[redacted password]
[email]:rPe[redacted password]
[email]:b4F[redacted password]
[email]:2u&[redacted password]
[email]:5%f[redacted password]
[email]:Lmt[redacted password]
[email]:p
[email]:Tem[redacted password]
[email]:fuc[redacted password]
[email]:*e@[redacted password]
[email]:(k+[redacted password]
[email]:Ste[redacted password]
[email]:^@f[redacted password]
[email]:XT$[redacted password]
[email]:25@[redacted password]
[email]:Jav[redacted password]
[email]:U8![redacted password]
[email]:LsZ[redacted password]
[email]:But[redacted password]
[email]:g$V[redacted password]
[email]:M9@[redacted password]
[email]:!6D[redacted password]
[email]:Fac[redacted password]
[email]:but[redacted password]
[email]:Why[redacted password]
[email]:h45[redacted password]
[email]:blo[redacted password]
[email]:azT[redacted password]
I've redacted everything after the first three characters of the password so you can get a sense of the breadth of different ones here. In this instance, there was no accompanying website, but the data checked out:
Oh damn a lot of those do seem pretty accurate. Some are quite old and outdated too. I tend to use that gmail account for inconsequential shit so I'm not too fussed, but I'll defintely get stuck in and change all those passwords ASAP. This actually explains a lot because I've noticed some pretty suspicious activity with a couple of different accounts lately.
Another with 35 records of website, email and password triplets responded as follows (I'll stop pasting in the source data, you know what that looks like by now):
Thank you very much for the information, although I already knew about this (I think it was due to a breach in LastPass) and I already changed the passwords, your information is much more complete and clear. It helped me find some pages where I haven't changed the password.
The final one of note really struck a chord with me, not because of the thrirteen rows of records similar to the ones above, but because of what he told me in his reply:
Thank you for your kindness. Most of these I have been able to change the passwords of and they do look familiar. The passwords on there have been changed. Is there a way we both can fix this problem as seeing I am only 14?
That's my son's age and predictably, all the websites listed were gaming sites. The kid had obviously installed something nasty and had signed up to HIBP notifications only a week earlier. He explained he'd recently received an email attempting to extort him for $1.3k worth of Bitcoin and shared the message. It was clearly a mass-mailed, indiscriminate shakedown and I advised him that it in no way targeted him directly. Concerned, he countered with a second extortion email he'd received, this time it was your classic "we caught you watching porn and masturbating" scam, and this one really had him worried:
I have been stressed and scared about these scams (even though I shouldn’t be). I have been very stressed and scared today because of another one of those emails.
Imagine being a young teenage boy and receiving that?! That's the sort of thing criminals frequenting Telegram channels such as the ones in question are using this data for, and it's reprehensible. I gave him some tips (I see the sorts of things my son's friends randomly install!) and hopefully, that'll set him on the right course.
They were the most noteworthy responses, the others that were often just a single email address and password pair just simply reinforced the same message:
Yes, this is an old password that I have used in the past, and matches the password of my accounts that had been logged into recently.
And:
Yes that password is familiar and accurate. I used to practice password re-use with this password across many services 5+ years ago.This makes it impossible to correlate it to a particular service or breach. It is known to me to be out there already, I've received crypto extortion emails containing it.
I know that many people who find themselves in this incident will be confused; which breach is it? I've never used Telegram before, why am I there? Those questions came through during my verification process and I know from loading previous similar breaches, they'll come up over and over again in the coming days and I hope that the overview above sufficiently answers these.
The questions that are harder to answer (and again, I know these will come up based on prior experience), are what the password is that was exposed, what the website it appeared next to was and, indeed, if it appeared next to a website at all or just alongside an email address. Right at the beginning of this project more than a decade ago, I made the decision not to load the data that would answer these questions due to the risk it posed to individuals and by extension, the risk to my ability to continue running HIBP. We were reminded of how important this decision was earlier in the year when a service aggregating data breaches left the whole thing exposed and put everyone in there at even more risk.
So, if you're in here, what do you do? It's a repeat of the same old advice we've been giving in this industry for decades now, namely keeping devices patched and updated, running security software appropriate for your device (I use Microsoft Defender on my PCs), using strong and unique passwords (get a password manager!) and enabling 2FA wherever possible. Each HIBP subscriber I contacted wasn't doing at least one of these things, which was evident in their password selection. Time and time again, passwords consisted of highly predictable patterns and often included their name, year of birth (I assume) and common character substitutions, usually within a dozen characters of length too. It's the absolute basics that are going wrong here.
To the point one of the HIBP subscribers made above, loading this data will help many people explain why they've been seeing unusual behaviour on their accounts. It's also the wakeup call to lift everyone's security game per the previous paragraph. But this also isn't the end of it, and more combolists have been posted in more Telegram channels since loading this incident. Whilst I'm still of the view from years ago that I'm not going to continuously load endless lists, I do hope people recognise that their security posture is an ongoing concern and not just something you think about after appearing in a breach.
The data is now searchable in Have I Been Pwned.
Today we loaded 16.5M email addresses and 13.5M unique passwords provided by law enforcement agencies into Have I Been Pwned (HIBP) following botnet takedowns in a campaign they've coined Operation Endgame. That link provides an excellent overview so start there then come back to this blog post which adds some insight into the data and explains how HIBP fits into the picture.
Since 2013 when I kicked off HIBP as a pet project, it has become an increasingly important part of the security posture of individuals, organisations, governments and law enforcement agencies. Gradually and organically, it has found a fit where it's able to provide a useful service to the good guys after the bad guys have done evil cyber things. The phrase I've been fond of this last decade is that HIBP is there to do good things with data after bad things happen. The reputation and reach the service has gained in this time has led to partnerships such as the one you're reading about here today. So, with that in mind, let's get into the mechanics of the data:
In terms of the email addresses, there were 16.5M in total with 4.5M of them not having been seen in previous data breaches already in HIBP. We found 25k of our own individual subscribers in the corpus of data, plus another 20k domain subscribers which is usually organisations monitoring the exposure of their customers (all of these subscribers have now been sent notification emails). As the data was provided to us by law enforcement for the public good, the breach is flagged as subscription free which means any organisation that can prove control of the domain can search it irrespective of the subscription model we launched for large domains in August last year.
The only data we've been provided with is email addresses and disassociated password hashes, that is they don't appear alongside a corresponding address. This is the bare minimum we need to make that data searchable and useful to those impacted. So, let's talk about those standalone passwords:
There are 13.5 million unique passwords of which 8.9M were already in Pwned Passwords. Those passwords have had their prevalence counts updated accordingly (we received counts for each password with many appearing in the takedown multiple times over), so if you're using Pwned Passwords already, you'll see new numbers next to some entries. That also means there are 4.6M passwords we've never seen before which you can freely download using our open source tool. Or even better, if you're querying Pwned Passwords on demand you don't need to do anything as the new entries are automatically added to the result set. All this is made possible by feeding the data into the law enforcement pipeline we built for the FBI and NCA a few years ago.
A quick geek-out moment on Pwned Passwords: at present, we're serving almost 8 billion requests per month to this service:
Taking just last week as an example, we're a rounding error off 100% of requests being served directly from Cloudflare's cache:
That's over 99.99% of all requests during that period that were served from one of Cloudflare's edge nodes that sit in 320 cities globally. What that means for consumers of the service is massively fast response times due to the low latency of serving content from a nearby location and huge confidence in availability as there's only about a one-in-ten-thousand chance of the request being served by our origin service. If you'd like to know more about how we achieved this, check out my post from a year ago on using Cloudflare Cache Reserve.
After pushing out the new passwords today, all but 5 hash prefixes were modified (read more about how we use hashes to enable anonymous password searches) so we did a complete Cloudflare cache flush. By the time you read this, almost the entire 16^5 possible hash ranges have been completely repopulated into cache due to the volume of requests the service receives:
Lastly, when we talk about passwords in HIBP, the inputs we receive from law enforcement consist of 3 parts:
The rationale for this is explained in the links above but in a nutshell, the SHA-1 format ensures any badly parsed data that may inadvertently include PII is protected and it aligns with the underlying data structure that drives the k-anonymity searches. We have NTLM hashes as well because many orgs use them to check passwords in their own Active Directory instances.
So, what can you do if you find your data in this incident? It's a similar story to the Emotet malware provided by the FBI and NHTCU a few years ago in that the sage old advice applies: get a password manager and make them all strong and unique, turn on 2FA everywhere, keep machines patched, etc. If you find your password in the data (the HIBP password search feature anonymises it before searching, or password managers like 1Password can scan all of your passwords in one go), obviously change it everywhere you've used it.
This operation will be significant in terms of the impact on cybercrime, and I'm glad we've been able to put this little project to good use by supporting our friends in law enforcement who are doing their best to support all of us as online citizens.
We often do that in this industry, the whole "1.0" thing, but it seems apt here. I started Have I Been Pwned (HIBP) in 2013 as a pet project that scratched an itch, so I never really thought of myself as an "employee". Over time, it grew (and I tell you what, nobody is more surprised by that than me!) and over the last few years, my wife Charlotte got more and more involved. Technically, we're both employees and we work on HIBP things but we're like, well, beta versions.
Today, I'm very happy to announce our first full-time, production-ready employee: Stefán Jökull Sigurðarson. This is both a massive commitment on Charlotte's and my part and a leap of faith on Stefán's and deserves some background:
I suffer somewhat from what I'll call the "founder's paradox", that is I find myself having built something genuinely useful and wanting to see it grow and mature yet also not wanting to let go. I want to be involved in everything, but I also want to go on holidays sometimes and tune out. I like making decisions on every aspect of how the service runs, but I want it to outlive me. Bringing any outside party into any business can be hard to come to terms with, but especially in the case of HIBP where it's become so critical to so many people and deals with so much sensitive data. Which is why I have to trust people like Stefán because if I don't, I'm one shark / snake / croc incident away from disappointing a lot of people.
Trust is the cornerstone of why Stefán is joining us now. Not just trust in his technical skills, but trust in him as a person. I've known Stefán for many years now, initially when he came to one of my Hack Yourself First workshops in Oslo back in 2018, then as a blogger writing about how he was implementing Pwned Passwords at EVE Online, then as conference speaker himself, a Microsoft MVP, and in 2021, as the person who selflessly gave up his own time to support the open source Pwned Passwords. What we never made any formal announcements about is that we did hire Stefán on a part-time basis beginning earlier last year to help out with the coding when he had free cycles amidst his full-time work. That went great and he obviously enjoyed working at HIBP so earlier this month, Stefán handed in his resignation and will shortly be a full-time employee.
I'm really happy with the timing of this and how it's all worked out. We're in a position to make the financial commitment largely because of finally putting a price on searches for large domains last year. What this has allowed us to do is shift money from companies who see value in the service (more than half the Fortune 500 use the domain search feature), and reinvest it into making HIBP more sustainable. Getting Stefán onboard is the manifestation of that investment and you'll very shortly see his work begin to translate into highly visible new features. But what you won't see is the stuff that's even more important, especially as it relates to running a more sustainable service that no longer has me as a single point of failure.
So, welcome Stefán, and thank you for your commitment 😊
Oh - just one more thing: I was looking around for a great hero image for this blog post and I found this awesome video of Stefán swimming through a semi-frozen Norwegian fjord before riding an iceberg. For real, this it perhaps the most Nordic thing I've ever seen (Stefán being from Iceland and all), but unfortunately videos don't really lend themselves to hero images, so I went switch a stylised AI-generated rendition of the event.
I hate having to use that word - "alleged" - because it's so inconclusive and I know it will leave people with many unanswered questions. (Edit: 12 days after publishing this blog post, it looks like the "alleged" caveat can be dropped, see the addition at the end of the post for more.) But sometimes, "alleged" is just where we need to begin and over the course of time, proper attribution is made and the dots are joined. We're here at "alleged" for two very simple reasons: one is that AT&T is saying "the data didn't come from us", and the other is that I have no way of proving otherwise. But I have proven, with sufficient confidence, that the data is real and the impact is significant. Let me explain:
Firstly, just as a primer if you're new to this story, read BleepingComputer's piece on the incident. What it boils down to is in August 2021, someone with a proven history of breaching large organisations posted what they claimed were 70 million AT&T records to a popular hacking forum and asked for a very large amount of money should anyone wish to purchase the data. From that story:
From the samples shared by the threat actor, the database contains customers' names, addresses, phone numbers, Social Security numbers, and date of birth.
Fast forward two and a half years and the successor to this forum saw a post this week alleging to contain the entire corpus of data. Except that rather than put it up for sale, someone has decided to just dump it all publicly and make it easily accessible to the masses. This isn't unusual: "fresh" data has much greater commercial value and is often tightly held for a long period before being released into the public domain. The Dropbox and LinkedIn breaches, for example, occurred in 2012 before being broadly distributed in 2016 and just like those incidents, the alleged AT&T data is now in very broad circulation. It is undoubtedly in the hands of thousands of internet randos.
AT&T's position on this is pretty simple:
AT&T continues to tell BleepingComputer today that they still see no evidence of a breach in their systems and still believe that this data did not originate from them.
The old adage of "absence of evidence is not evidence of absence" comes to mind (just because they can't find evidence of it doesn't mean it didn't happen), but as I said earlier on, I (and others) have so far been unable to prove otherwise. So, let's focus on what we can prove, starting with the accuracy of the data.
The linked article talks about the author verifying the data with various people he knows, as well as other well-known infosec identities verifying its accuracy. For my part, I've got 4.8M Have I Been Pwned (HIBP) subscribers I can lean on to assist with verification, and it turns out that 153k of them are in this data set. What I'll typically do in a scenario like this is reach out to the 30 newest subscribers (people who will hopefully recall the nature of HIBP from their recent memory), and ask them if they're willing to assist. I linked to the story from the beginning of this blog post and got a handful of willing respondents for whom I sent their data and asked two simple questions:
The first reply I received was simple, but emphatic:
This individual had their name, phone number, home address and most importantly, their social security number exposed. Per the linked story, social security numbers and dates of birth exist on most rows of the data in encrypted format, but two supplemental files expose these in plain text. Taken at face value, it looks like whoever snagged this data also obtained the private encryption key and simply decrypted the vast bulk (but not all of) the protected values.
The above example simply didn't have plain text entries for the encrypted data. Just by way of raw numbers, the file that aligns with the "70M" headline actually has 73,481,539 lines with 49,102,176 unique email addresses. The file with decrypted SSNs has 43,989,217 lines and the decrypted dates of birth file only has 43,524 rows. (Edit: the reason for this later became clear - there is only one entry per date of birth which is then referenced from multiple records.) The last file, for example, has rows that look just like this:
.encrypted_value='*0g91F1wJvGV03zUGm6mBWSg==' .decrypted_value='1996-07-18'
That encrypted value is precisely what appears in the large file hence providing an easy way of matching all the data together. But those numbers also obviously mean that not every impacted individual had their SSN exposed, and most individuals didn't have their date of birth leaked. (Edit: per above, the same entries in the DoB file are referenced by multiple source records so whilst not every record had a DoB recorded, the difference isn't as stark as I originally reported.)
As I'm fond of saying, there's only one thing worse than your data appearing on the dark web: it's appearing on the clear web. And that's precisely where it is; the forum this was posted to isn't within the shady underbelly of a Tor hidden service, it's out there in plain sight on a public forum easily accessed by a normal web browser. And the data is real.
That last response is where most people impacted by this will now find themselves - "what do I do?" Usually I'd tell them to get in touch with the impacted organisation and request a copy of their data from the breach, but if AT&T's position is that it didn't come from them then they may not be much help. (Although if you are a current or previous customer, you can certainly request a copy of your personal information regardless of this incident.) I've personally also used identity theft protection services since as far back as the 90's now, simply to know when actions such as credit enquiries appear against my name. In the US, this is what services like Aura do and it's become common practice for breached organisations to provide identity protection subscriptions to impacted customers (full disclosure: Aura is a previous sponsor of this blog, although we have no ongoing or upcoming commercial relationship).
What I can't do is send you your breached data, or an indication of what fields you had exposed. Whilst I did this in that handful of aforementioned cases as part of the breach verification process, this is something that happens entirely manually and is infeasible en mass. HIBP only ever stores email addresses and never the additional fields of personal information that appear in data breaches. In case you're wondering why that is, we got a solid reminder only a couple of months ago when a service making this sort of data available to the masses had an incident that exposed tens of billions of rows of personal information. That's just an unacceptable risk for which the old adage of "you cannot lose what you do not have" provides the best possible fix.
As I said in the intro, this is not the conclusive end I wanted for this blog post... yet. As impacted HIBP subscribers receive their notifications and particularly as those monitoring domains learn of the aliases in the breach (many domain owners use unique aliases per service they sign up to), we may see a more conclusive outcome to this incident. That may not necessarily be confirmation that the data did indeed originate from AT&T, it could be that it came from a third party processor they use or from another entity altogether that's entirely unrelated. The truth is somewhere there in the data, I'll add any relevant updates to this blog post if and when it comes out.
As of now, all 49M impacted email addresses are searchable within HIBP.
Edit (31 March): AT&T have just released a short statement making 2 important points:
AT&T data-specific fields were contained in a data set
it is not yet known whether the data in those fields originated from AT&T or one of its vendors
They've also been mass-resetting account passcodes after TechCrunch apparently alerted AT&T to the presence of these in the data set. That article also includes the following statement from AT&T:
Based on our preliminary analysis, the data set appears to be from 2019 or earlier, impacting approximately 7.6 million current AT&T account holders and approximately 65.4 million former account holders
Between originally publishing this blog post and AT&T's announcements today, there have been dozens of comments left below that attribute the source of the breach to AT&T in ways that made it increasingly unlikely that the data could have been sourced from anywhere else. I know that many journos (and myself) reached out to folks in AT&T to draw their attention to this, I'm happy to now end this blog post by quoting myself from the opening para 😊
But sometimes, "alleged" is just where we need to begin and over the course of time, proper attribution is made and the dots are joined.
I've always thought of it a bit like baseball cards; a kid has a card of this one player that another kid is keen on, and that kid has a card the first one wants so they make a trade. They both have a bunch of cards they've collected over time and by virtue of existing in the same social circles, trades are frequent, and cards flow back and forth on a regular basis. That's the analogy I often use to describe the data breach "personal stash" ecosystem, but with one key difference: if you trade a baseball card then you no longer have the original card, but if you trade a data breach which is merely a digital file, it replicates.
There are personal stashes of data breaches all over the place and they're usually presented like this one:
You'll recognise many of those names because they're noteworthy incidents that received a bunch of press. My Space. Adobe. LinkedIn. Ashley Madison.
The same incidents appear here:
And so on and so forth. Stashes of breaches like this are all over the place and they fuel an exchange ecosystem that replicates billions of records of personal data over and over again. Your data. My data. The data of a significant portion of the global internet-using population, just freely flowing backwards and forwards not just in the shady corners of "the dark web" but traded out there in the clear on mainstream websites. Until inevitably:
Diogo Santos Coelho was 14 when he started RaidForums, and was 21 by the time he was arrested for running the service 2 years ago. A kid, exchanging data without the maturity to understand the consequences of his actions. RaidForums left a void that was quickly filled by BreachForums:
Conor Fitzpatrick was 20 years old when he was finally picked up for running the service last year. Still just a kid, at least in the colloquial fashion in which we refer to youngsters as when we get a bit older, but surely still legally a minor when he chose to begin collecting data breaches.
Websites like these are taken down for a simple reason:
The ecosystem of personal stashes exchanged with other parties fuels crime.
For example, data breaches seed services set up with the express intent of monetising a broad range of personal attributes to the detriment of people who are already victims of a breach. Call them shady versions of Have I Been Pwned if you will, and this talk I gave at AusCERT a couple of years ago is a great explainer (deep-linked to the start of that segment):
The first service I spoke about in that segment was We Leak Info and it was run by two 22 year old guys. The website first appeared 3 years earlier - only a year after the creators had left childhood - and it allowed anyone with the money to access anyone else's personal data including:
names, email addresses, usernames, phone numbers, and passwords
One of the duo was later sentenced to 2 years in prison for his role, and when you read the sorts of conversations they were having, you can't help but think they behaved exactly like you'd expect a couple of young guys who thought they were anonymous would:
In the video, I mentioned Jordan Bloom in relation to LeakedSource, a veritable older gentleman of this class of crime being 24 when the site first appeared.
The company operating LeakedSource, Defiant Tech Inc, which was founded by Jordan Bloom, eventually entered a guilty plea to charges that included trafficking in identity information and when you read what that involved, you can see why this would attract the ire of law enforcement agencies:
However, unlike other breach notification services, such as Have I Been Pwned, LeakedSource also gave subscribers access to usernames, passwords (including in clear text), email addresses and IP addresses. LeakedSource services were often advertised on hacking forums and there was suspicion that its operators were actively looking to hack organizations whose data they could add to their database.
In 2016, a well-wisher purchased my own data from LeakedSource and sent over a dozen different records similar to this one:
Not mentioned in my talk but running in the same era was Leakbase, yet another service that collated huge volumes of sensitive data and sold it to absolutely anyone:
And just like all the other ones, the same data appeared over and over again:
It went dark at the end of 2017 amidst speculation the disappearance was tied to the takedown of the Hansa dark web market. If that was the case, why did we never hear of charges being laid as we did with We Leak Info and LeakedSource? Could it be that the operator of Leakbase was only ever so slightly younger than the other guys mentioned above and not having yet reached adulthood, managed to dodge charges? It would certainly be consistent with the demographic pattern of those with personal stashes of data breaches.
Speaking of patterns: We Leak Info, LeakedSource, Leakbase - it's like there's a theme of shady services attached to the word. As I say in the video, there's also a theme of attempting to remain anonymous (which clearly hasn't worked very well!), and a theme of attempting to eschew legal responsibility for how the data is used by merely putting words in the terms of service. For example, here's Jordan's go at deflecting his role in the ecosystem and yes, this was the entire terms of service:
I particularly like this clause:
You may only use this tool for your own personal security and data research. You may only search information about yourself, or those you are authorized in writing to do so.
That's not going to keep you out of trouble! Time and time again, I see this sort of wording on services used as if it's going to make a difference when the law comes asking hard questions; "Hey we literally told people to play nice with the data!"
We Leak Info used similar entertaining wording with some of the highlights including:
That last one in particular is an absolute zinger! But again, remember, we're talking about guys who stood this service up as teenagers and literally worked on the assumption of "as [l]ong as we cooperate they [the FBI] won't fuck with us" 🤦♂️ The ignorance of that attitude whilst advertising services on criminal forums is just mind-blowing, even for kids.
All of which brings me to the inspiration for this blog post:
Interesting find by @MayhemDayOne, wonder if it was from a shady breach search service (we’ve seen a bunch shut down over the years)? Either way, collecting and storing this data is now trivial so not a big surprise to see someone screw up their permissions and (re)leak it all. https://t.co/DM7udeUcRk
— Troy Hunt (@troyhunt) January 22, 2024
It's like I've seen it all before! No, really, because only a couple of days later someone running a service popped up and claimed responsibility for having exposed the data due to "a firewall misconfiguration". I'm not going to name or link the service, but I will describe a few key features:
I could write predictions about the future of this service but if you've read this far and paid attention to the precedents, you can reliably form your own conclusion. The outcome is easily predictable and indeed it was the predictability of the whole situation when I started getting bombarded with queries about the "Mother of all Breaches" that frustrated me; of course it was someone's personal stash, because we've seen it all before and we live in an era where it's dead easy to build services like this. Cloud is ubiquitous and storage is cheap, you can stand up great looking websites in next to no time courtesy of freely available templates, and the whole data breach trading ecosystem I referred to earlier can easily seed services like this.
Maybe the young guy running this service (assuming the previously observed patterns apply) will learn from history and quietly exit while the getting is good, I don't know, time will tell. At the very least, if he reads this and takes nothing else away, don't go driving around in a bright green Lamborghini!
Edit: In the original version of this blog post, it was incorrectly implied that Jordan Bloom may have been the person who pled guilty to charges when in fact it was the company that ran LeakedSource, Defiant Tech Inc, that the plea was entered under. To the extent that the blog contained words to the effect of, or otherwise implied or contained innuendo that Mr Bloom engaged in criminal or otherwise illegal conduct, or pled guilty to trafficking identify information, I apologise and unreservedly retract such statements and this blog has been edited to ensure that the facts involved in this matter are accurately portrayed.
It feels like not a week goes by without someone sending me yet another credential stuffing list. It's usually something to the effect of "hey, have you seen the Spotify breach", to which I politely reply with a link to my old No, Spotify Wasn't Hacked blog post (it's just the output of a small set of credentials successfully tested against their service), and we all move on. Occasionally though, the corpus of data is of much greater significance, most notably the Collection #1 incident of early 2019. But even then, the rapid appearance of Collections #2 through #5 (and more) quickly became, as I phrased it in that blog post, "a race to the bottom" I did not want to take further part in.
Until the Naz.API list appeared. Here's the back story: this week I was contacted by a well-known tech company that had received a bug bounty submission based on a credential stuffing list posted to a popular hacking forum:
Whilst this post dates back almost 4 months, it hadn't come across my radar until now and inevitably, also hadn't been sent to the aforementioned tech company. They took it seriously enough to take appropriate action against their (very sizeable) user base which gave me enough cause to investigate it further than your average cred stuffing list. Here's what I found:
That last number was the real kicker; when a third of the email addresses have never been seen before, that's statistically significant. This isn't just the usual collection of repurposed lists wrapped up with a brand-new bow on it and passed off as the next big thing; it's a significant volume of new data. When you look at the above forum post the data accompanied, the reason why becomes clear: it's from "stealer logs" or in other words, malware that has grabbed credentials from compromised machines. Apparently, this was sourced from the now defunct illicit.services website which (in)famously provided search results for other people's data along these lines:
I was aware of this service because, well, just look at the first example query 🤦♂️
So, what does a stealer log look like? Website, username and password:
That's just the first 20 rows out of 5 million in that particular file, but it gives you a good sense of the data. Is it legit? Whilst I won't test a username and password pair on a service (that's way too far into the grey for my comfort), I regularly use enumeration vectors on websites to validate whether an account actually exists or not. For example, take that last entry for racedepartment.com, head to the password reset feature and mash the keyboard to generate a (quasi) random alias @hotmail.com:
And now, with the actual Hotmail address from that last line:
The email address exists.
The VideoScribe service on line 9:
Exists.
And even the service on the very first line:
From a verification perspective, this gives me a high degree of confidence in the legitimacy of the data. The question of how valid the accompanying passwords remain aside, time and time again the email addresses in the stealer logs checked out on the services they appeared alongside.
Another technique I regularly use for validation is to reach out to impacted HIBP subscribers and simply ask them: "are you willing to help verify the legitimacy of a breach and if so, can you confirm if your data looks accurate?" I usually get pretty prompt responses:
Yes, it does. This is one of the old passwords I used for some online services.
When I asked them to date when they might have last used that password, they believed it was was either 2020 or 2021.
And another whose details appears alongside a Webex URL:
Yes, it does. but that was very old password and i used it for webex cuz i didnt care and didnt use good pass because of the fear of leaking
And another:
Yes these are passwords I have used in the past.
Which got me wondering: is my own data in there? Yep, turns out it is and with a very old password I'd genuinely used pre-2011 when I rolled over to 1Password for all my things. So that sucks, but it does help me put the incident in more context and draw an important conclusion: this corpus of data isn't just stealer logs, it also contains your classic credential stuffing username and password pairs too. In fact, the largest file in the collection is just that: 312 million rows of email addresses and passwords.
Speaking of passwords, given the significance of this data set we've made sure to roll every single one of them into Pwned Passwords. Stefán has been working tirelessly the last couple of days to trawl through this massive corpus and get all the data in so that anyone hitting the k-anonymity API is already benefiting from those new passwords. And there's a lot of them: it's a rounding error off 100 million unique passwords that appeared 1.3 billion times across the corpus of data 😲 Now, what does that tell you about the general public's password practices? To be fair, there are instances of duplicated rows, but there's also a massive prevalence of people using the same password across multiple difference services and completely different people using the same password (there are a finite set of dog names and years of birth out there...) And now more than ever, the impact of this service is absolutely huge!
When we weren't looking, @haveibeenpwned's Pwned Passwords rocketed past 7 *billion* requests in a month 😲 pic.twitter.com/hVDxWp3oQG
— Troy Hunt (@troyhunt) January 16, 2024
Pwned Passwords remains totally free and completely open source for both code and data so do please make use of it to the fullest extent possible. This is such an easy thing to implement, and it has a profound impact on credential stuffing attacks so if you're running any sort of online auth service and you're worried about the impact of Naz.API, this now completely kills any attack using that data. Password reuse remain rampant so attacks of this type prosper (23andMe's recent incident comes immediately to mind), definitely get out in front of this one as early as you can.
So that's the story with the Naz.API data. All the email addresses are now in HIBP and searchable either individually or via domain and all those passwords are in Pwned Passwords. There are inevitably going to be queries along the lines of "can you show me the actual password" or "which website did my record appear against" and as always, this just isn't information we store or return in queries. That said, if you're following the age-old guidance of using a password manager, creating strong and unique ones and turning 2FA on for all your things, this incident should be a non-event. If you're not and you find yourself in this data, maybe this is the prompt you finally needed to go ahead and do those things right now 🙂
Edit: A few clarifications based on comments:
A decade ago to the day, I published a tweet launching what would surely become yet another pet project that scratched an itch, was kinda useful to a few people but other than that, would shortly fade away into the same obscurity as all the other ones I'd launched over the previous couple of decades:
It's alive! "Have I been pwned?" by @troyhunt is now up and running. Search for your account across multiple breaches http://t.co/U0QyHZxP6k
— Have I Been Pwned (@haveibeenpwned) December 4, 2013
And then, as they say, things kinda escalated quickly. The very next day I published a blog post about how I made it so fast to search through 154M records and thus began a now 185-post epic where I began detailing the minutiae of how I built this thing, the decisions I made about how to run it and commentary on all sorts of different breaches. And now, a 10th birthday blog post about what really sticks out a decade later. And that's precisely what this 185th blog post tagging HIBP is - the noteworthy things of the years past, including a few things I've never discussed publicly before.
You know why it's called "Have I Been Pwned"? Try coming up with almost any conceivable normal sounding English name and getting a .com domain for it. Good luck! That was certainly part of it, but another part of the name choice was simply that I honestly didn't expect this thing to go anywhere. It's like I said in the intro of this post where I fully expected this to be another failed project, so why does the name matter?
But it's weird how "pwned" has stuck and increasingly, become synonymous with HIBP. For many people, the first time they ever hear the word is in the context of "Have I Been..." with an ensuing discussion often explaining the origins of the term as it relates to gaming culture. And if you do go and look for a definition of the term online, you'll come across resources such as How “PWNED” went from hacker slang to the internet’s favourite taunt:
Then in 2013, when various web services and sites saw an uptick in personal data breaches, security expert Troy Hunt created the website “Have I Been Pwned?” Anyone can type in an email address into the site to check if their personal data has been compromised in a security breach.
And somehow, this little project is now referenced in the definition of the name it emerged from. Weird.
But, because it's such an odd name that has so frequently been mispronounced or mistyped, I've ended up with a whole raft of bizarre domain names including haveibeenpaened.com, haveibeenpwnded.com, haveibeenporned.com and my personal favourite, haveibeenprawned.com (because a journo literally pronounced it that way in a major news segment 🤦♂️). Not to mention all the other weird variations including haveibeenburned.com, haveigotpwned.com, haveibeenrekt.com and after someone made the suggestion following the revelation that PornHub follows me, haveibeenfucked.com 🤷♂️
It's difficult to even know where to start here. How does the little site with the weird name end up in the press? Inevitably, "because data breaches", and it's nuts just how much exposure this project has had because of them. These are often mainstream news events and what reporters often want to impart to people is along the lines of "Here's what you should do if you've been impacted", which often boils down to checking HIBP.
Press is great for raising awareness of the project, but it has also quite literally DDoS'd the service with the Martin Lewis Money Show in the UK knocking it offline in 2016. Cool! No, for real, I learned some really valuable lessons from that experience which, of course, I shared in a blog post. And then ensured could never happen again.
Back in 2018, Gizmodo reckoned HIBP was one of the top 100 websites that shaped the internet as we knew it, alongside the likes of Wikipedia, Google, Amazon and Goatse (don't Google it). Only the year after it launched, TIME magazine reckon'd it was one of the 50 best websites of the year. And every time I do a Google search for a major news outlet, I find this little website. The Wall Street Journal. The Standard (nice headline!) USA Today. Toronto Star. De Telegraaf. VG. Le Monde. Corriere della Sera. It's wild - I just kept Googling for the largest newspapers in various parts of the world and kept getting hits!
The point is that it's had impact, and nobody is more surprised about that than me.
How on earth did I end up here?!
6 years and a few days ago now, I found myself in a place I'd only ever seen before in the movies: Congress. American Congress. Saying "pwned"!
For reasons I still struggle to completely grasp, the folks there thought it would be a good idea if I flew to the other side of the world and talked about the impact of data breaches on identity verification. "You know they're just trying to get you to DC so they can arrest you for all that stolen data you have, right?! 🤣", the internet quipped. But instead, I had one of the most memorable moments of my career as I read my testimony (these are public hearings so it's all recorded and available to watch), responded to questions from congressmen and congresswomen and rounded out the trip staring down at where they inaugurate presidents:
Today, that photo adorns the wall outside my office and dozens of times a day I look at it and ask the same question - how did it all lead to this?!
The potential sale of HIBP was a very painful, very expensive chapter of life, announced in a blog post from June 2019. For the most part, I was as transparent and honest as I could be about the reasons behind the decision, including the stress:
To be completely honest, it's been an enormously stressful year dealing with it all.
More than one year later, I finally wrote about the source of so much of that stress: divorce. Relationship circumstances had put a huge amount of pressure on me and I needed a relief valve which at the time, I thought would be the sale of the project I loved so much but was becoming increasingly demanding. Ultimately, Project Svalbard (the code name for the sale of HIBP), had the opposite effect as years of bitter legal battles with my ex ensued, in part due to the perceived value that would have been realised had it been sold and some big tech company owned my arse for years to come. The project I built out of a passion to do community good was now being used as a tool to extract as much money out of me as possible. There's a wild story to be told there one day but whilst that saga is now well and truly behind me, the scars are still raw.
There were many times throughout Project Svalbard where I felt like I was living out an episode of Silicon Valley, especially as I hopped between interviews at the who's-who of tech firms in San Francisco to meet potential acquirers. But there was one moment in particular that I knew at the time would form an indelible memory, so I took a photo of it:
I'm sitting in a rental car in Yosemite whilst driving from the aforementioned meetings in SF and onto Vegas for the annual big cyber-events. I had a scheduled call with a big tech firm who was a potential acquirer and should that deal go through, the guy I was speaking to would be my new boss. I'd done that dozens of times by now and I don't know if it was because I was especially tired or emotional or if there was something in the way he phrased the question, but this triggered something deep inside me:
So Troy, what would your perfect day in the office look like?
I didn't say it this directly, but I kid you not this is exactly what popped into my mind:
I get on my jet ski and I do whatever the fuck I want
My potential new overlord had somehow managed to find exactly the raw nerve to touch that made me realise how valuable independence had become to me. 6 months later, Project Svalbard was dead after a deal I'd struck fell through. I still can't talk about the precise circumstances due to being NDA'd up to wazoo, but the term we chose to use was "a change of business circumstances on behalf of the purchaser". With the benefit of hindsight, I've never been so happy to have lost so much 😊
10 years ago, I certainly didn't see this on the cards:
This is so cool, thanks @FBI 😊 pic.twitter.com/aqMi3as91O
— Troy Hunt (@troyhunt) June 28, 2023
Nor did I expect them to be actively feeding data into HIBP. Or the UK's NCA to be feeding data in. Or various other law enforcement agencies the world over. And I never envisioned a time where dozens of national governments would be happy to talk about using the service.
A couple of months ago, the ABC wrote a long piece on how this whole thing is, to use their term, a strange sign of the times.
He’s just “a dude on the web”, but Troy Hunt has ended up playing an oddly central role in global cybersecurity.
It's strange until you look at through the lens of aligned objectives: the whole idea of HIBP was "to do good things after bad things happen" which is well aligned with the mandates of law enforcement agencies. You could call it... common ground:
This is something I suspect a lot of people don't understand - that law enforcement agencies often work in conjunction with private enterprise to further their goals of protecting people just like you and me. It's something I certainly didn't understand 10 years ago, and I still remember the initial surprise when agencies started reaching out. Many years on, these have become really productive relationships with a bunch of top notch people, a number of whom I now count as friends and make an effort to spend time with on my travels.
This was never on the cards originally. In fact, I'd always been adamant that there should never be passwords in HIBP although in my defence, the sentiment was that they should never appear next to the username to which they originally accompanied. But looking at passwords through the lens of how breach data can be used to do good things, a list of known compromised passwords disassociated from any form of PII made a lot of sense. So, in 2017, Pwned Passwords was born. You know what I was saying earlier about things escalating quickly? Yeah:
Setting all new records for Pwned Passwords this week: biggest day ever yesterday at 282M requests and biggest rolling 30 days ever, now passing the 6 *billion* requests mark! pic.twitter.com/dQiuQim3da
— Troy Hunt (@troyhunt) September 12, 2023
As if to make the point, I just checked the latest stats and last week we did 301.6M requests in a single day. 100% of those requests - and that's not a rounded number either, it's 100.0000000000% - were served from Cloudflare's cache 🤯
There's so much I love about this service. I love that it's free, there's no auth, it's entirely open source (both code and data), the FBI feeds data into it and perhaps most importantly, it has real impact on security. It's such a simple thing, but every time you see a headline such as "Big online website hit with credential stuffing attack", a significant portion of the accounts being taken over have passwords that could easily have been blocked.
On multiple occasions now, I've had conversations that can best be paraphrased as follows:
Random Internet Person: I'm going to report you to the FBI for having all that stolen data
Me: Maybe you should start by Googling "troy hunt fbi" first...
But I understand where they're coming from and the paradox I refer to is the perceived conflict between handling what is usually the output of a crime whilst simultaneously trying to perform a community good. It's the same discussion I've often had with people citing privacy laws in their corner of the world (often the EU and GDPR) as the reason why HIBP shouldn't exist: "but you're processing data without informed consent!", they'll claim. The issue of there being other legal bases for processing aside, nobody consents to being in a data breach! The natural progression of that conversation is that being in a data breach is a parallel discussion to HIBP then indexing it and making it searchable, which is something I've devoted many words to addressing in the past.
But for all the bluster the occasional random internet person can have (and honestly, I could count the number of annual instances of this on one hand), nothing has come of any complaints. And when I say "complaints", it's often nothing more than a polite conversation which may simply conclude with an acknowledgment of opposing views and that's it. There has been one exception in the entire decade of running this service where a complaint did come via a government privacy regulator, I responded to all the questions that were asked and that was the end of it.
When you have a pet project like HIBP was in the beginning, it's usually just you putting in the hours. That's fine, it's a hobby and you're scratching an itch, so what does it matter that there's nobody else involved? Like many similar passion projects, HIBP consumed a lot of hours from early on, everything from obviously building the service then sourcing data breaches, verifying and disclosing them, writing up descriptions and even editing every single one of those 700+ logos by hand to be just the right dimensions and file size. But in the beginning, if I'd just stopped one day, what would happen? Nothing. But today, a genuinely important part of the internet that a huge number of individuals, corporations and governments have built dependencies on would stop working if I lost interest.
The dependency on just me was partly behind the possible sale in 2019, but clearly that didn't eventuate. There was always the option to employ people and build it out like most people would a normal company, but every time I gave that consideration it just didn't stack up for a whole bunch of reasons. It was certainly feasible from the perspective of building some sort of valuable commercial entity, but in just the same way as that question about my perfect day in the office sucked the soul from my body, so did the prospect of being responsible for other people. Employment contracts. Salary negotiations. Performance reviews. Sick leave and annual leave and all sorts of other people issues from strangers I'd need to entrust with "my baby". So, bringing in more people was a really unattractive idea, with 2 exceptions:
In early 2021, my (soon to be at the time) wife Charlotte started working for HIBP.
Charlotte had spent the last 8 years working with people just like me; software nerds. As a project manager for the NDC conferences based out of Norway, she'd dealt with hundreds of speakers (including me on many occasions), and thousands of attendees at the best conference I've ever been a part of. Plus, she spent a great deal of time coordinating sponsors, corporate attendees and all sorts of other folks that live in the tech world HIBP inhabited. For Charlotte, even though she's not a technical person (her qualifications are in PR and entrepreneurial studies), this was very familiar territory.
So, for the last few years, Charlotte has done absolutely everything that she can to ensure that I can focus on the things that need my attention. She onboards new corporate subscribers, handles masses of tickets for API and domain subscribers and does all the accounting and tax work. And she does this tirelessly every single day at all sorts of hours whether we're at home or travelling. She is... amazing 🤩
Earlier this year, Stefán Jökull Sigurðarson started working for us part time writing code, cleaning up code, migrating code and, well, doing lots of different code things.
Just today I asked Stefán what I should write about him, thinking he'd give me some bullet points I'd massage and then incorporate into this blog post. Instead, I reckon what he wrote was so spot on that I'm just going to quote the entire thing here:
"Just" that having had my eye on the service since it was released and then developing one of the first big integrations with the PwnedPasswords v2 API in EVE, coinciding with us meeting for the first time at NDC Oslo in 2018 shortly after, HIBP has managed to take me on this awesome journey where it has been a part of launching my public speaking career, contributing to OSS with Pwned Passwords, becoming an MVP and helped me meet a bunch of awesome people and allowed me to contribute to a better and hopefully safer internet. I'm very happy and honoured to a be a part of this project which is full of awesome challenges and interesting problems to deal with. Having meeting invites from the FBI in my inbox a few years after doing a few experimental rest calls to the Pwned Passwords API in early 2018 was definitely not something I was expecting 😅
What really resonated with me in Stefán's message is that for him, this isn't just a job, it's a passion. His journey is my journey in that we freely devoted our time to do something we love and it led to many wonderful things, including MVP roles and speaking at "Charlotte's" conference, NDC. Stefán is based in Iceland, but we've still had many opportunities to share beers together and establish a relationship that transcends merely writing code. I can't think of anyone better to do what he does today.
731 breaches later, here we are. So, what stands out? Just going off the top of my head here:
Ashley Madison. Every knows the name so it needs no introduction, but that incident in 2015 had a major impact on HIBP in terms of use of the service, and also a major impact on me in terms of the engagements I had with impacted parties. My blog post on Here’s what Ashley Madison members have told me still feels harrowing to read.
Collection #1. This is the one that really contributed to my stress levels in early 2019 and had a profound impact on my decision to look at selling the service. Read about where those 773M records came from (still the largest breach in HIBP to date).
Rosebutt. Don't make a joke about it, don't make a joke about it, don't... aw man, thanks The Register! (link to an archive.org version as they seem to have thought better of their image choice later on...) The point is that even serious data breaches can have their moments of levity.
Shit Express. Sometimes, you just need a bit of hilarity in your data breach. Shit Express is literally a site to send other people pieces of that - anonymously - and they got breached, thus somewhat affecting their anonymity. The more serious point is that as I later wrote, claims of anonymity are often highly misleading.
I often joke about my life being very much about getting up each morning, reading my emails and events from overnight and then just winging it from there. Of course there are the occasional scheduled things not to mention travel commitments, but for the most part it's very much just rolling with whatever is demanding attention on the day. This is also probably a significant part of why I don't really want to see this thing grow into a larger concern with more responsibilities, I just don't want to lose that freedom. Yet...
We're gradually moving in a direction where things become more formalised. 3 years ago, I did 100% of everything myself. 1 year ago, I did everything technical myself. 6 months ago, we had no ticketing system for support. But these are small, incremental steps forward and that's what I'd like to see continuing. I want HIBP to outlive me, I just don't want it to become a burden I'm beholden to in the process. I'd like to have more people involved but as you can see from above, that's been a very slow process with only those very close to me playing a role.
The only thing I have real certainty on at the moment is that there will be more breaches. I've commented many times recently that the scourge that is ransomware feels like it's really accelerated lately, I wonder how many of the people in the emails and documents and all sorts of other data that get dumped there ever learn of their exposure? It's a non-trivial exercise to index that (for all sorts of reasons), but it also seems like an increasingly worthy exercise. Who knows, let's see how I feel when I get up tomorrow morning 🙂
Finally, for this week's regular video, I'm going to make a birthday special and do it live with Charlotte. Please come and join us, I'm not entirely sure what we'll cover (I'll work it out on the morning!) but let's make a virtual 10th birthday party out of it 🎂
Allegedly, Acuity had a data breach. That's the context that accompanied a massive trove of data that was sent to me 2 years ago now. I looked into it, tried to attribute and verify it then put it in the "too hard basket" and moved onto more pressing issues. It was only this week as I desperately tried to make some space to process yet more data that I realised why I was short on space in the first place:
Ah, yeah - Acuity - that big blue 437GB blob. What follows is the process I went through trying to work out what an earth this thing is, the confusion surrounding the data, the shady characters dealing with it and ultimately, how it's now searchable in Have I Been Pwned (HIBP), which may be what brought you to this blog post in the first place.
One of the first things I do after receiving a data breach is to literally just Google it: acuity data breach. Which immediately yielded this top result from June:
Ah, so Acuity is a healthcare company. But wait - here's the next result:
That's not about healthcare, that's Acuity Brands. How many companies called "Acuity" that have been breached are there?! Let's see what references I have in my email:
Another one 🤦♂️ That "breach" could be circumstantial, so we'll call it a "maybe", but it's yet another Acuity with a question mark next to it. So how many "Acuity" companies are out there in total?! Just in the course of investigating this data, I came across a total of 6 of them that as far as I can tell, are completely unrelated:
Ugh, great. We'll work through them and try to figure out where they fit into the picture in a moment, but first let's look at the actual data. We already know it's 437GB, but it's the breadth of column headings that's most stunning; here's all 414 of them:
Just by eyeballing these, it really doesn't feel like the sort of data that comes from a healthcare provider, a brands company or a scheduler. The other 3, however... Maybe.
Some more data points before going further:
On that final point, here's an example of what I'm talking about:
The last names are the same, as are the salutations. The physical addresses are spot on accurate in their structure as are the phone numbers; there are no spaces, no dashes and no other artifacts typical of millions of different humans entering data. This is clean - too clean.
The "datasource" field is another interesting data point with the top 10 values being:
Each of these entries appeared at least hundreds of thousands of times, if not millions. Does that mean that Netflix, for example, provided customer data to this list? Almost certainly no, but it does feel reminiscent of the Acxiom / Live Ramp misattribution post I wrote a year ago where I listed full counts of a similar column. One of the top values there was also "TAGGED.COM" (also all in uppercase), alongside several other values that also appeared in both sources.
Back to attribution and a post on a popular hacking forum jumps out:
Many things here line up, for example the column names that are very unique to this data source, including "estimatedincomecode", "del_point_check_digit" and "secondaryaddresspresent". The attribution is to the insurance company named "Acuity", but is that accurate? Insurance companies collect a lot of data as it's relevant to how they run their business, but that data is highly unlikely to include fields such as:
That's much more in the "data enrichment" space where a company sells a massive data set so that it can expand the profile data of the purchaser's existing customer base. It's a legitimate, honest, legal business model. It's also indistinguishable from this:
Hey, it's 437GB! And the column names line up! And it's called Acuity! Slightly different column count to mine (and similar but different to the hacker forum post), and slightly different email count, but the similarities remain striking. How I got to this resource is also interesting, having come by someone I was discussing the data with a couple of years ago:
The YouTube video is a walkthrough of a campaign management tool to send emails to customers. Could that indicate the data as coming from Acuity Ads (now Illumin)? No, not in and of itself, the walkthrough there isn't that dissimilar to other campaign tools I've used in the past. No matter how much I looked, I just couldn't find a solid lead back to Acuity Ads and anything even remotely related was merely circumstantial. It could be from them, but it could also be from many other places and the mere fact that a near identical corpus of data was sitting there on an outright spam site only makes the whole mystery that much deeper. There was just one more interesting data point in that email:
i myself am in that dataset and i've been getting 100x more phishing/scam calls, emails, and physical mail
Let me end this with a best guess: this feels like the same situation as the massive Master Deeds incident in South Africa in 2017. In that case, a legally operating data aggregator (I think you know how I feel about those by now...) sold personal information to a real estate business who then left it publicly exposed. I say it feels the same because it's just such a clean set of data and it's clearly very comprehensive in terms of the columns. It's exactly what I'd expect a data aggregator to prepare and sell to other businesses so they could identify which of their existing customers likes needlework.
In the past, publishing blog posts like this has helped identify an origin service and if that happens again here then I'll be sure to provide an update. For now, I've loaded it into HIBP and flagged it as a spam list which means it won't impact the size of anyone's domains and bump them into a different subscription level. If you do have any interesting insights on this data, please leave a comment below and with any luck, one of the Acuity entities out there will emerge as the source.
Note: just after loading the data, I ran the calcs on how many of the addresses were pre-existing in HIBP. This seems like a statistically significant number 😲
So, 100% (just under actually, but it rounded up). Working through a bunch of sample addresses, they appeared across all sorts of other existing spam lists and dodgy data aggregator breaches. Who knows which ones came first, just more data in the big swimming pool of breaches. https://t.co/Ux2rw6uaAk
— Troy Hunt (@troyhunt) November 15, 2023
Edit (1 day later): After posting this, the party responsible for leaking the data turned around and said "that was only a small part of it, here's the whole thing", and released records encompassing a further 14M records. I've added those into HIBP and will shortly be re-sending notifications to people monitoring domains as the count of impacted addresses will likely have changed. Everything else about the subsequent dataset is consistent with what you'll read below in terms of structure, patterns and conclusions.
The same threat actor has leaked larger amounts of data from LinkedIn dated 2023. They claim this new data contains 35M lines and is 12 GB uncompressed. They also issue an apology to @troyhunt. #Breach #Clearnet #DarkWeb #DarkWebInformer #Database #Leaks #Leaked #LinkedIn https://t.co/qBFAofvppU pic.twitter.com/Clg5o92b6t
— Dark Web Informer (@DarkWebInformer) November 7, 2023
I like to think of investigating data breaches as a sort of scientific search for truth. You start out with a theory (a set of data coming from an alleged source), but you don't have a vested interested in whether the claim is true or not, rather you follow the evidence and see where it leads. Verification that supports the alleged source is usually quite straightforward, but disproving a claim can be a rather time consuming exercise, especially when a dataset contains fragments of truth mixed in with data that is anything but. Which is what we have here today.
To lead with the conclusion and save you reading all the details if you're not inclined, the dataset so many people flagged me this week titled "Linkedin Database 2023 2.5 Millions" turned out to be a combination of publicly available LinkedIn profile data and 5.8M email addresses mostly fabricated from a combination of first and last name. It all began with this tweet:
A threat actor has allegedly leaked a database from LinkedIn @LinkedIn dated 2023. They claim the database shows emails, profile data, phones, full names, and more confidential info. #Breach #Clearnet #DarkWeb #DarkWebInformer #Database #Leaks #Leaked #LinkedIn pic.twitter.com/8MQecKc1vz
— Dark Web Informer (@DarkWebInformer) November 4, 2023
All good lies are believable at face value; is it feasible a massive corpus of LinkedIn data is floating around? Well, they were proper breached in 2012 to the tune of 164M records (by which I mean that incident was genuinely internal data such as email addresses and passwords extracted out by a vulnerability), then they were massively scraped in 2021 with another 126M records going into Have I Been Pwned (HIBP). So, when you see a claim like the one above, it seems highly feasible at face value which is what many people take it at. But I'm a bit more suspicious than most people 🙂
First, the claim:
This one is similar to my twitter data scrapped [sic] but for linkedin plus 2023
Now, there's a whole debate about whether scraped data is breached data and indeed whether the definition of it even matters. With the rising prevalence of scraped data, this topic came up enough that I wrote a dedicated blog post about it a couple of years ago and concluded the following in terms of how we should define the term "breach":
A data breach occurs when information is obtained by an unauthorised party in a fashion in which it was not intended to be made available
Which makes scrapes like this alleged one a breach. If indeed it was accurate, LinkedIn data had been taken and redistributed in a way it was never intended to be by either the service itself or the individuals whose data was in this corpus. So, it's something to take seriously, and that warranted further investigation.
I scrolled through the 10M+ rows of data (many records spanned multiple rows due to line returns), and my eyes fell on a fellow Aussie who for the purposes of this exercise we'll call "EM", being the initials of her first and last name. Whilst the data I'm going to refer to is either public by design or fabricated, I don't want to use a real person as an example without their consent so let's just play it safe. Here's a fragment of EM's record:
There are 5 noteworthy parts of this I that immediately caught my attention:
On its own, this record would be unremarkable. It'd be entirely feasible - this could very well be legit - except when you keep looking through the remainder of the data. A pattern quickly emerged and I'm going to bold it here because it's the smoking gun that ultimately indicates that a bunch of this data is fake:
Every single record with multiple email addresses had exactly the same alias on completely unrelated domains and it was almost always in the form of "[first name].[last name]@".
Representing email addresses in this fashion is certainly common, but it's far from ubiquitous, and that's easy to demonstrate. For example, I have tons of emails from Pluralsight so I dig one out from my friend "CU":
There's no dot, rather a dash. Every single real Pluralsight email address I looked at was a dash rather than a dot, yet when I delved into the alleged LinkedIn data and dig out another sample Pluralsight address, here's what I found:
That's not LM's real address because it has a dot instead of a dash. Every. Single. One. Is. Fake.
Let's try this the other way around and load up the existing breached accounts in HIBP for the domain of one of EM's alleged email addresses and see how they're formed:
That's definitely not the same format as EM's address, not by a long shot. And time and time again, the same pattern of addresses in the corpus of data in the original tweet emerged, drawing me to what seems to be a pretty logical conclusion:
Each email address was fabricated by taking the actual domain of a company the individual legitimately worked at and then constructing the alias from their name.
And these are legitimate companies too because every single LinkedIn profile I checked had all the cues of accurate information and each domain I checked in the corpus of data was indeed the correct one for the company they worked at. I imagine someone has effectively worked through the following logic:
On that final point, what is the point? The data wasn't being sold in that original tweet, rather it was freely downloadable. But per the date on EM's profile, the data could have been obtained much earlier and previously monetised. And on that, the date wasn't constant across records, rather there was a broad range of them as recent as July last year and as old as... well, I stopped when the records got older than me. What is this?!
I suspect the answer may partly lie in the column headings which I've pasted here in their entirety:
"PROFILE_KEY", "PROFILE_USERNAMES", "PROFILE_SPENDESK_IDS", "PROFILE_LINKEDIN_PUBLIC_IDENTIFIER", "PROFILE_LINKEDIN_ID", "PROFILE_SALES_NAVIGATOR_ID", "PROFILE_LINKEDIN_MEMBER_ID", "PROFILE_SALESFORCE_IDS", "PROFILE_AUTOPILOT_IDS", "PROFILE_PIPL_IDS", "PROFILE_HUBSPOT_IDS", "PROFILE_HAS_LINKEDIN_SOURCE", "PROFILE_HAS_SALES_NAVIGATOR_SOURCE", "PROFILE_HAS_SALESFORCE_SOURCE", "PROFILE_HAS_SPENDESK_SOURCE", "PROFILE_HAS_ASGARD_SOURCE", "PROFILE_HAS_AUTOPILOT_SOURCE", "PROFILE_HAS_PIPL_SOURCE", "PROFILE_HAS_HUBSPOT_SOURCE", "PROFILE_FETCHED_AT", "PROFILE_LINKEDIN_FETCHED_AT", "PROFILE_SALES_NAVIGATOR_FETCHED_AT", "PROFILE_SALESFORCE_FETCHED_AT", "PROFILE_SPENDESK_FETCHED_AT", "PROFILE_ASGARD_FETCHED_AT", "PROFILE_AUTOPILOT_FETCHED_AT", "PROFILE_PIPL_FETCHED_AT", "PROFILE_HUBSPOT_FETCHED_AT", "PROFILE_LINKEDIN_IS_NOT_FOUND", "PROFILE_SALES_NAVIGATOR_IS_NOT_FOUND", "PROFILE_EMAILS", "PROFILE_PERSONAL_EMAILS", "PROFILE_PHONES", "PROFILE_FIRST_NAME", "PROFILE_LAST_NAME", "PROFILE_TEAM", "PROFILE_HIERARCHY", "PROFILE_PERSONA", "PROFILE_GENDER", "PROFILE_COUNTRY_CODE", "PROFILE_SUMMARY", "PROFILE_INDUSTRY_NAME", "PROFILE_BIRTH_YEAR", "PROFILE_MARVIN_SEARCHES", "PROFILE_POSITION_STARTED_AT", "PROFILE_POSITION_TITLE", "PROFILE_POSITION_LOCATION", "PROFILE_POSITION_DESCRIPTION", "PROFILE_COMPANY_NAME", "PROFILE_COMPANY_LINKEDIN_ID", "PROFILE_COMPANY_LINKEDIN_UNIVERSAL_NAME", "PROFILE_COMPANY_SALESFORCE_ID", "PROFILE_COMPANY_SPENDESK_ID", "PROFILE_COMPANY_HUBSPOT_ID", "PROFILE_SKILLS", "PROFILE_LANGUAGES", "PROFILE_SCHOOLS", "PROFILE_EXTERNAL_SEARCHES", "PROFILE_LINKEDIN_HEADLINE", "PROFILE_LINKEDIN_LOCATION", "PROFILE_SALESFORCE_CREATED_AT", "PROFILE_SALESFORCE_STATUS", "PROFILE_SALESFORCE_LAST_ACTIVITY_AT", "PROFILE_SALESFORCE_OWNER_CONTACT_ID", "PROFILE_SALESFORCE_OWNER_CONTACT_NAME", "PROFILE_SPENDESK_SIGNUP_AT", "PROFILE_SPENDESK_DELETED_AT", "PROFILE_SPENDESK_ROLES", "PROFILE_SPENDESK_AVERAGE_NPS_SCORE", "PROFILE_SPENDESK_NPS_SCORES_COUNT", "PROFILE_SPENDESK_FIRST_NPS_SCORE", "PROFILE_SPENDESK_LAST_NPS_SCORE", "PROFILE_SPENDESK_LAST_NPS_SCORE_SENT_AT", "PROFILE_SPENDESK_PAYMENTS_COUNT", "PROFILE_SPENDESK_TOTAL_EUR_SPENT", "PROFILE_SPENDESK_ACTIVE_SUBSCRIPTIONS_COUNT", "PROFILE_SPENDESK_LAST_ACTIVITY_AT", "PROFILE_AUTOPILOT_MAIL_CLICKED_COUNT", "PROFILE_AUTOPILOT_LAST_MAIL_CLICKED_AT", "PROFILE_AUTOPILOT_MAIL_OPENED_COUNT", "PROFILE_AUTOPILOT_LAST_MAIL_OPENED_AT", "PROFILE_AUTOPILOT_MAIL_RECEIVED_COUNT", "PROFILE_AUTOPILOT_LAST_MAIL_RECEIVED_AT", "PROFILE_AUTOPILOT_MAIL_UNSUBSCRIBED_AT", "PROFILE_AUTOPILOT_MAIL_REPLIED_AT", "PROFILE_AUTOPILOT_LISTS", "PROFILE_AUTOPILOT_SEGMENTS", "PROFILE_HUBSPOT_CFO_CONNECT_SLACK_MEMBER_STATUS", "PROFILE_HUBSPOT_IS_CFO_CONNECT_MEETUPS_MEMBER", "PROFILE_HUBSPOT_CFO_CONNECT_AREAS_OF_EXPERTISE", "PROFILE_HUBSPOT_CORPORATE_FINANCE_EXPERIENCE_YEARS_RANGE"
Check out some of those names: LinkedIn is obviously there, but so is Salesforce and Spendesk and Hubspot, among others. This reads more like an aggregation of multiple sources than it does data solely scraped from LinkedIn. My hope is that in posting this someone might pop up and say "I recognise those column headings, they're from..." Who knows.
So, here's where that leaves us: this data is a combination of information sourced from public LinkedIn profiles, fabricated emails address and in part (anecdotally based on simply eyeballing the data this is a small part), the other sources in the column headings above. But the people are real, the companies are real, the domains are real and in many cases, the email addresses themselves are real. There are over 1.8k HIBP subscribers in the data set and this is folks that have double opted-in so they've successfully received an email to that address in the past. Further, when the data was loaded into HIBP there were nearly a million email addresses that were already in the system so evidently, they were addresses that had previously been in use. Which stands to reason because even if every address was constructed by an algorithm, the pattern is common enough that there'll be a bunch of hits.
Because the conclusion is that there's a significant component of legitimate data in this corpus, I've loaded it into HIBP. But because there are also a significant number of fabricated email addresses in there, I've flagged it as a spam list which means the addresses won't impact the scale of anyone's paid subscription if they're monitoring domains. And whilst I know some people will suggest it shouldn't go in at all, time and time again when I've polled the public about similar incidents the overwhelming majority of people have said "we want to know about it then we'll make up our own minds what action needs to be taken". And in this case, even if you find an email address on your domain that doesn't actually exist, that person who either currently works at your company or previously did has still had their personal data dumped in this corpus. That's something most people will still want to know.
Lastly, one of the main reasons I decided to invest hours into this today is that I loathe disinformation and I hate people using that to then make statements that are completely off base. I'm looking at my Twitter feed now and see people angry at LinkedIn for this, blaming an insider due to recent layoffs there, accusing them of mishandling our data and so on and so forth. No, not this time, the evidence has led us somewhere completely different.
Last week I was contacted by CERT Poland. They'd observed a phishing campaign that had collected 68k credentials from unsuspecting victims and asked if HIBP may be used to help alert these individuals to their exposure. The campaign began with a typical email requesting more information:
In this case, the email contained a fake purchase order attachment which requested login credentials that were then posted back to infrastructure controlled by the attacker:
All in all, CERT Poland identified 202 other phishing campaigns using the same infrastructure which has subsequently been taken offline. Data accumulated by the malicious activity spanned from October 2022 until just last week.
The advice to impacted individuals is as follows:
Today, the US Justice Department announced a multinational operation involving actions in the United States, France, Germany, the Netherlands, and the United Kingdom to disrupt the botnet and malware known as Qakbot and take down its infrastructure. Beyond just taking down the backbone of the operation, the FBI began actively intercepting traffic from the botnet and instructing infected machines the uninstall the malware:
To disrupt the botnet, the FBI was able to redirect Qakbot botnet traffic to and through servers controlled by the FBI, which in turn instructed infected computers in the United States and elsewhere to download a file created by law enforcement that would uninstall the Qakbot malware
As part of the operation, the FBI have requested support from Have I Been Pwned (HIBP) to help notify impacted victims of their exposure to the malware. We provided similar support in 2021 with the Emotet botnet, although this time around with a grand total of 6.43M impacted email addresses. These are now all searchable in HIBP albeit with the incident is flagged as "sensitive" so you'll need to verify you control the email address via the notification service first, or you can search any domains you control via the domain search feature. Further, the passwords from the malware will shortly be searchable in the Pwned Passwords service which can either be checked online or via the API. Pwned Passwords is presently requested 5 and a half billion times each month to help organisations prevent people from using known compromised passwords.
Guidance for those impacted by this incident is the same tried and tested advice given after previous malware incidents:
There's a "hidden" API on HIBP. Well, it's not "hidden" insofar as it's easily discoverable if you watch the network traffic from the client, but it's not meant to be called directly, rather only via the web app. It's called "unified search" and it looks just like this:
It's been there in one form or another since day 1 (so almost a decade now), and it serves a sole purpose: to perform searches from the home page. That is all - only from the home page. It's called asynchronously from the client without needing to post back the entire page and by design, it's super fast and super easy to use. Which is bad. Sometimes.
To understand why it's bad we need to go back in time all the way to when I first launched the API that was intended to be consumed programmatically by other people's services. That was easy, because it was basically just documenting the API that sat behind the home page of the website already, the predecessor to the one you see above. And then, unsurprisingly in retrospect, it started to be abused so I had to put a rate limit on it. Problem is, that was a very rudimentary IP-based rate limit and it could be circumvented by someone with enough IPs, so fast forward a bit further and I put auth on the API which required a nominal payment to access it. At the same time, that unified search endpoint was created and home page searches updated to use that rather than the publicly documented API. So, 2 APIs with 2 different purposes.
The primary objective for putting a price on the public API was to tackle abuse. And it did - it stopped it dead. By attaching a rate limit to a key that required a credit card to purchase it, abusive practices (namely enumerating large numbers of email addresses) disappeared. This wasn't just about putting a financial cost to queries, it was about putting an identity cost to them; people are reluctant to start doing nasty things with a key traceable back to their own payment card! Which is why they turned their attention to the non-authenticated, non-documented unified search API.
Let's look at a 3 day period of requests to that API earlier this year, keeping in mind this should only ever be requested organically by humans performing searches from the home page:
This is far from organic usage with requests peaking at 121.3k in just 5 minutes. Which poses an interesting question: how do you create an API that should only be consumed asynchronously from a web page and never programmatically via a script? You could chuck a CAPTCHA on the front page and require that be solved first but let's face it, that's not a pleasant user experience. Rate limit requests by IP? See the earlier problem with that. Block UA strings? Pointless, because they're easily randomised. Rate limit an ASN? It gets you part way there, but what happens when you get a genuine flood of traffic because the site has hit the mainstream news? It happens.
Over the years, I've played with all sorts of combinations of firewall rules based on parameters such as geolocations with incommensurate numbers of requests to their populations, JA3 fingerprints and, of course, the parameters mentioned above. Based on the chart above these obviously didn't catch all the abusive traffic, but they did catch a significant portion of it:
If you combine it with the previous graph, that's about a third of all the bad traffic in that period or in other words, two thirds of the bad traffic was still getting through. There had to be a better way, which brings us to Cloudflare's Turnstile:
With Turnstile, we adapt the actual challenge outcome to the individual visitor or browser. First, we run a series of small non-interactive JavaScript challenges gathering more signals about the visitor/browser environment. Those challenges include, proof-of-work, proof-of-space, probing for web APIs, and various other challenges for detecting browser-quirks and human behavior. As a result, we can fine-tune the difficulty of the challenge to the specific request and avoid ever showing a visual puzzle to a user.
"Avoid ever showing a visual puzzle to a user" is a polite way of saying they avoid the sucky UX of CAPTCHA. Instead, Turnstile offers the ability to issue a "non-interactive challenge" which implements the sorts of clever techniques mentioned above and as it relates to this blog post, that can be an invisible non-interactive challenge. This is one of 3 different widget types with the others being a visible non-interactive challenge and a non-intrusive interactive challenge. For my purposes on HIBP, I wanted a zero-friction implementation nobody saw, hence the invisible approach. Here's how it works:
Get it? Ok, let's break it down further as it relates to HIBP, starting with when the front page first loads and it embeds the Turnstile widget from Cloudflare:
<script src="https://challenges.cloudflare.com/turnstile/v0/api.js" async defer></script>
The widget takes responsibility for running the non-interactive challenge and returning a token. This needs to be persisted somewhere on the client side which brings us to embedding the widget:
<div ID="turnstileWidget" class="cf-turnstile" data-sitekey="0x4AAAAAAADY3UwkmqCvH8VR" data-callback="turnstileCompleted"></div>
Per the docs in that link, the main thing here is to have an element with the "cf-turnstile" class set on it. If you happen to go take a look at the HIBP HTML source right now, you'll see that element precisely as it appears in the code block above. However, check it out in your browser's dev tools so you can see how it renders in the DOM and it will look more like this:
Expand that DIV tag and you'll find a whole bunch more content set as a result of loading the widget, but that's not relevant right now. What's important is the data-token attribute because that's what's going to prove you're not a bot when you run the search. How you implement this from here is up to you, but what HIBP does is picks up the token and sets it in the "cf-turnstile-response" header then sends it along with the request when that unified search endpoint is called:
So, at this point we've issued a challenge, the browser has solved the challenge and received a token back, now that token has been sent along with the request for the actual resource the user wanted, in this case the unified search endpoint. The final step is to validate the token and for this I'm using a Cloudflare worker. I've written a lot about workers in the past so here's the short pitch: it's code that runs in each one of Cloudflare's 300+ edge nodes around the world and can inspect and modify requests and responses on the fly. I already had a worker to do some other processing on unified search requests, so I just added the following:
const token = request.headers.get('cf-turnstile-response');
if (token === null) {
return new Response('Missing Turnstile token', { status: 401 });
}
const ip = request.headers.get('CF-Connecting-IP');
let formData = new FormData();
formData.append('secret', '[secret key goes here]');
formData.append('response', token);
formData.append('remoteip', ip);
const turnstileUrl = 'https://challenges.cloudflare.com/turnstile/v0/siteverify';
const result = await fetch(turnstileUrl, {
body: formData,
method: 'POST',
});
const outcome = await result.json();
if (!outcome.success) {
return new Response('Invalid Turnstile token', { status: 401 });
}
That should be pretty self-explanatory and you can find the docs for this on Cloudflare's server-side validation page which goes into more detail, but in essence, it does the following:
And because this is all done in a Cloudflare worker, any of those 401 responses never even touch the origin. Not only do I not need to process the request in Azure, the person attempting to abuse my API gets a nice speedy response directly from an edge node near them 🙂
So, what does this mean for bots? If there's no token then they get booted out right away. If there's a token but it's not valid then they get booted out at the end. But can't they just take a previously generated token and use that? Well, yes, but only once:
If the same response is presented twice, the second and each subsequent request will generate an error stating that the response has already been consumed.
And remember, a real browser had to generate that token in the first place so it's not like you can just automate the process of token generation then throw it at the API above. (Sidenote: that server-side validation link includes how to handle idempotency, for example when retrying failed requests.) But what if a real human fails the verification? That's entirely up to you but in HIBP's case, that 401 response causes a fallback to a full page post back which then implements other controls, for example an interactive challenge.
Time for graphs and stats, starting with the one in the hero image of this page where we can see the number of times Turnstile was issued and how many times it was solved over the week prior to publishing this post:
That's a 91% hit rate of solved challenges which is great. That remaining 9% is either humans with a false positive or... bots getting rejected 😎
More graphs, this time how many requests to the unified search page were rejected by Turnstile:
That 990k number doesn't marry up with the 476k unsolved ones from before because they're 2 different things: the unsolved challenges are when the Turnstile widget is loaded but not solved (hopefully due to it being a bot rather than a false positive), whereas the 401 responses to the API is when a successful (and previously unused) Turnstile token isn't in the header. This could be because the token wasn't present, wasn't solved or had already been used. You get more of a sense of how many of these rejected requests were legit humans when you drill down into attributes like the JA3 fingerprints:
In other words, of those 990k failed requests, almost 40% of them were from the same 5 clients. Seems legit 🤔
And about a third were from clients with an identical UA string:
And so on and so forth. The point being that the number of actual legitimate requests from end users that were inconvenienced by Turnstile would be exceptionally small, almost certainly a very low single-digit percentage. I'll never know exactly because bots obviously attempt to emulate legit clients and sometimes legit clients look like bots and if we could easily solve this problem then we wouldn't need Turnstile in the first place! Anecdotally, that very small false positive number stacks up as people tend to complain pretty quickly when something isn't optimal, and I implemented this all the way back in March. Yep, 5 months ago, and I've waited this long to write about it just to be confident it's actually working. Over 100M Turnstile challenges later, I'm confident it is - I've not seen a single instance of abnormal traffic spikes to the unified search endpoint since rolling this out. What I did see initially though is a lot of this sort of thing:
By now it should be pretty obvious what's going on here, and it should be equally obvious that it didn't work out real well for them 😊
The bot problem is a hard one for those of us building services because we're continually torn in different directions. We want to build a slick UX for humans but an obtrusive one for bots. We want services to be easily consumable, but only in the way we intend them to... which might be by the good bots playing by the rules!
I don't know exactly what Cloudflare is doing in that challenge and I'll be honest, I don't even know what a "proof-of-space" is. But the point of using a service like this is that I don't need to know! What I do know is that Cloudflare sees about 20% of the internet's traffic and because of that, they're in an unrivalled position to look at a request and make a determination on its legitimacy.
If you're in my shoes, go and give Turnstile a go. And if you want to consume data from HIBP, go and check out the official API docs, the uh, unified search doesn't work real well for you any more 😎
I've been teaching my 13-year old son Ari how to code since I first got him started on Scratch many years ago, and gradually progressed through to the current day where he's getting into Python in Visual Studio Code. As I was writing the new domain search API for Have I Been Pwned (HIBP) over the course of this year, I was trying to explain to him how powerful APIs are:
Think of HIBP as one website that does pretty much one thing; you load it in your browser and search through data breaches which then display on the screen. But when you have an API, it's no longer just locked into your browser, it's in all sorts of other systems. Mobile apps, other websites, dashboards and if you really want, you can even integrate the lights in your room with HIBP! Why? How? Well, there's a Home Assistant integration for HIBP and being pwned in a new breach could raise an event there you can then use YAML to perform an action with, for example flashing a light red. That might be weird and unnecessary, but when you have an API, suddenly all these things you never thought of are possible.
It took Brett Adams less than a day after we released the new domain search API last Monday for him to reach out to me with one of those ideas. He wanted to build a Splunk app (Brett is a Splunk MVP so this was right up his alley) to surface breached data about an organisation's domains right into the place where so many security engineers spend their days. He just wanted 2 new APIs to make the user experience the best it could be:
That seems so ridiculously obvious, why didn't I think of that originally?! But hey, easy fix, so the next day Brett had his APIs. And today, you also have the APIs because they're now all publicly documented and ready for you to consume. You also have Brett's Splunk app and because he's published it to Splunkbase, you can go and pull it into your own Splunk instance, plug in your HIBP API key and it's job done!
I'll leave you with a bunch of screen caps from Brett's work, starting with a zoomed in grab of what I suspect folks will find the most valuable - the addresses on their domains and their appearances across breaches:
That's a fragment of the broader dashboard that also breaks down the incidents over time:
The starting point for this is simply plugging your API key into the interface:
I like these headline figures and I picture particularly large organisations that have gone through various acquisitions of different brands with various domains finding this really useful:
And speaking of breaches, there's a lot of them which Brett has visualised across the course of time:
So that's it, you can see all the APIs documented on the HIBP website and you can grab Brett's app right now from Splunkbase. You can also find all the code for this in Brett's GitHub repo should you wish to have a read through it.
The HIBP APIs are there for other people to build awesome things. If you're one of those people, please get in touch with me and show me what you've created, I can't wait to see more integrations like Brett's 😊
This is a big one. A massive one. It's the culmination of a solid 7 months of work that finally, as of now, is live. The full back story is in my blog post from mid-June about The Big 5 Announcements but to save you trawling through all of that, here are the cliff notes:
I've spent the last 8 weeks since publishing that post crunching numbers, writing code, doing loads of formal things (namely terms of use and privacy policy), and regularly talking about it on my weekly video. I've had loads of enormously useful feedback, much of which has shaped the state of the services we're launching here today. Thank you everyone who contributed, now let me get into it and explain exactly what we've come up with 🙂
We've been thinking about the best way to structure this since January. How do we take something that has been provided for free for almost a decade and put a reasonable price on it? That's a highly subjective word - reasonable - and there'll never be complete consensus, so it's more about passing the pub test where your average person will look at this and go "yeah, that seems fair enough". Let me explain the thinking and how we reached the pricing structure you'll see further down:
Firstly, we wanted most domain searches to remain free. This keeps with the spirit of HIBP's roots being a community service and ensures the data is accessible without barrier to the majority of people. It would also mean that for most people, these changes would have absolutely no impact on the way they've been using the service, not unless they want access to the new bits.
Next, we wanted to divide the commercial offerings into a manageable number of tiers. The public API key has 4 tiers and I reckon that's the sweet spot; it's not too many options, but it's enough to provide a good separation between the scale of each. We then wanted to distribute the number of domains that would fall into the commercial category roughly equally between those 4 tiers, so it was pretty much a matter of taking what was left after the free ones and dividing them into 4 groups and putting a price on them.
Finally, we wanted the first commercial tier to be easily affordable so that most people could access it without thinking twice about it. My measure for that has always been "the cost of a cup of coffee", so I went down to my favourite local and checked what I was blindly paying when I waved my watch in the general direction of the EFTPOS machine:
$6 Aussie, or just under $4 in USD. Which led us to here (all in USD from now on):
Plan | Breached addresses | Percent of all domains | Price / m |
Pwned 0 | Up to 10 | 60% | Free! |
Pwned 1 | Up to 25 | 10% | $3.95 |
Pwned 2 | Up to 100 | 10% | $16.95 |
Pwned 3 | Up to 500 | 10% | $28.50 |
Pwned 4 | Unlimited | 10% | $115.00 |
What you're looking at here is a list of plan names (more on that soon), the size of the domain it covers (expressed in the number of breached email addresses on it), what percentage of all domains presently being monitored in HIBP this represents and, of course, the monthly price. As with the public API, if you subscribe annually then it's "pay for 10, get 12" which means that "Pwned 1" price works out at only $3.25 a month. As I flagged in the earlier post, this is all based around the number of addresses that appear in a breach, with one important caveat I'll expand on later: this number excludes all breaches flagged as a spam list. As a rough rule of thumb, over the years I've found approximately 20% of addresses on a domain have been breached so by that logic, you'll need 55 actual email addresses on a domain before there's a cost. Or up to 130 before it costs more than a coffee a month. (If you're a stickler for detail and are thinking those percentages are too perfect, I've rounded them from their actual values of 59.1%, 9.7%, 11.3%, 10.4% and 9.4%.)
But what if you have multiple domains? Easy - the one plan will cover all your domains within the size of that plan. For example, if you have 3 domains and one has 5 breached addresses, one has 20 and one has 90, you can get a single "Pwned 2" plan and cover them all. Or get a single "Pwned 1" plan and cover just the first 2. It's pretty simple.
So that was our initial thinking - stand this up as a product that sits alongside the existing API key one then you just purchase whichever one you want. Then, Brendan gave me a much better idea - combine them altogether! You can see the gears turning around in my head as I read his suggestion and as the days progressed and I gave it more thought, it became a brilliant idea. It massively simplifies the code base, it removes a lot of confusion that I'm sure would have otherwise ensued and perhaps most importantly, it gives you all something more than you would have had otherwise. The one fly in the ointment was the price disparity; the above prices are 13% to 15% higher than the old corresponding API key ones. So, what we've decided to do is run the old prices until 8 October then revise everything to the new prices above. That gives more than 60 days' notice to everyone with an existing API key (we'll have to email everyone anyway as the terms of use have changed to incorporate the domain bits), and there's clear verbiage everywhere about the change for anyone purchasing a new subscription. Plus, it gives everyone a little incentive to lock in for a year now and delay the increase until later in 2024. Thanks Brendan! 😊
So that's the rationale. There's no change for 60% of domains that have previously been searched, a negligible cost for the next 10% of them with the remainder paying commensurately more based on their scale. But we didn't just want to whack a cost on an existing service and you're down a few bucks a month with nothing more to show for it, let's talk about new stuff!
There are two brand new features we're now offering to all commercial subscribers. Even if your domain is small and has less than 10 breached addresses on it, you can still get access to these features via the entry level plan and they're both pretty self-explanatory: API-level access and formal support.
API first as I think it's the coolest and it's exactly what it sounds like: there's now a public endpoint you can throw a domain at and get a JSON response of breached aliases and the incidents they've appeared in. It looks just like this:
GET https://haveibeenpwned.com/api/v3/breacheddomain/{domain}
hibp-api-key: [your key]
Which then responds like this:
{
"alias1": [
"Adobe"
],
"alias2": [
"Adobe",
"Gawker",
"Stratfor"
],
"alias3": [
"AshleyMadison"
]
}
If you're already paying for an API key, you have immediate access to this! Same key, same logic in terms of resolving the returned breach name to the full thing via the unauthenticated API that returns breach metadata, the only caveats are that is has to be a domain you've previously demonstrated you control and it has to be within your plan size (e.g. you have a Pwned 1 plan and your domains don't exceed 25 breached addresses). Otherwise:
Subscription upgrade required.
Just one more thing with the domain search API: it only makes sense to hit it after a new breach is loaded. There's absolutely no point in hammering away at it non-stop as you'll only get the same result so instead, try polling the brand new API we've just added to return only the most recent breach (it's massively cached at Cloudflare anyway) and just hit the domain search API when there's a new one. But because not everybody will do this and domain searches are expensive relative to other queries, the terms and conditions include this clause:
Controls such as rate limiting may be added to the domain search API if excessive API requests are made despite no new breaches appearing since the last request.
There is a rate limit based on a variety of factors and it's possible you may receive an HTTP 429 if you request it more frequently than is necessary. The only reason I'm not going into the details of how that works here is that I expect it will adapt and change pretty frequently in response to how people use the service. What I can confidently say now though, is that if you use the domain search feature in the way it's intended to work - querying each domain after a new breach is added - you won't have a problem with rate limits.
I'm really excited to see how people will integrate this data into their existing tooling, do please let me know if you do something awesome 😊
Then there's the formal support which we offer via Zendesk at support.haveibeenpwned.com. That launched with the API key upgrades last November and since that time, we've answered almost 600 tickets. We've been trying to fine tune things to the extent that the knowledge base there answers the most common questions, but there's certainly a great deal of time that still goes into supporting the questions that pop up. Adding domain searches to the mix will inevitably increase that, possibly by a significant order of magnitude which is why we're only making this available to commercial subscribers.
So, that's the new bits. If you're in that 60% group of people with smaller domains outside of the commercial tiers, you can get access to both the API and support by subscribing to the smallest possible plan for that cup of coffee a month. We feel that's a pretty reasonable balance, and I hope you do too.
Speaking of reasonable, about those spam lists...
I mentioned sharing as much as I could in my weekly update videos, including the intended pricing structure and how it would be based on the number of breached email addresses on a domain. Several people raised a very important point as it related to the calculations: data breaches ain't data breaches or more specifically, there are breaches in HIBP that shouldn't be treated like the other ones as they artificially inflate the pwn count. Could these be excluded?
The Onliner Spambot incident was the worst culprit and in the case of one person that contacted me, it caused his personal domain to read as though hundreds of addresses had been breached when the correct number was... zero. Someone else had their domain pegged at 40 breached addresses whereas once you took this breach out, the number came down to 13. This created somewhat of a rock and hard place situation because whilst those aliases did appear in this incident, they weren't real addresses. But what's a "real" email address anyway? Or more specifically, how can I tell via a string alone whether an address is real or not? A decade ago now I wrote about how hard this is and per the comments on that post, concluded that the only way to tell for sure is to send an email and have the recipient perform some sort of explicit action such as clicking on a link. Clearly, that's not feasible in this situation but equally, putting a price on a service based on a metric that has been artificially inflated just wasn't fair.
Adding spam lists back in 2016 was the right thing to do but equally, excluding them from the number that determines the pricing tier is also the right thing to do. We've tried to make this logic as clear as possible throughout the system and focus on a simple UX that's explicit but can also provide more insight if required,
And if you're interested in which breaches specifically have been classified as a spam list, I've added a filter to the API that lists all breaches. It's an unauthenticated API you can load directly in your browser via GET request and at the time of writing, has 11 breaches on it with nearly 1.4 billion records.
The very last thing from that screen cap is the "Enable debug mode" link and for that, we need to talk about "domain creep".
Data breaches are obviously an ongoing thing. Always have been, always will be so what that means is when you look at a domain today and see, say, 20 breached accounts on it, that might be 30 breached accounts tomorrow. I think everyone who uses HIBP understands that, but it does create a bit of a problem when domain searches are priced on a metric that can "creep". What if you've just paid for a year's worth of Pwned 1 subscription and per the example here, you've suddenly got more than 25 breached accounts on your domain and can no longer search it?
The sentiment of how this should be handled was always obvious: people have to get what they pay for. We didn't want a situation where someone could be left disappointed, and our fear was that the organic increase in breaches could lead to that event. The solution was easy: when you buy a subscription at a certain scale, every domain you're currently monitoring that can be searched on the first day of the subscription can still be searched on the last day of the subscription. If you take out one year of Pwned 1 today and per the example above, the domain creeps beyond 25 breached accounts tomorrow, it'll have zero impact for the next 364 days.
I'm conscious that this concept can get confusing: domain searches are based on the number of breached accounts on the domain but not including spam lists and then locked in at the size of the domain until the next subscription renew... phew! The debug mode link mentioned above aims to show all this logic in its raw detail:
Even though domain1.com in this example has grown to 26 breached addresses, because it was 22 breached addresses when the subscription was taken out then that's the number it's locked at until it renews in August next year. I hope this is clear enough, do please leave a comment if we can do better.
Lastly, let me put some raw numbers around the "domain creep" situation as I foresee this causing concern beyond what might be warranted. Let's start with the number of unique email addresses which is approximately 6 billion. There have been about 723M records added in the last 12 months and a bunch of those will be for the same email address (shout out to everyone who was pwned again in the last year!) Further, of that number, most email addresses were already pwned. That's a link through to the Twitter feed where I broadcast the percentage of previously seen addresses and you'll see that number is regularly around the 60% to 70% range. In other words, it's probably in the order of 250M new addresses we've seen in the last year which is appx 4% of the entire corpus. So, yes, over the course of time we'll see domains slip into higher plans, but only at about the rate of CPI.
Lastly, locking domain counts for the duration of the subscription creates additional incentive to make it an annual one, and that's beyond the existing incentive of "buy 10 months, get 12 months". That's also in addition to massively cutting down on the number of times you may need to deal with corporate bureaucracy. Speaking of which...
Let me start with a story: Many years ago during my lengthy tenure at Pfizer, I pushed hard to drive us away from traditional hosting models and towards modern cloud paradigms, namely the Azure App Service. Here we had a model where you could self-service provision resources that cost about $50 per month and completely replaced a model that was costing us tens of thousands a year. It was an easy win, however... the organisation demanded vendor assessments, compliance paperwork and a billing model which, of course, was favourable to them. But Microsoft's model was "chuck your credit card in and off you go", so that's what one of my colleagues did. And paid for it himself, entirely out of his own pocket in order to save one of the world's largest companies money. My point is that I've done time on the inside and I understand the barriers organisations put in place "because reasons". I touched on this in the June post about the upcoming domain changes:
To be honest, the experience with the public API keys has taught me that it's usually not money that's the barrier to using commercial services, it's corporate procurement bureaucracy. Onboarding documentation. Vendor assessments. Tax forms.
And so too, I have the experience from the outside having regularly received requests to invest hours doing manual labour for the sake of something an organisation is paying a few bucks a month for. That simply doesn't scale and the whole point of providing services like this at volume is that you can go and set everything up yourself with nothing more than a credit card. This one came in while preparing this blog post:
My company is looking to purchase an API key so we can automate user lookups on your site. Our procurement process is wildly complex and I was wondering if we have the option of submitting a Purchase Order instead of using the Stripe credit card payment method?
If this situation resonates, you have my sympathies and my own corporate bureaucracy scars are still raw! If there's more we can do to ease the onboarding path without creating manual labour on a per-customer basis then please let me know. I'm sure there are improvements that can be made, the last thing I want to see is you ending up like my old mate from Pfizer 😞
We've tried to do everything possible to remove barriers. We've made significant investments in legal counsel to get the terms of use and privacy policy right and we've tried to provide answers to all the regular questions in the FAQs. We've even publicly provided a W-8BEN-E US tax form which was often requested by folks in the US. But it won't be enough for some organisations, which is why we do exactly the same thing as Pfizer often found themselves doing which is to provide an enterprise-orientated process where we deal with all this rigmarole... and charge accordingly. If that's you, then get in touch with me.
There will be lots of "but what about...?" edge cases. Let me give you some examples and our views on them:
But what about addresses that don't actually exist?
For most data breaches, email addresses are extracted using a regular expression run over the entire corpus of data. You can see what this looks like in the open source email address extractor used to process breaches. So, what is an email address? Per my earlier explanation, it's anything that matches the regex when run across the breach. That could mean strings that aren't actually an address on a domain get caught up and reported incorrectly. It happens, but there's no way to practically stop it and it's extraordinarily rare.
But what about email addresses from years ago that still appear as breached on a domain?
The argument here is that whilst these are genuine addresses that did indeed exist at one point, they aren't really relevant anymore either due to their age or the address no longer existing (e.g. ex staff). I have both a philosophical and a technical view on this, with the former being that data breaches are immutable. At a point in time, addresses were exposed, and that fact can never be reversed. As for the latter point, those addresses remain in a storage construct we need to continue to support, and every single domain query needs to pick those addresses up and return them to the code processing the search (the design of HIBP means that Azure's Table Storage returns the entire partition on each domain query). Further, in most cases, that doesn't change the total number of breached accounts being a reasonable metric for organisation size and subsequently, the pricing tier they should fit into.
But what about old breaches I don't care about any more causing me to require a higher plan?
It's a similar answer to the previous point insofar as the immutability of history and the need to store the data. It also remains the most reliable metric we have to determine the size of the domain and in many cases, the organisation that owns it. Think of this measurement primarily as a means of slicing up the corpus of data within HIBP and distributing the cost as equitably as possible across the organisations using the domain search feature.
But what about people who don't want to use a credit card?
I'll give you a two-part answer on this, beginning with the recognition that cards can pose legitimate challenges for some people. Just as I was drafting this blog post, someone trying to sign up to the public API reached out after failing to subscribe multiple times with different cards:
For a variety of reasons, I believe the guy is legit, but Stripe reports two payments declined by his bank and another due to an invalid CVC. But using Stripe doesn't just mean credit cards, it also means Apple Pay and Google Pay, WeChat Pay in China, EPS in Belgium, Afterpay in Australia and a raft of other payment mechanisms in different parts of the world. It's hard to imagine a legitimate case where someone does not have access to any of the available payment mechanisms, which brings me to the second part:
The reason we don't support the likes of anonymous cryptocurrency and rely solely on fiat money payments is that it very quickly weeds out the bad actors. That was the whole rationale for putting a payment gateway on the public API back in 2019 - to cut out the abuse. It turns out that once you have to pass the sort of KYC barriers financial institutions put in place, people don't misbehave under their own identity. And yes, there's always fraudulent use of cards, but Stripe has gotten so good at handling that (we pay for their Radar service as well), our dispute rate is only one in many thousands of transactions.
But what about [other reasons related to calculations and costs]?
Amongst the corpus of 12.6 billion records, there will be anomalies. It'll almost certainly be sub-1% and the anomalies won't be evenly distributed across domains; they'll affect some more than others. It's infeasible to ever get that down to zero and it's also infeasible to respond to every single request I know will come through asking for an anomaly to be rectified. The most practical way we could find to deal with this is to keep the pricing structure such that anomalies will be unlikely to have much impact of consequence.
We're also conscious that some people will challenge the cost and it happens all the time with the existing public API key either because of the individual's position in life or the nature of the organisation they work in. But this is why we've structured it as we have, with the majority of domains being within that free tier and the entry level cost being the cup of coffee that gets you access to things like API level access and formal support. This was the most reasonable, equitable model we could come up with and I hope that shines through in the explanations above.
I know there'll be individuals with catch all domains that have ended up in a couple of dozen data breaches and they think paying $3.95 to see them is unreasonable. I know there'll be organisations with much larger numbers who feel it's unreasonable because similarly sized orgs are more profitable. But I also know that I've been running domain searches totally out of my own pocket for almost a decade so whilst I'm sympathetic to anyone who now needs to pay for a service that was previously free, I'm also comfortable that a reasonable and well thought out model has been arrived at.
I'm excited to see what people do with the new API. The email address search one is presently requested millions of times a day and people have built all sorts of amazing things with it, everything from corporate awareness campaigns to tooling to help protect customers from account takeover attacks to integration within the corporate SOC. It's cases like that last one where I think the domain search API will really shine and if you do something awesome with it, please get in touch and let me know.
I know this was a long read, I hope it adequately explains the rationale for the subscription service and that you use it to do amazing things 😊
You can get started right now from the domain search page on HIBP.
Update: Following feedback and consultation with a range of existing users of the service, we now provide a model for the education and non-profit sectors. See the KB titled Do you provide discounts based on the nature of the organisation? for more information.
There are presently 201k people monitoring domains in Have I Been Pwned (HIBP). That's massive! That's 201k people that have searched for a domain, left their email address for future notifications when the domain appears in a new breach and successfully verified that they control the domain. But that's only a subset of all the domains searched, which totals 231k. In many instances, multiple people have searched for the same domain (most likely from the same company given they've successfully verified control), and also in many instances, people are obviously searching for and monitoring multiple domains. Companies have different brands, mergers and acquisitions happen and so on and so forth. Larger numbers of domains also means larger numbers of notifications; HIBP has now sent out 2.7M emails to those monitoring domains after a breach has occurred. And the largest number of the lot: all those domains being monitored encompass an eye watering 273M breached email addresses 😲
The point is, just as HIBP itself has escalated into something far bigger than I ever expected, so too has the domain search feature. Today, I'm launching an all new domain search experience and 5 announcements about major changes surrounding it. Let's jump into it!
Every time I look at numbers related to domain searches, they stagger me. One of the stats I found particularly interesting was that of those 200k people monitoring domains, 23k of them were monitoring 2 or more domains. 8.5k were monitoring 3 or more. 4.6k were 4 or more and so on and so forth. The point being that there are a very large number of people monitoring multiple domains. In fact, 1k people are monitoring 9 or more and hundreds have gone through the manual verification process at least 2 dozen times.
To make life much, much easier on those folks monitoring multiple domains, they're now all bundled up into a centralised dashboard accessible from the existing Domain search link on the website. Because I already know who is monitoring which domains and the email address they're using for notifications, that same email address can be used to verify your identity and drop you straight into the dashboard. Here's mine:
One of the problems the dashboard approach helps tackle is unsubscribing on an individual domain basis. In the past, the only way to unsubscribe from domain notifications was to wait until one landed in your inbox then unsubscribe from every single monitored domain in one go. It was an all or nothing affair that nuked the lot of them whereas now, it's a domain-by-domain exercise.
Another problem this solves is how I respond to an often-received question: "Hey, can you tell me which domains I'm currently subscribed to". Uh, the ones you verified? Like, possibly almost a decade ag... ah, yeah, that's a poor answer! The dashboard now makes the answer crystal clear.
And finally, another massive problem it helps tackle is verification, and that brings me to the second big announcement:
I originally introduced domain searches to HIBP only 6 weeks after the project first launched. Up until this week, it functioned exactly the same way for almost a decade: plug in a domain name, verify control of it then see the results. Each and every time. What it meant is that if you wanted to search a domain, you successfully demonstrated control then you came back later and tried to search it again, you had to go back through the same process:
You'd be surprised at how many emails I get about the difficulty this poses. We don't have any of those 4 aliases on our domain. We can't add a meta tag. We can't upload a file. We can't touch DNS. It leaves me prone to asking "well do you really have control of the domain?" Thing is, "control" is a bit of a nuanced term; there are many people in roles where they don't have access to any of the above means of verification but they're legitimately responsible for infosec and responding to precisely the sorts of notifications HIBP sends out after a breach. Usually in these cases they can get support to go through the verification process, but it involves formal internal processes, ticketing, documentation and having to explain to some IT ops person why a data breach website with a funny name needs one of the above things to happen. This doesn't fix the pain of doing it once, but it does mean that it's now a one-off pain.
As the popularity of HIBP and domain searches has grown over the years, another challenge has emerged. Let me illustrate by example: in January this year, I loaded a rather large breach into HIBP:
New scraped data: Twitter had over 200M accounts scraped from a vulnerable API in 2021. Email addresses were passed in and Twitter profiles returned. 98% were already in @haveibeenpwned. Read more: https://t.co/FRBDFk3nkp
— Have I Been Pwned (@haveibeenpwned) January 5, 2023
That's a sizeable whack of data, in fact it was the 14th largest in HIBP out of the existing 644 in there at the time. It also had a massive impact on HIBP subscribers; I sent over 1 million emails to individuals using the notification service which made it the single largest corpus of notification emails we'd ever sent by a significant margin. But further, I also sent 60,851 emails to people monitoring domains. And that's when this started happening:
6 minutes later...
And so on and so forth until my inbox looked like this:
This was Azure auto-scale doing its thing and it was one of the early attractions for me building HIBP on Microsoft's PaaS offering way back in 2013. Need more resources? Just add more cloud! Job done, next problem. Except there are 2 major drawbacks with this:
Domain searches were actually one of the last remaining remnants of a resource intensive process still running on PaaS; most of the other important bits (namely email address searches and Pwned Password's k-anonymity searches) had been on Azure Functions for ages. Functions are awesome as they're "serverless" (except for the servers they run on, but don't let me get in the marketing team's way here), in that you're never deploying large logical containers of compute like with auto-scale so that solves problem 1 above.
As of now, all domain searches run on Azure Functions. There's literally no domain search logic remaining in the Azure App Service PaaS model, it's all gone. That moves things over to much more scalable infrastructure and massively reduces the likelihood of a timeout when searching a larger domain.
I didn't just want to ship a model from years ago and reproduce all the assumptions of the day, so I made a bunch of tweaks to further optimise things. These are all things that benefit both those searching domains and me running the platform as they reduce overhead on everyone.
For example, there was no point searching for a domain then listing every alias on it "@domain.com" so now you'll just see "alias@" instead. Doesn't sound like a lot, but imagine a domain with tens of thousands of results and then a heap of orgs running searches on them. More data equals more processing equals more egress bandwidth equals more latency and more cost. (Sidenote: if you're wondering "how costly can a bit of bandwidth really be", read my post from last year on How I Got Pwned by My Cloud Costs.)
The same logic extended to exporting the domain search results in Excel or JSON format - strip out the redundant data. I went even harder on the JSON front as this format is primarily used for ingestion into other apps where there's a large amount of programmatic control. So, rather than returning a heap of redundant breach metadata over and over again, now each alias just lists the name of the breach and you can match that up to the data from the breaches API. To be clear, the domain search JSON format itself was never an "API"; it wasn't designed for programmatic consumption, it required manual verification first and I set no expectation of stability. That's something that will change soon - there'll be a proper API - but I'll come back to that at the end of this post.
Something else I've been working away on in the background is to better leverage Cloudflare's WAF to minimise the impact on the origin services. For example, last week I did a thread on blocking 401 and excessive 428 responses at the edge rather than having to process them (and pay to process them) at the origin. I've been using similar logic to keep some, well, let's just call it "very excessive" domain queries under control. For example, one particular domain was searched 140 times after a breach was loaded in April, followed by another 40 times immediately after a breach the following month:
Clearly, this is just unnecessary. Remember how domain searches are a resource intensive process that hits my bottom line pretty hard? Yeah, well, not any more!
And finally on the performance front, if you were previously monitoring multiple domains and you got a breach alert, you could run a single search that bundled all the results in together. You reckon searching for one domain can be resource intensive? Try throwing a bunch of them into the one search! As the system grew and grew, this model became increasingly hard to sustain and equally, it became increasingly noisy. So now, exactly the same domains can be searched one by one which breaks the processing down into smaller, more manageable units. Hey, wouldn't it be great to have an API around that so you could just automate the entire thing? Read on!
All these tweaks along with the move to Azure Functions has made a massive difference to the performance problem mentioned earlier, but another problem remains: I'm still paying for your domain searches. Azure Functions are charged based on a combination of how long they run for and how many resources they consume. Both those factors are extraordinarily small for individual email address searches, but they're not for domain searches. That's why soon, the largest users of the service are going to see a small fee.
Pick a brand. A big brand. If I was to bet you that either the brand directly or its parent company has used the HIBP domain search feature in the past, I'd win. I wouldn't win every bet, but I'd come out on top over a bunch of them and I know this because I have the data to be confident of my odds 🙂
Knowing which big brands use which domains for their email is actually a hard metric to define:
Anyone know where I can find a list of the Fortune 500’s domains used for email accounts? There may be more than 1 per company and it may be different to their primary website.
— Troy Hunt (@troyhunt) January 15, 2023
But by cobbling enough OSINT data together, I was able to confidently demonstrate that more than half the Fortune 500 have used this service and the vast majority of those continue to do so via ongoing domain monitoring. That's awesome! And that pattern extends all the way down to much more localised brands too; My bank. My telco. My supermarket. All sorts of commercial organisations running businesses and using data sourced from HIBP to help them do so.
I started analysing the metrics back at that tweet in Jan, just the week after all the domain searches following the scraped Twitter data going into HIBP. For the last 5 months, I've been trawling through the usage patterns and watching how organisations are using the service. I also paid a lot of attention to the reactions following the change in rate limits and annual billing for the public API that enables email address searches last Nov. That's now given me a pretty good sense of how to structure a commercial domain search model. It's not final yet, but I do hope to put the finishing touches on it next month and in the interim, welcome feedback on the high-level overview of how it'll work that I'll list here in point form:
That last point in particular is hotly requested and as of a couple of months ago, already under development:
UserVoice suggestion for @haveibeenpwned to add domain search capability to the API now started! Follow along, vote and subscribe to updates here: https://t.co/Z32eC0d9nb
— Troy Hunt (@troyhunt) April 20, 2023
I'm still working through the mechanics of all this, both technically and commercially. One part of that is looking at raw numbers, for example about half of all the domains being monitored have 10 or less breached accounts on them. These aren't commercial entities of any scale and whilst I'm not saying "10 is the free tier number", clearly there are a massive number of domains that are tiny and shouldn't be at all impacted by this.
To be honest, the experience with the public API keys has taught me that it's usually not money that's the barrier to using commercial services, it's corporate procurement bureaucracy. Onboarding documentation. Vendor assessments. Tax forms. All sorts of things that demand hours of our time, often for the sake of only $3.50 per month. So we politely decline 😊 I know that will be an issue, in fact I suspect it will be the issue and a lot of the work we've been doing this year is to try and ease that pain to the fullest extent possible. I'll talk more about that once things finally launch but for now, that's the direction we're heading and the sorts of issues we're tackling in preparation.
As we approach the 10th birthday of HIBP later this year, it's hard not to look back and reflect. So much has changed in that time, yet the service still feels very much like what it was on day 1. The challenge for me over this time has been to work out how to adapt to the changes whilst keeping true to the original intent of service. Nothing has happened quickly in that regard, and the transparent fashion in which I've chosen to run HIBP has made the rationale for any change very clear to everyone. Even this blog post has been 5 months in the making, gradually evolving to reflect my thinking on the issues until I was confident enough in the path forward.
Go and use the new dashboard. Give it a good run and let me know what you think as I'm sure there are many things we can do better. And do provide your feedback on the both the changes announced here and those to come regarding the commercial tiers too, the more input we get on this the better equipped we are to make good decisions.
A quick summary first before the details: This week, the FBI in cooperation with international law enforcement partners took down a notorious marketplace trading in stolen identity data in an effort they've named "Operation Cookie Monster". They've provided millions of impacted email addresses and passwords to Have I Been Pwned (HIBP) so that victims of the incident can discover if they have been exposed. This breach has been flagged as "sensitive" which means it is not publicly searchable, rather you must demonstrate you control the email address being searched before the results are shown. This can be done via the free notification service on HIBP and involves you entering the email address then clicking on the link sent to your inbox. Specific guidance prepared by the FBI in conjunction with the Dutch police on further steps you can take to protect yourself are detailed at the end of this blog post on the gold background. That's the short version, here's the whole story:
Ever heard that saying about how "data is the new oil"? Or that "data is the currency of the digital economy"? You've probably seen stories and infographics about how much your personal information is worth, both to legitimate organisations and criminal networks. Like any valuable commodity, marketplaces selling data inevitably emerge, some operating as legal businesses and others, well, not so much. In its simplest form, the illegal data marketplace has long involved the exchange of currency for personal records containing attributes such as email addresses, passwords, names, etc. Cybercriminals then use this data for purposes ranging from identity theft to phishing attacks to credential stuffing. So, we (the good guys) adapt and build better defences. We block known breached passwords. We implement two factor authentication. We roll out user behavioural analytics that identifies abnormalities in logins (why is Joe suddenly logging in from the other side of the world with a new machine?) And in turn, the criminals adapt, which brings us to Genesis Market.
Until this week, Genesis had been up and running for 4 years. This is an excellent primer from Catalin Cimpanu, and it describes how in order to circumvent the aforementioned fraud protection measures, cybercriminals are increasingly relying on obtaining more abstract pieces of information from victims in order to gain access to their accounts. Rather than relying on the credentials themselves and then being subject to all the modern fraud detection services mentioned above, criminals instead began to trade in a combination of "fingerprints" and "cookies". The latter will be a familiar term to most people (and was obviously the inspiration for the name behind the FBI's operation), whilst the former refers to observable attributes of the user and their browser. To see a very easy demonstration of what fingerprinting involves, go and check out amiunique.org and hit the "View my browser fingerprint" button. You'll get something similar to this:
Among more than 1.6M sampled clients, nobody has the same fingerprint as me. Somehow, using the current version of Chrome on the current version of Windows, I am a unique snowflake. Why I'm so unique is partly explained by my time zone which is shared by less than half a percent of people, but it's when that's combined with the other observable fingerprint attributes that you realise just how special I really am. For example, less than 0.01% of people have a content language request header of "en-US,en,en-AU". Only 0.12% of people share a screen width of 5,120 pixel (I'm using an ultrawide monitor). And so on and so forth. Because they're so unique, fingerprints are increasingly used as a fraud detection method such that if a malicious party attempts to impersonate a legitimate users with otherwise correct attributes (for example, the correct cookies) but the wrong fingerprint, they're rejected. Which is why we now have IMPaaS.
There's an excellent IMPaaS explanation from the Eindhoven University of Technology in the Netherlands via a paper titled Impersonation-as-a-Service: Characterising the Emerging Criminal Infrastructure for User Impersonation at Scale. Released only a year and a half after the emergence of Genesis, the paper explains the mechanics of IMPaaS:
IMPaaS allows attackers to systematically collect and enforce user profiles (consisting of user credentials, cookies, device and behavioural fingerprints, and other metadata) to circumvent risk-based authentication system and effectively bypass multifactor authentication mechanisms
In other words, if you have all the bits of information a website requires to persist authenticated state after the login process has successfully completed (including after any 2FA requirements), you can perform a modern equivalent of session hijacking. Obtaining this level of information is typically done via malicious software running on the victim's machine which can then grab anything useful and send it off to a C2 server where it can then be sold and used to commit fraud (from the IMPaaS paper):
Catalin's story from the early days of Genesis showed how buyers could browse through a list of compromised victims and pick their target based on the various services they had authenticated too, along with their operating system and location. Pricing was inevitably based on the value of those services with the examples below going for $41.30 each (and just like a legitimate marketplace, these were marked down prices so a real bargain!)
To make things as turn-key as possible for the criminals, buyers would then run a browser extension from Genesis that would reconstruct the required fingerprint based on the information the malware had obtained and grant them access to the victims' accounts (I'm having flashbacks of Firesheep here). It was that simple... until this week. As of now, the following banner greets anyone browsing to the Genesis website:
The aptly named "Operation Cookie Monster" is a joint effort between the FBI and a coalition of law enforcement agencies across the globe who have now put an abrupt end to Genesis. I imagine they'll be having some "discussions" with those involved in running the service, but what about the individuals who are the victims? These are the people whose identities have been put up for sale, purchased by other criminals and then abused to their detriment. The FBI approached me and asked if HIBP could be used as a mechanism to help warn victims of their exposure in the same way as we'd previously done with the Emotet malware a couple of years ago. This is well aligned with the mantra of HIBP - to do good and constructive things with data breaches after they occur - and I was happy to provide support.
There are 2 separate things that have now been loaded into HIBP, each disassociated from the other:
The Pwned Passwords API is presently hit more than 4 billion times each month, and the downloadable data set is hit, well, I don't know because anyone can grab it run it offline. The point is that password corpuses loaded into HIBP have huge reach and are used by thousands of different online services to help people make better password choices. You're probably using it without even knowing it when you signup or login to various services but if you want to check it directly, you can browse to the web interface. (If you're worried about the privacy of your password, there's a full explainer on how the service preserves anonymity but I also suggest testing it after you've changed it as a generally good practice.)
The email address search is what HIBP is so well known for and that's obviously what will help you understand if you've been impacted. Per the opening paragraph, this breach is flagged as "sensitive" so you will not get a result when searching directly from the front page or via the API, rather you'll need to use the free notification service. This approach was chosen to avoid the risk of people being further targeted as a result of their inclusion in Genesis. All existing HIBP subscribers have been sent notification emails and between individuals and those monitoring domains, tens of thousands of emails have now been sent out. Whilst the volume of accounts represented is "8M", please note that this is merely an approximation (hence the perfectly round number on HIBP), intended to be an indicative representation of scale as many of the breached accounts didn't include email addresses. This number only represents the number of unique email addresses which showed up in the data set so consider it a subset of a much larger corpus.
Let me add some final context and this is important if you do find yourself in the Genesis data: due to the nature of how the malware collected personal information and the broad range of different services victims may have been using at the time, the exposed data can differ significantly person by person. What's been provided by the FBI is one set of passwords (incidentally, as SHA-1 and NTLM hash pairs fed into the law enforcement ingestion pipeline), one set of email addresses and a list of meta data. Beyond the data already listed here, the meta data includes names, physical addresses, phone numbers and full credit card details among other personal attributes. This does not mean that all impacted individuals had each of those data classes exposed. The hope is that by listing these fields it will help victims understand, for example, why they may have observed fraudulent transactions on their card, and they can then take informed and appropriate steps to better protect themselves.
Lastly, as flagged in the intro, following is the guidance prepared by the FBI and Dutch police on how people can safeguard themselves if they get a hit in the Genesis data or frankly, just want to better protect themselves in future:
The FBI reached out to Have I Been Pwned (HIBP) to continue sharing efforts to help victims determine if they've been victimized. In this instance, the data shared emanates from the Initial Access Broker Marketplace Genesis Market. The FBI has taken action against Genesis Market, and in the process has been able to extract victim information for the purposes of alerting victims.
In all, millions of passwords and email addresses were provided which span a wide range of countries and domains. These emails and passwords were sold on Genesis Market and were used by Genesis Market users to access the various accounts and platforms that were for sale.
Prepared in conjunction with the FBI, following is the recommended guidance for those that find themselves in this collection of data:
To safeguard yourself against fraud in the future, it is important that you immediately remove the malware from your computer and then change all your passwords. Do this as follows:
How can I prevent my data being stolen (again)?
Just one more thing to end on a lighter note: a quick shoutout to whoever at the bureau slipped a half-eaten cookie into the takedown image, having been munched on by what I can only assume is a very satisfied FBI agent after a successful "Operation Cookie Monster" 😊
What if I told you... that you could run a website from behind Cloudflare and only have 385 daily requests miss their cache and go through to the origin service?
No biggy, unless... that was out of a total of more than 166M requests in the same period:
Yep, we just hit "five nines" of cache hit ratio on Pwned Passwords being 99.999%. Actually, it was 99.9998% but we're at the point now where that's just splitting hairs, let's talk about how we've managed to only have two requests in a million hit the origin, beginning with a bit of history:
Optimising Caching on Pwned Passwords (with Workers)- @troyhunt - https://t.co/KjBtCwmhmT pic.twitter.com/BSfJbWyxMy
— Cloudflare (@Cloudflare) August 9, 2018
Ah, memories 😊 Back then, Pwned Passwords was serving way fewer requests in a month than what we do in a day now and the cache hit ratio was somewhere around 92%. Put another way, instead of 2 in every million requests hitting the origin it was 85k. And we were happy with that! As the years progressed, the traffic grew and the caching model was optimised so our stats improved:
There it is - Pwned Passwords is now doing north of 2 *billion* requests a month, peaking at 91.59M in a day with a cache-hit ratio of 99.52%. All free, open source and out there for the community to do good with 😊 pic.twitter.com/DSJOjb2CxZ
— Troy Hunt (@troyhunt) May 24, 2022
And that's pretty much where we levelled out, at about the 99-and-a-bit percent mark. We were really happy with that as it was now only 5k requests per million hitting the origin. There was bound to be a number somewhere around that mark due to the transient nature of cache and eviction criteria inevitably meaning a Cloudflare edge node somewhere would need to reach back to the origin website and pull a new copy of the data. But what if Cloudflare never had to do that unless explicitly instructed to do so? I mean, what if it just stayed in their cache unless we actually changed the source file and told them to update their version? Welcome to Cloudflare Cache Reserve:
Ok, so I may have annotated the important bit but that's what it feels like - magic - because you just turn it on and... that's it. You still serve your content the same way, you still need the appropriate cache headers and you still have the same tiered caching as before, but now there's a "cache reserve" sitting between that and your origin. It's backed by R2 which is their persistent data store and you can keep your cached things there for as long as you want. However, per the earlier link, it's not free:
You pay based on how much you store for how long, how much you write and how much you read. Let's put that in real terms and just as a brief refresher (longer version here), remember that Pwned Passwords is essentially just 16^5 (just over 1 million) text files of about 30kb each for the SHA-1 hashes and a similar number for the NTLM ones (albeit slight smaller file sizes). Here are the Cache Reserve usage stats for the last 9 days:
We can now do some pretty simple maths with that and working on the assumption of 9 days, here's what we get:
2 bucks a day 😲 But this has taken nearly 16M requests off my origin service over this period of time so I haven't paid for the Azure Function execution (which is cheap) nor the egress bandwidth (which is not cheap). But why are there only 16M read operations over 9 days when earlier we saw 167M requests to the API in a single day? Because if you scroll back up to the "insert magic here" diagram, Cache Reserve is only a fallback position and most requests (i.e. 99.52% of them) are still served from the edge caches.
Note also that there are nearly 1M write operations and there are 2 reasons for this:
An untold number of businesses rely on Pwned Passwords as an integral part of their registration, login and password reset flows. Seriously, the number is "untold" because we have no idea who's actually using it, we just know the service got hit three and a quarter billion times in the last 30 days:
Giving consumers of the service confidence that not only is it highly resilient, but also massively fast is essential to adoption. In turn, more adoption helps drive better password practices, less account takeovers and more smiles all round 😊
As those remaining hash prefixes populate Cache Reserve, keep an eye on the "cf-cache-status" response header. If you ever see a value of "MISS" then congratulations, you're literally one in a million!
Full disclosure: Cloudflare provides services to HIBP for free and they helped in getting Cache Reserve up and running. However, they had no idea I was writing this blog post and reading it live in its entirety is the first anyone there has seen it. Surprise! 👋