Classifying email providers of 2000+ Swiss municipalities via DNS, looking for feedback on methodology

/r/netsec - Information Security News & Discussion

10 March 2026 at 20:30

I built a pipeline and map that classifies where Swiss municipalities host their email by probing public DNS records. I wanted to find out how much uses MS365 or other US clouds, based on public data:

screenshot of map

Interactive map: https://mxmap.ch
Code: https://github.com/davidhuser/mxmap

The classification uses a hierarchical decision tree:

MX record keyword matching (highest priority) — direct hostname patterns for Microsoft 365 (mail.protection.outlook.com), Google Workspace (aspmx.l.google.com), AWS SES, Infomaniak (Swiss provider)
CNAME chain resolution on MX hostnames — follows aliases to detect providers hidden behind vanity hostnames
Gateway detection — identifies security appliances (e.g. Trend Micro etc.) by MX hostname, then falls through to SPF to identify the actual backend provider
Recursive SPF resolution — follows include: and redirect= chains (with loop detection, max 10 lookups) to expand the full SPF tree and match provider keywords
ASN lookup via Team Cymru DNS — maps MX server IPs to autonomous systems to detect Swiss ISP relay hosting (SWITCH, Swisscom, Sunrise, etc.). For these, autodiscover is checked to see if a hyperscaler is actually behind the relay.
Autodiscover probing (CNAME + _autodiscover._tcp SRV) — fallback to detect hidden Microsoft 365 usage behind self-hosted or ISP-relayed MX
Website scraping as last resort — probes /kontakt, /contact, /impressum pages, extracts email addresses (including decrypting TYPO3 obfuscated mailto links), then classifies the email domain's infrastructure

Key design decisions:

MX takes precedence over SPF
Gateway + SPF expansion is critical — many municipalities use security appliances that mask the real provider
Three independent DNS resolvers (system, Google, Cloudflare) for resilience
Confidence scoring (0–100) with quality gates (avg ≥70, ≥80% high-confidence)

Results land in 7 categories: microsoft, google, aws, infomaniak, swiss-isp, self-hosted, unknown.

Where I'd especially appreciate feedback:

Do you think this a good approach?
Are there MX/SPF patterns I'm missing for common provider setups?
Edge cases where gateway detection could misattribute the backend?
Are there better heuristics than autodiscover for detecting hyperscaler usage behind ISP relays?
Would you rather introduce a new category "uncertain" instead, if so for which cases?

Thanks!

submitted by /u/dfhsr
[link] [comments]

Reading view