Classifying email providers of 2000+ Swiss municipalities via DNS, looking for feedback on methodology
I built a pipeline and map that classifies where Swiss municipalities host their email by probing public DNS records. I wanted to find out how much uses MS365 or other US clouds, based on public data:
screenshot of map
- Interactive map: https://mxmap.ch
- Code: https://github.com/davidhuser/mxmap
The classification uses a hierarchical decision tree:
- MX record keyword matching (highest priority) β direct hostname patterns for Microsoft 365 (mail.protection.outlook.com), Google Workspace (aspmx.l.google.com), AWS SES, Infomaniak (Swiss provider)
- CNAME chain resolution on MX hostnames β follows aliases to detect providers hidden behind vanity hostnames
- Gateway detection β identifies security appliances (e.g. Trend Micro etc.) by MX hostname, then falls through to SPF to identify the actual backend provider
- Recursive SPF resolution β follows include: and redirect= chains (with loop detection, max 10 lookups) to expand the full SPF tree and match provider keywords
- ASN lookup via Team Cymru DNS β maps MX server IPs to autonomous systems to detect Swiss ISP relay hosting (SWITCH, Swisscom, Sunrise, etc.). For these, autodiscover is checked to see if a hyperscaler is actually behind the relay.
- Autodiscover probing (CNAME + _autodiscover._tcp SRV) β fallback to detect hidden Microsoft 365 usage behind self-hosted or ISP-relayed MX
- Website scraping as last resort β probes /kontakt, /contact, /impressum pages, extracts email addresses (including decrypting TYPO3 obfuscated mailto links), then classifies the email domain's infrastructure
Key design decisions:
- MX takes precedence over SPF
- Gateway + SPF expansion is critical β many municipalities use security appliances that mask the real provider
- Three independent DNS resolvers (system, Google, Cloudflare) for resilience
- Confidence scoring (0β100) with quality gates (avg β₯70, β₯80% high-confidence)
Results land in 7 categories: microsoft, google, aws, infomaniak, swiss-isp, self-hosted, unknown.
Where I'd especially appreciate feedback:
- Do you think this a good approach?
- Are there MX/SPF patterns I'm missing for common provider setups?
- Edge cases where gateway detection could misattribute the backend?
- Are there better heuristics than autodiscover for detecting hyperscaler usage behind ISP relays?
- Would you rather introduce a new category "uncertain" instead, if so for which cases?
Thanks!
[link] [comments]