❌

Reading view

Classifying email providers of 2000+ Swiss municipalities via DNS, looking for feedback on methodology

I built a pipeline and map that classifies where Swiss municipalities host their email by probing public DNS records. I wanted to find out how much uses MS365 or other US clouds, based on public data:

screenshot of map

The classification uses a hierarchical decision tree:

  1. MX record keyword matching (highest priority) β€” direct hostname patterns for Microsoft 365 (mail.protection.outlook.com), Google Workspace (aspmx.l.google.com), AWS SES, Infomaniak (Swiss provider)
  2. CNAME chain resolution on MX hostnames β€” follows aliases to detect providers hidden behind vanity hostnames
  3. Gateway detection β€” identifies security appliances (e.g. Trend Micro etc.) by MX hostname, then falls through to SPF to identify the actual backend provider
  4. Recursive SPF resolution β€” follows include: and redirect= chains (with loop detection, max 10 lookups) to expand the full SPF tree and match provider keywords
  5. ASN lookup via Team Cymru DNS β€” maps MX server IPs to autonomous systems to detect Swiss ISP relay hosting (SWITCH, Swisscom, Sunrise, etc.). For these, autodiscover is checked to see if a hyperscaler is actually behind the relay.
  6. Autodiscover probing (CNAME + _autodiscover._tcp SRV) β€” fallback to detect hidden Microsoft 365 usage behind self-hosted or ISP-relayed MX
  7. Website scraping as last resort β€” probes /kontakt, /contact, /impressum pages, extracts email addresses (including decrypting TYPO3 obfuscated mailto links), then classifies the email domain's infrastructure

Key design decisions:

  • MX takes precedence over SPF
  • Gateway + SPF expansion is critical β€” many municipalities use security appliances that mask the real provider
  • Three independent DNS resolvers (system, Google, Cloudflare) for resilience
  • Confidence scoring (0–100) with quality gates (avg β‰₯70, β‰₯80% high-confidence)

Results land in 7 categories: microsoft, google, aws, infomaniak, swiss-isp, self-hosted, unknown.

Where I'd especially appreciate feedback:

  • Do you think this a good approach?
  • Are there MX/SPF patterns I'm missing for common provider setups?
  • Edge cases where gateway detection could misattribute the backend?
  • Are there better heuristics than autodiscover for detecting hyperscaler usage behind ISP relays?
  • Would you rather introduce a new category "uncertain" instead, if so for which cases?

Thanks!

submitted by /u/dfhsr
[link] [comments]
  •  
❌