DNS Troubleshooting

Systematic DNS debugging. Five-step methodology from client to authoritative server, plus common failure patterns.

Systematic DNS Debugging

DNS troubleshooting follows a pattern: identify which layer is broken (client, resolver, authoritative), then isolate the specific failure. Work from the client outward.

Step 1: Is DNS Reachable?

Test basic connectivity to the resolver
ping -c1 -W1 10.50.1.90 && echo "Reachable" || echo "Unreachable"
Test port 53 specifically — TCP and UDP
# UDP (normal queries)
dig @10.50.1.90 inside.domusdigitalis.dev A +short +timeout=2

# TCP (zone transfers, large responses)
dig @10.50.1.90 inside.domusdigitalis.dev A +short +tcp +timeout=2

If ping works but dig fails, a firewall is blocking port 53.

Check which resolver the system is using
cat /etc/resolv.conf
systemd-resolved — check the actual active resolver
resolvectl status | awk '/DNS Servers/'

systemd-resolved may ignore /etc/resolv.conf. Always check resolvectl on systemd-based systems.

Step 2: Is the Record There?

Query the authoritative server directly — bypass all caches
dig @10.50.1.90 inside.domusdigitalis.dev A +norecurse

If the authoritative server returns the record, the problem is caching or resolver configuration. If it doesn’t, the zone file is wrong.

Check if the record exists in the zone file
sudo awk '/10.50.1.20/' /var/named/inside.domusdigitalis.dev.zone
Validate the zone file for errors
sudo named-checkzone inside.domusdigitalis.dev /var/named/inside.domusdigitalis.dev.zone

Common zone file errors: missing trailing dot on FQDNs, duplicate records, CNAME at a name with other records, stale serial number.

Step 3: Is the Cache Stale?

Check TTL — how long until the cache expires?
dig @10.50.1.90 inside.domusdigitalis.dev A +noall +answer

The TTL column shows seconds remaining. If you changed a record but the old answer persists, wait for TTL expiry or flush.

Flush and re-query
sudo rndc flush
dig @10.50.1.90 inside.domusdigitalis.dev A +short
Flush client-side cache too
# systemd-resolved
sudo systemd-resolve --flush-caches

# nscd
sudo nscd -i hosts

Step 4: Forward and Reverse Match

Validate forward/reverse consistency — required by RADIUS, Kerberos, SMTP
# Forward
dig ise-01.inside.domusdigitalis.dev A +short

# Reverse
dig -x 10.50.1.20 +short

The forward lookup should return 10.50.1.20. The reverse lookup should return ise-01.inside.domusdigitalis.dev. Mismatches cause ISE authentication failures and Kerberos ticket issues.

Bulk forward/reverse check for a subnet
for ip in 10.50.1.{1..254}; do
    fwd=$(dig -x "$ip" +short 2>/dev/null)
    if [[ -n "$fwd" ]]; then
        rev=$(dig "$fwd" A +short 2>/dev/null)
        [[ "$rev" != "$ip" ]] && printf "MISMATCH: %s -> %s -> %s\n" "$ip" "$fwd" "$rev"
    fi
done

Step 5: Delegation and Forwarding

Trace the delegation chain — find where it breaks
dig example.com +trace +nodnssec

If the trace stops at a referral that returns SERVFAIL or REFUSED, the delegation is broken at that level.

Test if forwarding is working
# Should resolve via AD DNS forwarder
dig @10.50.1.90 home-dc01.inside.domusdigitalis.dev A +short

# Should resolve via public DNS forwarder
dig @10.50.1.90 google.com A +short

If internal names resolve but public names don’t, the public forwarder is unreachable. If neither works, named isn’t running or the listener config is wrong.

Common Failure Patterns

SERVFAIL — the resolver tried but failed
dig example.com A
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL

Causes: DNSSEC validation failure, upstream server unreachable, zone file syntax error on the authoritative server. Try +cd to bypass DNSSEC validation and isolate the cause.

REFUSED — the server won’t answer this query
dig @ns1.example.com example.com A
;; ->>HEADER<<- opcode: QUERY, status: REFUSED

Causes: client IP outside allow-query or allow-recursion ACL, recursion disabled and server is not authoritative for the zone.

NXDOMAIN — the name does not exist
dig nonexistent.example.com A
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN

Causes: typo in the hostname, missing record in zone file, wrong zone being queried. Verify the zone file and check for the trailing dot.

Timeout — no response at all
dig @10.50.1.90 example.com +timeout=2
;; connection timed out; no servers could be reached

Causes: named not running, firewall blocking UDP/TCP 53, wrong IP in the query, named not listening on that interface.

BIND Server-Side Diagnostics

Check if named is running
systemctl status named
Check named error logs
sudo journalctl -u named --since "10 min ago" --no-pager
Enable debug logging temporarily
sudo rndc trace 1
# reproduce the issue
sudo rndc notrace
Validate all configuration
sudo named-checkconf /etc/named.conf && echo "Config OK"
sudo named-checkzone inside.domusdigitalis.dev /var/named/inside.domusdigitalis.dev.zone && echo "Zone OK"

Quick Diagnostic Pipeline

One-command DNS health check
echo "=== Resolver ===" && \
cat /etc/resolv.conf | awk '/^nameserver/' && \
echo "=== Forward ===" && \
dig @10.50.1.90 inside.domusdigitalis.dev A +short && \
echo "=== Reverse ===" && \
dig @10.50.1.90 -x 10.50.1.20 +short && \
echo "=== External ===" && \
dig @10.50.1.90 google.com A +short && \
echo "=== Named ===" && \
systemctl is-active named

See Also

  • dig — the primary debugging tool

  • Caching — cache flushing procedures

  • named — server-side diagnostics