DNS Troubleshooting

Systematic DNS debugging. Five-step methodology from client to authoritative server, plus common failure patterns.

Systematic DNS Debugging

DNS troubleshooting follows a pattern: identify which layer is broken (client, resolver, authoritative), then isolate the specific failure. Work from the client outward.

Step 1: Is DNS Reachable?

Test basic connectivity to the resolver

ping -c1 -W1 10.50.1.90 && echo "Reachable" || echo "Unreachable"

Test port 53 specifically — TCP and UDP

# UDP (normal queries)
dig @10.50.1.90 inside.domusdigitalis.dev A +short +timeout=2

# TCP (zone transfers, large responses)
dig @10.50.1.90 inside.domusdigitalis.dev A +short +tcp +timeout=2

If ping works but dig fails, a firewall is blocking port 53.

Check which resolver the system is using

cat /etc/resolv.conf

systemd-resolved — check the actual active resolver

resolvectl status | awk '/DNS Servers/'

systemd-resolved may ignore /etc/resolv.conf. Always check resolvectl on systemd-based systems.

Step 2: Is the Record There?

Query the authoritative server directly — bypass all caches

dig @10.50.1.90 inside.domusdigitalis.dev A +norecurse

If the authoritative server returns the record, the problem is caching or resolver configuration. If it doesn’t, the zone file is wrong.

Check if the record exists in the zone file

sudo awk '/10.50.1.20/' /var/named/inside.domusdigitalis.dev.zone

Validate the zone file for errors

sudo named-checkzone inside.domusdigitalis.dev /var/named/inside.domusdigitalis.dev.zone

Common zone file errors: missing trailing dot on FQDNs, duplicate records, CNAME at a name with other records, stale serial number.

Step 3: Is the Cache Stale?

Check TTL — how long until the cache expires?

dig @10.50.1.90 inside.domusdigitalis.dev A +noall +answer

The TTL column shows seconds remaining. If you changed a record but the old answer persists, wait for TTL expiry or flush.

Flush and re-query

sudo rndc flush
dig @10.50.1.90 inside.domusdigitalis.dev A +short

Flush client-side cache too

# systemd-resolved
sudo systemd-resolve --flush-caches

# nscd
sudo nscd -i hosts

Step 4: Forward and Reverse Match

Validate forward/reverse consistency — required by RADIUS, Kerberos, SMTP

# Forward
dig ise-01.inside.domusdigitalis.dev A +short

# Reverse
dig -x 10.50.1.20 +short

The forward lookup should return 10.50.1.20. The reverse lookup should return ise-01.inside.domusdigitalis.dev. Mismatches cause ISE authentication failures and Kerberos ticket issues.

Bulk forward/reverse check for a subnet

for ip in 10.50.1.{1..254}; do
    fwd=$(dig -x "$ip" +short 2>/dev/null)
    if [[ -n "$fwd" ]]; then
        rev=$(dig "$fwd" A +short 2>/dev/null)
        [[ "$rev" != "$ip" ]] && printf "MISMATCH: %s -> %s -> %s\n" "$ip" "$fwd" "$rev"
    fi
done

Step 5: Delegation and Forwarding

Trace the delegation chain — find where it breaks

dig example.com +trace +nodnssec

If the trace stops at a referral that returns SERVFAIL or REFUSED, the delegation is broken at that level.

Test if forwarding is working

# Should resolve via AD DNS forwarder
dig @10.50.1.90 home-dc01.inside.domusdigitalis.dev A +short

# Should resolve via public DNS forwarder
dig @10.50.1.90 google.com A +short

If internal names resolve but public names don’t, the public forwarder is unreachable. If neither works, named isn’t running or the listener config is wrong.

Common Failure Patterns

SERVFAIL — the resolver tried but failed

dig example.com A
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL

Causes: DNSSEC validation failure, upstream server unreachable, zone file syntax error on the authoritative server. Try +cd to bypass DNSSEC validation and isolate the cause.

REFUSED — the server won’t answer this query

dig @ns1.example.com example.com A
;; ->>HEADER<<- opcode: QUERY, status: REFUSED

Causes: client IP outside allow-query or allow-recursion ACL, recursion disabled and server is not authoritative for the zone.

NXDOMAIN — the name does not exist

dig nonexistent.example.com A
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN

Causes: typo in the hostname, missing record in zone file, wrong zone being queried. Verify the zone file and check for the trailing dot.

Timeout — no response at all

dig @10.50.1.90 example.com +timeout=2
;; connection timed out; no servers could be reached

Causes: named not running, firewall blocking UDP/TCP 53, wrong IP in the query, named not listening on that interface.

BIND Server-Side Diagnostics

Check if named is running

systemctl status named

Check named error logs

sudo journalctl -u named --since "10 min ago" --no-pager

Enable debug logging temporarily

sudo rndc trace 1
# reproduce the issue
sudo rndc notrace

Validate all configuration

sudo named-checkconf /etc/named.conf && echo "Config OK"
sudo named-checkzone inside.domusdigitalis.dev /var/named/inside.domusdigitalis.dev.zone && echo "Zone OK"

Quick Diagnostic Pipeline

One-command DNS health check

echo "=== Resolver ===" && \
cat /etc/resolv.conf | awk '/^nameserver/' && \
echo "=== Forward ===" && \
dig @10.50.1.90 inside.domusdigitalis.dev A +short && \
echo "=== Reverse ===" && \
dig @10.50.1.90 -x 10.50.1.20 +short && \
echo "=== External ===" && \
dig @10.50.1.90 google.com A +short && \
echo "=== Named ===" && \
systemctl is-active named