Vault Troubleshooting Runbook
Overview
vault-01.inside.domusdigitalis.dev (10.50.1.60) is the central certificate management server running:
-
HashiCorp Vault PKI - Internal certificates (EAP-TLS, servers)
-
Certbot - Let’s Encrypt certificates via Cloudflare DNS-01
| Vault is currently a Single Point of Failure. See Vault Enterprise Hardening for HA cluster deployment. |
Architecture Reference
-
Certbot Renewal Flow - Full renewal process with error states
-
PKI Hierarchy - Certificate authority chain
-
Infrastructure Overview - System topology
Current Certificates
# List all managed certificates
sudo certbot certificates
| Certificate Name | Domains | Days Left | Status |
|---|---|---|---|
9800-wlc-01.inside.domusdigitalis.dev |
9800-wlc-01.inside.domusdigitalis.dev |
~90 |
OK |
guest.domusdigitalis.dev |
guest.domusdigitalis.dev |
~90 |
BROKEN - Path issue |
ipmi-01.inside.domusdigitalis.dev |
ipmi-01.inside.domusdigitalis.dev |
~90 |
OK |
kvm-01.inside.domusdigitalis.dev |
kvm-01.inside.domusdigitalis.dev |
~90 |
BROKEN - DNS NXDOMAIN |
Vault Sealed After Restart/Disruption
|
Vault seals automatically when restarted, when VM is disrupted, or after power loss. This is by design (security). You must unseal it manually. |
Symptoms
-
vault-ssh-signfails with503 Vault is sealed -
SSH falls back to passphrase or YubiKey (no Vault cert)
-
Any Vault API call fails
Cluster Details
| Property | Value |
|---|---|
Storage Type |
file |
Cluster Name |
vault-cluster-904d4b42 |
Cluster ID |
aff339e0-7b48-58ce-cb35-0ae05a230c97 |
HA Enabled |
false (single node) |
Threshold |
2 of 3 unseal keys required |
Unseal Vault
Need 2 unseal keys (threshold 2):
vault operator unseal $(gopass show -o v3/domains/d000/vault/unseal-key-1)
vault operator unseal $(gopass show -o v3/domains/d000/vault/unseal-key-2)
vault status
Sealed false
Prevention: Auto-Unseal (Future)
Phase 6 of Vault Enterprise Hardening will implement auto-unseal using:
-
AWS KMS, or
-
Azure Key Vault, or
-
Transit secrets engine on another Vault
Until then, manual unseal is required after any Vault restart.
Known Issues and Fixes
CRITICAL: guest.domusdigitalis.dev - Wrong Cloudflare Path
Attempting to renew cert (guest.domusdigitalis.dev) from /etc/letsencrypt/renewal/guest.domusdigitalis.dev.conf produced an unexpected error:
dns_cloudflare_credentials: /root/.secrets/cloudflare.ini: No such file or directory
The certificate was configured with credentials path /root/.secrets/cloudflare.ini but credentials are stored at /home/ansible/.secrets/cloudflare.ini.
# 1. Check current config
sudo cat /etc/letsencrypt/renewal/guest.domusdigitalis.dev.conf | grep cloudflare
# 2. Edit the renewal config
sudo vim /etc/letsencrypt/renewal/guest.domusdigitalis.dev.conf
# 3. Change:
# dns_cloudflare_credentials = /root/.secrets/cloudflare.ini
# To:
# dns_cloudflare_credentials = /home/ansible/.secrets/cloudflare.ini
# 4. Test renewal
sudo certbot renew --cert-name guest.domusdigitalis.dev --dry-run
# 5. If successful, force renewal
sudo certbot renew --cert-name guest.domusdigitalis.dev --force-renewal
Always use absolute path to ansible user’s secrets:
certbot certonly \
--dns-cloudflare \
--dns-cloudflare-credentials /home/ansible/.secrets/cloudflare.ini \
-d guest.domusdigitalis.dev
CRITICAL: kvm-01.inside.domusdigitalis.dev - DNS NXDOMAIN
Encountered NXDOMAIN when looking up TXT record: _acme-challenge.kvm-01.inside.domusdigitalis.dev
DNS TXT record _acme-challenge.kvm-01.inside.domusdigitalis.dev does not exist in Cloudflare zone, OR DNS propagation timing is too short.
# Edit renewal config
sudo vim /etc/letsencrypt/renewal/kvm-01.inside.domusdigitalis.dev.conf
# Add or increase:
dns_cloudflare_propagation_seconds = 60
# Check API token permissions
# Token needs: Zone:DNS:Edit for domusdigitalis.dev
# List zones accessible by token
curl -X GET "https://api.cloudflare.com/client/v4/zones" \
-H "Authorization: Bearer $(cat /home/ansible/.secrets/cloudflare.ini | grep api_token | cut -d= -f2 | tr -d ' ')" \
-H "Content-Type: application/json"
# Consider using wildcard for *.inside.domusdigitalis.dev
# This reduces individual cert management
certbot certonly \
--dns-cloudflare \
--dns-cloudflare-credentials /home/ansible/.secrets/cloudflare.ini \
-d "*.inside.domusdigitalis.dev" \
-d "inside.domusdigitalis.dev"
Deploy Hook Failures
/etc/letsencrypt/renewal-hooks/deploy/deploy-certs.sh
| Issue | Symptom | Fix |
|---|---|---|
ISE API unreachable |
|
Check ISE is up, verify IP in secrets |
age decryption failed |
|
Verify age key in |
iPSK SSH failed |
|
Check iPSK VM is running, SSH key authorized |
Missing environment variable |
|
Secrets not loaded, check age file path |
# Run deploy hook manually with debug
sudo bash -x /etc/letsencrypt/renewal-hooks/deploy/deploy-certs.sh
# Check what certificates triggered the hook
echo $RENEWED_LINEAGE
echo $RENEWED_DOMAINS
Systemd Timer Status
# Check timer status
systemctl status certbot-renew.timer
# Check recent runs
journalctl -u certbot-renew.service -n 50
# Force a renewal check
sudo systemctl start certbot-renew.service
certbot-renew.timer - Run certbot twice daily
Loaded: loaded (/lib/systemd/system/certbot-renew.timer; enabled)
Active: active (waiting)
Trigger: [next trigger time]
Full Renewal Test
# Dry-run all certificates
sudo certbot renew --dry-run
# Expected: All simulated renewals succeeded
# If failures, check each cert individually
Emergency Procedures
Certificate About to Expire
# Force renewal immediately
sudo certbot renew --force-renewal
# If renewal fails, issue new cert
sudo certbot certonly \
--dns-cloudflare \
--dns-cloudflare-credentials /home/ansible/.secrets/cloudflare.ini \
-d [domain]
Deploy Hook Not Running
# Manually deploy after renewal
export RENEWED_LINEAGE="/etc/letsencrypt/live/[cert-name]"
export RENEWED_DOMAINS="[domain]"
sudo -E /etc/letsencrypt/renewal-hooks/deploy/deploy-certs.sh
ISE Guest Portal Cert Expired
# 1. Renew cert
sudo certbot renew --cert-name guest.domusdigitalis.dev --force-renewal
# 2. Manual deployment to ISE if hook fails
# Use netapi or direct API call
source <(dsec source d000 dev/network)
curl -sk -X POST "https://$<ise-pan-ip>/api/v1/certs/system-certificate/import" \
-H "Authorization: Basic $<ise-api-token>" \
-H "Content-Type: application/json" \
-d @- << EOF
{
"admin": false,
"eap": false,
"portal": true,
"portalGroupTag": "Default Portal Certificate Group",
"data": "$(base64 -w0 /etc/letsencrypt/live/guest.domusdigitalis.dev/fullchain.pem)",
"privateKeyData": "$(base64 -w0 /etc/letsencrypt/live/guest.domusdigitalis.dev/privkey.pem)"
}
EOF
Long-Term Solutions
Priority 1: Fix Current Failures
-
Fix guest.domusdigitalis.dev cloudflare path
-
Fix kvm-01.inside.domusdigitalis.dev DNS propagation/NXDOMAIN
-
Verify all certs renew with
--dry-run