INC-2026-04-13-001: Prevention

Prevention

Short-term (This Week)

Add nofail + use IP in ALL NAS mounts on kvm-02 — Done 2026-04-13. Replaced hostname nas-01 with IP 10.50.1.70, added nofail,x-systemd.device-timeout=10 to all three entries. kvm-01 fstab still needs the same fix.

# Fixed pattern (kvm-02 — applied):
10.50.1.70:/volume1/vms /mnt/nas/vms nfs defaults,_netdev,nofail,x-systemd.device-timeout=10 0 0
10.50.1.70:/volume1/isos /mnt/nas/isos nfs defaults,_netdev,nofail,x-systemd.device-timeout=10 0 0
10.50.1.70:/volume1/backups /mnt/nas/backups nfs defaults,_netdev,nofail,x-systemd.device-timeout=10 0 0

Fix kvm-01 fstab permanently — Four entries need correction:
- NAS mounts: replace nas-01 with 10.50.1.70, add nofail,x-systemd.device-timeout=10
- /dev/sdb1: change xfs to ext4, add nofail (this was the actual emergency mode blocker — filesystem type mismatch)
Document the "never panic reboot" rule — kvm-01 VMs were running fine before the reboot. The ipsk-mgr-01 outage was a single VM issue, not a hypervisor issue. Rebooting the hypervisor killed 7 healthy VMs to troubleshoot 1. Correct response: virsh console <vm-name> to investigate the guest first.

Move WLC-02 qcow2 off NAS to local storage — VM images critical for HA must not depend on NAS availability.

# After NAS mount is restored
sudo cp /mnt/nas/vms/9800-WLC-02.qcow2 /var/lib/libvirt/images/
sudo virsh edit 9800-WLC-02
# Change <source file='/mnt/nas/vms/9800-WLC-02.qcow2'/> to
#        <source file='/var/lib/libvirt/images/9800-WLC-02.qcow2'/>

Verify VM autostart is configured — All critical VMs should autostart so a legitimate reboot doesn’t require manual intervention.

# On both KVMs
sudo virsh list --all --autostart
# Enable for any missing
sudo virsh autostart vyos-01
sudo virsh autostart bind-01
# ... etc

Short-term (This Week) — Bridge PVID

Investigate and fix the br-mgmt PVID mismatch properly — The host is on PVID 1, VMs are on PVID 100. This is a pre-existing condition. The fix must be planned in a maintenance window with IPMI console ready. Requires understanding the full VLAN mapping: switch trunk native VLAN ↔ bridge PVID ↔ physical uplink PVID ↔ VM vnet PVID. Consult domus-infra-ops bridge design documentation before making any changes.
Re-add ISE-01 to switch RADIUS group — ISE-01 was temporarily removed from ISE-RADIUS group during incident. Re-add after kvm-01 recovery:
```
configure terminal
aaa group server radius ISE-RADIUS
 server name ISE-01
end
```

Long-term (This Quarter)

IPMI/BMC remote access — kvm-02 IPMI was available and saved the recovery when the bridge change made the host unreachable. Verify kvm-01 IPMI is also configured and accessible.

NFS mount monitoring — Add a cron job or systemd timer that checks NFS mount health and alerts (or remounts) before it becomes critical.

# Simple NFS health check
[[ -z "$(ls -A /mnt/nas/vms/)" ]] && logger -p local0.crit "NFS mount /mnt/nas/vms/ is empty — stale mount suspected"

Eliminate NAS dependency for critical VM images — All production VM qcow2 images should be on local NVMe. NAS should be used only for backups and ISOs, not as primary VM storage. WLC-02 already moved to local storage during this incident.
Fix Te1/0/2 CRC errors — 386 CRC errors caused frame corruption. SFPs tested good. Replace fiber patch cable, clean connectors, re-test. kvm-01 temporarily on Te1/0/8. Move back to Te1/0/2 after cable fix confirmed (clear counters, monitor for new CRC).
Make bridge VLAN config persistent via nmcli — bridge vlan commands are ephemeral. Both kvm-01 and kvm-02 need nmcli connection modify br-mgmt bridge.vlan-default-pvid 0 bridge.vlans "100 pvid untagged, 10, 20, 30, 40, 110, 120" for persistence across reboots.
Document and drill the incident response runbook — This incident exposed two escalation patterns:
- Panic reboot of kvm-01 — killed 7 healthy VMs
- Untested bridge change over SSH — caused second outage
- A runbook should define: when to reboot vs. investigate in-place, VM restart order (dependency chain), NAS mount troubleshooting, VRRP failover verification, and a rule to never modify bridge/network config over SSH without IPMI console ready
Add kvm-01 and kvm-02 to monitoring — Wazuh agents on both hypervisors should alert on: failed mount units, emergency mode entry, NFS stale mounts, VM state changes.
Resolve why bind-02, vault-02, vault-03 shut down overnight — These VMs were running on 2026-04-12 ~21:40 but were shut off by 2026-04-13 morning. Root cause unknown — check journalctl on kvm-02 and VM-level logs for shutdown events.
Debug libvirt VLAN hook on kvm-02 — Hook at /etc/libvirt/hooks/qemu did not fire on VM start. journalctl -t "libvirt-hook[vyos-02]" returned no entries even after systemctl restart libvirtd. VLAN config had to be applied manually. Without the hook, vnets revert to PVID 1 on every VM restart. Check: hook permissions, SELinux context, libvirtd hook execution path, and whether the hook runs in the started phase correctly.

Lessons Learned

What Went Well

kvm-02 was accessible via SSH — provided a recovery foothold
vyos-02 maintained internet connectivity (VRRP partial failover)
Switch console available as last-resort management plane
VM disk images on local NVMe are safe — no data loss risk
bind-02 (secondary DNS) responded to queries when bind-01 was down — HA DNS worked as designed

What Could Be Improved

Panic reboot was the first escalation — kvm-01 VMs were running. Only ipsk-mgr-01 was down. The correct response was to investigate ipsk-mgr-01 specifically (virsh console ipsk-mgr-01, check MariaDB, check the VM itself). Rebooting the hypervisor killed 7 healthy VMs and the system couldn’t come back up.
Untested bridge change was the second escalation — The bridge PVID mismatch was a pre-existing condition that didn’t affect operations. Changing it during an active incident, over SSH, without consulting the documented bridge design, caused a second outage requiring IPMI recovery.
No nofail on NAS mounts — A single missing fstab option turned a recoverable situation into a boot failure. Compounded by using hostname (nas-01) instead of IP — DNS-dependent NFS mounts create a circular dependency when DNS itself runs on VMs that depend on the NFS mount.
fstab hostname creates circular dependency — NFS mount needs DNS to resolve nas-01 → DNS (bind) runs in a VM → VM images may be on NFS → boot deadlock.
Critical VM on NAS storage — WLC-02’s qcow2 on NAS means the HA standby is only as reliable as the NAS mount. HA pairs should have zero shared dependencies.
No incident response runbook — Under pressure, two destructive decisions were made. A documented decision tree would have guided toward less destructive triage first.

Key Takeaways

Never reboot a hypervisor to fix a single VM. Investigate the VM first — virsh console, virsh domifaddr, check the guest OS. The hypervisor is not the problem unless all VMs are failing simultaneously.
Every NFS/CIFS mount in fstab MUST have nofail, _netdev, and use IP not hostname. Without these, a network storage outage becomes a compute outage on every reboot. Hostnames create circular DNS dependencies.
Never modify bridge/network config over SSH without IPMI console ready. If the change breaks connectivity, you lose all access. Have out-of-band management open before touching L2.
Consult documentation before making infrastructure changes during an incident. The bridge design was documented in domus-infra-ops. Reading it first would have prevented the second outage.
HA pairs must not share failure domains. If WLC-02 depends on the same NAS as WLC-01, they fail together — that’s not HA.