INC-2026-04-13-001: Prevention
Prevention
Short-term (This Week)
-
Add
nofail+ use IP in ALL NAS mounts on kvm-02 — Done 2026-04-13. Replaced hostnamenas-01with IP10.50.1.70, addednofail,x-systemd.device-timeout=10to all three entries. kvm-01 fstab still needs the same fix.# Fixed pattern (kvm-02 — applied): 10.50.1.70:/volume1/vms /mnt/nas/vms nfs defaults,_netdev,nofail,x-systemd.device-timeout=10 0 0 10.50.1.70:/volume1/isos /mnt/nas/isos nfs defaults,_netdev,nofail,x-systemd.device-timeout=10 0 0 10.50.1.70:/volume1/backups /mnt/nas/backups nfs defaults,_netdev,nofail,x-systemd.device-timeout=10 0 0 -
Fix kvm-01 fstab permanently — Four entries need correction:
-
NAS mounts: replace
nas-01with10.50.1.70, addnofail,x-systemd.device-timeout=10 -
/dev/sdb1: changexfstoext4, addnofail(this was the actual emergency mode blocker — filesystem type mismatch)
-
-
Document the "never panic reboot" rule — kvm-01 VMs were running fine before the reboot. The ipsk-mgr-01 outage was a single VM issue, not a hypervisor issue. Rebooting the hypervisor killed 7 healthy VMs to troubleshoot 1. Correct response:
virsh console <vm-name>to investigate the guest first. -
Move WLC-02 qcow2 off NAS to local storage — VM images critical for HA must not depend on NAS availability.
# After NAS mount is restored sudo cp /mnt/nas/vms/9800-WLC-02.qcow2 /var/lib/libvirt/images/ sudo virsh edit 9800-WLC-02 # Change <source file='/mnt/nas/vms/9800-WLC-02.qcow2'/> to # <source file='/var/lib/libvirt/images/9800-WLC-02.qcow2'/> -
Verify VM autostart is configured — All critical VMs should autostart so a legitimate reboot doesn’t require manual intervention.
# On both KVMs sudo virsh list --all --autostart # Enable for any missing sudo virsh autostart vyos-01 sudo virsh autostart bind-01 # ... etc
Short-term (This Week) — Bridge PVID
-
Investigate and fix the br-mgmt PVID mismatch properly — The host is on PVID 1, VMs are on PVID 100. This is a pre-existing condition. The fix must be planned in a maintenance window with IPMI console ready. Requires understanding the full VLAN mapping: switch trunk native VLAN ↔ bridge PVID ↔ physical uplink PVID ↔ VM vnet PVID. Consult domus-infra-ops bridge design documentation before making any changes.
-
Re-add ISE-01 to switch RADIUS group — ISE-01 was temporarily removed from
ISE-RADIUSgroup during incident. Re-add after kvm-01 recovery:configure terminal aaa group server radius ISE-RADIUS server name ISE-01 end
Long-term (This Quarter)
-
IPMI/BMC remote access — kvm-02 IPMI was available and saved the recovery when the bridge change made the host unreachable. Verify kvm-01 IPMI is also configured and accessible.
-
NFS mount monitoring — Add a cron job or systemd timer that checks NFS mount health and alerts (or remounts) before it becomes critical.
# Simple NFS health check [[ -z "$(ls -A /mnt/nas/vms/)" ]] && logger -p local0.crit "NFS mount /mnt/nas/vms/ is empty — stale mount suspected" -
Eliminate NAS dependency for critical VM images — All production VM qcow2 images should be on local NVMe. NAS should be used only for backups and ISOs, not as primary VM storage. WLC-02 already moved to local storage during this incident.
-
Fix Te1/0/2 CRC errors — 386 CRC errors caused frame corruption. SFPs tested good. Replace fiber patch cable, clean connectors, re-test. kvm-01 temporarily on Te1/0/8. Move back to Te1/0/2 after cable fix confirmed (clear counters, monitor for new CRC).
-
Make bridge VLAN config persistent via nmcli —
bridge vlancommands are ephemeral. Both kvm-01 and kvm-02 neednmcli connection modify br-mgmt bridge.vlan-default-pvid 0 bridge.vlans "100 pvid untagged, 10, 20, 30, 40, 110, 120"for persistence across reboots. -
Document and drill the incident response runbook — This incident exposed two escalation patterns:
-
Panic reboot of kvm-01 — killed 7 healthy VMs
-
Untested bridge change over SSH — caused second outage
-
A runbook should define: when to reboot vs. investigate in-place, VM restart order (dependency chain), NAS mount troubleshooting, VRRP failover verification, and a rule to never modify bridge/network config over SSH without IPMI console ready
-
-
Add kvm-01 and kvm-02 to monitoring — Wazuh agents on both hypervisors should alert on: failed mount units, emergency mode entry, NFS stale mounts, VM state changes.
-
Resolve why bind-02, vault-02, vault-03 shut down overnight — These VMs were running on 2026-04-12 ~21:40 but were shut off by 2026-04-13 morning. Root cause unknown — check
journalctlon kvm-02 and VM-level logs for shutdown events. -
Debug libvirt VLAN hook on kvm-02 — Hook at
/etc/libvirt/hooks/qemudid not fire on VM start.journalctl -t "libvirt-hook[vyos-02]"returned no entries even aftersystemctl restart libvirtd. VLAN config had to be applied manually. Without the hook, vnets revert to PVID 1 on every VM restart. Check: hook permissions, SELinux context, libvirtd hook execution path, and whether the hook runs in thestartedphase correctly.
Lessons Learned
What Went Well
-
kvm-02 was accessible via SSH — provided a recovery foothold
-
vyos-02 maintained internet connectivity (VRRP partial failover)
-
Switch console available as last-resort management plane
-
VM disk images on local NVMe are safe — no data loss risk
-
bind-02 (secondary DNS) responded to queries when bind-01 was down — HA DNS worked as designed
What Could Be Improved
-
Panic reboot was the first escalation — kvm-01 VMs were running. Only ipsk-mgr-01 was down. The correct response was to investigate ipsk-mgr-01 specifically (
virsh console ipsk-mgr-01, check MariaDB, check the VM itself). Rebooting the hypervisor killed 7 healthy VMs and the system couldn’t come back up. -
Untested bridge change was the second escalation — The bridge PVID mismatch was a pre-existing condition that didn’t affect operations. Changing it during an active incident, over SSH, without consulting the documented bridge design, caused a second outage requiring IPMI recovery.
-
No
nofailon NAS mounts — A single missing fstab option turned a recoverable situation into a boot failure. Compounded by using hostname (nas-01) instead of IP — DNS-dependent NFS mounts create a circular dependency when DNS itself runs on VMs that depend on the NFS mount. -
fstab hostname creates circular dependency — NFS mount needs DNS to resolve
nas-01→ DNS (bind) runs in a VM → VM images may be on NFS → boot deadlock. -
Critical VM on NAS storage — WLC-02’s qcow2 on NAS means the HA standby is only as reliable as the NAS mount. HA pairs should have zero shared dependencies.
-
No incident response runbook — Under pressure, two destructive decisions were made. A documented decision tree would have guided toward less destructive triage first.
Key Takeaways
|