INC-2026-04-13-001: Prevention

Prevention

Short-term (This Week)

  • Add nofail + use IP in ALL NAS mounts on kvm-02 — Done 2026-04-13. Replaced hostname nas-01 with IP 10.50.1.70, added nofail,x-systemd.device-timeout=10 to all three entries. kvm-01 fstab still needs the same fix.

    # Fixed pattern (kvm-02 — applied):
    10.50.1.70:/volume1/vms /mnt/nas/vms nfs defaults,_netdev,nofail,x-systemd.device-timeout=10 0 0
    10.50.1.70:/volume1/isos /mnt/nas/isos nfs defaults,_netdev,nofail,x-systemd.device-timeout=10 0 0
    10.50.1.70:/volume1/backups /mnt/nas/backups nfs defaults,_netdev,nofail,x-systemd.device-timeout=10 0 0
  • Fix kvm-01 fstab permanently — Four entries need correction:

    • NAS mounts: replace nas-01 with 10.50.1.70, add nofail,x-systemd.device-timeout=10

    • /dev/sdb1: change xfs to ext4, add nofail (this was the actual emergency mode blocker — filesystem type mismatch)

  • Document the "never panic reboot" rule — kvm-01 VMs were running fine before the reboot. The ipsk-mgr-01 outage was a single VM issue, not a hypervisor issue. Rebooting the hypervisor killed 7 healthy VMs to troubleshoot 1. Correct response: virsh console <vm-name> to investigate the guest first.

  • Move WLC-02 qcow2 off NAS to local storage — VM images critical for HA must not depend on NAS availability.

    # After NAS mount is restored
    sudo cp /mnt/nas/vms/9800-WLC-02.qcow2 /var/lib/libvirt/images/
    sudo virsh edit 9800-WLC-02
    # Change <source file='/mnt/nas/vms/9800-WLC-02.qcow2'/> to
    #        <source file='/var/lib/libvirt/images/9800-WLC-02.qcow2'/>
  • Verify VM autostart is configured — All critical VMs should autostart so a legitimate reboot doesn’t require manual intervention.

    # On both KVMs
    sudo virsh list --all --autostart
    # Enable for any missing
    sudo virsh autostart vyos-01
    sudo virsh autostart bind-01
    # ... etc

Short-term (This Week) — Bridge PVID

  • Investigate and fix the br-mgmt PVID mismatch properly — The host is on PVID 1, VMs are on PVID 100. This is a pre-existing condition. The fix must be planned in a maintenance window with IPMI console ready. Requires understanding the full VLAN mapping: switch trunk native VLAN ↔ bridge PVID ↔ physical uplink PVID ↔ VM vnet PVID. Consult domus-infra-ops bridge design documentation before making any changes.

  • Re-add ISE-01 to switch RADIUS group — ISE-01 was temporarily removed from ISE-RADIUS group during incident. Re-add after kvm-01 recovery:

    configure terminal
    aaa group server radius ISE-RADIUS
     server name ISE-01
    end

Long-term (This Quarter)

  • IPMI/BMC remote access — kvm-02 IPMI was available and saved the recovery when the bridge change made the host unreachable. Verify kvm-01 IPMI is also configured and accessible.

  • NFS mount monitoring — Add a cron job or systemd timer that checks NFS mount health and alerts (or remounts) before it becomes critical.

    # Simple NFS health check
    [[ -z "$(ls -A /mnt/nas/vms/)" ]] && logger -p local0.crit "NFS mount /mnt/nas/vms/ is empty — stale mount suspected"
  • Eliminate NAS dependency for critical VM images — All production VM qcow2 images should be on local NVMe. NAS should be used only for backups and ISOs, not as primary VM storage. WLC-02 already moved to local storage during this incident.

  • Fix Te1/0/2 CRC errors — 386 CRC errors caused frame corruption. SFPs tested good. Replace fiber patch cable, clean connectors, re-test. kvm-01 temporarily on Te1/0/8. Move back to Te1/0/2 after cable fix confirmed (clear counters, monitor for new CRC).

  • Make bridge VLAN config persistent via nmclibridge vlan commands are ephemeral. Both kvm-01 and kvm-02 need nmcli connection modify br-mgmt bridge.vlan-default-pvid 0 bridge.vlans "100 pvid untagged, 10, 20, 30, 40, 110, 120" for persistence across reboots.

  • Document and drill the incident response runbook — This incident exposed two escalation patterns:

    • Panic reboot of kvm-01 — killed 7 healthy VMs

    • Untested bridge change over SSH — caused second outage

    • A runbook should define: when to reboot vs. investigate in-place, VM restart order (dependency chain), NAS mount troubleshooting, VRRP failover verification, and a rule to never modify bridge/network config over SSH without IPMI console ready

  • Add kvm-01 and kvm-02 to monitoring — Wazuh agents on both hypervisors should alert on: failed mount units, emergency mode entry, NFS stale mounts, VM state changes.

  • Resolve why bind-02, vault-02, vault-03 shut down overnight — These VMs were running on 2026-04-12 ~21:40 but were shut off by 2026-04-13 morning. Root cause unknown — check journalctl on kvm-02 and VM-level logs for shutdown events.

  • Debug libvirt VLAN hook on kvm-02 — Hook at /etc/libvirt/hooks/qemu did not fire on VM start. journalctl -t "libvirt-hook[vyos-02]" returned no entries even after systemctl restart libvirtd. VLAN config had to be applied manually. Without the hook, vnets revert to PVID 1 on every VM restart. Check: hook permissions, SELinux context, libvirtd hook execution path, and whether the hook runs in the started phase correctly.

Lessons Learned

What Went Well

  • kvm-02 was accessible via SSH — provided a recovery foothold

  • vyos-02 maintained internet connectivity (VRRP partial failover)

  • Switch console available as last-resort management plane

  • VM disk images on local NVMe are safe — no data loss risk

  • bind-02 (secondary DNS) responded to queries when bind-01 was down — HA DNS worked as designed

What Could Be Improved

  • Panic reboot was the first escalation — kvm-01 VMs were running. Only ipsk-mgr-01 was down. The correct response was to investigate ipsk-mgr-01 specifically (virsh console ipsk-mgr-01, check MariaDB, check the VM itself). Rebooting the hypervisor killed 7 healthy VMs and the system couldn’t come back up.

  • Untested bridge change was the second escalation — The bridge PVID mismatch was a pre-existing condition that didn’t affect operations. Changing it during an active incident, over SSH, without consulting the documented bridge design, caused a second outage requiring IPMI recovery.

  • No nofail on NAS mounts — A single missing fstab option turned a recoverable situation into a boot failure. Compounded by using hostname (nas-01) instead of IP — DNS-dependent NFS mounts create a circular dependency when DNS itself runs on VMs that depend on the NFS mount.

  • fstab hostname creates circular dependency — NFS mount needs DNS to resolve nas-01 → DNS (bind) runs in a VM → VM images may be on NFS → boot deadlock.

  • Critical VM on NAS storage — WLC-02’s qcow2 on NAS means the HA standby is only as reliable as the NAS mount. HA pairs should have zero shared dependencies.

  • No incident response runbook — Under pressure, two destructive decisions were made. A documented decision tree would have guided toward less destructive triage first.

Key Takeaways

  1. Never reboot a hypervisor to fix a single VM. Investigate the VM first — virsh console, virsh domifaddr, check the guest OS. The hypervisor is not the problem unless all VMs are failing simultaneously.

  2. Every NFS/CIFS mount in fstab MUST have nofail, _netdev, and use IP not hostname. Without these, a network storage outage becomes a compute outage on every reboot. Hostnames create circular DNS dependencies.

  3. Never modify bridge/network config over SSH without IPMI console ready. If the change breaks connectivity, you lose all access. Have out-of-band management open before touching L2.

  4. Consult documentation before making infrastructure changes during an incident. The bridge design was documented in domus-infra-ops. Reading it first would have prevented the second outage.

  5. HA pairs must not share failure domains. If WLC-02 depends on the same NAS as WLC-01, they fail together — that’s not HA.