INC-2026-04-13-001: KVM Dual Hypervisor Outage — NAS Mount Failure Cascade

Incident Summary

Field Value

Detected

2026-04-12 ~21:00 PST — user reported Domus-IoT WiFi prompting for password (iPSK Manager unreachable)

Mitigated

2026-04-13 ~14:20 PST — kvm-02 fully operational, all services verified

Resolved

2026-04-13 ~20:00 PST — all services operational on both KVMs

Duration

~23 hours (2026-04-12 21:00 → 2026-04-13 20:00)

Severity

P1 (Critical) — Total infrastructure outage. All VMs on kvm-01 down. kvm-02 NAS mount degraded. No routing, DNS, auth, or PKI.

Impact

Complete loss of: VyOS routing (both routers), DNS (bind-01/02 effectively unreachable), 802.1X authentication (ISE), PKI/SSH CA (Vault), AD/FreeIPA identity, iPSK self-service, WLC HA pair, k3s cluster. Workstation forced to mobile hotspot.

Root Cause (Suspected)

NAS NFS mount failure. kvm-01 /etc/fstab contains a NAS mount without nofail — when NAS mount failed, reboot dropped to emergency mode. kvm-02 NAS mount at /mnt/nas/vms/ is stale/empty despite NAS being pingable, preventing WLC-02 from starting.

Environment

Property Value

kvm-01

Supermicro SYS-E300-9D-8CN8TP, Rocky Linux 9.7, 10.50.1.100 — DOWN (emergency mode)

kvm-02

Supermicro SYS-E300-9D, Rocky Linux 9.7, 10.50.1.101 — Degraded (NAS mount stale)

NAS

Synology 10.50.1.70 — Pingable from kvm-02, but NFS exports appear empty/stale

Switch

Cisco 3560CX-01 — Operational, Te1/0/1 and Te1/0/2 connected (L1 up to both Supermicros)

Workstation

Razer — on domus-movil mobile hotspot, wired 802.1X failing

Timeline

Time (PST) Event

2026-04-12 ~21:00

User reports Domus-IoT WiFi prompting for password instead of iPSK auto-connect.

2026-04-12 ~21:05

Checked ISE (ise-02 on kvm-02) — ISE is up. Checked ODBC external identity source — shows ipsk-mgr-01 (10.50.1.30) unreachable.

2026-04-12 ~21:10

Attempted DNS resolution: dig +short ipsk-mgr-01.inside.domusdigitalis.dev — bind-01 (10.50.1.90) unreachable. bind-02 (10.50.1.91) responded: 10.50.1.30. Ping to 10.50.1.30 returned Destination Host Unreachable from 10.50.1.106.

2026-04-12 ~21:15

SSH to kvm-01 (10.50.1.100) failed: No route to host. Concluded kvm-01 is down — all 8 VMs lost.

2026-04-12 ~21:20

Panic reboot of kvm-01. System dropped to emergency/rescue mode prompting for root password. Error message referenced inability to mount /mnt partition (likely NAS NFS mount).

2026-04-12 ~21:30

Entered emergency mode root prompt. Unable to resolve mount issue. Multiple reboot attempts — same result.

2026-04-12 ~21:40

SSH to kvm-02 (10.50.1.101) — successful. VMs running: vyos-02, ise-02, 9800-WLC-02, bind-02, vault-02, vault-03.

2026-04-12 ~21:45

Attempted virsh start 9800-WLC-02failed: Cannot access storage file '/mnt/nas/vms/9800-WLC-02.qcow2' — No such file or directory. Confirmed /mnt/nas/vms/ exists but is empty. NAS (10.50.1.70) is pingable from kvm-02.

2026-04-12 ~21:50

Consoled into vyos-02 via virsh console — internet reachable (ping 8.8.8.8 succeeds from vyos-02). But kvm-02 host itself cannot ping 8.8.8.8 — routing anomaly.

2026-04-12 ~22:00

Gave up for the night. kvm-01 stuck in emergency mode. kvm-02 partially functional but NAS mount broken.

2026-04-13 morning

Workstation unable to connect to wired network (802.1X). Forced to mobile hotspot (domus-movil). Both KVMs effectively down from workstation perspective.

2026-04-13 morning

Switch console confirms Te1/0/1 and Te1/0/2 both show connected — L1 to both Supermicros is up.

2026-04-13 ~13:00

Switch can ping kvm-02 (.111) but not kvm-01 (.110). ISE RADIUS servers marked alive on switch (%RADIUS-4-RADIUS_ALIVE for both .20 and .21).

2026-04-13 ~13:05

Removed 802.1X template from switch port Gi1/0/4. EAP-TLS profile fails: Secrets were required, but not provided (keyring/passphrase issue). Static profile Domus-Wired-MGMT-Static times out at 90 seconds.

2026-04-13 ~13:10

Bypassed NetworkManager — manually assigned IP via ip addr add 10.50.1.106/24 dev enp130s0 and ip route add default via 10.50.1.1. Workstation on the wire. SSH to kvm-02 successful.

2026-04-13 ~13:15

kvm-02 virsh list: vyos-02 and ise-02 running, but bind-02, vault-02, vault-03 shut off (down since overnight). WLC-02 shut off (NAS mount). Started bind-02, vault-02, vault-03 — all came up.

2026-04-13 ~13:18

NAS mount diagnosis: mount -a fails — Failed to resolve server nas-01: Name or service not known. fstab uses hostname nas-01 not IP. Mounted manually by IP: mount -t nfs 10.50.1.70:/volume1/vms /mnt/nas/vms. All three NFS mounts restored. WLC-02 started successfully.

2026-04-13 ~13:22

Fixed kvm-02 /etc/fstab — replaced nas-01 hostname with 10.50.1.70 IP, added nofail,x-systemd.device-timeout=10 to all three NAS mount entries.

2026-04-13 ~13:25

All 6 kvm-02 VMs running. But kvm-02 host cannot ping any VM (ISE .21, bind .91). Investigated bridge VLAN configuration.

2026-04-13 ~13:30

Root cause of host-to-VM isolation identified: NetworkManager br-mgmt bridge configured with bridge.vlan-default-pvid: 1 and bridge.vlans: 1 pvid untagged. VMs are on PVID 100 (set by libvirt VLAN hook). Host and VMs in different VLANs on the same bridge.

2026-04-13 ~13:32

Confirmed via virsh console vyos-02: VRRP MASTER on all groups, holds VIP 10.50.1.1. vyos-02 can ping ISE (.21) and bind-02 (.91) but NOT kvm-02 host (.111). VM-to-VM networking is functional; only host is isolated.

2026-04-13 ~13:35

Self-inflicted outage #2: Attempted to fix bridge PVID via nmcli connection modify br-mgmt bridge.vlan-default-pvid 100 followed by nmcli device reapply br-mgmt. kvm-02 host became completely unreachable — lost SSH, switch could not ping .111. All management access to kvm-02 lost.

2026-04-13 ~13:40

Accessed kvm-02 via IPMI/BMC remote console. Reverted bridge change: nmcli connection modify br-mgmt bridge.vlan-default-pvid 1 bridge.vlans "1 pvid untagged, 10, 20, 30, 40, 100, 110, 120" + nmcli device reapply br-mgmt. kvm-02 host connectivity restored.

2026-04-13 ~13:50

Consulted domus-infra-ops kvm-01-rocky-rebuild.adoc Phase 5 for correct bridge VLAN config. Created /tmp/fix-bridge-vlans.sh script. Applied fix: removed VLAN 1, set PVID 100 on eno8 and br-mgmt self. Made persistent via nmcli connection modify br-mgmt bridge.vlan-default-pvid 0 bridge.vlans "100 pvid untagged, 10, 20, 30, 40, 110, 120".

2026-04-13 ~13:55

Restarted all VMs to trigger libvirt VLAN hook. Hook did not firejournalctl -t "libvirt-hook[vyos-02]" returned no entries. Restarted libvirtd, retried — still no entries. Hook investigation deferred.

2026-04-13 ~14:00

Applied VLAN config manually to all vnets as stopgap: removed PVID 1, added VLANs 10/20/30/40/100/110/120, set PVID 100.

2026-04-13 ~14:02

kvm-02 RESTORED. All 6 VMs running. Host can ping vyos-02 (.3), ISE (.21), bind-02 (.91). Gateway VIP (.1) responding with ICMP redirects (normal — vyos-02 redirects to itself). Bridge VLAN state matches runbook specification.

2026-04-13 ~14:10

WLC-02 booted but showed "no bootable media". qcow2 on NAS corrupted (66 refcount errors). Copied to local storage (/var/lib/libvirt/images/), repaired with qemu-img check -r all (10 leaked clusters fixed, 0 corruptions). Updated VM disk path via virt-xml. WLC-02 booted successfully from local disk.

2026-04-13 ~14:20

Infrastructure verified end-to-end. Gateway VIP (.1) responding. DNS resolving (bind-02). ISE authenticating (802.1X MAB on switch, EAP-TLS WiFi). Internet reachable (8.8.8.8). P16g connected on Domus-Secure WiFi — confirmed working. AP associated with WLC-02.

2026-04-13 ~14:25

Razer workstation WiFi drops when LAN disconnected — Razer-specific issue, not infrastructure. P16g WiFi stable. Deferred to separate investigation.

2026-04-13 ~14:30

kvm-01 booted to normal login after commenting out /dev/sdb1 fstab entry (ext4 disk mounted as xfs — superblock mismatch). Root cause of emergency mode was sdb1 superblock error, not NAS mounts. NAS mounts also failing (hostname nas-01) but sdb1 was the actual blocker.

2026-04-13 ~14:35

kvm-01 VMs started: vyos-01, bind-01, vault-01, home-dc01, ipa-01, 9800-WLC-01 all up. ipsk-mgr-01 failed initially — cloud-init ISO on /mnt/onboard-ssd (sdb1 not mounted). Mounted sdb1 manually as ext4, ipsk-mgr-01 started.

2026-04-13 ~14:40

kvm-01 reachable from switch when connected to Te1/0/1 but NOT Te1/0/2 (original port). Zero L2 traffic on Te1/0/2 despite connected status. Swapped SFP cables to test — kvm-01 works on Te1/0/1, nothing works on Te1/0/2. Te1/0/2 port suspected dead (hardware failure or SFP issue).

2026-04-13 ~14:50

DHCP snooping drop for AP MAC 8c88.812a.0000 — stale binding from port move. AP operational on WLC-02, WiFi confirmed working (P16g on Domus-Secure).

2026-04-13 ~15:00

Current state: kvm-02 fully operational on Te1/0/1. kvm-01 offline pending Te1/0/2 port resolution. All kvm-02 services verified. kvm-01 VMs were running before shutdown.

2026-04-13 ~19:30

Te1/0/2 investigation: show interfaces Te1/0/2 revealed 386 CRC errors — corrupted frames on the link. SFPs tested good on Te1/0/1. Root cause: bad fiber patch cable or dirty connector on Te1/0/2 path.

2026-04-13 ~19:45

Moved kvm-01 to Te1/0/8. Configured as trunk (native VLAN 100, allowed VLANs 10/20/30/40/100/110/120). kvm-01 reachable via SSH.

2026-04-13 ~19:50

Started all kvm-01 VMs in dependency order via loop: for vm in vyos-01 bind-01 vault-01 home-dc01 ipa-01 9800-WLC-01 k3s-master-01; do sudo virsh start $vm && echo "$vm started"; sleep 5; done

2026-04-13 ~20:00

INCIDENT RESOLVED. All services verified:

* iPSK Manager — login page accessible, Domus-IoT WiFi working, IoT clients connected * DNS HA — both bind servers responding: dig +short vault-02.inside.domusdigitalis.dev @10.50.1.90 ✓ and @10.50.1.91 ✓ * Both KVMs operational: kvm-01 (8 VMs on Te1/0/8), kvm-02 (6 VMs on Te1/0/1) * VRRP — vyos-01 resuming MASTER (priority 200)

Symptoms

  • iPSK Manager (ipsk-mgr-01) unreachable — Domus-IoT users prompted for password

  • DNS primary (bind-01) unreachable — host unreachable on 10.50.1.90

  • kvm-01 SSH refused — No route to host on 10.50.1.100

  • kvm-01 stuck in emergency/rescue mode after reboot — /mnt mount failure

  • kvm-02 NAS mount /mnt/nas/vms/ is empty despite NAS being pingable

  • 9800-WLC-02 cannot start — qcow2 on NAS mount not found

  • kvm-02 host cannot reach internet (but vyos-02 VM can) — VRRP/routing anomaly

  • Workstation 802.1X authentication failing this morning — all wired profiles disconnected

  • Only connectivity: mobile hotspot and switch console

Impact Assessment

System Status Impact

kvm-01 (8 VMs)

DOWN — emergency mode

vyos-01, WLC-01, vault-01, bind-01, home-dc01, ipa-01, ipsk-mgr-01, k3s-master-01 all offline

kvm-02 (5 VMs running)

DEGRADED

vyos-02, ise-02, bind-02, vault-02, vault-03 running. WLC-02 cannot start (NAS mount).

VyOS routing

DEGRADED

vyos-01 down. vyos-02 running but VRRP failover may not be fully functional — kvm-02 host can’t reach internet.

DNS

DEGRADED

bind-01 down. bind-02 running on kvm-02 but may be unreachable from workstation due to routing.

802.1X Authentication

DEGRADED

ISE (ise-02) running on kvm-02 but workstation cannot authenticate — likely routing/VRRP issue preventing RADIUS traffic.

Vault PKI

DEGRADED

vault-01 (leader) down. vault-02/03 (followers) on kvm-02 — Raft may have elected new leader but clients can’t reach it.

AD / FreeIPA

DOWN

Both on kvm-01 — no Windows auth, Kerberos, sudo rules.

iPSK Manager

DOWN

On kvm-01 — IoT self-service PSK broken.

WLC HA

DOWN

WLC-01 (kvm-01) down. WLC-02 (kvm-02) can’t start — NAS qcow2 missing.

k3s

DOWN

Control plane on kvm-01.

Workstation connectivity

WORKAROUND

On mobile hotspot. No wired access.

Business Impact

  • All home network services offline

  • Single user affected (home lab)

  • No data loss expected — VM disks on local NVMe (kvm-01: /mnt/onboard-ssd/libvirt/images/, kvm-02: /var/lib/libvirt/images/)

  • NAS-stored VM images (WLC-02) inaccessible

  • Workaround: mobile hotspot for internet, switch console for L2 management

Metadata

Field Value

Incident ID

INC-2026-04-13-001

Author

Evan Rosado

Created

2026-04-13

Last Updated

2026-04-13

Status

Resolved — all services operational on both KVMs

Post-Incident Review

After resolution verified

RCA Required

Yes (P1 — STD-011)