INC-2026-04-13-001: KVM Dual Hypervisor Outage — NAS Mount Failure Cascade
Incident Summary
| Field | Value |
|---|---|
Detected |
2026-04-12 ~21:00 PST — user reported Domus-IoT WiFi prompting for password (iPSK Manager unreachable) |
Mitigated |
2026-04-13 ~14:20 PST — kvm-02 fully operational, all services verified |
Resolved |
2026-04-13 ~20:00 PST — all services operational on both KVMs |
Duration |
~23 hours (2026-04-12 21:00 → 2026-04-13 20:00) |
Severity |
P1 (Critical) — Total infrastructure outage. All VMs on kvm-01 down. kvm-02 NAS mount degraded. No routing, DNS, auth, or PKI. |
Impact |
Complete loss of: VyOS routing (both routers), DNS (bind-01/02 effectively unreachable), 802.1X authentication (ISE), PKI/SSH CA (Vault), AD/FreeIPA identity, iPSK self-service, WLC HA pair, k3s cluster. Workstation forced to mobile hotspot. |
Root Cause (Suspected) |
NAS NFS mount failure. kvm-01 |
Environment
| Property | Value |
|---|---|
kvm-01 |
Supermicro SYS-E300-9D-8CN8TP, Rocky Linux 9.7, 10.50.1.100 — DOWN (emergency mode) |
kvm-02 |
Supermicro SYS-E300-9D, Rocky Linux 9.7, 10.50.1.101 — Degraded (NAS mount stale) |
NAS |
Synology 10.50.1.70 — Pingable from kvm-02, but NFS exports appear empty/stale |
Switch |
Cisco 3560CX-01 — Operational, Te1/0/1 and Te1/0/2 connected (L1 up to both Supermicros) |
Workstation |
Razer — on |
Timeline
| Time (PST) | Event |
|---|---|
2026-04-12 ~21:00 |
User reports Domus-IoT WiFi prompting for password instead of iPSK auto-connect. |
2026-04-12 ~21:05 |
Checked ISE (ise-02 on kvm-02) — ISE is up. Checked ODBC external identity source — shows ipsk-mgr-01 (10.50.1.30) unreachable. |
2026-04-12 ~21:10 |
Attempted DNS resolution: |
2026-04-12 ~21:15 |
SSH to kvm-01 (10.50.1.100) failed: |
2026-04-12 ~21:20 |
Panic reboot of kvm-01. System dropped to emergency/rescue mode prompting for root password. Error message referenced inability to mount |
2026-04-12 ~21:30 |
Entered emergency mode root prompt. Unable to resolve mount issue. Multiple reboot attempts — same result. |
2026-04-12 ~21:40 |
SSH to kvm-02 (10.50.1.101) — successful. VMs running: vyos-02, ise-02, 9800-WLC-02, bind-02, vault-02, vault-03. |
2026-04-12 ~21:45 |
Attempted |
2026-04-12 ~21:50 |
Consoled into vyos-02 via |
2026-04-12 ~22:00 |
Gave up for the night. kvm-01 stuck in emergency mode. kvm-02 partially functional but NAS mount broken. |
2026-04-13 morning |
Workstation unable to connect to wired network (802.1X). Forced to mobile hotspot ( |
2026-04-13 morning |
Switch console confirms Te1/0/1 and Te1/0/2 both show |
2026-04-13 ~13:00 |
Switch can ping kvm-02 (.111) but not kvm-01 (.110). ISE RADIUS servers marked alive on switch ( |
2026-04-13 ~13:05 |
Removed 802.1X template from switch port Gi1/0/4. EAP-TLS profile fails: |
2026-04-13 ~13:10 |
Bypassed NetworkManager — manually assigned IP via |
2026-04-13 ~13:15 |
kvm-02 |
2026-04-13 ~13:18 |
NAS mount diagnosis: |
2026-04-13 ~13:22 |
Fixed kvm-02 |
2026-04-13 ~13:25 |
All 6 kvm-02 VMs running. But kvm-02 host cannot ping any VM (ISE .21, bind .91). Investigated bridge VLAN configuration. |
2026-04-13 ~13:30 |
Root cause of host-to-VM isolation identified: NetworkManager |
2026-04-13 ~13:32 |
Confirmed via |
2026-04-13 ~13:35 |
Self-inflicted outage #2: Attempted to fix bridge PVID via |
2026-04-13 ~13:40 |
Accessed kvm-02 via IPMI/BMC remote console. Reverted bridge change: |
2026-04-13 ~13:50 |
Consulted domus-infra-ops |
2026-04-13 ~13:55 |
Restarted all VMs to trigger libvirt VLAN hook. Hook did not fire — |
2026-04-13 ~14:00 |
Applied VLAN config manually to all vnets as stopgap: removed PVID 1, added VLANs 10/20/30/40/100/110/120, set PVID 100. |
2026-04-13 ~14:02 |
kvm-02 RESTORED. All 6 VMs running. Host can ping vyos-02 (.3), ISE (.21), bind-02 (.91). Gateway VIP (.1) responding with ICMP redirects (normal — vyos-02 redirects to itself). Bridge VLAN state matches runbook specification. |
2026-04-13 ~14:10 |
WLC-02 booted but showed "no bootable media". qcow2 on NAS corrupted (66 refcount errors). Copied to local storage ( |
2026-04-13 ~14:20 |
Infrastructure verified end-to-end. Gateway VIP (.1) responding. DNS resolving (bind-02). ISE authenticating (802.1X MAB on switch, EAP-TLS WiFi). Internet reachable (8.8.8.8). P16g connected on Domus-Secure WiFi — confirmed working. AP associated with WLC-02. |
2026-04-13 ~14:25 |
Razer workstation WiFi drops when LAN disconnected — Razer-specific issue, not infrastructure. P16g WiFi stable. Deferred to separate investigation. |
2026-04-13 ~14:30 |
kvm-01 booted to normal login after commenting out |
2026-04-13 ~14:35 |
kvm-01 VMs started: vyos-01, bind-01, vault-01, home-dc01, ipa-01, 9800-WLC-01 all up. ipsk-mgr-01 failed initially — cloud-init ISO on |
2026-04-13 ~14:40 |
kvm-01 reachable from switch when connected to Te1/0/1 but NOT Te1/0/2 (original port). Zero L2 traffic on Te1/0/2 despite |
2026-04-13 ~14:50 |
DHCP snooping drop for AP MAC |
2026-04-13 ~15:00 |
Current state: kvm-02 fully operational on Te1/0/1. kvm-01 offline pending Te1/0/2 port resolution. All kvm-02 services verified. kvm-01 VMs were running before shutdown. |
2026-04-13 ~19:30 |
Te1/0/2 investigation: |
2026-04-13 ~19:45 |
Moved kvm-01 to Te1/0/8. Configured as trunk (native VLAN 100, allowed VLANs 10/20/30/40/100/110/120). kvm-01 reachable via SSH. |
2026-04-13 ~19:50 |
Started all kvm-01 VMs in dependency order via loop:
|
2026-04-13 ~20:00 |
INCIDENT RESOLVED. All services verified: * iPSK Manager — login page accessible, Domus-IoT WiFi working, IoT clients connected
* DNS HA — both bind servers responding: |
Symptoms
-
iPSK Manager (ipsk-mgr-01) unreachable — Domus-IoT users prompted for password
-
DNS primary (bind-01) unreachable —
host unreachableon 10.50.1.90 -
kvm-01 SSH refused —
No route to hoston 10.50.1.100 -
kvm-01 stuck in emergency/rescue mode after reboot —
/mntmount failure -
kvm-02 NAS mount
/mnt/nas/vms/is empty despite NAS being pingable -
9800-WLC-02 cannot start — qcow2 on NAS mount not found
-
kvm-02 host cannot reach internet (but vyos-02 VM can) — VRRP/routing anomaly
-
Workstation 802.1X authentication failing this morning — all wired profiles disconnected
-
Only connectivity: mobile hotspot and switch console
Impact Assessment
| System | Status | Impact |
|---|---|---|
kvm-01 (8 VMs) |
DOWN — emergency mode |
vyos-01, WLC-01, vault-01, bind-01, home-dc01, ipa-01, ipsk-mgr-01, k3s-master-01 all offline |
kvm-02 (5 VMs running) |
DEGRADED |
vyos-02, ise-02, bind-02, vault-02, vault-03 running. WLC-02 cannot start (NAS mount). |
VyOS routing |
DEGRADED |
vyos-01 down. vyos-02 running but VRRP failover may not be fully functional — kvm-02 host can’t reach internet. |
DNS |
DEGRADED |
bind-01 down. bind-02 running on kvm-02 but may be unreachable from workstation due to routing. |
802.1X Authentication |
DEGRADED |
ISE (ise-02) running on kvm-02 but workstation cannot authenticate — likely routing/VRRP issue preventing RADIUS traffic. |
Vault PKI |
DEGRADED |
vault-01 (leader) down. vault-02/03 (followers) on kvm-02 — Raft may have elected new leader but clients can’t reach it. |
AD / FreeIPA |
DOWN |
Both on kvm-01 — no Windows auth, Kerberos, sudo rules. |
iPSK Manager |
DOWN |
On kvm-01 — IoT self-service PSK broken. |
WLC HA |
DOWN |
WLC-01 (kvm-01) down. WLC-02 (kvm-02) can’t start — NAS qcow2 missing. |
k3s |
DOWN |
Control plane on kvm-01. |
Workstation connectivity |
WORKAROUND |
On mobile hotspot. No wired access. |
Business Impact
-
All home network services offline
-
Single user affected (home lab)
-
No data loss expected — VM disks on local NVMe (kvm-01:
/mnt/onboard-ssd/libvirt/images/, kvm-02:/var/lib/libvirt/images/) -
NAS-stored VM images (WLC-02) inaccessible
-
Workaround: mobile hotspot for internet, switch console for L2 management
Metadata
| Field | Value |
|---|---|
Incident ID |
INC-2026-04-13-001 |
Author |
Evan Rosado |
Created |
2026-04-13 |
Last Updated |
2026-04-13 |
Status |
Resolved — all services operational on both KVMs |
Post-Incident Review |
After resolution verified |
RCA Required |
Yes (P1 — STD-011) |