INC-2026-04-13-001: Investigation
Failure Chain Analysis
This incident involves four interconnected failures. The original trigger was NAS NFS mount degradation, but two escalation decisions — a panic reboot and an untested bridge VLAN change — each converted a partial outage into a wider one.
Failure 1: kvm-01 Emergency Mode
Symptom: After reboot, kvm-01 drops to emergency/rescue mode. Prompts for root password. Reports inability to mount a /mnt path.
Hypothesis: /etc/fstab on kvm-01 contains an NFS mount to the Synology NAS (10.50.1.70) — likely /mnt/nas or /mnt/ssd. The mount entry lacks the nofail option. When the NFS server is unreachable or slow to respond during boot, systemd’s mount unit fails and the system drops to emergency mode.
Evidence:
-
kvm-01 was running fine before reboot — the NAS mount was established during the original boot and stayed up
-
The panic reboot broke the mount because NFS doesn’t survive a cold restart if the NAS is unreachable at boot time
-
The error message referenced
/mnt— consistent with NAS mount paths documented in infrastructure notes -
kvm-02 has the same pattern:
/mnt/nas/exists and has subdirectories (backups,isos,k3s,vms) but/mnt/nas/vms/is empty — stale mount
Failure 2: kvm-02 Stale NAS Mount
Symptom: /mnt/nas/vms/ directory exists but contains no files. virsh start 9800-WLC-02 fails because /mnt/nas/vms/9800-WLC-02.qcow2 is not found.
Confirmed: NFS mounts were not active — mount | awk '/nfs|nas/' returned only sunrpc (RPC pipefs), no actual NFS mounts. The directory structure (/mnt/nas/vms/, /mnt/nas/isos/, /mnt/nas/backups/) existed as empty local directories — the NFS mounts had silently failed or never mounted after a reboot.
Root cause: fstab used hostname nas-01 instead of IP 10.50.1.70. When DNS (bind-01/bind-02) was unavailable at boot time, NFS mount resolution failed: mount.nfs: Failed to resolve server nas-01: Name or service not known. The _netdev option was present (waits for network) but doesn’t help when the issue is DNS resolution, not network availability.
Evidence:
[evanusmodestus@kvm-02 ~]$ mount | awk '/nfs|nas/'
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
# No actual NFS mounts
[evanusmodestus@kvm-02 ~]$ sudo mount -a 2>&1
mount.nfs: Failed to resolve server nas-01: Name or service not known
mount.nfs: Failed to resolve server nas-01: Name or service not known
mount.nfs: Failed to resolve server nas-01: Name or service not known
# Mounting by IP works immediately
[evanusmodestus@kvm-02 ~]$ sudo mount -t nfs 10.50.1.70:/volume1/vms /mnt/nas/vms
[evanusmodestus@kvm-02 ~]$ ls /mnt/nas/vms
9800-WLC-01.qcow2 9800-WLC-02.qcow2 bind-01.qcow2 home-dc01.qcow2 ...
# Original fstab (broken — uses hostname):
nas-01:/volume1/vms /mnt/nas/vms nfs defaults,_netdev 0 0
# Fixed fstab (IP + nofail):
10.50.1.70:/volume1/vms /mnt/nas/vms nfs defaults,_netdev,nofail,x-systemd.device-timeout=10 0 0
Failure 3: Bridge VLAN PVID Mismatch — Host Isolated from VMs
Symptom: kvm-02 host (10.50.1.111) cannot ping any VM on the same bridge (ISE .21, bind-02 .91, vyos-02 .3). VMs can ping each other. Switch can ping kvm-02 host. vyos-02 is VRRP MASTER on all groups and holds VIP 10.50.1.1.
Confirmed root cause: NetworkManager br-mgmt bridge configured with bridge.vlan-default-pvid: 1 and bridge.vlans: 1 pvid untagged. The libvirt VLAN hook sets all VM vnets to PVID 100. Host on VLAN 1, VMs on VLAN 100 — different broadcast domains on the same bridge. VLAN filtering is enabled (/sys/class/net/br-mgmt/bridge/vlan_filtering = 1), so this isolation is enforced.
Evidence:
# Bridge VLAN state — host (br-mgmt) on PVID 1, VMs on PVID 100
[evanusmodestus@kvm-02 ~]$ nmcli connection show br-mgmt | awk '/bridge.vlan/'
bridge.vlan-filtering: yes
bridge.vlan-default-pvid: 1
bridge.vlans: 1 pvid untagged, 10, 20, 30, 40, 100, 110, 120
# VMs reachable from each other (both on VLAN 100)
vyos@vyos-02:~$ ping 10.50.1.21
64 bytes from 10.50.1.21: icmp_seq=1 ttl=64 time=1.56 ms # ISE reachable
# Host cannot reach VMs
[evanusmodestus@kvm-02 ~]$ ping 10.50.1.21
From 10.50.1.111 icmp_seq=1 Destination Host Unreachable # ISE unreachable
Impact: This explains why kvm-02 host couldn’t reach the internet (gateway VIP 10.50.1.1 is on VLAN 100 via vyos-02, but host is on VLAN 1). VMs functioned correctly between themselves — ISE responded to RADIUS from the switch (which reaches VMs through the physical trunk on VLAN 100, bypassing the host’s PVID 1).
This PVID mismatch may be a pre-existing condition that was masked because kvm-02 has a secondary default route via br-wan (192.168.1.254) which provides internet access on a separate interface. The host-to-VM isolation was never noticed because management was done via virsh from the host (which uses libvirt’s internal API, not network), not by pinging VM IPs.
|
Failure 4: Self-Inflicted Bridge Outage During Recovery
Symptom: After modifying br-mgmt bridge PVID from 1 to 100 via nmcli connection modify + nmcli device reapply, kvm-02 host became completely unreachable from all networks. SSH dropped. Switch could not ping .111. Required IPMI console access to revert.
Root cause: Changing the bridge PVID affected the physical uplink port (eno8) membership on the bridge. The switch trunk port expects untagged traffic on VLAN 100 (native VLAN), but the bridge PVID change likely caused a VLAN mismatch between the switch trunk and the bridge, isolating the host from the physical network entirely.
Resolution: Reverted via IPMI console:
sudo nmcli connection modify br-mgmt bridge.vlan-default-pvid 1 \
bridge.vlans "1 pvid untagged, 10, 20, 30, 40, 100, 110, 120"
sudo nmcli device reapply br-mgmt
Lesson: Never modify bridge VLAN settings on a production hypervisor without understanding the full trunk-to-bridge VLAN mapping. The bridge PVID, the physical uplink PVID, and the switch trunk native VLAN must all agree. Changes must be tested in a maintenance window with IPMI console ready — not over SSH.
Initial Triage (2026-04-12)
# DNS — bind-01 unreachable, bind-02 responded
dig +short ipsk-mgr-01.inside.domusdigitalis.dev @10.50.1.90
# communications error to 10.50.1.90#53: host unreachable
dig +short ipsk-mgr-01.inside.domusdigitalis.dev @10.50.1.91
# 10.50.1.30
# iPSK Manager — unreachable
ping 10.50.1.30
# From 10.50.1.106 icmp_seq=1 Destination Host Unreachable
# kvm-01 — down
ssh kvm-01
# ssh: connect to host 10.50.1.110 port 22: No route to host
# kvm-02 — accessible
ssh kvm-02
# Connected. VMs: vyos-02, ise-02, 9800-WLC-02, bind-02, vault-02, vault-03
# WLC-02 — NAS mount broken
sudo virsh start 9800-WLC-02
# error: Cannot access storage file '/mnt/nas/vms/9800-WLC-02.qcow2'
# (as uid:107, gid:107): No such file or directory
# NAS — reachable but mount stale
ping 10.50.1.70 # success
ls /mnt/nas/vms # empty
# Internet from vyos-02 VM — works
ping 8.8.8.8 # success (from inside vyos-02 via virsh console)
# Internet from kvm-02 host — fails
ping 8.8.8.8 # 100% packet loss
Diagnostic Commands Needed
The following commands need to be run to confirm root cause and plan remediation. Access path: switch console → kvm-02 SSH → virsh console for VMs.
On kvm-01 (requires emergency mode root shell or IPMI/BMC)
# 1. Check fstab for NAS mounts without nofail
awk '/nfs|nas|cifs/ {print NR": "$0}' /etc/fstab
# 2. Check what mount is failing
systemctl --failed
journalctl -b -p err
# 3. Check if commenting out NAS mount allows normal boot
# (requires editing fstab in emergency mode)
On kvm-02 (via SSH)
# 1. Check NFS mount status
mount | awk '/nfs|nas/'
df -h /mnt/nas/
stat /mnt/nas/vms/
# 2. Check fstab entries
awk '/nfs|nas/ {print NR": "$0}' /etc/fstab
# 3. Try remounting NAS
sudo umount -l /mnt/nas
sudo mount -a
# 4. Check VRRP status from vyos-02
sudo virsh console vyos-02
# Inside vyos-02:
show vrrp
show interfaces
# 5. Check routing on kvm-02 host
ip route show
On Switch Console (Cisco 3560CX-01)
! Check if kvm-01 is sending any L2 traffic
show mac address-table interface Te1/0/1
show mac address-table interface Te1/0/2
! Check ARP for hypervisor IPs
show arp | include 10.50.1.11
! Check VLAN assignments on trunk ports
show interfaces trunk
! Verify gateway reachability
ping 10.50.1.1
ping 10.50.1.110
ping 10.50.1.111
Root Cause (Confirmed)
Four failures in a cascade, each escalating the blast radius:
| # | Failure | Cause | Escalation |
|---|---|---|---|
1 |
kvm-01 emergency mode |
fstab NAS mount uses hostname |
Panic reboot of hypervisor killed 8 healthy VMs to investigate 1 (ipsk-mgr-01) |
2 |
kvm-02 NAS mount empty |
Same fstab hostname issue — |
3 VMs (bind-02, vault-02, vault-03) shut down overnight for unknown reason — possibly related to NAS dependency or resource exhaustion |
3 |
Host-to-VM VLAN isolation |
NetworkManager |
Pre-existing condition — masked because host manages VMs via libvirt API (not network) and has secondary WAN route via |
4 |
Self-inflicted bridge outage |
Attempted to fix PVID mismatch via |
Required IPMI console access to revert. Added ~15 minutes to recovery. |
The two escalation decisions are the real story:
-
Panic reboot of kvm-01 — The original symptom was ipsk-mgr-01 down (1 VM). The correct response was
virsh console ipsk-mgr-01to investigate the guest OS. Instead, rebooting the hypervisor killed 7 healthy VMs and exposed the fstab bug that prevented the system from coming back up. -
Untested bridge VLAN change — The bridge PVID mismatch was a pre-existing issue that didn’t affect operations (VMs worked, host managed via libvirt, external access via physical switch). Changing it during an active incident without consulting the documented bridge design in domus-infra-ops, and without IPMI console ready, caused a second outage.
Network Architecture (Reference)
Documented for future incident responders.
Bridge Design (kvm-02)
Physical: eno8 ──── br-mgmt ──── vnet0 (vyos-02 LAN)
│──────── vnet1 (vyos-02 WAN → br-wan)
│──────── vnet2 (ise-02)
│──────── vnet3 (bind-02)
│──────── vnet4-8 (vault-02/03, WLC-02)
Switch: Te1/0/2 ── trunk ── eno8 (VLAN 1,10,20,30,40,100,110,120)
VLAN Assignments
-
PVID 100 (MGMT): All VMs via libvirt hook — vyos, ISE, bind, vault, WLC, etc.
-
PVID 1 (default): kvm-02 host
br-mgmtinterface — hosts 10.50.1.111/24 -
Switch trunk native VLAN: Needs verification — may be VLAN 1 or VLAN 100
VRRP State (Confirmed Working)
vyos-02 VRRP: ALL MASTER (DATA, GUEST, IOT, MGMT, SECURITY, SERVICES, VOICE)
VIP 10.50.1.1 held on eth0 (secondary address)
VM-to-VM connectivity: FUNCTIONAL (ISE, bind-02, vault-02/03 all reachable)
Routing (kvm-02 Host)
default via 192.168.1.254 dev br-wan proto dhcp src 192.168.1.226 metric 425
default via 10.50.1.3 dev br-mgmt proto static metric 426
Host internet goes via br-wan (higher priority). Host-to-VM route via br-mgmt fails due to PVID mismatch — this is the pre-existing bug that needs a proper fix in a maintenance window.
AAA / RADIUS (Switch)
aaa group server radius ISE-RADIUS
server name ISE-02 # 10.50.1.21 — ACTIVE on kvm-02
server name ISE-01 # 10.50.1.20 — DOWN (kvm-01) — temporarily removed during incident
ip radius source-interface Vlan100
radius server ISE-01 # Config retained but server unreachable
address ipv4 10.50.1.20
timeout 5
retransmit 2
| ISE-01 was temporarily removed from ISE-RADIUS group during incident to prevent 15-second timeout delays. Must be re-added after kvm-01 recovery. |