INC-2026-04-13-001: Investigation

Failure Chain Analysis

This incident involves four interconnected failures. The original trigger was NAS NFS mount degradation, but two escalation decisions — a panic reboot and an untested bridge VLAN change — each converted a partial outage into a wider one.

Failure 1: kvm-01 Emergency Mode

Symptom: After reboot, kvm-01 drops to emergency/rescue mode. Prompts for root password. Reports inability to mount a /mnt path.

Hypothesis: /etc/fstab on kvm-01 contains an NFS mount to the Synology NAS (10.50.1.70) — likely /mnt/nas or /mnt/ssd. The mount entry lacks the nofail option. When the NFS server is unreachable or slow to respond during boot, systemd’s mount unit fails and the system drops to emergency mode.

Evidence:

kvm-01 was running fine before reboot — the NAS mount was established during the original boot and stayed up
The panic reboot broke the mount because NFS doesn’t survive a cold restart if the NAS is unreachable at boot time
The error message referenced /mnt — consistent with NAS mount paths documented in infrastructure notes
kvm-02 has the same pattern: /mnt/nas/ exists and has subdirectories (backups, isos, k3s, vms) but /mnt/nas/vms/ is empty — stale mount

Failure 2: kvm-02 Stale NAS Mount

Symptom: /mnt/nas/vms/ directory exists but contains no files. virsh start 9800-WLC-02 fails because /mnt/nas/vms/9800-WLC-02.qcow2 is not found.

Confirmed: NFS mounts were not active — mount | awk '/nfs|nas/' returned only sunrpc (RPC pipefs), no actual NFS mounts. The directory structure (/mnt/nas/vms/, /mnt/nas/isos/, /mnt/nas/backups/) existed as empty local directories — the NFS mounts had silently failed or never mounted after a reboot.

Root cause: fstab used hostname nas-01 instead of IP 10.50.1.70. When DNS (bind-01/bind-02) was unavailable at boot time, NFS mount resolution failed: mount.nfs: Failed to resolve server nas-01: Name or service not known. The _netdev option was present (waits for network) but doesn’t help when the issue is DNS resolution, not network availability.

Evidence:

[evanusmodestus@kvm-02 ~]$ mount | awk '/nfs|nas/'
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
# No actual NFS mounts

[evanusmodestus@kvm-02 ~]$ sudo mount -a 2>&1
mount.nfs: Failed to resolve server nas-01: Name or service not known
mount.nfs: Failed to resolve server nas-01: Name or service not known
mount.nfs: Failed to resolve server nas-01: Name or service not known

# Mounting by IP works immediately
[evanusmodestus@kvm-02 ~]$ sudo mount -t nfs 10.50.1.70:/volume1/vms /mnt/nas/vms
[evanusmodestus@kvm-02 ~]$ ls /mnt/nas/vms
9800-WLC-01.qcow2  9800-WLC-02.qcow2  bind-01.qcow2  home-dc01.qcow2  ...

# Original fstab (broken — uses hostname):
nas-01:/volume1/vms      /mnt/nas/vms     nfs  defaults,_netdev  0 0

# Fixed fstab (IP + nofail):
10.50.1.70:/volume1/vms /mnt/nas/vms nfs defaults,_netdev,nofail,x-systemd.device-timeout=10 0 0

Failure 3: Bridge VLAN PVID Mismatch — Host Isolated from VMs

Symptom: kvm-02 host (10.50.1.111) cannot ping any VM on the same bridge (ISE .21, bind-02 .91, vyos-02 .3). VMs can ping each other. Switch can ping kvm-02 host. vyos-02 is VRRP MASTER on all groups and holds VIP 10.50.1.1.

Confirmed root cause: NetworkManager br-mgmt bridge configured with bridge.vlan-default-pvid: 1 and bridge.vlans: 1 pvid untagged. The libvirt VLAN hook sets all VM vnets to PVID 100. Host on VLAN 1, VMs on VLAN 100 — different broadcast domains on the same bridge. VLAN filtering is enabled (/sys/class/net/br-mgmt/bridge/vlan_filtering = 1), so this isolation is enforced.

Evidence:

# Bridge VLAN state — host (br-mgmt) on PVID 1, VMs on PVID 100
[evanusmodestus@kvm-02 ~]$ nmcli connection show br-mgmt | awk '/bridge.vlan/'
bridge.vlan-filtering:                  yes
bridge.vlan-default-pvid:               1
bridge.vlans:                           1 pvid untagged, 10, 20, 30, 40, 100, 110, 120

# VMs reachable from each other (both on VLAN 100)
vyos@vyos-02:~$ ping 10.50.1.21
64 bytes from 10.50.1.21: icmp_seq=1 ttl=64 time=1.56 ms  # ISE reachable

# Host cannot reach VMs
[evanusmodestus@kvm-02 ~]$ ping 10.50.1.21
From 10.50.1.111 icmp_seq=1 Destination Host Unreachable   # ISE unreachable

Impact: This explains why kvm-02 host couldn’t reach the internet (gateway VIP 10.50.1.1 is on VLAN 100 via vyos-02, but host is on VLAN 1). VMs functioned correctly between themselves — ISE responded to RADIUS from the switch (which reaches VMs through the physical trunk on VLAN 100, bypassing the host’s PVID 1).

This PVID mismatch may be a pre-existing condition that was masked because kvm-02 has a secondary default route via br-wan (192.168.1.254) which provides internet access on a separate interface. The host-to-VM isolation was never noticed because management was done via virsh from the host (which uses libvirt’s internal API, not network), not by pinging VM IPs.

Failure 4: Self-Inflicted Bridge Outage During Recovery

Symptom: After modifying br-mgmt bridge PVID from 1 to 100 via nmcli connection modify + nmcli device reapply, kvm-02 host became completely unreachable from all networks. SSH dropped. Switch could not ping .111. Required IPMI console access to revert.

Root cause: Changing the bridge PVID affected the physical uplink port (eno8) membership on the bridge. The switch trunk port expects untagged traffic on VLAN 100 (native VLAN), but the bridge PVID change likely caused a VLAN mismatch between the switch trunk and the bridge, isolating the host from the physical network entirely.

Resolution: Reverted via IPMI console:

sudo nmcli connection modify br-mgmt bridge.vlan-default-pvid 1 \
  bridge.vlans "1 pvid untagged, 10, 20, 30, 40, 100, 110, 120"
sudo nmcli device reapply br-mgmt

Lesson: Never modify bridge VLAN settings on a production hypervisor without understanding the full trunk-to-bridge VLAN mapping. The bridge PVID, the physical uplink PVID, and the switch trunk native VLAN must all agree. Changes must be tested in a maintenance window with IPMI console ready — not over SSH.

Initial Triage (2026-04-12)

# DNS — bind-01 unreachable, bind-02 responded
dig +short ipsk-mgr-01.inside.domusdigitalis.dev @10.50.1.90
# communications error to 10.50.1.90#53: host unreachable

dig +short ipsk-mgr-01.inside.domusdigitalis.dev @10.50.1.91
# 10.50.1.30

# iPSK Manager — unreachable
ping 10.50.1.30
# From 10.50.1.106 icmp_seq=1 Destination Host Unreachable

# kvm-01 — down
ssh kvm-01
# ssh: connect to host 10.50.1.110 port 22: No route to host

# kvm-02 — accessible
ssh kvm-02
# Connected. VMs: vyos-02, ise-02, 9800-WLC-02, bind-02, vault-02, vault-03

# WLC-02 — NAS mount broken
sudo virsh start 9800-WLC-02
# error: Cannot access storage file '/mnt/nas/vms/9800-WLC-02.qcow2'
# (as uid:107, gid:107): No such file or directory

# NAS — reachable but mount stale
ping 10.50.1.70    # success
ls /mnt/nas/vms    # empty

# Internet from vyos-02 VM — works
ping 8.8.8.8       # success (from inside vyos-02 via virsh console)

# Internet from kvm-02 host — fails
ping 8.8.8.8       # 100% packet loss

Diagnostic Commands Needed

The following commands need to be run to confirm root cause and plan remediation. Access path: switch console → kvm-02 SSH → virsh console for VMs.

On kvm-01 (requires emergency mode root shell or IPMI/BMC)

# 1. Check fstab for NAS mounts without nofail
awk '/nfs|nas|cifs/ {print NR": "$0}' /etc/fstab

# 2. Check what mount is failing
systemctl --failed
journalctl -b -p err

# 3. Check if commenting out NAS mount allows normal boot
# (requires editing fstab in emergency mode)

On kvm-02 (via SSH)

# 1. Check NFS mount status
mount | awk '/nfs|nas/'
df -h /mnt/nas/
stat /mnt/nas/vms/

# 2. Check fstab entries
awk '/nfs|nas/ {print NR": "$0}' /etc/fstab

# 3. Try remounting NAS
sudo umount -l /mnt/nas
sudo mount -a

# 4. Check VRRP status from vyos-02
sudo virsh console vyos-02
# Inside vyos-02:
show vrrp
show interfaces

# 5. Check routing on kvm-02 host
ip route show

On Switch Console (Cisco 3560CX-01)

! Check if kvm-01 is sending any L2 traffic
show mac address-table interface Te1/0/1
show mac address-table interface Te1/0/2

! Check ARP for hypervisor IPs
show arp | include 10.50.1.11

! Check VLAN assignments on trunk ports
show interfaces trunk

! Verify gateway reachability
ping 10.50.1.1
ping 10.50.1.110
ping 10.50.1.111

Root Cause (Confirmed)

Four failures in a cascade, each escalating the blast radius:

# Failure Cause Escalation

#	Failure	Cause	Escalation
1	kvm-01 emergency mode	fstab NAS mount uses hostname `nas-01` without `nofail` — DNS unavailable at boot, mount fails, systemd drops to emergency mode	Panic reboot of hypervisor killed 8 healthy VMs to investigate 1 (ipsk-mgr-01)
2	kvm-02 NAS mount empty	Same fstab hostname issue — `nas-01` unresolvable without DNS. NFS mounts never activated. WLC-02 qcow2 on NAS unreachable.	3 VMs (bind-02, vault-02, vault-03) shut down overnight for unknown reason — possibly related to NAS dependency or resource exhaustion
3	Host-to-VM VLAN isolation	NetworkManager `br-mgmt` bridge PVID set to 1 (default). Libvirt VLAN hook sets VMs to PVID 100. Host and VMs in different VLANs. VLAN filtering enabled.	Pre-existing condition — masked because host manages VMs via libvirt API (not network) and has secondary WAN route via `br-wan`
4	Self-inflicted bridge outage	Attempted to fix PVID mismatch via `nmcli device reapply br-mgmt` over SSH — change broke physical uplink VLAN mapping, isolating host from all networks	Required IPMI console access to revert. Added ~15 minutes to recovery.

kvm-01 emergency mode

fstab NAS mount uses hostname nas-01 without nofail — DNS unavailable at boot, mount fails, systemd drops to emergency mode

Panic reboot of hypervisor killed 8 healthy VMs to investigate 1 (ipsk-mgr-01)

kvm-02 NAS mount empty

Same fstab hostname issue — nas-01 unresolvable without DNS. NFS mounts never activated. WLC-02 qcow2 on NAS unreachable.

3 VMs (bind-02, vault-02, vault-03) shut down overnight for unknown reason — possibly related to NAS dependency or resource exhaustion

Host-to-VM VLAN isolation

NetworkManager br-mgmt bridge PVID set to 1 (default). Libvirt VLAN hook sets VMs to PVID 100. Host and VMs in different VLANs. VLAN filtering enabled.

Pre-existing condition — masked because host manages VMs via libvirt API (not network) and has secondary WAN route via br-wan

Self-inflicted bridge outage

Attempted to fix PVID mismatch via nmcli device reapply br-mgmt over SSH — change broke physical uplink VLAN mapping, isolating host from all networks

Required IPMI console access to revert. Added ~15 minutes to recovery.

The two escalation decisions are the real story:

Panic reboot of kvm-01 — The original symptom was ipsk-mgr-01 down (1 VM). The correct response was virsh console ipsk-mgr-01 to investigate the guest OS. Instead, rebooting the hypervisor killed 7 healthy VMs and exposed the fstab bug that prevented the system from coming back up.
Untested bridge VLAN change — The bridge PVID mismatch was a pre-existing issue that didn’t affect operations (VMs worked, host managed via libvirt, external access via physical switch). Changing it during an active incident without consulting the documented bridge design in domus-infra-ops, and without IPMI console ready, caused a second outage.

Network Architecture (Reference)

Documented for future incident responders.

Bridge Design (kvm-02)

Physical:  eno8 ──── br-mgmt ──── vnet0 (vyos-02 LAN)
                         │──────── vnet1 (vyos-02 WAN → br-wan)
                         │──────── vnet2 (ise-02)
                         │──────── vnet3 (bind-02)
                         │──────── vnet4-8 (vault-02/03, WLC-02)

Switch:    Te1/0/2 ── trunk ── eno8 (VLAN 1,10,20,30,40,100,110,120)

VLAN Assignments

PVID 100 (MGMT): All VMs via libvirt hook — vyos, ISE, bind, vault, WLC, etc.
PVID 1 (default): kvm-02 host br-mgmt interface — hosts 10.50.1.111/24
Switch trunk native VLAN: Needs verification — may be VLAN 1 or VLAN 100

VRRP State (Confirmed Working)

vyos-02 VRRP: ALL MASTER (DATA, GUEST, IOT, MGMT, SECURITY, SERVICES, VOICE)
VIP 10.50.1.1 held on eth0 (secondary address)
VM-to-VM connectivity: FUNCTIONAL (ISE, bind-02, vault-02/03 all reachable)

Routing (kvm-02 Host)

default via 192.168.1.254 dev br-wan proto dhcp src 192.168.1.226 metric 425
default via 10.50.1.3 dev br-mgmt proto static metric 426

Host internet goes via br-wan (higher priority). Host-to-VM route via br-mgmt fails due to PVID mismatch — this is the pre-existing bug that needs a proper fix in a maintenance window.

AAA / RADIUS (Switch)

aaa group server radius ISE-RADIUS
 server name ISE-02        # 10.50.1.21 — ACTIVE on kvm-02
 server name ISE-01        # 10.50.1.20 — DOWN (kvm-01) — temporarily removed during incident
 ip radius source-interface Vlan100

radius server ISE-01       # Config retained but server unreachable
 address ipv4 10.50.1.20
 timeout 5
 retransmit 2

ISE-01 was temporarily removed from ISE-RADIUS group during incident to prevent 15-second timeout delays. Must be re-added after kvm-01 recovery.