INC-2026-04-13-001: Resolution
Recovery Executed
Phase 1: Workstation Access (Completed ~13:10)
802.1X EAP-TLS failed (Secrets were required, but not provided — keyring issue). Static NM profile timed out (port VLAN mismatch). Bypassed NetworkManager entirely:
sudo ip addr add 10.50.1.106/24 dev enp130s0
sudo ip route add default via 10.50.1.1 dev enp130s0
Phase 2: kvm-02 VM Recovery (Completed ~13:18)
SSH to kvm-02 (10.50.1.101). Found bind-02, vault-02, vault-03 shut off (down overnight, cause unknown). Started all three. WLC-02 couldn’t start — NAS mount empty.
sudo virsh start bind-02
sudo virsh start vault-02
sudo virsh start vault-03
Phase 3: NAS Mount Restoration (Completed ~13:22)
mount -a failed — fstab uses hostname nas-01 which can’t resolve without DNS. Mounted by IP:
# mount -a fails — DNS circular dependency
sudo mount -a 2>&1
# mount.nfs: Failed to resolve server nas-01: Name or service not known
# Mount by IP — works
sudo mount -t nfs 10.50.1.70:/volume1/vms /mnt/nas/vms
sudo mount -t nfs 10.50.1.70:/volume1/isos /mnt/nas/isos
sudo mount -t nfs 10.50.1.70:/volume1/backups /mnt/nas/backups
# WLC-02 now startable
sudo virsh start 9800-WLC-02
Phase 4: fstab Hardening (Completed ~13:22)
Fixed all three NAS mount entries — IP instead of hostname, added nofail:
sudo cp /etc/fstab /root/fstab.backup.2026-04-13
# Before (broken):
# nas-01:/volume1/vms /mnt/nas/vms nfs defaults,_netdev 0 0
# After (fixed):
# 10.50.1.70:/volume1/vms /mnt/nas/vms nfs defaults,_netdev,nofail,x-systemd.device-timeout=10 0 0
sudo systemctl daemon-reload
Phase 5: Bridge VLAN Fix (Completed ~14:02)
Consulted domus-infra-ops kvm-01-rocky-rebuild.adoc Phase 5 for correct bridge config. Created and ran fix script:
#!/bin/bash
# /tmp/fix-bridge-vlans.sh — run as root
# Remove default VLAN 1
bridge vlan del vid 1 dev eno8
bridge vlan del vid 1 dev br-mgmt self
# Add tagged VLANs
for vid in 10 20 30 40 110 120; do
bridge vlan add vid $vid dev eno8
bridge vlan add vid $vid dev br-mgmt self
done
# Add VLAN 100 as PVID untagged (native VLAN)
bridge vlan add vid 100 dev eno8 pvid untagged
bridge vlan add vid 100 dev br-mgmt self pvid untagged
Made persistent:
sudo nmcli connection modify br-mgmt bridge.vlan-default-pvid 0 \
bridge.vlans "100 pvid untagged, 10, 20, 30, 40, 110, 120"
Self-inflicted outage during first attempt: Changed PVID via nmcli device reapply over SSH without consulting runbook. Lost all connectivity. Recovered via IPMI console revert. Second attempt used the correct commands from the runbook.
Phase 6: vnet VLAN Fix (Completed ~14:02)
Libvirt VLAN hook (/etc/libvirt/hooks/qemu) did not fire on VM restart. journalctl -t "libvirt-hook[vyos-02]" returned no entries even after libvirtd restart. Applied VLAN config manually to all vnets:
for vnet in $(ip link show master br-mgmt 2>/dev/null | awk -F': ' '/vnet/{print $2}'); do
sudo bridge vlan del vid 1 dev $vnet pvid untagged 2>/dev/null
for vid in 10 20 30 40 100 110 120; do
sudo bridge vlan add vid $vid dev $vnet
done
sudo bridge vlan add vid 100 dev $vnet pvid untagged
done
| This is ephemeral — vnets will revert to PVID 1 on next VM restart until the hook issue is resolved. |
Phase 7: kvm-01 Recovery (Completed ~14:35)
kvm-01 emergency mode had TWO causes — both in /etc/fstab:
-
NAS mounts using hostname
nas-01— DNS circular dependency (same as kvm-02) -
/dev/sdb1mounted asxfsbut filesystem isext4— superblock mismatch causedsdb1 invalid superblockerror
# Original fstab entries (broken):
nas-01:/volume1/vms /mnt/nas/vms nfs defaults,_netdev 0 0
nas-01:/volume1/isos /mnt/nas/isos nfs defaults,_netdev 0 0
nas-01:/volume1/backups /mnt/nas/backups nfs defaults,_netdev 0 0
/dev/sdb1 /mnt/onboard-ssd xfs defaults 0 0 # <-- WRONG: ext4 not xfs
# Fix: commented all four lines to boot, then mounted manually:
sudo mount -t ext4 /dev/sdb1 /mnt/onboard-ssd
Permanent fix needed for fstab:
# NAS mounts — same pattern as kvm-02 (IP + nofail):
10.50.1.70:/volume1/vms /mnt/nas/vms nfs defaults,_netdev,nofail,x-systemd.device-timeout=10 0 0
10.50.1.70:/volume1/isos /mnt/nas/isos nfs defaults,_netdev,nofail,x-systemd.device-timeout=10 0 0
10.50.1.70:/volume1/backups /mnt/nas/backups nfs defaults,_netdev,nofail,x-systemd.device-timeout=10 0 0
# sdb1 — correct filesystem type + nofail:
/dev/sdb1 /mnt/onboard-ssd ext4 defaults,nofail 0 0
Phase 8: Bridge VLAN Persistence (Pending — Both KVMs)
Bridge VLAN config applied via bridge vlan commands is ephemeral — lost on reboot. Must be persisted via nmcli. The libvirt VLAN hook handles vnets on VM start, but the bridge device and physical uplink need nmcli.
Correct persistent config (from kvm-01-rocky-rebuild.adoc Phase 5):
# Make bridge VLAN config persistent via nmcli
sudo nmcli connection modify br-mgmt \
bridge.vlan-filtering yes \
bridge.vlan-default-pvid 0 \
bridge.vlans "100 pvid untagged, 10, 20, 30, 40, 110, 120"
# Verify persistence — should survive reboot
nmcli connection show br-mgmt | awk '/bridge.vlan/'
# Expected output:
# bridge.vlan-filtering: yes
# bridge.vlan-default-pvid: 0
# bridge.vlans: 100 pvid untagged, 10, 20, 30, 40, 110, 120
Runtime fix (ephemeral — from runbook Phase 5):
# Enable VLAN filtering
sudo ip link set br-mgmt type bridge vlan_filtering 1
# Remove default VLAN 1
sudo bridge vlan del vid 1 dev eno8
sudo bridge vlan del vid 1 dev br-mgmt self
# Add tagged VLANs
for vid in 10 20 30 40 110 120; do
sudo bridge vlan add vid $vid dev eno8
sudo bridge vlan add vid $vid dev br-mgmt self
done
# Add VLAN 100 as native (PVID untagged)
sudo bridge vlan add vid 100 dev eno8 pvid untagged
sudo bridge vlan add vid 100 dev br-mgmt self pvid untagged
Expected state after fix:
port vlan-id
eno8 10
20
30
40
100 PVID Egress Untagged
110
120
br-mgmt 10
20
30
40
100 PVID Egress Untagged
110
120
| Never apply bridge VLAN changes over SSH — use IPMI/console. Connectivity drops when PVID changes. |
Phase 9: Te1/0/2 Port Investigation (Completed ~19:45)
Root cause: 386 CRC errors on Te1/0/2 — frames arriving corrupted.
LAB-3560CX-01# show interfaces Te1/0/2 | include errors|CRC|input
420 input errors, 386 CRC, 0 frame, 0 overrun, 0 ignored
SFPs tested good on Te1/0/1 (both KVMs work). Switch port shows connected and STP forwarding. CRC errors indicate bad fiber patch cable or dirty connector on the Te1/0/2 path — link establishes but data is corrupted and dropped.
Resolution: Moved kvm-01 to Te1/0/8 instead of troubleshooting the cable during an active incident.
configure terminal
interface Te1/0/8
description TRUNK-TO-SUPERMICRO-KVM-01
switchport trunk allowed vlan 10,20,30,40,100,110,120
switchport trunk native vlan 100
switchport mode trunk
ip arp inspection trust
spanning-tree portfast edge trunk
ip dhcp snooping trust
end
Pending: Replace fiber patch cable on Te1/0/2 path. Clean SFP connectors. Re-test. If CRC clears, move kvm-01 back to Te1/0/2.
Phase 10: Switch Port Configuration Alignment (Pending)
Both trunk ports should have identical config. Currently Te1/0/1 has a stale switchport access vlan 10 that Te1/0/2 doesn’t:
! Clean up Te1/0/1 (remove stale access VLAN)
configure terminal
interface Te1/0/1
no switchport access vlan 10
end
! Target config for both trunk ports:
interface TenGigabitEthernet1/0/X
description TRUNK-TO-SUPERMICRO-KVM-0X
switchport trunk allowed vlan 10,20,30,40,100,110,120
switchport trunk native vlan 100
switchport mode trunk
ip arp inspection trust
spanning-tree portfast edge trunk
ip dhcp snooping trust
end
Phase 11: Switch RADIUS Cleanup (Pending)
ISE-01 was removed from RADIUS group during incident. Re-add after kvm-01 recovery:
configure terminal
aaa group server radius ISE-RADIUS
server name ISE-01
end
kvm-02 Final State
All 6 VMs: RUNNING (vyos-02, ise-02, bind-02, vault-02, vault-03, 9800-WLC-02)
Bridge: eno8 + br-mgmt = PVID 100, VLANs 10/20/30/40/110/120 (correct)
vnets: PVID 100, VLANs 10/20/30/40/100/110/120 (manual, ephemeral)
NAS: Mounted by IP, fstab fixed with nofail
VRRP: vyos-02 MASTER on all groups, VIP 10.50.1.1 held
Host→VM: Reachable (ping .3, .21, .91 confirmed)
nmcli: Persistent bridge config applied
kvm-01 Recovery (Pending)
kvm-01 remains in emergency mode. Requires physical console or IPMI access.
# From emergency mode root shell:
# 1. Backup fstab
cat /etc/fstab > /root/fstab.backup.2026-04-13
# 2. Fix NAS mounts — same pattern as kvm-02
# Replace nas-01 with 10.50.1.70, add nofail
vi /etc/fstab
# 3. If hostname is used (like kvm-02 was):
# Change: nas-01:/volume1/vms /mnt/nas/vms nfs defaults,_netdev 0 0
# To: 10.50.1.70:/volume1/vms /mnt/nas/vms nfs defaults,_netdev,nofail,x-systemd.device-timeout=10 0 0
# 4. Reload and exit emergency mode
systemctl daemon-reload
exit
# 5. Once booted, verify bridge VLAN state matches runbook Phase 5
bridge vlan show
# 6. Start VMs in dependency order
sudo virsh start vyos-01 # Router — restores VRRP master (priority 200)
sudo virsh start bind-01 # DNS primary
sudo virsh start vault-01 # PKI/SSH CA
sudo virsh start home-dc01 # AD DC
sudo virsh start ipa-01 # FreeIPA
sudo virsh start ipsk-mgr-01 # iPSK Manager
sudo virsh start 9800-WLC-01 # WLC primary
sudo virsh start k3s-master-01 # k3s control plane
Verification Checklist
kvm-02 (Completed)
-
All 6 VMs running:
sudo virsh list --all -
NAS mount functional:
ls /mnt/nas/vms/shows qcow2 files -
fstab uses IP not hostname, has
nofail:awk '/volume1/' /etc/fstab -
Bridge PVID 100 on eno8 and br-mgmt:
bridge vlan show -
nmcli persistent config:
nmcli c show br-mgmt | awk '/bridge.vlans/' -
Host can ping VMs: ISE (.21), bind-02 (.91), vyos-02 (.3)
-
VRRP MASTER on vyos-02:
show vrrpshows all groups MASTER
kvm-01 (Completed ~20:00)
-
Boots to multi-user.target — fstab entries commented out, sdb1 mounted manually as ext4
-
fstab needs permanent fix: uncomment with correct types (ext4 + nofail, NAS IP + nofail)
-
Bridge VLAN — needs PVID 100 fix + nmcli persistence (same as kvm-02)
-
All 8 VMs running and autostart enabled
-
fstab fixed:
sed -i '/pattern/c\replacement'— ext4 (not xfs) for sdb1, IP + nofail for NAS -
Bridge VLAN persistent:
nmcli connection modify br-mgmt bridge.vlan-filtering yes bridge.vlan-default-pvid 0 bridge.vlans "100 pvid untagged, 10, 20, 30, 40, 110, 120" -
VM autostart:
for vm in vyos-01 bind-01 vault-01 ipa-01 ipsk-mgr-01 9800-WLC-01 k3s-master-01; do sudo virsh autostart $vm; done -
VRRP: vyos-01 resuming MASTER (priority 200)
-
DNS HA confirmed:
dig +short vault-02.inside.domusdigitalis.dev @10.50.1.90✓ and@10.50.1.91✓ -
iPSK Manager: login page accessible, Domus-IoT WiFi working, IoT clients connected
-
Vault: verify vault-01 rejoined Raft cluster
-
Libvirt VLAN hook — needs same debug as kvm-02
kvm-02 Final Verification (Completed ~15:00)
All checks passed — typed manually from kvm-02 console.
# Bridge VLAN — PVID 100 on both eno8 and br-mgmt ✓
bridge vlan show dev eno8 | head -10
bridge vlan show dev br-mgmt self
# nmcli persistence — survives reboot ✓
nmcli c s br-mgmt | grep bridge.vlan
# bridge.vlan-filtering: yes
# bridge.vlan-default-pvid: 0
# bridge.vlans: 10, 20, 30, 40, 100 pvid untagged, 110, 120
# All 6 VMs running + autostart enabled ✓
sudo virsh list --all --autostart
# vyos-02, ise-02, bind-02, vault-02, vault-03, 9800-WLC-02 — all autostart
# NAS mount alive ✓
ls /mnt/nas/vms/
# Sweep all critical VM IPs — brace expansion + xargs
printf '%s\n' 10.50.1.{1,21,40,61,62} | xargs -I{} ping -c1 -W1 {}
# VIP .1 ✓ | ISE .21 ✓ | WLC .40 ✓ | Vault .61 ✓ | Vault .62 ✓
# Same sweep — for loop variant
for ip in 10.50.1.{1,21,40,61,62}; do ping -c1 -W1 $ip; done
kvm-01 Persistence Hardening (Completed ~20:30)
# 1. Verify before changing — always check current state first
nmcli c s br-mgmt | grep bridge.vlan
# bridge.vlan-filtering: no ← not set
# bridge.vlan-default-pvid: 1 ← wrong
# bridge.vlans: -- ← empty — no persistence
# 2. fstab fix via sed — replace commented lines with correct entries
sudo sed -i '/volume1\/vms/c\10.50.1.70:/volume1/vms /mnt/nas/vms nfs defaults,_netdev,nofail,x-systemd.device-timeout=10 0 0' /etc/fstab
sudo sed -i '/volume1\/isos/c\10.50.1.70:/volume1/isos /mnt/nas/isos nfs defaults,_netdev,nofail,x-systemd.device-timeout=10 0 0' /etc/fstab
sudo bash -c "sed -i '/volume1\/backups/c\\10.50.1.70:/volume1/backups /mnt/nas/backups nfs defaults,_netdev,nofail,x-systemd.device-timeout=10 0 0' /etc/fstab"
sudo sed -i '/sdb1/c\/dev/sdb1 /mnt/onboard-ssd ext4 defaults,nofail 0 0' /etc/fstab
# 3. Verify fstab
awk '/volume1|sdb1/' /etc/fstab
sudo systemctl daemon-reload
# 4. Bridge VLAN persistence
sudo nmcli connection modify br-mgmt bridge.vlan-filtering yes \
bridge.vlan-default-pvid 0 \
bridge.vlans "100 pvid untagged, 10, 20, 30, 40, 110, 120"
nmcli c s br-mgmt | grep bridge.vlan
# 5. VM autostart
for vm in vyos-01 bind-01 vault-01 ipa-01 ipsk-mgr-01 9800-WLC-01 k3s-master-01; do
sudo virsh autostart $vm
done
sudo virsh list --all --autostart
# All 8 VMs: autostart enabled ✓
Infrastructure (Pending)
-
Workstation on EAP-TLS wired (fix keyring/secrets issue)
-
ISE-01 re-added to switch RADIUS group
-
Libvirt VLAN hook debugged and fixed on kvm-02
-
Investigate why bind-02, vault-02, vault-03 shut down overnight