KVM Operations & Maintenance

Operational runbook for KVM hypervisor management across kvm-01 and kvm-02. Covers VM lifecycle, storage management, troubleshooting, and maintenance procedures.

1. Quick Reference

Task Command

List all VMs

sudo virsh list --all

VM state with reason

sudo virsh domstate <vm> --reason

Start/stop/restart

sudo virsh start/shutdown/reboot <vm>

Force stop

sudo virsh destroy <vm>

Console access

sudo virsh console <vm>

Check disk location

sudo virsh domblklist <vm>

Check VM memory/CPU

sudo virsh dominfo <vm>

Change disk path (sed)

sudo virsh dumpxml <vm> | sed 's|old|new|' | sudo virsh define /dev/stdin

2. Storage Architecture

2.1. Current Layout (kvm-01)

Path Purpose Size

/dev/sda2 (root)

OS only - DO NOT store VMs here

14GB

/mnt/onboard-ssd

VM images, ISOs, backups

962GB

Root partition is only 14GB. All VM images MUST be stored on /mnt/onboard-ssd/libvirt/images/. Storing images on root will cause I/O errors when disk fills.

2.2. Storage Paths

# VM disk images
/mnt/onboard-ssd/libvirt/images/

# ISO files
/mnt/onboard-ssd/libvirt/images/iso/

# Cloud-init ISOs
/mnt/onboard-ssd/libvirt/images/

# Base/template images
/mnt/onboard-ssd/libvirt/images/

2.3. Check Storage Usage

Host filesystem:

df -h | awk 'NR==1 || /G|T/ {print}'

VM images by size:

sudo du -sh /mnt/onboard-ssd/libvirt/images/* | sort -rh

Find which disk a VM uses:

sudo virsh domblklist <vm-name> | awk 'NR>2 && $2 != "-" {print $2}'
Example
sudo virsh domblklist ise-01 | awk 'NR>2 && $2 != "-" {print $2}'

3. VM Lifecycle Management

3.1. List VMs with Status

sudo virsh list --all | awk 'NR>2 {print $2": "$3}'

3.2. Start/Stop/Restart

# Graceful shutdown (sends ACPI signal)
sudo virsh shutdown <vm-name>

# Force stop (like pulling power)
sudo virsh destroy <vm-name>

# Start
sudo virsh start <vm-name>

# Reboot
sudo virsh reboot <vm-name>

3.3. Delete VM Completely

This permanently deletes the VM and its disk. Cannot be undone.

Graceful removal (VM must be shut off):

sudo virsh undefine <vm-name> --remove-all-storage

Force removal (running VM):

sudo virsh destroy <vm-name> 2>/dev/null; sudo virsh undefine <vm-name> --remove-all-storage

Verify deletion:

sudo virsh list --all | grep <vm-name>

3.4. Pause/Resume

# Pause (freeze in place)
sudo virsh suspend <vm-name>

# Resume
sudo virsh resume <vm-name>

4. Move VM to Different Storage

When VMs are on wrong storage (NAS, root partition), move them to local SSD.

Critical VMs (ISE, Vault, AD) should be on local SSD, not NAS.

NAS disconnects during VM I/O cause filesystem corruption. Learned from ise-02 incident.

4.1. Discover Storage

List block devices:

lsblk -o NAME,SIZE,TYPE,MOUNTPOINT | grep -v loop

Find disk devices:

sudo fdisk -l | awk '/^Disk \/dev\/[a-z]/ {print}'

Find libvirt image directories:

df -h | awk '/libvirt|images/'

4.2. Storage Paths by Hypervisor

Hypervisor Local SSD NAS (non-critical only)

kvm-01

/mnt/onboard-ssd/libvirt/images/

/mnt/nas/vms/

kvm-02

/var/lib/libvirt/images/ (1.8TB NVMe LVM)

/mnt/nas/vms/

4.3. Quick One-Liner (Experienced Users)

For VMs already shut down with disk already copied:

sudo virsh dumpxml <vm> | sed 's|/old/path/|/new/path/|' | sudo virsh define /dev/stdin
Example: vault-01 NAS → local
sudo virsh dumpxml vault-01 | sed 's|/mnt/nas/vms/|/var/lib/libvirt/images/|' | sudo virsh define /dev/stdin

4.4. Move Procedure (sed Workflow)

1. Shut down the VM:

sudo virsh shutdown <vm-name>
# Wait for shutdown
while [ "$(sudo virsh domstate <vm-name>)" != "shut off" ]; do sleep 2; echo "waiting..."; done
echo "VM is off"

2. Find current disk location:

sudo virsh domblklist <vm-name> | awk 'NR>2 && $2 != "-" {print $2}'

3. Copy disk to local SSD (preserve original until verified):

# On kvm-01:
sudo cp /mnt/nas/vms/<vm-name>.qcow2 /mnt/onboard-ssd/libvirt/images/

# On kvm-02:
sudo cp /mnt/nas/vms/<vm-name>.qcow2 /var/lib/libvirt/images/

4. Export and update VM config with sed:

sudo virsh dumpxml <vm-name> > /tmp/<vm-name>.xml
# Find current path
awk '/source file=.*qcow2/' /tmp/<vm-name>.xml
# Replace NAS path with local SSD (kvm-02 example)
sed -i 's|/mnt/nas/vms/|/var/lib/libvirt/images/|g' /tmp/<vm-name>.xml
# Validate change
awk '/source file=.*qcow2/' /tmp/<vm-name>.xml

5. Redefine and start:

sudo virsh define /tmp/<vm-name>.xml
sudo virsh start <vm-name>

6. Verify VM boots and runs correctly:

sudo virsh domstate <vm-name>
ping -c3 <vm-ip>

7. Remove NAS copy (only after verified working):

# ONLY after VM confirmed working for 24+ hours
sudo rm /mnt/nas/vms/<vm-name>.qcow2
Example: Move ise-02 from NAS to local SSD on kvm-02
# Shutdown
sudo virsh shutdown ise-02

# Wait
while [ "$(sudo virsh domstate ise-02)" != "shut off" ]; do sleep 2; done

# Copy to local SSD (kvm-02: /var/lib/libvirt/images/)
sudo cp /mnt/nas/vms/ise-02.qcow2 /var/lib/libvirt/images/

# Export XML
sudo virsh dumpxml ise-02 > /tmp/ise-02.xml

# Find current path
awk '/source file=.*qcow2/' /tmp/ise-02.xml

# Update path (NAS → local NVMe)
sed -i 's|/mnt/nas/vms/|/var/lib/libvirt/images/|g' /tmp/ise-02.xml

# Verify new path
awk '/source file=.*qcow2/' /tmp/ise-02.xml

# Redefine and start
sudo virsh define /tmp/ise-02.xml
sudo virsh start ise-02

# Verify connectivity
ping -c3 10.50.1.21

5. Troubleshooting

5.1. VM Paused Due to I/O Error

Symptoms:

sudo virsh domstate <vm-name> --reason
# Output: paused (I/O error)

Root cause: Usually disk full on host.

Diagnosis:

# Check host disk space
df -h /

# Check QEMU logs
sudo tail -20 /var/log/libvirt/qemu/<vm-name>.log

Fix:

  1. Free space on host (see storage section)

  2. Move VM images to larger partition

  3. Resume VM: sudo virsh resume <vm-name>

5.2. VM Unreachable - Orphaned vnet (Not in Bridge)

Symptoms:

  • VM running but can’t ping

  • ARP shows FAILED

  • bridge vlan add returns Operation not supported

Diagnosis:

VNET=$(sudo virsh domiflist <vm-name> | awk 'NR==3 {print $1}')
ip link show $VNET | grep master

If NO master br-mgmt shown, vnet is orphaned.

Fix - Add vnet to bridge:

sudo ip link set $VNET master br-mgmt

Then add VLANs and PVID:

for vid in 10 20 30 40 100 110 120; do sudo bridge vlan add vid $vid dev $VNET; done
sudo bridge vlan add vid 100 dev $VNET pvid untagged
sudo bridge vlan del vid 1 dev $VNET 2>/dev/null

Verify:

bridge vlan show dev $VNET
ping -c2 <vm-ip>

Root cause: This happens when nmcli conn up br-mgmt is run while VMs are running - it kicks all vnets out of the bridge.

5.3. VM Won’t Start - Disk Not Found

Symptoms:

error: Failed to start domain 'vm-name'
error: Cannot access storage file '/path/to/disk.qcow2'

Fix: Update disk path in VM config:

sudo virsh edit <vm-name>
# Correct the <source file='/correct/path/disk.qcow2'/> line

5.4. Console Shows Nothing

Press Enter after connecting - console may need input to refresh.

sudo virsh console <vm-name>
# Press Enter
# To exit: Ctrl+]

5.5. Check Why VM Paused

sudo virsh domstate <vm-name> --reason

Common reasons: - paused (I/O error) - Disk full or storage issue - paused (user) - Manual suspend - paused (watchdog) - Guest OS triggered watchdog

5.6. View libvirt Logs

# Recent libvirt daemon logs
sudo journalctl -u libvirtd --since "10 minutes ago" --no-pager | tail -30

# VM-specific QEMU logs
sudo tail -50 /var/log/libvirt/qemu/<vm-name>.log

5.7. DNS Resolution Failing (NAS/NFS Mounts)

Symptoms:

sudo mount -t nfs nas-01:/volume1/isos /mnt/nas/isos
# mount.nfs: Failed to resolve server nas-01: Name or service not known

Diagnosis:

cat /etc/resolv.conf
# If only pfSense (10.50.1.1), internal DNS names won't resolve

Fix - Add bind-01 as primary nameserver:

# Overwrite (preferred - clean config)
sudo tee /etc/resolv.conf <<'EOF'
nameserver 10.50.1.90
nameserver 10.50.1.1
EOF
# Or append without clobbering existing
echo "nameserver 10.50.1.90" | sudo tee -a /etc/resolv.conf

Workaround - Use IP directly:

sudo mount -t nfs 10.50.1.70:/volume1/isos /mnt/nas/isos

bind-01 (10.50.1.90) resolves internal .inside.domusdigitalis.dev names. pfSense (10.50.1.1) only forwards external DNS.

6. Maintenance Operations

6.1. Bulk VM Status Check

for vm in $(sudo virsh list --all --name); do
  state=$(sudo virsh domstate "$vm" 2>/dev/null)
  printf "%-20s %s\n" "$vm" "$state"
done

6.2. Find All VMs Using Root Partition

for vm in $(sudo virsh list --all --name); do
  disk=$(sudo virsh domblklist "$vm" 2>/dev/null | awk 'NR>2 && $2 ~ /^\/var\/lib/ {print $2}')
  [ -n "$disk" ] && echo "$vm: $disk"
done

6.3. Calculate Total VM Disk Usage

sudo du -sh /mnt/onboard-ssd/libvirt/images/ | awk '{print "Total VM storage: "$1}'

6.4. Check VM Resource Allocation

sudo virsh dominfo <vm-name> | awk '/Max memory|Used memory|CPU/'

All VMs summary:

printf "%-20s %8s %8s\n" "VM" "Memory" "CPUs"
for vm in $(sudo virsh list --name); do
  mem=$(sudo virsh dominfo "$vm" | awk '/Used memory/{print $3}')
  cpu=$(sudo virsh dominfo "$vm" | awk '/CPU\(s\)/{print $2}')
  printf "%-20s %6sKB %8s\n" "$vm" "$mem" "$cpu"
done

7. VM Migration Between Hypervisors

7.1. Migration from kvm-01 to kvm-02

CRITICAL: kvm-01 uses virbr0 (no VLAN filtering), kvm-02 uses br-mgmt (VLAN filtering).

When migrating VMs from kvm-01 to kvm-02, you MUST fix the vnet PVID or the VM will have no network connectivity. The vnet defaults to PVID 1 but the MGMT network is VLAN 100.

Symptoms of wrong PVID:

# VM is running but can't be pinged
ip neigh | grep <vm-ip>
# Shows: 10.50.1.X dev br-mgmt FAILED

Quick Fix (immediate, non-persistent):

# Find the vnet interface
VNET=$(sudo virsh domiflist <vm-name> | awk '/br-mgmt/ {print $1}')
echo "VM uses: $VNET"

# Check current PVID
bridge vlan show dev $VNET

# Fix PVID: remove PVID 1, add PVID 100
sudo bridge vlan add vid 100 dev $VNET pvid untagged
sudo bridge vlan del vid 1 dev $VNET pvid untagged

# Verify PVID 100
bridge vlan show dev $VNET

Persistent Fix (add VM to libvirt hook):

# Add VM to PVID100_VMS in the hook script
sudo vim /etc/libvirt/hooks/qemu

# Find this line:
# PVID100_VMS="vyos-01 vyos-02 ..."
# Add your VM name to the list

# Restart libvirtd to load changes
sudo systemctl restart libvirtd

7.2. Migration Procedure

1. Copy qcow2 via NFS:

# On kvm-01: Mount NAS if not mounted
sudo mount -t nfs 10.50.1.70:/volume1/vms /mnt/nas-vms

# Copy VM disk
sudo cp /mnt/onboard-ssd/libvirt/images/<vm-name>.qcow2 /mnt/nas-vms/

2. Create VM on kvm-02:

# Mount NAS
sudo mount -t nfs 10.50.1.70:/volume1/vms /mnt/nas/vms

# Import VM
sudo virt-install \
  --name <vm-name> \
  --memory 2048 \
  --vcpus 2 \
  --disk /mnt/nas/vms/<vm-name>.qcow2,bus=virtio \
  --import \
  --os-variant rocky9 \
  --network bridge=br-mgmt,model=virtio \
  --graphics vnc,listen=0.0.0.0 \
  --noautoconsole

3. Fix PVID (CRITICAL):

VNET=$(sudo virsh domiflist <vm-name> | awk '/br-mgmt/ {print $1}')
sudo bridge vlan add vid 100 dev $VNET pvid untagged
sudo bridge vlan del vid 1 dev $VNET pvid untagged

4. Fix VM gateway (if VM was using old gateway):

# Inside VM via Cockpit console
ip route show
# If shows old gateway (e.g., 10.50.1.1 pfSense):
sudo ip route del default
sudo ip route add default via 10.50.1.3  # VyOS

5. Make PVID persistent:

# Add to PVID100_VMS in /etc/libvirt/hooks/qemu
sudo sed -i 's/PVID100_VMS="\([^"]*\)"/PVID100_VMS="\1 <vm-name>"/' /etc/libvirt/hooks/qemu
grep PVID100_VMS /etc/libvirt/hooks/qemu

7.3. Migration from kvm-02 to kvm-01 (via Workstation)

SSH is NOT configured between kvm-01 and kvm-02. All transfers must go through the workstation.

  • kvm-01: 10.50.1.110 (/mnt/onboard-ssd/libvirt/images/)

  • kvm-02: 10.50.1.111 (/var/lib/libvirt/images/)

Use case: Move primary VMs from kvm-02 (NAS-dependent) to kvm-01 (onboard SSD) for resilience.

7.3.1. Standard Migration (Small VMs via Workstation /tmp)

1. On kvm-02 - Shutdown VMs and export:

# Shutdown VMs
for vm in vault-01 bind-01; do
  sudo virsh shutdown $vm
done
# Wait for shutdown, then export XMLs
for vm in vault-01 bind-01; do
  sudo virsh dumpxml $vm > /tmp/$vm.xml
done
# Stage qcow2 files (requires root)
for vm in vault-01 bind-01; do
  sudo cp /var/lib/libvirt/images/$vm.qcow2 /tmp/
  sudo chmod 644 /tmp/$vm.qcow2
done

2. From workstation - Pull files from kvm-02:

# Pull XMLs
scp kvm-02:/tmp/{vault-01,bind-01}.xml /tmp/
# Pull qcow2 files (may take several minutes per VM)
scp kvm-02:/tmp/{vault-01,bind-01}.qcow2 /tmp/

3. From workstation - Push to kvm-01:

# Push XMLs
scp /tmp/{vault-01,bind-01}.xml kvm-01:/tmp/
# Push qcow2 files
scp /tmp/{vault-01,bind-01}.qcow2 kvm-01:/tmp/

4. On kvm-01 - Move files and define VMs:

# Move qcow2 to libvirt images directory
for vm in vault-01 bind-01; do
  sudo mv /tmp/$vm.qcow2 /mnt/onboard-ssd/libvirt/images/
done
# Fix XML paths, define, and start VMs
for vm in vault-01 bind-01; do
  sed -i 's|/var/lib/libvirt/images/|/mnt/onboard-ssd/libvirt/images/|g' /tmp/$vm.xml
  sudo virsh define /tmp/$vm.xml
  sudo virsh start $vm
done

5. Verify:

sudo virsh list --all | grep -E 'vault-01|bind-01'

6. Cleanup on kvm-02 (after confirmed working):

for vm in vault-01 bind-01; do
  sudo virsh undefine $vm
  sudo rm /var/lib/libvirt/images/$vm.qcow2
  rm /tmp/$vm.xml /tmp/$vm.qcow2
done

7.3.2. Large VM Migration (via NAS Staging)

For VMs larger than workstation /tmp capacity (e.g., home-dc01 ~40GB), use NAS as intermediate storage.

1. On kvm-02 - Copy to NAS:

# Shutdown VM
sudo virsh shutdown home-dc01
# Export XML
sudo virsh dumpxml home-dc01 > /tmp/home-dc01.xml
# Copy qcow2 to NAS (if NAS mounted)
sudo cp /var/lib/libvirt/images/home-dc01.qcow2 /mnt/nas/vms/

2. From workstation - Transfer XML only:

scp kvm-02:/tmp/home-dc01.xml /tmp/
scp /tmp/home-dc01.xml kvm-01:/tmp/

3. On kvm-01 - Copy from NAS and define:

# Mount NAS if needed
sudo mount -t nfs {nas-ip}:/volume1/vms /mnt/nas/vms
# Copy from NAS to onboard SSD
sudo cp /mnt/nas/vms/home-dc01.qcow2 /mnt/onboard-ssd/libvirt/images/
# Fix XML path and define
sed -i 's|/var/lib/libvirt/images/|/mnt/onboard-ssd/libvirt/images/|g' /tmp/home-dc01.xml
sudo virsh define /tmp/home-dc01.xml
sudo virsh start home-dc01

4. Verify and cleanup:

sudo virsh list --all | grep home-dc01
ping -c3 {homedc-ip}

7.4. Import VM from NAS (No Existing XML)

For VMs where only the qcow2 exists (no XML definition), create a new VM definition.

Example: ipa-01 (FreeIPA)

1. Copy qcow2 from NAS:

sudo cp /mnt/nas/vms/ipa-01.qcow2 /mnt/onboard-ssd/libvirt/images/

2. Create VM with virt-install:

sudo virt-install \
  --name ipa-01 \
  --memory 4096 \
  --vcpus 2 \
  --disk /mnt/onboard-ssd/libvirt/images/ipa-01.qcow2,bus=virtio \
  --import \
  --os-variant rocky9 \
  --network bridge=br-mgmt,model=virtio \
  --graphics vnc,listen=0.0.0.0 \
  --noautoconsole

3. Trigger libvirt hook (virt-install doesn’t trigger "started" hook):

sudo virsh destroy ipa-01
sudo virsh start ipa-01

4. Verify PVID and connectivity:

VNET=$(sudo virsh domiflist ipa-01 | awk '/br-mgmt/ {print $1}')
bridge vlan show dev $VNET
ping -c3 {ipa-ip}

5. Fix clock drift (common after VM migration):

ssh ipa-01 "sudo timedatectl set-ntp true && sudo chronyc makestep"

6. Verify services (FreeIPA):

ssh ipa-01 "sudo ipactl status"

8. Resize VM Resources

8.1. Increase VM Memory (RAM)

Safe operation. VM must be shut down. Data on NFS/persistent storage is unaffected. Pods auto-restart when VM comes back.

1. Check current allocation:

sudo virsh dominfo <vm-name> | grep -E 'Max memory|Used memory'

2. Shut down VM gracefully:

sudo virsh shutdown <vm-name>

# Wait for shutdown (check every 5 seconds)
while [ "$(sudo virsh domstate <vm-name>)" != "shut off" ]; do
  sleep 5
  echo "Waiting for shutdown..."
done
echo "VM is off"

3. Increase memory (example: 8GB):

# Set maximum memory (requires VM off)
sudo virsh setmaxmem <vm-name> 8G --config

# Set current memory
sudo virsh setmem <vm-name> 8G --config

4. Verify configuration:

sudo virsh dominfo <vm-name> | grep -E 'Max memory|Used memory'

5. Start VM:

sudo virsh start <vm-name>

6. Verify inside VM:

ssh <vm-name> "free -h | awk 'NR==2 {print \"Total RAM: \"\$2}'"
Example: Increase k3s-master-01 from 4GB to 8GB
# On kvm-01
sudo virsh shutdown k3s-master-01
# Wait...
sudo virsh setmaxmem k3s-master-01 8G --config
sudo virsh setmem k3s-master-01 8G --config
sudo virsh start k3s-master-01

# Verify
ssh k3s-master-01 "free -h"

8.2. Increase VM CPU

# Shut down first
sudo virsh shutdown <vm-name>

# Set vCPUs (example: 4 cores)
sudo virsh setvcpus <vm-name> 4 --config --maximum
sudo virsh setvcpus <vm-name> 4 --config

# Start
sudo virsh start <vm-name>

8.3. Resize VM Disk

Disk resize is more complex - requires filesystem resize inside VM. See separate runbook.

Increase disk size (VM must be off):

sudo qemu-img resize /mnt/onboard-ssd/libvirt/images/<vm-name>.qcow2 +20G

Then inside VM, extend the partition and filesystem.

9. ISO/CDROM Management

9.1. Check Attached Media

sudo virsh domblklist <vm-name>
Example output (ISO still attached after install)
Target   Source
------------------------------------------------------------
vda      /mnt/nas/vms/ise-02.qcow2
sda      /mnt/nas/isos/Cisco-ISE-3.5.0.527.SPA.x86_64.iso

9.2. Eject ISO (VM Keeps Booting to Install Menu)

Problem: VM boots to installation menu instead of installed OS.

Cause: ISO still attached from initial install.

# Find which device has the ISO (usually sda or hdc)
sudo virsh domblklist <vm-name>

# Eject the ISO
sudo virsh change-media <vm-name> sda --eject

# Reboot to boot from installed disk
sudo virsh reboot <vm-name>

9.3. Attach ISO (For Recovery or Reinstall)

Use case: ISE password reset, OS recovery, reinstall.

# Attach ISO to existing CDROM device
sudo virsh change-media <vm-name> sda /path/to/image.iso --insert

# Or attach to a VM without CDROM device
sudo virsh attach-disk <vm-name> /path/to/image.iso sda --type cdrom --mode readonly
ISE Password Reset Procedure
  1. Attach ISE ISO: sudo virsh change-media ise-02 sda /mnt/nas/isos/Cisco-ISE-3.5.0.527.SPA.x86_64.iso --insert

  2. Reboot: sudo virsh reboot ise-02

  3. At boot menu, select Option 4: Reset Administrator Password

  4. Follow prompts to reset password

  5. Eject ISO: sudo virsh change-media ise-02 sda --eject

  6. Reboot to normal operation: sudo virsh reboot ise-02

9.4. Persistent CDROM Removal (sed Workflow)

Advanced approach using sed for scripted/repeatable XML editing:

1. Shutdown and export XML:

sudo virsh destroy <vm-name>
sudo virsh dumpxml <vm-name> > /tmp/<vm-name>.xml

2. Find ISO source line:

awk '/source file.*iso/' /tmp/<vm-name>.xml

3. Remove ISO source line (keeps empty CDROM device):

sed -i '/<source file=.*\.iso/d' /tmp/<vm-name>.xml

4. Validate CDROM block (should have no source):

awk '/disk.*cdrom/,/<\/disk>/' /tmp/<vm-name>.xml
Expected output (no source line)
<disk type='file' device='cdrom'>
  <driver name='qemu' type='raw'/>
  <target dev='sda' bus='sata'/>
  <readonly/>
  <address type='drive' controller='0' bus='0' target='0' unit='0'/>
</disk>

5. Redefine and start:

sudo virsh define /tmp/<vm-name>.xml
sudo virsh start <vm-name>

10. kvm-02 Network Architecture

kvm-02 uses Linux bridge VLAN filtering with native VLAN 100 (MGMT). This differs from kvm-01 which uses a simple untagged bridge (virbr0).

10.1. Physical Topology

                    PHYSICAL NETWORK
┌─────────────────────────────────────────────────────────────┐
│  C3560CX Switch                                             │
│  ├── Te1/0/1 (to kvm-02)                                    │
│  │   ├── Native VLAN: 100 (MGMT) ─── untagged traffic       │
│  │   └── Trunk: 20,30,40,100,110,120 ─── tagged traffic     │
│  └── Gi1/0/X (to other devices)                             │
└─────────────────────────────────────────────────────────────┘
                         │
                         │ untagged = VLAN 100
                         │ tagged = VLANs 20,30,40,110,120
                         ▼
┌─────────────────────────────────────────────────────────────┐
│  kvm-02 Host (10.50.1.98)                                   │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ eno8 (physical NIC)                                   │  │
│  │ ├── PVID 100 ─── untagged frames → VLAN 100           │  │
│  │ └── VLANs: 10,20,30,40,100,110,120                    │  │
│  └───────────────────────────────────────────────────────┘  │
│                         │                                    │
│                         ▼                                    │
│  ┌───────────────────────────────────────────────────────┐  │
│  │ br-mgmt (Linux bridge with VLAN filtering)            │  │
│  │ ├── PVID 100 (self) ─── host traffic = VLAN 100       │  │
│  │ └── VLANs: 10,20,30,40,100,110,120                    │  │
│  └───────────────────────────────────────────────────────┘  │
│           │              │              │                    │
│           ▼              ▼              ▼                    │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐            │
│  │ vnet25      │ │ vnet31      │ │ vnetX       │            │
│  │ PVID 100    │ │ PVID 100    │ │ PVID 1      │            │
│  │ VLANs: all  │ │ VLANs: all  │ │ VLANs: all  │            │
│  └──────┬──────┘ └──────┬──────┘ └──────┬──────┘            │
│         │               │               │                    │
│         ▼               ▼               ▼                    │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐            │
│  │ vyos-02     │ │ 9800-WLC-02 │ │ other VM    │            │
│  │ eth0=MGMT   │ │ Vlan100=MGMT│ │ eth0.100    │            │
│  │ (untagged)  │ │ (native)    │ │ (tagged)    │            │
│  └─────────────┘ └─────────────┘ └─────────────┘            │
└─────────────────────────────────────────────────────────────┘

10.2. PVID (Port VLAN ID) Explained

PVID determines which VLAN receives UNTAGGED frames.

Interface PVID Effect

eno8

100

Switch native VLAN 100 (untagged) → tagged as VLAN 100 in bridge

br-mgmt (self)

100

Host traffic (10.50.1.98) → VLAN 100

vnet25 (vyos-02)

100

VyOS eth0 untagged → VLAN 100

vnet31 (WLC-02)

100

WLC native VLAN → VLAN 100

vnetX (other)

1 (default)

Uses eth0.100 (tagged), PVID doesn’t matter

10.3. Traffic Flow Examples

Example 1: kvm-02 host pings pfSense (10.50.1.1)

kvm-02 host (10.50.1.98)
    ↓ untagged
br-mgmt (PVID 100 → VLAN 100)
    ↓ VLAN 100
eno8 (PVID 100 → exits untagged)
    ↓ untagged
Switch Te1/0/1 (native 100 → VLAN 100)
    ↓ routed
pfSense (10.50.1.1)

Example 2: Switch pings WLC-02 (10.50.1.41)

Switch (VLAN 100)
    ↓ native VLAN 100 (untagged)
eno8 (PVID 100 → VLAN 100)
    ↓ VLAN 100
br-mgmt (VLAN 100)
    ↓ VLAN 100
vnet31 (PVID 100 → exits untagged)
    ↓ untagged
WLC-02 Vlan100 interface (10.50.1.41)

Example 3: VyOS receives DHCP request from VLAN 40

Client on VLAN 40
    ↓ tagged VLAN 40
Switch Te1/0/1 (trunk)
    ↓ tagged VLAN 40
eno8 (VLAN 40 allowed)
    ↓ VLAN 40
br-mgmt (VLAN 40)
    ↓ VLAN 40
vnet25 (VLAN 40 allowed)
    ↓ tagged VLAN 40
VyOS eth0.40 (DHCP server)

10.4. Key Differences: kvm-01 vs kvm-02

Aspect kvm-01 kvm-02

Bridge

virbr0 (simple NAT/untagged)

br-mgmt (VLAN filtering)

Physical NIC

No VLANs on host side

eno8 with VLAN trunk

Switch Connection

Access port or direct

Trunk (Te1/0/1, native 100)

VM VLAN handling

VMs handle VLANs internally

Bridge handles VLAN tagging

PVID Config

Not applicable

PVID 100 required on eno8, br-mgmt, VyOS/WLC vnets

Persistence

None needed

systemd service + libvirt hook

10.5. Persistence Summary

Component When Configured Method

eno8 VLANs + PVID

Boot

bridge-vlan-pvid.service

br-mgmt PVID (self)

Boot

bridge-vlan-pvid.service

vnet interfaces

VM start

/etc/libvirt/hooks/qemu

11. Bridge VLAN Persistence

bridge vlan add commands are NOT persistent!

When VM restarts or host reboots, vnet interfaces are recreated WITHOUT VLAN tags. This causes VLAN-tagged traffic (DHCP, DNS) to fail silently.

11.1. Problem: VLANs vs PVID

Two separate issues require different fixes:

Issue Symptom Fix

Missing VLANs

Tagged traffic (VLANs 10,20,30…​) dropped

bridge vlan add vid X dev vnetN

Wrong PVID

Untagged traffic goes to wrong VLAN

bridge vlan del vid 1 …​ pvid + bridge vlan add vid 100 …​ pvid

PVID (Port VLAN ID): Tags untagged ingress frames. Default is PVID 1. WLCs send management traffic untagged on native VLAN 100 - if PVID is 1, traffic goes to wrong VLAN.

# Check: Does vnet have required VLANs AND correct PVID?
sudo bridge vlan show dev vnet11

# BAD: PVID on wrong VLAN (1 instead of 100)
port    vlan-id
vnet11  1 PVID Egress Untagged    ← WRONG for WLC
        10
        100                        ← 100 exists but not PVID

# GOOD: PVID on VLAN 100 (for WLC native VLAN 100)
port    vlan-id
vnet11  10
        20
        30
        40
        100 PVID Egress Untagged   ← CORRECT for WLC
        110
        120

11.2. Diagnostic Commands

Find vnet interface for a VM:

sudo virsh domiflist <vm-name> | awk '/br-mgmt/ {print $1}'

Check VLAN config on vnet:

bridge vlan show dev vnet<N>

One-liner: Find VM’s vnet and show its VLANs:

VNET=$(sudo virsh domiflist 9800-WLC-02 | awk '/br-mgmt/ {print $1}') && bridge vlan show dev $VNET

11.3. Solution: Libvirt Hook Script

Libvirt hooks run on VM lifecycle events. Create a qemu hook that configures VLANs AND PVID when VM starts.

Create hook script
sudo tee /etc/libvirt/hooks/qemu << 'EOF'
#!/bin/bash
# Libvirt QEMU hook - configures VLANs and PVID on br-mgmt vnet interfaces
# CRITICAL: Do NOT use 'virsh' commands here - causes deadlock with libvirtd
#
# CRITICAL: PVID determines which VLAN receives UNTAGGED frames
#
# VMs with eth0 = MGMT (untagged 10.50.1.x) need PVID 100:
#   - VyOS:  eth0 = 10.50.1.2/10.50.1.3 (MGMT untagged), eth0.X = tagged VLANs
#   - WLC:   Vlan100 + native trunk = MGMT untagged, other VLANs tagged
#
# VMs with eth0 = VLAN 100 tagged (eth0.100) can use PVID 1 (default)

GUEST_NAME="$1"
OPERATION="$2"

# VLANs to add to all br-mgmt vnet interfaces
VLANS="10 20 30 40 100 110 120"

# VMs that need PVID 100 (MGMT VLAN for untagged management traffic)
# - VyOS: eth0 = 10.50.1.x (MGMT untagged)
# - WLC: native VLAN 100 (management untagged)
# - ISE: eth0 = 10.50.1.x (MGMT untagged)
# - Any VM with eth0 on 10.50.1.x (MGMT VLAN 100)
PVID100_VMS="vyos-01 vyos-02 9800-WLC-01 9800-WLC-02 ise-01 ise-02 bind-01 home-dc01 keycloak-01 ipsk-manager vault-01"

case "$OPERATION" in
    started)
        # Run in background (&) to avoid blocking libvirtd
        (
            sleep 3  # Wait for interfaces to be fully created

            # Find vnet interfaces attached to br-mgmt
            for vnet in $(ip link show master br-mgmt 2>/dev/null | awk -F': ' '/vnet/{print $2}'); do
                logger -t "libvirt-hook" "$GUEST_NAME: Configuring $vnet"

                # Add all VLANs
                for vid in $VLANS; do
                    bridge vlan add vid "$vid" dev "$vnet" 2>/dev/null
                done

                # Check if this VM needs PVID 100 (VyOS, WLC)
                for vm in $PVID100_VMS; do
                    if [ "$GUEST_NAME" = "$vm" ]; then
                        logger -t "libvirt-hook" "$GUEST_NAME: Setting PVID 100 on $vnet (MGMT native VLAN)"
                        bridge vlan del vid 1 dev "$vnet" pvid untagged 2>/dev/null
                        bridge vlan add vid 100 dev "$vnet" pvid untagged 2>/dev/null
                    fi
                done

                logger -t "libvirt-hook" "$GUEST_NAME: $vnet configuration complete"
            done
        ) &
        ;;
esac
exit 0
EOF
Make executable
sudo chmod +x /etc/libvirt/hooks/qemu
Restart libvirtd to load hook
sudo systemctl restart libvirtd

11.4. Verify Hook Works

Restart WLC-02 and check PVID
sudo virsh shutdown 9800-WLC-02 && sleep 5 && sudo virsh start 9800-WLC-02
Wait for VM to start, then verify PVID is 100
sleep 10
VNET=$(sudo virsh domiflist 9800-WLC-02 | awk '/br-mgmt/ {print $1}')
bridge vlan show dev $VNET
Expected: PVID on VLAN 100
port    vlan-id
vnet11  10
        20
        30
        40
        100 PVID Egress Untagged
        110
        120
Check system log for hook execution
journalctl -t libvirt-hook --since "5 min ago"

11.5. Manual Fix (Non-Persistent)

Add VLANs only (VyOS, general VMs):

for vid in 10 20 30 40 100 110 120; do
  sudo bridge vlan add vid $vid dev vnet<N>
done

Fix PVID for WLC (native VLAN 100):

VNET=$(sudo virsh domiflist 9800-WLC-02 | awk '/br-mgmt/ {print $1}')
sudo bridge vlan del vid 1 dev $VNET pvid untagged
sudo bridge vlan add vid 100 dev $VNET pvid untagged

11.6. Affected VMs (kvm-02)

VM Bridge PVID Required Hook Action

vyos-02

br-mgmt

PVID 100 (eth0 = MGMT untagged)

Add VLANs + Set PVID 100

9800-WLC-02

br-mgmt

PVID 100 (native VLAN 100)

Add VLANs + Set PVID 100

ise-01, ise-02

br-mgmt

PVID 100 (eth0 = MGMT untagged)

Add VLANs + Set PVID 100

bind-01

br-mgmt

PVID 100 (eth0 = MGMT untagged)

Add VLANs + Set PVID 100

home-dc01

br-mgmt

PVID 100 (eth0 = MGMT untagged)

Add VLANs + Set PVID 100

keycloak-01

br-mgmt

PVID 100 (eth0 = MGMT untagged)

Add VLANs + Set PVID 100

vault-01

br-mgmt

PVID 100 (eth0 = MGMT untagged)

Add VLANs + Set PVID 100

ipsk-manager

br-mgmt

PVID 100 (eth0 = MGMT untagged)

Add VLANs + Set PVID 100

Other VMs

br-mgmt

PVID 1 (default)

Add VLANs only

Why PVID 100 for most VMs?

VMs on br-mgmt use eth0 for management (10.50.1.x/24). This traffic is UNTAGGED. The bridge PVID tags incoming untagged frames - must be 100 to match MGMT VLAN.

Only VMs that use eth0.100 (tagged VLAN 100) can use PVID 1.

kvm-01 uses virbr0 (untagged bridge) - no VLAN filtering, no hook needed. VMs on kvm-01 (pfSense, WLC-01, etc.) handle VLANs internally.

11.7. Cross-Hypervisor XML Migration (kvm-01 ↔ kvm-02)

When migrating VMs between kvm-01 (Arch) and kvm-02 (RHEL 7), you must fix THREE compatibility issues in the XML:

Issue kvm-01 (Arch) kvm-02 (RHEL 7)

QEMU binary

/usr/bin/qemu-system-x86_64

/usr/libexec/qemu-kvm

Machine type

pc-i440fx-10.1

pc or pc-i440fx-rhel7.6.0

Disk path

/mnt/onboard-ssd/libvirt/images/

/var/lib/libvirt/images/

Shell prompt injection: If your shell has hooks that output text (like ⚡ No session active), SSH redirects will capture that output into files. Filter it out with grep -v.

11.7.1. Full Procedure: kvm-01 → kvm-02

1. Dump XML from source (filter shell noise):

ssh kvm-01 "sudo virsh dumpxml <vm-name>" | grep -v "session active" | grep -v "^⚡" > /tmp/<vm-name>.xml

2. Verify XML is clean:

head -3 /tmp/<vm-name>.xml
Expected (valid XML)
<domain type='kvm'>
  <name>9800-WLC-01</name>
  <uuid>920adcbd-5510-46f6-a48c-6b7280c82b2e</uuid>

3. Fix all three compatibility issues:

# Fix QEMU binary path
sed -i 's|/usr/bin/qemu-system-x86_64|/usr/libexec/qemu-kvm|' /tmp/<vm-name>.xml
# Fix machine type (check available: ssh kvm-02 "/usr/libexec/qemu-kvm -machine help | grep i440fx")
sed -i "s|machine='pc-i440fx-[^']*'|machine='pc'|" /tmp/<vm-name>.xml
# Fix disk path (NAS or kvm-01 SSD → kvm-02 local)
sed -i 's|/mnt/nas/vms/|/var/lib/libvirt/images/|' /tmp/<vm-name>.xml
sed -i 's|/mnt/onboard-ssd/libvirt/images/|/var/lib/libvirt/images/|' /tmp/<vm-name>.xml

4. Copy XML and disk image to target:

scp /tmp/<vm-name>.xml kvm-02:/tmp/
# Copy disk (if not already on NAS)
ssh kvm-02 "sudo cp /mnt/nas/vms/<vm-name>.qcow2 /var/lib/libvirt/images/"

5. Define and start on target:

ssh -t kvm-02 "sudo virsh define /tmp/<vm-name>.xml && sudo virsh start <vm-name>"

6. Fix PVID (if kvm-02 uses br-mgmt):

ssh kvm-02 "VNET=\$(sudo virsh domiflist <vm-name> | awk '/br-mgmt/ {print \$1}'); sudo bridge vlan add vid 100 dev \$VNET pvid untagged; sudo bridge vlan del vid 1 dev \$VNET pvid untagged 2>/dev/null"

11.7.2. Troubleshooting XML Import

Error Cause Fix

Start tag expected, '<' not found

Shell output captured in XML file

Re-dump with grep -v filter

Cannot check QEMU binary: No such file

Wrong QEMU path for target hypervisor

sed -i 's|old-path|new-path|'

Emulator does not support machine type

QEMU version mismatch

Use pc or check -machine help

Cannot access storage file

Disk image not copied or wrong path

Copy image, fix <source file=…​/> path

11.7.3. Chronicle: 2026-03-07 WLC Migration

Problem: 9800-WLC-01 unreachable on kvm-02, vnet orphaned.

Resolution: Fresh XML import from kvm-01:

  1. Dumped XML - corrupted by shell prompt injection (⚡ No session active)

  2. Re-dumped with grep -v filter

  3. Fixed QEMU path: /usr/bin/qemu-system-x86_64/usr/libexec/qemu-kvm

  4. Fixed machine type: pc-i440fx-10.1pc

  5. Fixed disk path: /mnt/nas/vms//var/lib/libvirt/images/

  6. Defined and started successfully

Key learning: kvm-01 (Arch, rolling) and kvm-02 (RHEL 7) have incompatible QEMU versions. Always transform XML when crossing hypervisors.

11.8. Persist br-mgmt and eno8 PVID 100

The libvirt hook handles vnet interfaces when VMs start. But br-mgmt and eno8 themselves need PVID 100 at boot, BEFORE any VMs start.

Why both br-mgmt and eno8?

Physical Network Path:

Switch Te1/0/1 (native VLAN 100)
    ↓ untagged
eno8 (must tag as VLAN 100, not VLAN 1)
    ↓ VLAN 100
br-mgmt (must recognize as VLAN 100)
    ↓ VLAN 100
vnetX (PVID 100 for VMs)
    ↓
VM eth0 (10.50.1.x untagged)

If eno8 or br-mgmt has PVID 1, switch native VLAN 100 traffic gets tagged as VLAN 1, breaking connectivity.

Create systemd service for bridge PVID persistence
sudo tee /etc/systemd/system/bridge-vlan-pvid.service << 'EOF'
[Unit]
Description=Configure br-mgmt and eno8 PVID 100 for MGMT VLAN
After=network.target NetworkManager.service
Before=libvirtd.service

[Service]
Type=oneshot
RemainAfterExit=yes
# Add VLANs to both interfaces
ExecStart=/usr/sbin/bridge vlan add vid 10 dev eno8
ExecStart=/usr/sbin/bridge vlan add vid 20 dev eno8
ExecStart=/usr/sbin/bridge vlan add vid 30 dev eno8
ExecStart=/usr/sbin/bridge vlan add vid 40 dev eno8
ExecStart=/usr/sbin/bridge vlan add vid 100 dev eno8
ExecStart=/usr/sbin/bridge vlan add vid 110 dev eno8
ExecStart=/usr/sbin/bridge vlan add vid 120 dev eno8
# Set PVID 100 on eno8 (physical interface to switch)
ExecStart=/usr/sbin/bridge vlan add vid 100 dev eno8 pvid untagged
# Set PVID 100 on br-mgmt self (bridge's own interface for host traffic)
ExecStart=/usr/sbin/bridge vlan add vid 100 dev br-mgmt self pvid untagged

[Install]
WantedBy=multi-user.target
EOF
Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable bridge-vlan-pvid.service
sudo systemctl start bridge-vlan-pvid.service
Verify PVID 100 is set
bridge vlan show dev eno8 | awk '/100.*PVID/{print "eno8: PVID 100 ✓"}'
bridge vlan show dev br-mgmt | awk '/100.*PVID/{print "br-mgmt: PVID 100 ✓"}'
Expected output
eno8: PVID 100 ✓
br-mgmt: PVID 100 ✓

Order matters: This service must run:

  1. AFTER network (bridge exists)

  2. BEFORE libvirtd (VMs need correct bridge config)

The Before=libvirtd.service ensures bridge is ready before VMs start.

12. Chronicle: 2026-02-21

12.1. Issue: k3s-master-01 Paused

Symptoms: - VM repeatedly pausing after resume - virsh domstate --reason showed paused (I/O error)

Root Cause: - Host root partition (/dev/sda2) was 100% full (14GB total) - k3s-master-01.qcow2 was on root partition, grew to 3.5GB - Left only 44MB free, triggering I/O errors

Resolution:

  1. Identified disk full:

    df -h /
    # Showed 0 bytes available
  2. Found VM image on wrong partition:

    sudo virsh domblklist k3s-master-01
    # Showed /var/lib/libvirt/images/k3s-master-01.qcow2
  3. Moved to SSD:

    sudo virsh destroy k3s-master-01
    sudo mv /var/lib/libvirt/images/k3s-master-01.qcow2 /mnt/onboard-ssd/libvirt/images/
    sudo virsh edit k3s-master-01  # Updated disk path
  4. Cleaned up unused VMs (ise-02, home-dc02):

    sudo virsh destroy ise-02; sudo virsh undefine ise-02 --remove-all-storage
    sudo virsh undefine home-dc02 --remove-all-storage

Prevention: - ALWAYS create new VMs with disks on /mnt/onboard-ssd/libvirt/images/ - Monitor root partition: df -h / | awk 'NR==2 {print $4}' - Consider symlinking /var/lib/libvirt/images to SSD