KVM Operations & Maintenance
Operational runbook for KVM hypervisor management across kvm-01 and kvm-02. Covers VM lifecycle, storage management, troubleshooting, and maintenance procedures.
1. Quick Reference
| Task | Command |
|---|---|
List all VMs |
|
VM state with reason |
|
Start/stop/restart |
|
Force stop |
|
Console access |
|
Check disk location |
|
Check VM memory/CPU |
|
Change disk path (sed) |
|
2. Storage Architecture
2.1. Current Layout (kvm-01)
| Path | Purpose | Size |
|---|---|---|
|
OS only - DO NOT store VMs here |
14GB |
|
VM images, ISOs, backups |
962GB |
|
Root partition is only 14GB. All VM images MUST be stored on |
2.2. Storage Paths
# VM disk images
/mnt/onboard-ssd/libvirt/images/
# ISO files
/mnt/onboard-ssd/libvirt/images/iso/
# Cloud-init ISOs
/mnt/onboard-ssd/libvirt/images/
# Base/template images
/mnt/onboard-ssd/libvirt/images/
2.3. Check Storage Usage
Host filesystem:
df -h | awk 'NR==1 || /G|T/ {print}'
VM images by size:
sudo du -sh /mnt/onboard-ssd/libvirt/images/* | sort -rh
Find which disk a VM uses:
sudo virsh domblklist <vm-name> | awk 'NR>2 && $2 != "-" {print $2}'
sudo virsh domblklist ise-01 | awk 'NR>2 && $2 != "-" {print $2}'
3. VM Lifecycle Management
3.2. Start/Stop/Restart
# Graceful shutdown (sends ACPI signal)
sudo virsh shutdown <vm-name>
# Force stop (like pulling power)
sudo virsh destroy <vm-name>
# Start
sudo virsh start <vm-name>
# Reboot
sudo virsh reboot <vm-name>
3.3. Delete VM Completely
|
This permanently deletes the VM and its disk. Cannot be undone. |
Graceful removal (VM must be shut off):
sudo virsh undefine <vm-name> --remove-all-storage
Force removal (running VM):
sudo virsh destroy <vm-name> 2>/dev/null; sudo virsh undefine <vm-name> --remove-all-storage
Verify deletion:
sudo virsh list --all | grep <vm-name>
4. Move VM to Different Storage
When VMs are on wrong storage (NAS, root partition), move them to local SSD.
|
Critical VMs (ISE, Vault, AD) should be on local SSD, not NAS. NAS disconnects during VM I/O cause filesystem corruption. Learned from ise-02 incident. |
4.1. Discover Storage
List block devices:
lsblk -o NAME,SIZE,TYPE,MOUNTPOINT | grep -v loop
Find disk devices:
sudo fdisk -l | awk '/^Disk \/dev\/[a-z]/ {print}'
Find libvirt image directories:
df -h | awk '/libvirt|images/'
4.2. Storage Paths by Hypervisor
| Hypervisor | Local SSD | NAS (non-critical only) |
|---|---|---|
kvm-01 |
|
|
kvm-02 |
|
|
4.3. Quick One-Liner (Experienced Users)
For VMs already shut down with disk already copied:
sudo virsh dumpxml <vm> | sed 's|/old/path/|/new/path/|' | sudo virsh define /dev/stdin
sudo virsh dumpxml vault-01 | sed 's|/mnt/nas/vms/|/var/lib/libvirt/images/|' | sudo virsh define /dev/stdin
4.4. Move Procedure (sed Workflow)
1. Shut down the VM:
sudo virsh shutdown <vm-name>
# Wait for shutdown
while [ "$(sudo virsh domstate <vm-name>)" != "shut off" ]; do sleep 2; echo "waiting..."; done
echo "VM is off"
2. Find current disk location:
sudo virsh domblklist <vm-name> | awk 'NR>2 && $2 != "-" {print $2}'
3. Copy disk to local SSD (preserve original until verified):
# On kvm-01:
sudo cp /mnt/nas/vms/<vm-name>.qcow2 /mnt/onboard-ssd/libvirt/images/
# On kvm-02:
sudo cp /mnt/nas/vms/<vm-name>.qcow2 /var/lib/libvirt/images/
4. Export and update VM config with sed:
sudo virsh dumpxml <vm-name> > /tmp/<vm-name>.xml
# Find current path
awk '/source file=.*qcow2/' /tmp/<vm-name>.xml
# Replace NAS path with local SSD (kvm-02 example)
sed -i 's|/mnt/nas/vms/|/var/lib/libvirt/images/|g' /tmp/<vm-name>.xml
# Validate change
awk '/source file=.*qcow2/' /tmp/<vm-name>.xml
5. Redefine and start:
sudo virsh define /tmp/<vm-name>.xml
sudo virsh start <vm-name>
6. Verify VM boots and runs correctly:
sudo virsh domstate <vm-name>
ping -c3 <vm-ip>
7. Remove NAS copy (only after verified working):
# ONLY after VM confirmed working for 24+ hours
sudo rm /mnt/nas/vms/<vm-name>.qcow2
# Shutdown
sudo virsh shutdown ise-02
# Wait
while [ "$(sudo virsh domstate ise-02)" != "shut off" ]; do sleep 2; done
# Copy to local SSD (kvm-02: /var/lib/libvirt/images/)
sudo cp /mnt/nas/vms/ise-02.qcow2 /var/lib/libvirt/images/
# Export XML
sudo virsh dumpxml ise-02 > /tmp/ise-02.xml
# Find current path
awk '/source file=.*qcow2/' /tmp/ise-02.xml
# Update path (NAS → local NVMe)
sed -i 's|/mnt/nas/vms/|/var/lib/libvirt/images/|g' /tmp/ise-02.xml
# Verify new path
awk '/source file=.*qcow2/' /tmp/ise-02.xml
# Redefine and start
sudo virsh define /tmp/ise-02.xml
sudo virsh start ise-02
# Verify connectivity
ping -c3 10.50.1.21
5. Troubleshooting
5.1. VM Paused Due to I/O Error
Symptoms:
sudo virsh domstate <vm-name> --reason
# Output: paused (I/O error)
Root cause: Usually disk full on host.
Diagnosis:
# Check host disk space
df -h /
# Check QEMU logs
sudo tail -20 /var/log/libvirt/qemu/<vm-name>.log
Fix:
-
Free space on host (see storage section)
-
Move VM images to larger partition
-
Resume VM:
sudo virsh resume <vm-name>
5.2. VM Unreachable - Orphaned vnet (Not in Bridge)
Symptoms:
-
VM running but can’t ping
-
ARP shows
FAILED -
bridge vlan addreturnsOperation not supported
Diagnosis:
VNET=$(sudo virsh domiflist <vm-name> | awk 'NR==3 {print $1}')
ip link show $VNET | grep master
If NO master br-mgmt shown, vnet is orphaned.
Fix - Add vnet to bridge:
sudo ip link set $VNET master br-mgmt
Then add VLANs and PVID:
for vid in 10 20 30 40 100 110 120; do sudo bridge vlan add vid $vid dev $VNET; done
sudo bridge vlan add vid 100 dev $VNET pvid untagged
sudo bridge vlan del vid 1 dev $VNET 2>/dev/null
Verify:
bridge vlan show dev $VNET
ping -c2 <vm-ip>
Root cause: This happens when nmcli conn up br-mgmt is run while VMs are running - it kicks all vnets out of the bridge.
5.3. VM Won’t Start - Disk Not Found
Symptoms:
error: Failed to start domain 'vm-name' error: Cannot access storage file '/path/to/disk.qcow2'
Fix: Update disk path in VM config:
sudo virsh edit <vm-name>
# Correct the <source file='/correct/path/disk.qcow2'/> line
5.4. Console Shows Nothing
Press Enter after connecting - console may need input to refresh.
sudo virsh console <vm-name>
# Press Enter
# To exit: Ctrl+]
5.5. Check Why VM Paused
sudo virsh domstate <vm-name> --reason
Common reasons:
- paused (I/O error) - Disk full or storage issue
- paused (user) - Manual suspend
- paused (watchdog) - Guest OS triggered watchdog
5.6. View libvirt Logs
# Recent libvirt daemon logs
sudo journalctl -u libvirtd --since "10 minutes ago" --no-pager | tail -30
# VM-specific QEMU logs
sudo tail -50 /var/log/libvirt/qemu/<vm-name>.log
5.7. DNS Resolution Failing (NAS/NFS Mounts)
Symptoms:
sudo mount -t nfs nas-01:/volume1/isos /mnt/nas/isos
# mount.nfs: Failed to resolve server nas-01: Name or service not known
Diagnosis:
cat /etc/resolv.conf
# If only pfSense (10.50.1.1), internal DNS names won't resolve
Fix - Add bind-01 as primary nameserver:
# Overwrite (preferred - clean config)
sudo tee /etc/resolv.conf <<'EOF'
nameserver 10.50.1.90
nameserver 10.50.1.1
EOF
# Or append without clobbering existing
echo "nameserver 10.50.1.90" | sudo tee -a /etc/resolv.conf
Workaround - Use IP directly:
sudo mount -t nfs 10.50.1.70:/volume1/isos /mnt/nas/isos
|
bind-01 (10.50.1.90) resolves internal |
6. Maintenance Operations
6.1. Bulk VM Status Check
for vm in $(sudo virsh list --all --name); do
state=$(sudo virsh domstate "$vm" 2>/dev/null)
printf "%-20s %s\n" "$vm" "$state"
done
6.2. Find All VMs Using Root Partition
for vm in $(sudo virsh list --all --name); do
disk=$(sudo virsh domblklist "$vm" 2>/dev/null | awk 'NR>2 && $2 ~ /^\/var\/lib/ {print $2}')
[ -n "$disk" ] && echo "$vm: $disk"
done
6.3. Calculate Total VM Disk Usage
sudo du -sh /mnt/onboard-ssd/libvirt/images/ | awk '{print "Total VM storage: "$1}'
6.4. Check VM Resource Allocation
sudo virsh dominfo <vm-name> | awk '/Max memory|Used memory|CPU/'
All VMs summary:
printf "%-20s %8s %8s\n" "VM" "Memory" "CPUs"
for vm in $(sudo virsh list --name); do
mem=$(sudo virsh dominfo "$vm" | awk '/Used memory/{print $3}')
cpu=$(sudo virsh dominfo "$vm" | awk '/CPU\(s\)/{print $2}')
printf "%-20s %6sKB %8s\n" "$vm" "$mem" "$cpu"
done
7. VM Migration Between Hypervisors
7.1. Migration from kvm-01 to kvm-02
|
CRITICAL: kvm-01 uses virbr0 (no VLAN filtering), kvm-02 uses br-mgmt (VLAN filtering). When migrating VMs from kvm-01 to kvm-02, you MUST fix the vnet PVID or the VM will have no network connectivity. The vnet defaults to PVID 1 but the MGMT network is VLAN 100. |
Symptoms of wrong PVID:
# VM is running but can't be pinged
ip neigh | grep <vm-ip>
# Shows: 10.50.1.X dev br-mgmt FAILED
Quick Fix (immediate, non-persistent):
# Find the vnet interface
VNET=$(sudo virsh domiflist <vm-name> | awk '/br-mgmt/ {print $1}')
echo "VM uses: $VNET"
# Check current PVID
bridge vlan show dev $VNET
# Fix PVID: remove PVID 1, add PVID 100
sudo bridge vlan add vid 100 dev $VNET pvid untagged
sudo bridge vlan del vid 1 dev $VNET pvid untagged
# Verify PVID 100
bridge vlan show dev $VNET
Persistent Fix (add VM to libvirt hook):
# Add VM to PVID100_VMS in the hook script
sudo vim /etc/libvirt/hooks/qemu
# Find this line:
# PVID100_VMS="vyos-01 vyos-02 ..."
# Add your VM name to the list
# Restart libvirtd to load changes
sudo systemctl restart libvirtd
7.2. Migration Procedure
1. Copy qcow2 via NFS:
# On kvm-01: Mount NAS if not mounted
sudo mount -t nfs 10.50.1.70:/volume1/vms /mnt/nas-vms
# Copy VM disk
sudo cp /mnt/onboard-ssd/libvirt/images/<vm-name>.qcow2 /mnt/nas-vms/
2. Create VM on kvm-02:
# Mount NAS
sudo mount -t nfs 10.50.1.70:/volume1/vms /mnt/nas/vms
# Import VM
sudo virt-install \
--name <vm-name> \
--memory 2048 \
--vcpus 2 \
--disk /mnt/nas/vms/<vm-name>.qcow2,bus=virtio \
--import \
--os-variant rocky9 \
--network bridge=br-mgmt,model=virtio \
--graphics vnc,listen=0.0.0.0 \
--noautoconsole
3. Fix PVID (CRITICAL):
VNET=$(sudo virsh domiflist <vm-name> | awk '/br-mgmt/ {print $1}')
sudo bridge vlan add vid 100 dev $VNET pvid untagged
sudo bridge vlan del vid 1 dev $VNET pvid untagged
4. Fix VM gateway (if VM was using old gateway):
# Inside VM via Cockpit console
ip route show
# If shows old gateway (e.g., 10.50.1.1 pfSense):
sudo ip route del default
sudo ip route add default via 10.50.1.3 # VyOS
5. Make PVID persistent:
# Add to PVID100_VMS in /etc/libvirt/hooks/qemu
sudo sed -i 's/PVID100_VMS="\([^"]*\)"/PVID100_VMS="\1 <vm-name>"/' /etc/libvirt/hooks/qemu
grep PVID100_VMS /etc/libvirt/hooks/qemu
7.3. Migration from kvm-02 to kvm-01 (via Workstation)
|
SSH is NOT configured between kvm-01 and kvm-02. All transfers must go through the workstation.
|
Use case: Move primary VMs from kvm-02 (NAS-dependent) to kvm-01 (onboard SSD) for resilience.
7.3.1. Standard Migration (Small VMs via Workstation /tmp)
1. On kvm-02 - Shutdown VMs and export:
# Shutdown VMs
for vm in vault-01 bind-01; do
sudo virsh shutdown $vm
done
# Wait for shutdown, then export XMLs
for vm in vault-01 bind-01; do
sudo virsh dumpxml $vm > /tmp/$vm.xml
done
# Stage qcow2 files (requires root)
for vm in vault-01 bind-01; do
sudo cp /var/lib/libvirt/images/$vm.qcow2 /tmp/
sudo chmod 644 /tmp/$vm.qcow2
done
2. From workstation - Pull files from kvm-02:
# Pull XMLs
scp kvm-02:/tmp/{vault-01,bind-01}.xml /tmp/
# Pull qcow2 files (may take several minutes per VM)
scp kvm-02:/tmp/{vault-01,bind-01}.qcow2 /tmp/
3. From workstation - Push to kvm-01:
# Push XMLs
scp /tmp/{vault-01,bind-01}.xml kvm-01:/tmp/
# Push qcow2 files
scp /tmp/{vault-01,bind-01}.qcow2 kvm-01:/tmp/
4. On kvm-01 - Move files and define VMs:
# Move qcow2 to libvirt images directory
for vm in vault-01 bind-01; do
sudo mv /tmp/$vm.qcow2 /mnt/onboard-ssd/libvirt/images/
done
# Fix XML paths, define, and start VMs
for vm in vault-01 bind-01; do
sed -i 's|/var/lib/libvirt/images/|/mnt/onboard-ssd/libvirt/images/|g' /tmp/$vm.xml
sudo virsh define /tmp/$vm.xml
sudo virsh start $vm
done
5. Verify:
sudo virsh list --all | grep -E 'vault-01|bind-01'
6. Cleanup on kvm-02 (after confirmed working):
for vm in vault-01 bind-01; do
sudo virsh undefine $vm
sudo rm /var/lib/libvirt/images/$vm.qcow2
rm /tmp/$vm.xml /tmp/$vm.qcow2
done
7.3.2. Large VM Migration (via NAS Staging)
For VMs larger than workstation /tmp capacity (e.g., home-dc01 ~40GB), use NAS as intermediate storage.
1. On kvm-02 - Copy to NAS:
# Shutdown VM
sudo virsh shutdown home-dc01
# Export XML
sudo virsh dumpxml home-dc01 > /tmp/home-dc01.xml
# Copy qcow2 to NAS (if NAS mounted)
sudo cp /var/lib/libvirt/images/home-dc01.qcow2 /mnt/nas/vms/
2. From workstation - Transfer XML only:
scp kvm-02:/tmp/home-dc01.xml /tmp/
scp /tmp/home-dc01.xml kvm-01:/tmp/
3. On kvm-01 - Copy from NAS and define:
# Mount NAS if needed
sudo mount -t nfs {nas-ip}:/volume1/vms /mnt/nas/vms
# Copy from NAS to onboard SSD
sudo cp /mnt/nas/vms/home-dc01.qcow2 /mnt/onboard-ssd/libvirt/images/
# Fix XML path and define
sed -i 's|/var/lib/libvirt/images/|/mnt/onboard-ssd/libvirt/images/|g' /tmp/home-dc01.xml
sudo virsh define /tmp/home-dc01.xml
sudo virsh start home-dc01
4. Verify and cleanup:
sudo virsh list --all | grep home-dc01
ping -c3 {homedc-ip}
7.4. Import VM from NAS (No Existing XML)
For VMs where only the qcow2 exists (no XML definition), create a new VM definition.
Example: ipa-01 (FreeIPA)
1. Copy qcow2 from NAS:
sudo cp /mnt/nas/vms/ipa-01.qcow2 /mnt/onboard-ssd/libvirt/images/
2. Create VM with virt-install:
sudo virt-install \
--name ipa-01 \
--memory 4096 \
--vcpus 2 \
--disk /mnt/onboard-ssd/libvirt/images/ipa-01.qcow2,bus=virtio \
--import \
--os-variant rocky9 \
--network bridge=br-mgmt,model=virtio \
--graphics vnc,listen=0.0.0.0 \
--noautoconsole
3. Trigger libvirt hook (virt-install doesn’t trigger "started" hook):
sudo virsh destroy ipa-01
sudo virsh start ipa-01
4. Verify PVID and connectivity:
VNET=$(sudo virsh domiflist ipa-01 | awk '/br-mgmt/ {print $1}')
bridge vlan show dev $VNET
ping -c3 {ipa-ip}
5. Fix clock drift (common after VM migration):
ssh ipa-01 "sudo timedatectl set-ntp true && sudo chronyc makestep"
6. Verify services (FreeIPA):
ssh ipa-01 "sudo ipactl status"
8. Resize VM Resources
8.1. Increase VM Memory (RAM)
|
Safe operation. VM must be shut down. Data on NFS/persistent storage is unaffected. Pods auto-restart when VM comes back. |
1. Check current allocation:
sudo virsh dominfo <vm-name> | grep -E 'Max memory|Used memory'
2. Shut down VM gracefully:
sudo virsh shutdown <vm-name>
# Wait for shutdown (check every 5 seconds)
while [ "$(sudo virsh domstate <vm-name>)" != "shut off" ]; do
sleep 5
echo "Waiting for shutdown..."
done
echo "VM is off"
3. Increase memory (example: 8GB):
# Set maximum memory (requires VM off)
sudo virsh setmaxmem <vm-name> 8G --config
# Set current memory
sudo virsh setmem <vm-name> 8G --config
4. Verify configuration:
sudo virsh dominfo <vm-name> | grep -E 'Max memory|Used memory'
5. Start VM:
sudo virsh start <vm-name>
6. Verify inside VM:
ssh <vm-name> "free -h | awk 'NR==2 {print \"Total RAM: \"\$2}'"
# On kvm-01
sudo virsh shutdown k3s-master-01
# Wait...
sudo virsh setmaxmem k3s-master-01 8G --config
sudo virsh setmem k3s-master-01 8G --config
sudo virsh start k3s-master-01
# Verify
ssh k3s-master-01 "free -h"
9. ISO/CDROM Management
9.1. Check Attached Media
sudo virsh domblklist <vm-name>
Target Source ------------------------------------------------------------ vda /mnt/nas/vms/ise-02.qcow2 sda /mnt/nas/isos/Cisco-ISE-3.5.0.527.SPA.x86_64.iso
9.2. Eject ISO (VM Keeps Booting to Install Menu)
Problem: VM boots to installation menu instead of installed OS.
Cause: ISO still attached from initial install.
# Find which device has the ISO (usually sda or hdc)
sudo virsh domblklist <vm-name>
# Eject the ISO
sudo virsh change-media <vm-name> sda --eject
# Reboot to boot from installed disk
sudo virsh reboot <vm-name>
9.3. Attach ISO (For Recovery or Reinstall)
Use case: ISE password reset, OS recovery, reinstall.
# Attach ISO to existing CDROM device
sudo virsh change-media <vm-name> sda /path/to/image.iso --insert
# Or attach to a VM without CDROM device
sudo virsh attach-disk <vm-name> /path/to/image.iso sda --type cdrom --mode readonly
|
ISE Password Reset Procedure
|
9.4. Persistent CDROM Removal (sed Workflow)
Advanced approach using sed for scripted/repeatable XML editing:
1. Shutdown and export XML:
sudo virsh destroy <vm-name>
sudo virsh dumpxml <vm-name> > /tmp/<vm-name>.xml
2. Find ISO source line:
awk '/source file.*iso/' /tmp/<vm-name>.xml
3. Remove ISO source line (keeps empty CDROM device):
sed -i '/<source file=.*\.iso/d' /tmp/<vm-name>.xml
4. Validate CDROM block (should have no source):
awk '/disk.*cdrom/,/<\/disk>/' /tmp/<vm-name>.xml
<disk type='file' device='cdrom'> <driver name='qemu' type='raw'/> <target dev='sda' bus='sata'/> <readonly/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk>
5. Redefine and start:
sudo virsh define /tmp/<vm-name>.xml
sudo virsh start <vm-name>
10. kvm-02 Network Architecture
kvm-02 uses Linux bridge VLAN filtering with native VLAN 100 (MGMT). This differs from kvm-01 which uses a simple untagged bridge (virbr0).
10.1. Physical Topology
PHYSICAL NETWORK
┌─────────────────────────────────────────────────────────────┐
│ C3560CX Switch │
│ ├── Te1/0/1 (to kvm-02) │
│ │ ├── Native VLAN: 100 (MGMT) ─── untagged traffic │
│ │ └── Trunk: 20,30,40,100,110,120 ─── tagged traffic │
│ └── Gi1/0/X (to other devices) │
└─────────────────────────────────────────────────────────────┘
│
│ untagged = VLAN 100
│ tagged = VLANs 20,30,40,110,120
▼
┌─────────────────────────────────────────────────────────────┐
│ kvm-02 Host (10.50.1.98) │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ eno8 (physical NIC) │ │
│ │ ├── PVID 100 ─── untagged frames → VLAN 100 │ │
│ │ └── VLANs: 10,20,30,40,100,110,120 │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ br-mgmt (Linux bridge with VLAN filtering) │ │
│ │ ├── PVID 100 (self) ─── host traffic = VLAN 100 │ │
│ │ └── VLANs: 10,20,30,40,100,110,120 │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ vnet25 │ │ vnet31 │ │ vnetX │ │
│ │ PVID 100 │ │ PVID 100 │ │ PVID 1 │ │
│ │ VLANs: all │ │ VLANs: all │ │ VLANs: all │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ vyos-02 │ │ 9800-WLC-02 │ │ other VM │ │
│ │ eth0=MGMT │ │ Vlan100=MGMT│ │ eth0.100 │ │
│ │ (untagged) │ │ (native) │ │ (tagged) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
10.2. PVID (Port VLAN ID) Explained
PVID determines which VLAN receives UNTAGGED frames.
| Interface | PVID | Effect |
|---|---|---|
eno8 |
100 |
Switch native VLAN 100 (untagged) → tagged as VLAN 100 in bridge |
br-mgmt (self) |
100 |
Host traffic (10.50.1.98) → VLAN 100 |
vnet25 (vyos-02) |
100 |
VyOS eth0 untagged → VLAN 100 |
vnet31 (WLC-02) |
100 |
WLC native VLAN → VLAN 100 |
vnetX (other) |
1 (default) |
Uses eth0.100 (tagged), PVID doesn’t matter |
10.3. Traffic Flow Examples
Example 1: kvm-02 host pings pfSense (10.50.1.1)
kvm-02 host (10.50.1.98)
↓ untagged
br-mgmt (PVID 100 → VLAN 100)
↓ VLAN 100
eno8 (PVID 100 → exits untagged)
↓ untagged
Switch Te1/0/1 (native 100 → VLAN 100)
↓ routed
pfSense (10.50.1.1)
Example 2: Switch pings WLC-02 (10.50.1.41)
Switch (VLAN 100)
↓ native VLAN 100 (untagged)
eno8 (PVID 100 → VLAN 100)
↓ VLAN 100
br-mgmt (VLAN 100)
↓ VLAN 100
vnet31 (PVID 100 → exits untagged)
↓ untagged
WLC-02 Vlan100 interface (10.50.1.41)
Example 3: VyOS receives DHCP request from VLAN 40
Client on VLAN 40
↓ tagged VLAN 40
Switch Te1/0/1 (trunk)
↓ tagged VLAN 40
eno8 (VLAN 40 allowed)
↓ VLAN 40
br-mgmt (VLAN 40)
↓ VLAN 40
vnet25 (VLAN 40 allowed)
↓ tagged VLAN 40
VyOS eth0.40 (DHCP server)
10.4. Key Differences: kvm-01 vs kvm-02
| Aspect | kvm-01 | kvm-02 |
|---|---|---|
Bridge |
virbr0 (simple NAT/untagged) |
br-mgmt (VLAN filtering) |
Physical NIC |
No VLANs on host side |
eno8 with VLAN trunk |
Switch Connection |
Access port or direct |
Trunk (Te1/0/1, native 100) |
VM VLAN handling |
VMs handle VLANs internally |
Bridge handles VLAN tagging |
PVID Config |
Not applicable |
PVID 100 required on eno8, br-mgmt, VyOS/WLC vnets |
Persistence |
None needed |
systemd service + libvirt hook |
11. Bridge VLAN Persistence
|
When VM restarts or host reboots, vnet interfaces are recreated WITHOUT VLAN tags. This causes VLAN-tagged traffic (DHCP, DNS) to fail silently. |
11.1. Problem: VLANs vs PVID
Two separate issues require different fixes:
| Issue | Symptom | Fix |
|---|---|---|
Missing VLANs |
Tagged traffic (VLANs 10,20,30…) dropped |
|
Wrong PVID |
Untagged traffic goes to wrong VLAN |
|
PVID (Port VLAN ID): Tags untagged ingress frames. Default is PVID 1. WLCs send management traffic untagged on native VLAN 100 - if PVID is 1, traffic goes to wrong VLAN.
# Check: Does vnet have required VLANs AND correct PVID?
sudo bridge vlan show dev vnet11
# BAD: PVID on wrong VLAN (1 instead of 100)
port vlan-id
vnet11 1 PVID Egress Untagged ← WRONG for WLC
10
100 ← 100 exists but not PVID
# GOOD: PVID on VLAN 100 (for WLC native VLAN 100)
port vlan-id
vnet11 10
20
30
40
100 PVID Egress Untagged ← CORRECT for WLC
110
120
11.2. Diagnostic Commands
Find vnet interface for a VM:
sudo virsh domiflist <vm-name> | awk '/br-mgmt/ {print $1}'
Check VLAN config on vnet:
bridge vlan show dev vnet<N>
One-liner: Find VM’s vnet and show its VLANs:
VNET=$(sudo virsh domiflist 9800-WLC-02 | awk '/br-mgmt/ {print $1}') && bridge vlan show dev $VNET
11.3. Solution: Libvirt Hook Script
Libvirt hooks run on VM lifecycle events. Create a qemu hook that configures VLANs AND PVID when VM starts.
sudo tee /etc/libvirt/hooks/qemu << 'EOF'
#!/bin/bash
# Libvirt QEMU hook - configures VLANs and PVID on br-mgmt vnet interfaces
# CRITICAL: Do NOT use 'virsh' commands here - causes deadlock with libvirtd
#
# CRITICAL: PVID determines which VLAN receives UNTAGGED frames
#
# VMs with eth0 = MGMT (untagged 10.50.1.x) need PVID 100:
# - VyOS: eth0 = 10.50.1.2/10.50.1.3 (MGMT untagged), eth0.X = tagged VLANs
# - WLC: Vlan100 + native trunk = MGMT untagged, other VLANs tagged
#
# VMs with eth0 = VLAN 100 tagged (eth0.100) can use PVID 1 (default)
GUEST_NAME="$1"
OPERATION="$2"
# VLANs to add to all br-mgmt vnet interfaces
VLANS="10 20 30 40 100 110 120"
# VMs that need PVID 100 (MGMT VLAN for untagged management traffic)
# - VyOS: eth0 = 10.50.1.x (MGMT untagged)
# - WLC: native VLAN 100 (management untagged)
# - ISE: eth0 = 10.50.1.x (MGMT untagged)
# - Any VM with eth0 on 10.50.1.x (MGMT VLAN 100)
PVID100_VMS="vyos-01 vyos-02 9800-WLC-01 9800-WLC-02 ise-01 ise-02 bind-01 home-dc01 keycloak-01 ipsk-manager vault-01"
case "$OPERATION" in
started)
# Run in background (&) to avoid blocking libvirtd
(
sleep 3 # Wait for interfaces to be fully created
# Find vnet interfaces attached to br-mgmt
for vnet in $(ip link show master br-mgmt 2>/dev/null | awk -F': ' '/vnet/{print $2}'); do
logger -t "libvirt-hook" "$GUEST_NAME: Configuring $vnet"
# Add all VLANs
for vid in $VLANS; do
bridge vlan add vid "$vid" dev "$vnet" 2>/dev/null
done
# Check if this VM needs PVID 100 (VyOS, WLC)
for vm in $PVID100_VMS; do
if [ "$GUEST_NAME" = "$vm" ]; then
logger -t "libvirt-hook" "$GUEST_NAME: Setting PVID 100 on $vnet (MGMT native VLAN)"
bridge vlan del vid 1 dev "$vnet" pvid untagged 2>/dev/null
bridge vlan add vid 100 dev "$vnet" pvid untagged 2>/dev/null
fi
done
logger -t "libvirt-hook" "$GUEST_NAME: $vnet configuration complete"
done
) &
;;
esac
exit 0
EOF
sudo chmod +x /etc/libvirt/hooks/qemu
sudo systemctl restart libvirtd
11.4. Verify Hook Works
sudo virsh shutdown 9800-WLC-02 && sleep 5 && sudo virsh start 9800-WLC-02
sleep 10
VNET=$(sudo virsh domiflist 9800-WLC-02 | awk '/br-mgmt/ {print $1}')
bridge vlan show dev $VNET
port vlan-id
vnet11 10
20
30
40
100 PVID Egress Untagged
110
120
journalctl -t libvirt-hook --since "5 min ago"
11.5. Manual Fix (Non-Persistent)
Add VLANs only (VyOS, general VMs):
for vid in 10 20 30 40 100 110 120; do
sudo bridge vlan add vid $vid dev vnet<N>
done
Fix PVID for WLC (native VLAN 100):
VNET=$(sudo virsh domiflist 9800-WLC-02 | awk '/br-mgmt/ {print $1}')
sudo bridge vlan del vid 1 dev $VNET pvid untagged
sudo bridge vlan add vid 100 dev $VNET pvid untagged
11.6. Affected VMs (kvm-02)
| VM | Bridge | PVID Required | Hook Action |
|---|---|---|---|
vyos-02 |
br-mgmt |
PVID 100 (eth0 = MGMT untagged) |
Add VLANs + Set PVID 100 |
9800-WLC-02 |
br-mgmt |
PVID 100 (native VLAN 100) |
Add VLANs + Set PVID 100 |
ise-01, ise-02 |
br-mgmt |
PVID 100 (eth0 = MGMT untagged) |
Add VLANs + Set PVID 100 |
bind-01 |
br-mgmt |
PVID 100 (eth0 = MGMT untagged) |
Add VLANs + Set PVID 100 |
home-dc01 |
br-mgmt |
PVID 100 (eth0 = MGMT untagged) |
Add VLANs + Set PVID 100 |
keycloak-01 |
br-mgmt |
PVID 100 (eth0 = MGMT untagged) |
Add VLANs + Set PVID 100 |
vault-01 |
br-mgmt |
PVID 100 (eth0 = MGMT untagged) |
Add VLANs + Set PVID 100 |
ipsk-manager |
br-mgmt |
PVID 100 (eth0 = MGMT untagged) |
Add VLANs + Set PVID 100 |
Other VMs |
br-mgmt |
PVID 1 (default) |
Add VLANs only |
|
Why PVID 100 for most VMs? VMs on br-mgmt use eth0 for management (10.50.1.x/24). This traffic is UNTAGGED. The bridge PVID tags incoming untagged frames - must be 100 to match MGMT VLAN. Only VMs that use eth0.100 (tagged VLAN 100) can use PVID 1. |
|
kvm-01 uses virbr0 (untagged bridge) - no VLAN filtering, no hook needed. VMs on kvm-01 (pfSense, WLC-01, etc.) handle VLANs internally. |
11.7. Cross-Hypervisor XML Migration (kvm-01 ↔ kvm-02)
When migrating VMs between kvm-01 (Arch) and kvm-02 (RHEL 7), you must fix THREE compatibility issues in the XML:
| Issue | kvm-01 (Arch) | kvm-02 (RHEL 7) |
|---|---|---|
QEMU binary |
|
|
Machine type |
|
|
Disk path |
|
|
|
Shell prompt injection: If your shell has hooks that output text (like |
11.7.1. Full Procedure: kvm-01 → kvm-02
1. Dump XML from source (filter shell noise):
ssh kvm-01 "sudo virsh dumpxml <vm-name>" | grep -v "session active" | grep -v "^⚡" > /tmp/<vm-name>.xml
2. Verify XML is clean:
head -3 /tmp/<vm-name>.xml
<domain type='kvm'> <name>9800-WLC-01</name> <uuid>920adcbd-5510-46f6-a48c-6b7280c82b2e</uuid>
3. Fix all three compatibility issues:
# Fix QEMU binary path
sed -i 's|/usr/bin/qemu-system-x86_64|/usr/libexec/qemu-kvm|' /tmp/<vm-name>.xml
# Fix machine type (check available: ssh kvm-02 "/usr/libexec/qemu-kvm -machine help | grep i440fx")
sed -i "s|machine='pc-i440fx-[^']*'|machine='pc'|" /tmp/<vm-name>.xml
# Fix disk path (NAS or kvm-01 SSD → kvm-02 local)
sed -i 's|/mnt/nas/vms/|/var/lib/libvirt/images/|' /tmp/<vm-name>.xml
sed -i 's|/mnt/onboard-ssd/libvirt/images/|/var/lib/libvirt/images/|' /tmp/<vm-name>.xml
4. Copy XML and disk image to target:
scp /tmp/<vm-name>.xml kvm-02:/tmp/
# Copy disk (if not already on NAS)
ssh kvm-02 "sudo cp /mnt/nas/vms/<vm-name>.qcow2 /var/lib/libvirt/images/"
5. Define and start on target:
ssh -t kvm-02 "sudo virsh define /tmp/<vm-name>.xml && sudo virsh start <vm-name>"
6. Fix PVID (if kvm-02 uses br-mgmt):
ssh kvm-02 "VNET=\$(sudo virsh domiflist <vm-name> | awk '/br-mgmt/ {print \$1}'); sudo bridge vlan add vid 100 dev \$VNET pvid untagged; sudo bridge vlan del vid 1 dev \$VNET pvid untagged 2>/dev/null"
11.7.2. Troubleshooting XML Import
| Error | Cause | Fix |
|---|---|---|
|
Shell output captured in XML file |
Re-dump with |
|
Wrong QEMU path for target hypervisor |
|
|
QEMU version mismatch |
Use |
|
Disk image not copied or wrong path |
Copy image, fix |
11.7.3. Chronicle: 2026-03-07 WLC Migration
Problem: 9800-WLC-01 unreachable on kvm-02, vnet orphaned.
Resolution: Fresh XML import from kvm-01:
-
Dumped XML - corrupted by shell prompt injection (
⚡ No session active) -
Re-dumped with
grep -vfilter -
Fixed QEMU path:
/usr/bin/qemu-system-x86_64→/usr/libexec/qemu-kvm -
Fixed machine type:
pc-i440fx-10.1→pc -
Fixed disk path:
/mnt/nas/vms/→/var/lib/libvirt/images/ -
Defined and started successfully
Key learning: kvm-01 (Arch, rolling) and kvm-02 (RHEL 7) have incompatible QEMU versions. Always transform XML when crossing hypervisors.
11.8. Persist br-mgmt and eno8 PVID 100
The libvirt hook handles vnet interfaces when VMs start. But br-mgmt and eno8 themselves need PVID 100 at boot, BEFORE any VMs start.
Why both br-mgmt and eno8?
Physical Network Path:
Switch Te1/0/1 (native VLAN 100)
↓ untagged
eno8 (must tag as VLAN 100, not VLAN 1)
↓ VLAN 100
br-mgmt (must recognize as VLAN 100)
↓ VLAN 100
vnetX (PVID 100 for VMs)
↓
VM eth0 (10.50.1.x untagged)
If eno8 or br-mgmt has PVID 1, switch native VLAN 100 traffic gets tagged as VLAN 1, breaking connectivity.
sudo tee /etc/systemd/system/bridge-vlan-pvid.service << 'EOF'
[Unit]
Description=Configure br-mgmt and eno8 PVID 100 for MGMT VLAN
After=network.target NetworkManager.service
Before=libvirtd.service
[Service]
Type=oneshot
RemainAfterExit=yes
# Add VLANs to both interfaces
ExecStart=/usr/sbin/bridge vlan add vid 10 dev eno8
ExecStart=/usr/sbin/bridge vlan add vid 20 dev eno8
ExecStart=/usr/sbin/bridge vlan add vid 30 dev eno8
ExecStart=/usr/sbin/bridge vlan add vid 40 dev eno8
ExecStart=/usr/sbin/bridge vlan add vid 100 dev eno8
ExecStart=/usr/sbin/bridge vlan add vid 110 dev eno8
ExecStart=/usr/sbin/bridge vlan add vid 120 dev eno8
# Set PVID 100 on eno8 (physical interface to switch)
ExecStart=/usr/sbin/bridge vlan add vid 100 dev eno8 pvid untagged
# Set PVID 100 on br-mgmt self (bridge's own interface for host traffic)
ExecStart=/usr/sbin/bridge vlan add vid 100 dev br-mgmt self pvid untagged
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable bridge-vlan-pvid.service
sudo systemctl start bridge-vlan-pvid.service
bridge vlan show dev eno8 | awk '/100.*PVID/{print "eno8: PVID 100 ✓"}'
bridge vlan show dev br-mgmt | awk '/100.*PVID/{print "br-mgmt: PVID 100 ✓"}'
eno8: PVID 100 ✓
br-mgmt: PVID 100 ✓
|
Order matters: This service must run:
The |
12. Chronicle: 2026-02-21
12.1. Issue: k3s-master-01 Paused
Symptoms:
- VM repeatedly pausing after resume
- virsh domstate --reason showed paused (I/O error)
Root Cause:
- Host root partition (/dev/sda2) was 100% full (14GB total)
- k3s-master-01.qcow2 was on root partition, grew to 3.5GB
- Left only 44MB free, triggering I/O errors
Resolution:
-
Identified disk full:
df -h / # Showed 0 bytes available -
Found VM image on wrong partition:
sudo virsh domblklist k3s-master-01 # Showed /var/lib/libvirt/images/k3s-master-01.qcow2 -
Moved to SSD:
sudo virsh destroy k3s-master-01 sudo mv /var/lib/libvirt/images/k3s-master-01.qcow2 /mnt/onboard-ssd/libvirt/images/ sudo virsh edit k3s-master-01 # Updated disk path -
Cleaned up unused VMs (ise-02, home-dc02):
sudo virsh destroy ise-02; sudo virsh undefine ise-02 --remove-all-storage sudo virsh undefine home-dc02 --remove-all-storage
Prevention:
- ALWAYS create new VMs with disks on /mnt/onboard-ssd/libvirt/images/
- Monitor root partition: df -h / | awk 'NR==2 {print $4}'
- Consider symlinking /var/lib/libvirt/images to SSD