WRKLOG-2026-03-08
Summary
Massive infrastructure day. 9 sessions covering network, wireless, identity, and compute.
Key Accomplishments
| Area | Achievement |
|---|---|
iPSK Manager |
ISE ODBC integration complete - All 5 stored procedure tests passing. Fixed |
Wireless DHCP |
Root cause: WLC DHCP relay without VLAN 40 SVI. Fix: Disabled relay, let broadcasts flow to VyOS. |
WLC Cleanup |
Removed legacy policy/tag configs. Created standardized |
WLC HA SSO |
WLC-01 (Active) + WLC-02 (Standby HOT). 2-NIC approach with Gi2 for HA heartbeat. |
WiFi EAP-TLS |
Fixed VLAN naming mismatch, enabled DACL download ( |
ISE Authorization |
Restricted |
Switch DACL |
3560CX now downloads DACLs - added VSA authentication. |
VM Migration |
Moved vault-01, bind-01, home-dc01, ipa-01 from kvm-02 (NAS) to kvm-01 (onboard SSD) for UPS independence. |
Documentation |
VyOS migration runbook hierarchy verified. Added missing |
Infrastructure State After Today
| System | Status | Notes |
|---|---|---|
iPSK Manager |
OPERATIONAL |
ISE ODBC working, wireless DHCP fixed |
WLC HA |
OPERATIONAL |
SSO Active/Standby HOT |
EAP-TLS WiFi |
OPERATIONAL |
VLAN/DACL assignment working |
Primary VMs |
kvm-01 |
vault, bind, DC, IPA migrated off NAS |
VyOS HA |
OPERATIONAL |
Remaining VLANs pending (VOICE, STORAGE, DMZ) |
Today’s Priority Tasks
| Priority | Task | Status | Notes |
|---|---|---|---|
P0 |
iPSK Manager ISE ODBC integration |
[x] DONE |
All 5 tests passing, iPSK_AttributeFetch fixed |
P0 |
iPSK Wireless DHCP Fix |
[x] DONE |
Root cause: WLC DHCP relay without VLAN 40 SVI |
P0 |
WLC Configuration Cleanup |
[x] DONE |
Removed legacy profiles/tags, created DOMUS architecture |
P1 |
kvm-01 Rocky Linux rebuild (Phases 4-7) |
[ ] CARRY-OVER |
Blocked on iPSK completion |
P1 |
VyOS complete cutover (remaining VLANs) |
[ ] CARRY-OVER |
VOICE, STORAGE, DMZ pools |
P1 |
WLC HA SSO Configuration |
[x] DONE |
WLC-01 + WLC-02 in SSO pair (Session 8) |
Carried Over from 2026-03-07
Professional (CHLA)
| Priority | Task | Status | Notes |
|---|---|---|---|
P0 |
Linux SSH (Xianming Ding) |
[x] RESOLVED |
DACL missing AD DC DNS IPs. EAP-TLS cert test tomorrow. |
P1 |
iPSK Manager DB replication |
[ ] CARRY-OVER |
- |
P1 |
ISE 3.4p9 / 3.5 migration |
[ ] CARRY-OVER |
- |
P1 |
Monad - QRadar â Sentinel |
[ ] ACTIVE |
Cost-driven. Critical â Sentinel, bulk â local Linux. I own firewall/ISE/network sources. |
Monad Migration Context
Project: QRadar â Microsoft Sentinel migration
Constraint: Cost. Sentinel charges per GB ingested.
Strategy: - Critical pipeline â Sentinel (security events, auth failures, alerts) - Bulk logs â Local Linux syslog server (as backup/compliance)
My scope: Firewall (pfSense/VyOS), ISE, network devices - high log volume sources that need filtering.
Questions to answer: 1. What ISE log categories are "critical" vs "bulk"? 2. Can ISE syslog be split by severity/category? 3. What’s the local Linux target? (rsyslog? syslog-ng? Wazuh?)
Session 10: k3s Observability & jq Practice
Time: Evening
What Was Learned
| Topic | Key Insight |
|---|---|
Shell prompt pollution |
|
kubectl JSON + jq |
|
Pod status truth |
|
jq aggregation |
|
Monad transforms |
Use jq to filter/transform security data BEFORE SIEM ingestion. Same patterns apply. |
MetalLB L2 mode |
Services need |
Fixes Applied
| Issue | Fix | Result |
|---|---|---|
wazuh-indexer-0 Unknown |
|
StatefulSet recreating |
Monitoring ClusterIP only |
|
IPs assigned: Grafana .135, Prometheus .136, AlertManager .137 |
svclb-dashboard Pending |
Port conflict - ignored (dashboard service misconfigured) |
Non-critical |
jq Patterns Practiced
# Filter non-Running pods
jq '.items[] | select(.status.phase != "Running" and .status.phase != "Succeeded") | {namespace: .metadata.namespace, name: .metadata.name, phase: .status.phase}' /tmp/pods.json
# Count by namespace
jq '[.items[].metadata.namespace] | group_by(.) | map({namespace: .[0], count: length})' /tmp/pods.json
# Deep container status inspection
jq '.items[] | select(.metadata.name == "wazuh-indexer-0") | {phase: .status.phase, containerStatuses: .status.containerStatuses}' /tmp/pods.json
Infrastructure State After Session
| Service | IP | Status |
|---|---|---|
Grafana |
10.50.1.135:80 |
â Accessible |
Prometheus |
10.50.1.136:9090 |
â Accessible |
AlertManager |
10.50.1.137:9093 |
â Accessible |
Wazuh Indexer |
- |
ð Recreating |
Session 11: Traefik DNS Mismatch Fix
Time: Late Evening
Problem
Monitoring services (Grafana, Prometheus, AlertManager) unreachable via DNS names despite IngressRoutes configured.
Root Cause Analysis
| Check | Finding |
|---|---|
DNS records |
|
Traefik actual IP |
10.50.1.131 (MetalLB auto-assigned) |
Who has .130? |
|
Diagnosis commands:
# Check Traefik's actual IP
ssh k3s-master-01 "kubectl get svc traefik -n kube-system -o jsonpath='{.status.loadBalancer.ingress[0].ip}'"
# Returns: 10.50.1.131
# Check all LoadBalancer assignments
ssh k3s-master-01 "kubectl get svc -A -o jsonpath='{range .items[?(@.spec.type==\"LoadBalancer\")]}{.metadata.namespace}/{.metadata.name}: {.status.loadBalancer.ingress[0].ip}{\"\\n\"}{end}'" | grep -v "session active"
Fix
Follow: BIND Operations Quick Ref â Update Existing DNS Records (infra-ops runbook)
Phase 1: Pre-Validation
DNS_SERVER="bind-01"
DNS_IP="10.50.1.90"
DOMAIN="inside.domusdigitalis.dev"
FORWARD_ZONE="/var/named/inside.domusdigitalis.dev.zone"
echo "=== 1.1 CURRENT STATE ==="
ssh $DNS_SERVER "sudo grep -E 'grafana|prometheus|alertmanager' $FORWARD_ZONE"
echo "=== 1.2 CURRENT SERIAL ==="
ssh $DNS_SERVER "sudo rndc zonestatus $DOMAIN | grep serial"
echo "=== 1.3 CURRENT DNS RESOLUTION ==="
for h in grafana prometheus alertmanager; do
echo -n "$h: "
dig @$DNS_IP $h.$DOMAIN +short
done
Phase 2: Backup
TIMESTAMP=$(date +%Y%m%d%H%M)
echo "=== 2.1 CREATE BACKUP ==="
ssh $DNS_SERVER "sudo cp $FORWARD_ZONE ${FORWARD_ZONE}.bak.${TIMESTAMP}"
echo "=== 2.2 VERIFY BACKUP EXISTS ==="
ssh $DNS_SERVER "sudo ls -la ${FORWARD_ZONE}.bak.${TIMESTAMP}"
Phase 3: Capture Serial
CURRENT_SERIAL=$(dig @$DNS_IP $DOMAIN SOA +short | awk '{print $3}')
NEW_SERIAL=$((CURRENT_SERIAL + 1))
echo "=== 3.1 SERIAL VALUES ==="
echo "Current: $CURRENT_SERIAL"
echo "New: $NEW_SERIAL"
Phase 4: Apply Changes
echo "=== 4.1 UPDATE IPS ==="
ssh $DNS_SERVER "sudo sed -i '/grafana/s/10.50.1.130/10.50.1.131/' $FORWARD_ZONE"
ssh $DNS_SERVER "sudo sed -i '/prometheus/s/10.50.1.130/10.50.1.131/' $FORWARD_ZONE"
ssh $DNS_SERVER "sudo sed -i '/alertmanager/s/10.50.1.130/10.50.1.131/' $FORWARD_ZONE"
echo "=== 4.2 UPDATE SERIAL ==="
ssh $DNS_SERVER "sudo sed -i 's/$CURRENT_SERIAL/$NEW_SERIAL/' $FORWARD_ZONE"
echo "=== 4.3 VERIFY CHANGES ==="
ssh $DNS_SERVER "sudo grep -E 'grafana|prometheus|alertmanager' $FORWARD_ZONE"
ssh $DNS_SERVER "sudo grep -E '[0-9]{10}.*[Ss]erial' $FORWARD_ZONE"
Phase 5: Validate & Reload
echo "=== 5.1 CHECK ZONE SYNTAX ==="
ssh $DNS_SERVER "sudo named-checkzone $DOMAIN $FORWARD_ZONE"
echo "=== 5.2 RELOAD ZONE ==="
ssh $DNS_SERVER "sudo rndc reload $DOMAIN"
echo "=== 5.3 VERIFY NEW SERIAL LOADED ==="
ssh $DNS_SERVER "sudo rndc zonestatus $DOMAIN | grep serial"
Phase 6: Post-Validation
echo "=== 6.1 VERIFY DNS RESOLUTION ==="
for h in grafana prometheus alertmanager; do
result=$(dig @$DNS_IP $h.$DOMAIN +short)
if [[ "$result" == "10.50.1.131" ]]; then
echo "â $h â $result"
else
echo "â $h: expected 10.50.1.131, got '$result'"
fi
done
echo "=== 6.2 TEST HTTPS ACCESS ==="
curl -sSk -o /dev/null -w "%{http_code}" https://grafana.inside.domusdigitalis.dev
# Expected: 200 or 302
Key Learning
MetalLB L2 mode auto-assigns IPs in order of service creation. If you need a specific IP:
# Option 1: Patch service with specific IP
kubectl patch svc traefik -n kube-system -p '{"spec":{"loadBalancerIP":"10.50.1.130"}}'
# Option 2: Let MetalLB auto-assign, update DNS to match
# (Simpler - what we did here)
Target architecture vs Reality:
| Service | Target (diagram) | Actual |
|---|---|---|
Traefik |
10.50.1.130 |
10.50.1.131 |
Wazuh Indexer |
- |
10.50.1.130 |
Future cleanup: Reassign IPs to match infrastructure-radial-v7.d2 target.
Personal Infrastructure
| Priority | Task | Status |
|---|---|---|
P0 |
iPSK Manager ISE ODBC integration |
[x] DONE |
P1 |
kvm-01 Rocky rebuild Phase 4-7 |
[ ] CARRY-OVER |
P1 |
VyOS complete cutover |
[ ] CARRY-OVER |
P1 |
kvm-01 bridge VLAN persistence (libvirt hook) |
[ ] CARRY-OVER |
P1 |
WLC HA SSO Configuration |
[x] DONE (Session 8) |
P2 |
vyos-01 HA establishment |
[ ] FUTURE |
Session 12: Reverse Zone PTR Update
Time: Late Evening (continued from Session 11)
Problem
Forward zone updated (grafana/prometheus/alertmanager â 10.50.1.131), but reverse zone PTR for .131 still pointed to wazuh-indexer.
Root Cause
When updating DNS records, both forward (A) and reverse (PTR) zones must be updated together.
Fix
On bind-01 (SSH’d in):
REVERSE_ZONE="/var/named/10.50.1.rev"
# Phase 1: Backup
sudo cp $REVERSE_ZONE ${REVERSE_ZONE}.bak.$(date +%Y%m%d%H%M)
# Phase 2: Capture serial (note: zone files need sudo to read)
CURRENT_SERIAL=$(sudo awk '/; Serial/ {print $1}' $REVERSE_ZONE)
echo "Current serial: $CURRENT_SERIAL"
# Phase 3: Update PTR
sudo sed -i '/^131/s/wazuh-indexer/traefik/' $REVERSE_ZONE
# Phase 4: Increment serial
NEW_SERIAL=$((CURRENT_SERIAL + 1))
sudo sed -i "s/$CURRENT_SERIAL/$NEW_SERIAL/" $REVERSE_ZONE
echo "New serial: $NEW_SERIAL"
# Phase 5: Validate
sudo named-checkzone 1.50.10.in-addr.arpa $REVERSE_ZONE
# Phase 6: Reload and verify
sudo rndc reload 1.50.10.in-addr.arpa
dig +short -x 10.50.1.131 @localhost
Result
Current serial: 2026030401
New serial: 2026030402
zone 1.50.10.in-addr.arpa/IN: loaded serial 2026030402
OK
zone reload queued
traefik.inside.domusdigitalis.dev.
Key Learnings
| Issue | Solution |
|---|---|
Reverse zone path |
|
awk serial pattern differs |
Forward: |
Zone files need sudo |
Even for reading: |
Always update BOTH zones |
A record change â also update corresponding PTR |
Documentation Gap
Missing from bind-operations-quick-ref.adoc:
-
Reverse zone path attribute (currently hardcoded)
-
Example for reverse zone PTR update procedure
Action needed: Add {reverse-zone-path} attribute to antora.yml
Session 13: Wazuh DNS Alignment & k3s NAT Discovery
Time: Late Night
Wazuh DNS Mismatch
Fixed forward and reverse zone misalignment between DNS records and actual MetalLB IPs.
Forward Zone Updates:
# Fix wazuh-indexer: .131 â .130
sudo sed -i '/^wazuh-indexer/s/10.50.1.131/10.50.1.130/' $FORWARD_ZONE
# Fix wazuh-workers: .133 â .134
sudo sed -i '/^wazuh-workers/s/10.50.1.133/10.50.1.134/' $FORWARD_ZONE
# Add wazuh-dashboard
sudo sed -i '/^wazuh[[:space:]]/a wazuh-dashboard IN A 10.50.1.133' $FORWARD_ZONE
Reverse Zone Updates:
# Add 130 for wazuh-indexer
sudo sed -i '/^131/i 130 IN PTR wazuh-indexer.inside.domusdigitalis.dev.' $REVERSE_ZONE
# Fix 133: wazuh-workers â wazuh-dashboard
sudo sed -i '/^133/s/wazuh-workers/wazuh-dashboard/' $REVERSE_ZONE
# Fix 134: wazuh-api â wazuh-workers
sudo sed -i '/^134/s/wazuh-api/wazuh-workers/' $REVERSE_ZONE
Final DNS State:
| IP | Forward (A) | Reverse (PTR) |
|---|---|---|
10.50.1.130 |
wazuh-indexer |
wazuh-indexer |
10.50.1.131 |
traefik |
traefik |
10.50.1.132 |
wazuh |
wazuh |
10.50.1.133 |
wazuh-dashboard |
wazuh-dashboard |
10.50.1.134 |
wazuh-workers |
wazuh-workers |
Wazuh Indexer Pod - ImagePullBackOff
Symptom: Wazuh dashboard returns 503 - can’t reach indexer.
Root Cause Chain:
1. wazuh-indexer-0 stuck in Init:ImagePullBackOff
2. Init container can’t pull busybox image
3. k3s pod network (10.42.0.0/16) not in VyOS NAT rules
Fix Applied (vyos-01 + vyos-02):
configure
# Create k3s pod network group
set firewall group network-group NET_K3S_PODS description 'k3s Pod Network'
set firewall group network-group NET_K3S_PODS network '10.42.0.0/16'
# Add NAT masquerade rule
set nat source rule 170 source group network-group 'NET_K3S_PODS'
set nat source rule 170 outbound-interface name 'eth0'
set nat source rule 170 translation address 'masquerade'
set nat source rule 170 description 'k3s pods to internet'
commit
save
Status: NAT rule applied, needs verification tomorrow (test curl still timing out).
Key Learnings
| Concept | Insight |
|---|---|
Masquerade |
Linux/VyOS term for Cisco PAT/NAT Overload - hides RFC1918 behind single IP |
Pod network isolation |
k3s pods use 10.42.0.0/16, separate from node network - needs explicit NAT |
Unix philosophy |
Small tools, one job each: awk (Aho/Weinberger/Kernighan), grep (g/re/p), sed (stream editor), tee (T-junction) |
awk for VyOS parsing |
|
Carried to Tomorrow
-
Verify k3s pod internet access after NAT rule
-
Restart wazuh-indexer pod
-
Test Wazuh dashboard accessibility
iPSK Manager Deployment Status
Current State
| Component | Status | Notes |
|---|---|---|
VM (ipsk-mgr-01) |
RUNNING |
Ubuntu 24.04 on kvm-01 |
Apache HTTPS |
WORKING |
HTTP 200 on ipsk-mgr-01.inside.domusdigitalis.dev/ |
MariaDB |
RUNNING |
TLS enabled, network listening on 0.0.0.0:3306 |
DNS Records |
ADDED |
ipsk-mgr-01, ipsk-mgr-02, ipsk-mgr (VIP) |
Web Installer |
COMPLETE |
Setup wizard finished |
ISE ODBC |
COMPLETE |
All 5 tests passing |
iPSK Wireless |
COMPLETE |
DHCP relay fix applied, clients getting IPs |
WLC Config |
COMPLETE |
Legacy cleanup, DOMUS tag architecture |
ISE ODBC Integration Requirements
From iPSK Manager documentation:
Required Stored Procedures:
-
iPSK_AttributeFetch -
iPSK_AuthMACPlain -
iPSK_FetchGroups -
iPSK_FetchPasswordForMAC -
iPSK_MACLookup
Optional (non-expired variants):
-
iPSK_AuthMACPlainNonExpired -
iPSK_FetchPasswordForMACNonExpired -
iPSK_MACLookupNonExpired
ISE ODBC Settings:
| Setting | Value |
|---|---|
Name |
iPSK-Manager-ODBC |
ODBC Driver |
MySQL ODBC 8.0 Unicode Driver |
Hostname |
10.50.1.30 |
Database |
ipsk |
Username |
ipsk_readonly |
Port |
3306 |
Enable SSL |
Yes |
Session Log
Session 1: ISE ODBC Identity Source Configuration - COMPLETE
Time: Morning
Objective: Configure ISE ODBC connection to iPSK Manager database.
Result: All 5 ISE ODBC tests passing.
Issue Found: iPSK_AttributeFetch procedure was missing. Root cause: v7 migration file had unsubstituted placeholders (\{{DB_NAME}\}, \{{ISE_DB_USERNAME}\}).
Fix Applied:
ssh ipsk-mgr-01 "cat /var/www/ipsk/supportfiles/db/migrations/v7__attribute_fetch_subscriber_name.sql | sed 's/\{\{DB_NAME\}\}/ipsk/g; s/\{\{ISE_DB_USERNAME\}\}/ipsk-ise/g' | sudo mysql"
ISE ODBC Test Results:
Test connection Connection succeeded Fetch attributes Test succeeded Fetch password info Test succeeded Fetch groups Test succeeded Check user exists Test succeeded
Documentation Created:
-
examples/ise/ipsk-api-operations.sh- ISE API reference with correct env vars -
Updated
ipsk-manager-deployment.adocwith troubleshooting section
Key Learning: ISE API environment variables (from netapi source):
ISE_PAN_FQDN or ISE_PAN_IP ISE_API_USER ISE_API_PASS ISE_API_TOKEN (Base64 alternative)
Session 2: iPSK ISE Policy Configuration - COMPLETE
Time: Afternoon
Objective: Configure ISE authorization rules to use iPSK ODBC identity source.
Status: Authorization rules already configured - Domus_IoT_Wireless exists in Domus_MAB policy set.
Existing Configuration Discovered:
-
Policy Set:
Domus_MAB(ID:76bba980-befd-45d4-9c2a-4be81ac47f8c) -
Authorization Rule:
Domus_IoT_Wireless(rank 1) -
Condition:
iPSKManager:ExternalGroups equals Domus-IoT -
Profile:
Domus-IoT-iPSK(created this session)
Session 3: iPSK End-to-End Validation - IN PROGRESS
Time: Evening
Objective: Validate complete iPSK flow from device to network access.
Test Device: 64:32:A8:C4:C7:19
Step 1: Release Rejected Endpoint
netapi ise release-rejected 64:32:A8:C4:C7:19
Command is release-rejected NOT release-rejected-endpoint
|
Step 2: First Auth Attempt - VLAN Empty Issue
ISE auth succeeded (5200) but:
Tunnel-Private-Group-ID: Empty
Root Cause: Original profile iPSK-Auth not found in ERS API. Created new profile Domus-IoT-iPSK.
Step 3: Authorization Profile Investigation
Profile Domus-IoT-iPSK created with dynamic VLAN from iPSK Manager:
"vlan": {
"nameID": "ipsk-mgr-01:vlan",
"tagID": 1
}
This means ISE queries iPSK Manager ODBC for VLAN value per endpoint.
Step 4: iPSK Manager ODBC Query - VLAN Empty
Queried iPSK_AttributeFetch directly:
MAC="64:32:A8:C4:C7:19"
ssh ipsk-mgr-01 "sudo mysql ipsk -e \"SET @result=0; CALL iPSK_AttributeFetch('$MAC', @result);\""
Key Learning: iPSK_AttributeFetch requires 2 arguments:
1. MAC address (IN)
2. Result (OUT parameter)
Result: vlan: [] - EMPTY!
Step 5: Fix VLAN in iPSK Manager Database
MAC="64:32:A8:C4:C7:19"
VLAN="40"
ssh ipsk-mgr-01 "sudo mysql ipsk -e \"UPDATE endpoints SET vlan='$VLAN' WHERE macAddress='$MAC';\""
Verified:
vlan: [40] dacl: []
Step 6: Second Auth - VLAN 40 Assigned, No DHCP
ISE now returns Tunnel-Private-Group-ID: 40 â
But WLC shows client stuck in IP Learn:
Client MAC: 6432.a8c4.c719 VLAN: 40 VLAN Name: IOT_VLAN Policy Manager State: IP Learn Client IPv4 Address: (empty)
Step 7: VyOS DHCP Investigation
VyOS configuration verified:
Pool: IOT (10.50.40.100-10.50.40.199) Interface: eth1.40 (10.50.40.1/24, 10.50.40.2/24 VRRP)
No IOT leases in DHCP table - requests not reaching VyOS.
Current Status: Troubleshooting DHCP path from AP/WLC to VyOS VLAN 40.
Documentation Created
-
examples/ise/ipsk-odbc-operations.sh- CRITICAL: MariaDB troubleshooting commands -
examples/ise/ipsk-api-operations.sh- ISE ERS API commands for profiles
Key Learnings
| Topic | Learning |
|---|---|
netapi command |
|
iPSK_AttributeFetch |
Requires OUT parameter: |
Dynamic VLAN |
Profile uses |
Database fix |
|
Key Commands Reference
MariaDB Diagnostics
# Check network binding
ss -tlnp | grep 3306
# Check SSL status
sudo mysql -e "SHOW VARIABLES LIKE '%ssl%';"
# List stored procedures
sudo mysql -e "SHOW PROCEDURE STATUS WHERE Db='ipsk';" | awk -F'\t' 'NR>1{print $2}'
# Test stored procedure
sudo mysql ipsk -e "CALL iPSK_MACLookup('AA:BB:CC:DD:EE:FF');"
ISE ODBC Testing
# Check ISE can reach MariaDB (from workstation with netapi)
nc -zv 10.50.1.30 3306
# ISE RADIUS authentication check
netapi ise mnt sessions | awk '/iPSK/{print}'
Session 4: VLAN 40 DHCP Path Troubleshooting - IN PROGRESS
Time: Evening
Objective: Determine why VLAN 40 DHCP requests are not reaching VyOS.
Evidence: VLAN 10 Works, VLAN 40 Doesn’t
VyOS DHCP leases show VLAN 10 (DATA) active, VLAN 40 (IOT) empty:
IP Address MAC address Pool Hostname 10.50.10.100 9c:83:06:ce:89:46 DATA evan-s-z-fold7 10.50.10.102 dc:8c:37:96:20:a6 DATA ap4800
No leases in IOT pool (10.50.40.x).
Incident: kvm-02 Bridge VLAN Outage
CRITICAL LESSON: Adding VLANs to eno8 without preserving PVID 100 caused connectivity loss.
What Happened:
- Discovered kvm-02 eno8 and br-mgmt missing VLANs 20/30/40/110/120
- Added VLANs with: for vid in 10 20 30 40 100 110 120; do sudo bridge vlan add vid $vid dev eno8; done
- This REMOVED PVID 100 from eno8 â lost SSH to kvm-02
Recovery (via IPMI console):
sudo bridge vlan add vid 100 dev eno8 pvid untagged
Key Learning: When adding VLANs to a bridge interface with PVID, ALWAYS include the pvid untagged flags for the management VLAN:
# SAFE: Preserves PVID 100 for management
sudo bridge vlan add vid 100 dev eno8 pvid untagged
sudo bridge vlan add vid 40 dev eno8
VMs Status: All VMs survived - no corruption.
kvm-02 Bridge VLANs After Fix
eno8 (physical NIC): 10, 20, 30, 40, 100 PVID Egress Untagged, 110, 120 br-mgmt (bridge): 10, 20, 30, 40, 100 PVID Egress Untagged, 110, 120
Troubleshooting Path
| Hop | Device | VLAN 40 Status | Verified |
|---|---|---|---|
1 |
WLC trunk (Te0/0/0) |
allowed vlan 1,10,20,30,40,100,110,120 |
â |
2 |
3560 Te1/0/1 (to kvm-01) |
Trunk, VLAN 40 allowed |
â |
3 |
kvm-01 virbr0 |
All VLANs present |
â |
4 |
kvm-02 eno8 |
VLAN 40 present (after fix) |
â |
5 |
kvm-02 br-mgmt |
VLAN 40 present |
â |
6 |
3560 G1/0/11 (to VyOS) |
Trunk, VLAN 40 allowed |
Need verify |
7 |
vyos-01 eth1.40 |
Configured |
tcpdump shows 0 packets |
Next Step: Verify 3560 Port to VyOS
Need to check which 3560 port vyos-01 is connected to and verify trunk config.
tcpdump Commands
Capture on VyOS trunk interface (see ALL VLANs):
sudo tcpdump -i eth1 -e port 67 or port 68 -n
Capture VLAN 40 only:
sudo tcpdump -i eth1.40 port 67 or port 68 -n
Capture VLAN 10 only (verification):
sudo tcpdump -i eth1.10 port 67 or port 68 -n
Session 5: WLC DHCP Relay Fix and Configuration Cleanup - COMPLETE
Time: Evening
Objective: Fix wireless VLAN 40 DHCP issue and clean up WLC legacy configuration.
Root Cause: WLC DHCP Relay Misconfiguration
Symptom: Wired VLAN 40 devices get DHCP, but wireless VLAN 40 clients stuck in "IP Learn" state.
Comparison of Policy Profiles:
POLICY-DOMUS_IoT (BROKEN): DHCP required: ENABLED server address: 10.50.40.1 POLICY-DOMUS_SECURE (WORKING): DHCP required: DISABLED server address: 0.0.0.0
Root Cause: WLC was configured as DHCP relay (ipv4 dhcp server 10.50.40.1) for IoT profile, but WLC only has Vlan100 SVI (10.50.1.40). Without an SVI on VLAN 40, the relay cannot function.
Fix Applied:
configure terminal
wireless profile policy POLICY-DOMUS_IoT
shutdown
no ipv4 dhcp required
no ipv4 dhcp server
no shutdown
exit
write memory
Result: Client immediately received DHCP (10.50.40.102) from VyOS.
WLC Configuration Cleanup
Removed orphaned WLAN-POLICY mappings from default-policy-tag:
configure terminal wireless tag policy default-policy-tag no wlan HomeRF no wlan IoT_Net no wlan DOMUS_IoT no wlan Guest_Net exit
Deleted orphaned policy profiles:
no wireless profile policy IoT-Policy no wireless profile policy Guest-Policy no wireless profile policy VLAN10-Policy
Deleted orphaned policy tags:
no wireless tag policy IoT-Tag no wireless tag policy HomeRF-Tag no wireless tag policy TAG-DOMUS_SECURE
New DOMUS Tag Architecture
Created standardized tags following DOMUS naming convention:
! Policy Tag - maps WLANs to policies
wireless tag policy TAG-DOMUS
description "Domus production policy tag"
wlan Domus-Secure policy POLICY-DOMUS_SECURE
wlan Domus-IoT policy POLICY-DOMUS_IoT
exit
! Site Tag - AP profile and local-site setting
wireless tag site TAG-DOMUS-SITE
description "Domus site settings"
ap-profile default-ap-profile
local-site
exit
! RF Tag - Radio frequency profiles
wireless tag rf TAG-DOMUS-RF
description "Domus RF settings"
24ghz-rf-policy Low_Client_Density_rf_24gh
5ghz-rf-policy Low_Client_Density_rf_5gh
exit
! Assign all tags to AP4800
ap dc8c.3796.20a6
policy-tag TAG-DOMUS
site-tag TAG-DOMUS-SITE
rf-tag TAG-DOMUS-RF
exit
write memory
Final State
show ap tag summary
AP Name Site Tag Name Policy Tag Name RF Tag Name Misconfigured Tag Source
AP4800 TAG-DOMUS-SITE TAG-DOMUS TAG-DOMUS-RF No Static
Key Learnings
| Topic | Learning |
|---|---|
DHCP Relay |
WLC needs SVI on target VLAN to relay DHCP. Without it, relay fails silently. |
Policy Profile Fix |
Remove |
Tag Changes |
Assigning new tags causes AP to disconnect and rejoin (1-2 min outage). |
Production Impact |
NEVER change AP tags in production without maintenance window. |
RF Tag "Misconfigured" |
Must assign DIFFERENT RF policies to 2.4GHz and 5GHz bands. |
AP Config Syntax |
Use MAC address format |
Documentation Updated
-
wlc-vyos-integration.adoc- Added DHCP relay troubleshooting section
Runbook Updates Needed
-
ipsk-manager-deployment.adoc- Add stored procedure verification section -
ipsk-manager-deployment.adoc- Add ISE ODBC stored procedure configuration -
kvm-operations.adoc- Add PVID warning for bridge VLAN commands
Session 6: WiFi EAP-TLS VLAN/DACL Troubleshooting - COMPLETE
Time: Late Evening
Objective: Fix Linux laptop (modestus-razer) failing to connect to Domus-Secure WiFi after successful EAP-TLS authentication.
Symptom
-
Phone (Android) connects fine â gets DATA_VLAN (VLAN 10)
-
Laptop (Arch Linux) EAP succeeds at ISE â WLC disconnects with reason 250
-
ISE DataConnect shows
PASSED=1withDomus_Admin_Profile
wpa_supplicant Log Pattern
CTRL-EVENT-EAP-SUCCESS # ISE accepted authentication
CTRL-EVENT-DISCONNECTED reason=250 # WLC rejected session
Key: reason=250 means "Association denied" - typically VLAN/ACL failure.
Issue 1: VLAN Naming Mismatch
WLC logs showed:
*ewlc-infra-capwapEventTrace: VLAN Failure. Failed attribute name MANAGEMENT_VLAN
Problem: ISE sends MANAGEMENT_VLAN, WLC had MGMT_VLAN.
Fix:
vlan 100
name MANAGEMENT_VLAN
Issue 2: VLAN Not in Policy Profile
After rename, still failing. Policy profile had only vlan 10 hardcoded.
Fix: (Required brief SSID outage)
wireless profile policy POLICY-DOMUS_SECURE
shutdown
vlan MANAGEMENT_VLAN
no shutdown
Issue 3: DACL Download Disabled
WLC logs showed:
*ewlc-infra-capwapEventTrace: ACL Failure. Failed attribute name xACSACLx-IP-DACL_ADMIN_FULL-696eef58
Root Cause: WLC missing required commands for DACL download:
radius-server vsa send authentication ! Enable Vendor-Specific Attributes
aaa authorization network default group radius ! Enable network authorization
Issue 4: DHCP Timeout
After all WLC fixes, client authenticated but DHCP failed.
Root Cause: MANAGEMENT_VLAN (10.50.1.0/24) uses static IPs by design.
Fix: Configure static IP on Linux WiFi connection:
nmcli conn modify "Domus-WiFi-EAP-TLS" \
ipv4.method manual \
ipv4.addresses "10.50.1.200/24" \
ipv4.gateway "10.50.1.1" \
ipv4.dns "10.50.1.90"
Final Result
show wireless client summary
MAC Address AP Name WLAN State Protocol Method
7015.fbf8.47ec AP4800 4 Run 11ac Dot1x
show access-lists | include xACSACLx
Extended IP access list xACSACLx-IP-DACL_ADMIN_FULL-696eef58
2 permit ip any any (10 matches)
Success: Client connected, VLAN 100 assigned, DACL downloaded.
WLC Configuration Summary
! VLAN renamed
vlan 100
name MANAGEMENT_VLAN
! Policy profile updated
wireless profile policy POLICY-DOMUS_SECURE
vlan MANAGEMENT_VLAN
! DACL enablement
radius-server vsa send authentication
aaa authorization network default group radius
Documentation Created
-
runbooks/wlc-eaptls-vlan-dacl-troubleshooting.adoc- Comprehensive troubleshooting runbook
Key Learnings
| Topic | Learning |
|---|---|
VLAN names |
Must match EXACTLY between ISE profile and WLC VLAN name |
Policy profile VLANs |
Policy must list ALL VLANs it can assign dynamically |
VSA for DACL |
|
AAA authorization |
|
WLC disconnect reason 250 |
Association denied - VLAN/ACL configuration issue |
Infrastructure VLANs |
May use static IPs - client must configure static on WiFi |
Session 7: Switch DACL, WLC VLAN, ISE Rule Refinement - COMPLETE
Time: Late Night
Objective: Fix switch DACL download, WLC default VLAN, and restrict ISE admin authorization rule.
Issue 1: Switch DACL Not Downloading (3560CX)
Symptom: Wired 802.1X working but DACL not applied on switch.
Diagnosis:
show run | include radius-server vsa
! (no output)
Fix:
configure terminal
radius-server vsa send authentication
end
write memory
Verification:
show access-s int g1/0/2 d
ACS ACL: xACSACLx-IP-DACL_ADMIN_FULL-696eef58
Issue 2: WLC Default VLAN Wrong
Problem: After Session 6, policy profile default was MANAGEMENT_VLAN. Should be DATA_VLAN.
Fix:
configure terminal
wireless profile policy POLICY-DOMUS_SECURE
shutdown
no vlan MANAGEMENT_VLAN
vlan DATA_VLAN
no shutdown
exit
end
write memory
Verification:
show wireless profile policy detailed POLICY-DOMUS_SECURE | include VLAN|vlan
VLAN : DATA_VLAN
| AAA Override still allows ISE to dynamically assign MANAGEMENT_VLAN or IOT_VLAN. |
Issue 3: Non-Admin Hitting Admin Rule
Symptom: Son’s laptop (modestus-p50) getting Domus_Admin_Profile and MANAGEMENT_VLAN.
Root Cause: Authorization rule Domus_Cert_Admins condition too broad:
CERTIFICATE:Subject - Organization contains 'Infrastructure'
All certs with O=Domus-Infrastructure matched - including non-admin users.
Solution: Add CN condition to restrict to admin workstations only.
API Approach:
Step 1 - Get current rule:
POLICY_ID="056a2880-5821-465f-adb2-90c32de0b06f"
curl -sk -u "$ISE_API_USER:$ISE_API_PASS" \
"https://$ISE_PAN_FQDN/api/v1/policy/network-access/policy-set/$POLICY_ID/authorization" \
-H "Accept: application/json" | jq '.response[] | select(.rule.name=="Domus_Cert_Admins")' > /tmp/admin_rule.json
Step 2 - Create updated rule with CN condition:
{
"rule": {
"id": "a493e874-a27b-4f31-8465-eca9b7f1feb3",
"name": "Domus_Cert_Admins",
"rank": 0,
"state": "enabled",
"condition": {
"conditionType": "ConditionAndBlock",
"isNegate": false,
"children": [
{
"conditionType": "ConditionAttributes",
"dictionaryName": "Network Access",
"attributeName": "EapAuthentication",
"operator": "equals",
"attributeValue": "EAP-TLS"
},
{
"conditionType": "ConditionAttributes",
"dictionaryName": "CERTIFICATE",
"attributeName": "Subject - Organization",
"operator": "contains",
"attributeValue": "Infrastructure"
},
{
"conditionType": "ConditionOrBlock",
"isNegate": false,
"children": [
{
"conditionType": "ConditionAttributes",
"dictionaryName": "CERTIFICATE",
"attributeName": "Subject - Common Name",
"operator": "contains",
"attributeValue": "razer"
},
{
"conditionType": "ConditionAttributes",
"dictionaryName": "CERTIFICATE",
"attributeName": "Subject - Common Name",
"operator": "contains",
"attributeValue": "aw"
}
]
}
]
}
},
"profile": ["Domus_Admin_Profile"]
}
Step 3 - Apply update:
RULE_ID="a493e874-a27b-4f31-8465-eca9b7f1feb3"
curl -sk -u "$ISE_API_USER:$ISE_API_PASS" \
-X PUT \
"https://$ISE_PAN_FQDN/api/v1/policy/network-access/policy-set/$POLICY_ID/authorization/$RULE_ID" \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d @/tmp/admin_rule_updated.json | jq .
Verification:
netapi ise dc query "
SELECT
TO_CHAR(acs_timestamp, 'HH24:MI:SS') as time,
user_name,
selected_azn_profiles as profile
FROM mnt.radius_auth_48_live
WHERE user_name LIKE '%p50%'
ORDER BY acs_timestamp DESC
FETCH FIRST 3 ROWS ONLY"
Result:
TIME USER_NAME PROFILE 18:06:19 modestus-p50.inside.domusdigitalis.dev Domus_Secure_Profile â NEW (correct) 17:42:19 modestus-p50.inside.domusdigitalis.dev Domus_Admin_Profile â OLD (wrong)
New Rule Condition
EAP-TLS AND
O contains 'Infrastructure' AND
(CN contains 'razer' OR CN contains 'aw')
Only modestus-razer and modestus-aw get admin profile. All other certs fall through to lower rules.
Key Learnings
| Topic | Learning |
|---|---|
Switch DACL |
Requires |
WLC default VLAN |
Use DATA_VLAN as default; AAA Override handles dynamic assignment |
ISE OpenAPI |
Certificate OU conditions not available; use CN or O instead |
Rule condition structure |
|
API PUT |
Must include full rule JSON, not just changed fields |
Session 8: WLC HA SSO Configuration - COMPLETE
Time: Night
Objective: Configure Stateful Switchover (SSO) between WLC-01 and WLC-02.
Prerequisites Verified
-
WLC-01 (10.50.1.40) running on kvm-01
-
WLC-02 (10.50.1.41) running on kvm-02
-
Both running IOS-XE 17.15.x
-
Both VMs have 2 NICs: Gi1 (trunk) and Gi2 (HA)
HA Configuration Applied
On WLC-01 (Active):
configure terminal
redundancy
mode sso
exit
chassis redundancy ha-interface GigabitEthernet 2 local-ip 169.254.1.1 /24 remote-ip 169.254.1.2
write memory
On WLC-02 (Standby):
configure terminal
redundancy
mode sso
exit
chassis redundancy ha-interface GigabitEthernet 2 local-ip 169.254.1.2 /24 remote-ip 169.254.1.1
write memory
Post-Reload Verification
show redundancy
Redundant System Information :
Hardware Mode = Duplex
Configured Redundancy Mode = sso
Operating Redundancy Mode = sso
Communications = Up
Current Processor Information :
Active Location = slot 1
Current Software state = ACTIVE
Peer Processor Information :
Standby Location = slot 2
Current Software state = STANDBY HOT
Key Learnings
| Topic | Learning |
|---|---|
Interface availability |
Check |
2-NIC approach |
Use Gi2 for HA with link-local IPs (169.254.x.x) |
Standby IP inaccessible |
Normal SSO behavior - only Active WLC owns management IP |
|
Console shows hostname with |
Communications = Up |
Confirms HA heartbeat working between WLCs |
Documentation Updated
-
wlc-ha-sso.adoc- Added "HA Interface Options" section with 3 approaches (3-NIC, 2-NIC, 1-NIC) -
wlc-vyos-integration.adoc- Updated Phase 4 to reflect completion
Session 9: VM Migration kvm-02 â kvm-01 - COMPLETE
Time: Night
Objective: Move all primary VMs from kvm-02 (NAS-dependent) to kvm-01 (onboard SSD) for resilience until UPS installed.
VMs Migrated
| VM | Size | Status |
|---|---|---|
vault-01 |
~5GB |
â Migrated, SSH CA working |
bind-01 |
~3GB |
â Migrated, DNS resolving |
home-dc01 |
41GB |
â Migrated via NAS staging |
ipa-01 |
3.3GB |
â Deployed from NAS (no XML existed) |
Migration Patterns
Standard (small VMs via workstation /tmp):
| SSH not configured between kvm-01 and kvm-02. All transfers via workstation. |
# kvm-02: Export
sudo virsh dumpxml $vm > /tmp/$vm.xml
sudo cp /var/lib/libvirt/images/$vm.qcow2 /tmp/ && sudo chmod 644 /tmp/$vm.qcow2
# Workstation: Transfer
scp kvm-02:/tmp/{vault-01,bind-01}.{xml,qcow2} /tmp/
scp /tmp/{vault-01,bind-01}.{xml,qcow2} kvm-01:/tmp/
# kvm-01: Define and start
for vm in vault-01 bind-01; do
sudo mv /tmp/$vm.qcow2 /mnt/onboard-ssd/libvirt/images/
sed -i 's|/var/lib/libvirt/images/|/mnt/onboard-ssd/libvirt/images/|g' /tmp/$vm.xml
sudo virsh define /tmp/$vm.xml
sudo virsh start $vm
done
Large VMs via NAS staging:
# kvm-02: Copy to NAS
sudo cp /var/lib/libvirt/images/home-dc01.qcow2 /mnt/nas/vms/
# kvm-01: Copy from NAS
sudo cp /mnt/nas/vms/home-dc01.qcow2 /mnt/onboard-ssd/libvirt/images/
Import from NAS (no XML):
sudo cp /mnt/nas/vms/ipa-01.qcow2 /mnt/onboard-ssd/libvirt/images/
sudo virt-install \
--name ipa-01 \
--memory 4096 \
--vcpus 2 \
--disk /mnt/onboard-ssd/libvirt/images/ipa-01.qcow2,bus=virtio \
--import \
--os-variant rocky9 \
--network bridge=br-mgmt,model=virtio \
--graphics vnc,listen=0.0.0.0 \
--noautoconsole
# CRITICAL: virt-install doesn't trigger libvirt "started" hook
sudo virsh destroy ipa-01
sudo virsh start ipa-01
Final kvm-01 State
Id Name State
-------------------------------
11 k3s-master-01 running
14 vyos-01 running
21 ipsk-mgr-01 running
23 9800-WLC-01 running
24 vault-01 running
25 bind-01 running
26 home-dc01 running
27 ipa-01 running
kvm-02 State (Secondaries Only)
Id Name State
------------------------------
27 vyos-02 running
42 ise-02 running
66 9800-WLC-02 running
Key Learnings
| Topic | Learning |
|---|---|
SSH between hypervisors |
NOT configured - use workstation as intermediary |
Large VM transfer |
NAS as staging bypasses workstation /tmp quota |
virt-install hook |
Does NOT trigger libvirt "started" hook - use destroy/start cycle |
Clock drift |
Common after VM migration - |
PVID verification |
|
Cockpit VMs |
Install |
Documentation Updated
-
kvm-operations.adoc- Added kvm-02âkvm-01 migration via workstation -
kvm-operations.adoc- Added NAS import for VMs without XML -
CLAUDE.md- Updated VM Migration priority section
Tomorrow (2026-03-09)
Personal Infrastructure
-
~~Complete iPSK ISE integration~~ - DONE (Session 3-5)
-
~~WiFi EAP-TLS VLAN/DACL fix~~ - DONE (Session 6)
-
~~WLC HA SSO configuration~~ - DONE (Session 8)
-
kvm-01 Rocky rebuild Phase 4-7
-
VyOS remaining VLAN cutover (VOICE, STORAGE, DMZ)
-
Test iPSK with additional IoT devices
CHLA
-
Xianming Ding Linux SSH (priority)
-
iPSK DB replication investigation