WRKLOG-2026-03-08

Summary

Massive infrastructure day. 9 sessions covering network, wireless, identity, and compute.

Key Accomplishments

Area Achievement

iPSK Manager

ISE ODBC integration complete - All 5 stored procedure tests passing. Fixed iPSK_AttributeFetch migration script.

Wireless DHCP

Root cause: WLC DHCP relay without VLAN 40 SVI. Fix: Disabled relay, let broadcasts flow to VyOS.

WLC Cleanup

Removed legacy policy/tag configs. Created standardized TAG-DOMUS architecture.

WLC HA SSO

WLC-01 (Active) + WLC-02 (Standby HOT). 2-NIC approach with Gi2 for HA heartbeat.

WiFi EAP-TLS

Fixed VLAN naming mismatch, enabled DACL download (radius-server vsa send authentication), corrected default VLAN.

ISE Authorization

Restricted Domus_Cert_Admins rule to CN containing 'razer' or 'aw' only.

Switch DACL

3560CX now downloads DACLs - added VSA authentication.

VM Migration

Moved vault-01, bind-01, home-dc01, ipa-01 from kvm-02 (NAS) to kvm-01 (onboard SSD) for UPS independence.

Documentation

VyOS migration runbook hierarchy verified. Added missing vyos-vlan-fasttrack-migration.adoc to nav.

Infrastructure State After Today

System Status Notes

iPSK Manager

OPERATIONAL

ISE ODBC working, wireless DHCP fixed

WLC HA

OPERATIONAL

SSO Active/Standby HOT

EAP-TLS WiFi

OPERATIONAL

VLAN/DACL assignment working

Primary VMs

kvm-01

vault, bind, DC, IPA migrated off NAS

VyOS HA

OPERATIONAL

Remaining VLANs pending (VOICE, STORAGE, DMZ)

Today’s Priority Tasks

Priority Task Status Notes

P0

iPSK Manager ISE ODBC integration

[x] DONE

All 5 tests passing, iPSK_AttributeFetch fixed

P0

iPSK Wireless DHCP Fix

[x] DONE

Root cause: WLC DHCP relay without VLAN 40 SVI

P0

WLC Configuration Cleanup

[x] DONE

Removed legacy profiles/tags, created DOMUS architecture

P1

kvm-01 Rocky Linux rebuild (Phases 4-7)

[ ] CARRY-OVER

Blocked on iPSK completion

P1

VyOS complete cutover (remaining VLANs)

[ ] CARRY-OVER

VOICE, STORAGE, DMZ pools

P1

WLC HA SSO Configuration

[x] DONE

WLC-01 + WLC-02 in SSO pair (Session 8)

Carried Over from 2026-03-07

Professional (CHLA)

Priority Task Status Notes

P0

Linux SSH (Xianming Ding)

[x] RESOLVED

DACL missing AD DC DNS IPs. EAP-TLS cert test tomorrow.

P1

iPSK Manager DB replication

[ ] CARRY-OVER

-

P1

ISE 3.4p9 / 3.5 migration

[ ] CARRY-OVER

-

P1

Monad - QRadar → Sentinel

[ ] ACTIVE

Cost-driven. Critical → Sentinel, bulk → local Linux. I own firewall/ISE/network sources.

Monad Migration Context

Project: QRadar → Microsoft Sentinel migration

Constraint: Cost. Sentinel charges per GB ingested.

Strategy: - Critical pipeline → Sentinel (security events, auth failures, alerts) - Bulk logs → Local Linux syslog server (as backup/compliance)

My scope: Firewall (pfSense/VyOS), ISE, network devices - high log volume sources that need filtering.

Questions to answer: 1. What ISE log categories are "critical" vs "bulk"? 2. Can ISE syslog be split by severity/category? 3. What’s the local Linux target? (rsyslog? syslog-ng? Wazuh?)

Session 10: k3s Observability & jq Practice

Time: Evening

What Was Learned

Topic Key Insight

Shell prompt pollution

⚡ No session active was captured in JSON redirects. Fix: grep -v "session active" or [[ -t 1 ]] in shell hook.

kubectl JSON + jq

kubectl get pods -A -o json + jq '.items[] | select(.status.phase != "Running")' = filter anomalies.

Pod status truth

.status.phase isn’t always accurate. wazuh-indexer-0 showed Running phase but terminated containerStatus.

jq aggregation

group_by(.) | map({namespace: .[0], count: length}) = count by namespace.

Monad transforms

Use jq to filter/transform security data BEFORE SIEM ingestion. Same patterns apply.

MetalLB L2 mode

Services need type: LoadBalancer to get external IPs. Patch with kubectl patch svc.

Fixes Applied

Issue Fix Result

wazuh-indexer-0 Unknown

kubectl delete pod wazuh-indexer-0 -n wazuh

StatefulSet recreating

Monitoring ClusterIP only

kubectl patch svc …​ -p '{"spec":{"type":"LoadBalancer"}}'

IPs assigned: Grafana .135, Prometheus .136, AlertManager .137

svclb-dashboard Pending

Port conflict - ignored (dashboard service misconfigured)

Non-critical

jq Patterns Practiced

# Filter non-Running pods
jq '.items[] | select(.status.phase != "Running" and .status.phase != "Succeeded") | {namespace: .metadata.namespace, name: .metadata.name, phase: .status.phase}' /tmp/pods.json

# Count by namespace
jq '[.items[].metadata.namespace] | group_by(.) | map({namespace: .[0], count: length})' /tmp/pods.json

# Deep container status inspection
jq '.items[] | select(.metadata.name == "wazuh-indexer-0") | {phase: .status.phase, containerStatuses: .status.containerStatuses}' /tmp/pods.json

Infrastructure State After Session

Service IP Status

Grafana

10.50.1.135:80

✅ Accessible

Prometheus

10.50.1.136:9090

✅ Accessible

AlertManager

10.50.1.137:9093

✅ Accessible

Wazuh Indexer

-

🔄 Recreating

Session 11: Traefik DNS Mismatch Fix

Time: Late Evening

Problem

Monitoring services (Grafana, Prometheus, AlertManager) unreachable via DNS names despite IngressRoutes configured.

Root Cause Analysis

Check Finding

DNS records

grafana.inside.domusdigitalis.dev → 10.50.1.130

Traefik actual IP

10.50.1.131 (MetalLB auto-assigned)

Who has .130?

wazuh/indexer service claimed it first

Diagnosis commands:

# Check Traefik's actual IP
ssh k3s-master-01 "kubectl get svc traefik -n kube-system -o jsonpath='{.status.loadBalancer.ingress[0].ip}'"
# Returns: 10.50.1.131

# Check all LoadBalancer assignments
ssh k3s-master-01 "kubectl get svc -A -o jsonpath='{range .items[?(@.spec.type==\"LoadBalancer\")]}{.metadata.namespace}/{.metadata.name}: {.status.loadBalancer.ingress[0].ip}{\"\\n\"}{end}'" | grep -v "session active"

Fix

Follow: BIND Operations Quick Ref → Update Existing DNS Records (infra-ops runbook)

Phase 1: Pre-Validation

DNS_SERVER="bind-01"
DNS_IP="10.50.1.90"
DOMAIN="inside.domusdigitalis.dev"
FORWARD_ZONE="/var/named/inside.domusdigitalis.dev.zone"

echo "=== 1.1 CURRENT STATE ==="
ssh $DNS_SERVER "sudo grep -E 'grafana|prometheus|alertmanager' $FORWARD_ZONE"

echo "=== 1.2 CURRENT SERIAL ==="
ssh $DNS_SERVER "sudo rndc zonestatus $DOMAIN | grep serial"

echo "=== 1.3 CURRENT DNS RESOLUTION ==="
for h in grafana prometheus alertmanager; do
  echo -n "$h: "
  dig @$DNS_IP $h.$DOMAIN +short
done

Phase 2: Backup

TIMESTAMP=$(date +%Y%m%d%H%M)

echo "=== 2.1 CREATE BACKUP ==="
ssh $DNS_SERVER "sudo cp $FORWARD_ZONE ${FORWARD_ZONE}.bak.${TIMESTAMP}"

echo "=== 2.2 VERIFY BACKUP EXISTS ==="
ssh $DNS_SERVER "sudo ls -la ${FORWARD_ZONE}.bak.${TIMESTAMP}"

Phase 3: Capture Serial

CURRENT_SERIAL=$(dig @$DNS_IP $DOMAIN SOA +short | awk '{print $3}')
NEW_SERIAL=$((CURRENT_SERIAL + 1))

echo "=== 3.1 SERIAL VALUES ==="
echo "Current: $CURRENT_SERIAL"
echo "New:     $NEW_SERIAL"

Phase 4: Apply Changes

echo "=== 4.1 UPDATE IPS ==="
ssh $DNS_SERVER "sudo sed -i '/grafana/s/10.50.1.130/10.50.1.131/' $FORWARD_ZONE"
ssh $DNS_SERVER "sudo sed -i '/prometheus/s/10.50.1.130/10.50.1.131/' $FORWARD_ZONE"
ssh $DNS_SERVER "sudo sed -i '/alertmanager/s/10.50.1.130/10.50.1.131/' $FORWARD_ZONE"

echo "=== 4.2 UPDATE SERIAL ==="
ssh $DNS_SERVER "sudo sed -i 's/$CURRENT_SERIAL/$NEW_SERIAL/' $FORWARD_ZONE"

echo "=== 4.3 VERIFY CHANGES ==="
ssh $DNS_SERVER "sudo grep -E 'grafana|prometheus|alertmanager' $FORWARD_ZONE"
ssh $DNS_SERVER "sudo grep -E '[0-9]{10}.*[Ss]erial' $FORWARD_ZONE"

Phase 5: Validate & Reload

echo "=== 5.1 CHECK ZONE SYNTAX ==="
ssh $DNS_SERVER "sudo named-checkzone $DOMAIN $FORWARD_ZONE"

echo "=== 5.2 RELOAD ZONE ==="
ssh $DNS_SERVER "sudo rndc reload $DOMAIN"

echo "=== 5.3 VERIFY NEW SERIAL LOADED ==="
ssh $DNS_SERVER "sudo rndc zonestatus $DOMAIN | grep serial"

Phase 6: Post-Validation

echo "=== 6.1 VERIFY DNS RESOLUTION ==="
for h in grafana prometheus alertmanager; do
  result=$(dig @$DNS_IP $h.$DOMAIN +short)
  if [[ "$result" == "10.50.1.131" ]]; then
    echo "✓ $h → $result"
  else
    echo "✗ $h: expected 10.50.1.131, got '$result'"
  fi
done

echo "=== 6.2 TEST HTTPS ACCESS ==="
curl -sSk -o /dev/null -w "%{http_code}" https://grafana.inside.domusdigitalis.dev
# Expected: 200 or 302

Key Learning

MetalLB L2 mode auto-assigns IPs in order of service creation. If you need a specific IP:

# Option 1: Patch service with specific IP
kubectl patch svc traefik -n kube-system -p '{"spec":{"loadBalancerIP":"10.50.1.130"}}'

# Option 2: Let MetalLB auto-assign, update DNS to match
# (Simpler - what we did here)

Target architecture vs Reality:

Service Target (diagram) Actual

Traefik

10.50.1.130

10.50.1.131

Wazuh Indexer

-

10.50.1.130

Future cleanup: Reassign IPs to match infrastructure-radial-v7.d2 target.

Personal Infrastructure

Priority Task Status

P0

iPSK Manager ISE ODBC integration

[x] DONE

P1

kvm-01 Rocky rebuild Phase 4-7

[ ] CARRY-OVER

P1

VyOS complete cutover

[ ] CARRY-OVER

P1

kvm-01 bridge VLAN persistence (libvirt hook)

[ ] CARRY-OVER

P1

WLC HA SSO Configuration

[x] DONE (Session 8)

P2

vyos-01 HA establishment

[ ] FUTURE

Session 12: Reverse Zone PTR Update

Time: Late Evening (continued from Session 11)

Problem

Forward zone updated (grafana/prometheus/alertmanager → 10.50.1.131), but reverse zone PTR for .131 still pointed to wazuh-indexer.

Root Cause

When updating DNS records, both forward (A) and reverse (PTR) zones must be updated together.

Fix

On bind-01 (SSH’d in):

REVERSE_ZONE="/var/named/10.50.1.rev"

# Phase 1: Backup
sudo cp $REVERSE_ZONE ${REVERSE_ZONE}.bak.$(date +%Y%m%d%H%M)

# Phase 2: Capture serial (note: zone files need sudo to read)
CURRENT_SERIAL=$(sudo awk '/; Serial/ {print $1}' $REVERSE_ZONE)
echo "Current serial: $CURRENT_SERIAL"

# Phase 3: Update PTR
sudo sed -i '/^131/s/wazuh-indexer/traefik/' $REVERSE_ZONE

# Phase 4: Increment serial
NEW_SERIAL=$((CURRENT_SERIAL + 1))
sudo sed -i "s/$CURRENT_SERIAL/$NEW_SERIAL/" $REVERSE_ZONE
echo "New serial: $NEW_SERIAL"

# Phase 5: Validate
sudo named-checkzone 1.50.10.in-addr.arpa $REVERSE_ZONE

# Phase 6: Reload and verify
sudo rndc reload 1.50.10.in-addr.arpa
dig +short -x 10.50.1.131 @localhost

Result

Current serial: 2026030401
New serial: 2026030402
zone 1.50.10.in-addr.arpa/IN: loaded serial 2026030402
OK
zone reload queued
traefik.inside.domusdigitalis.dev.

Key Learnings

Issue Solution

Reverse zone path

/var/named/10.50.1.rev (not 1.50.10.in-addr.arpa.zone)

awk serial pattern differs

Forward: awk '/serial/', Reverse: awk '/; Serial/'

Zone files need sudo

Even for reading: sudo awk …​

Always update BOTH zones

A record change → also update corresponding PTR

Documentation Gap

Missing from bind-operations-quick-ref.adoc:

  • Reverse zone path attribute (currently hardcoded)

  • Example for reverse zone PTR update procedure

Action needed: Add {reverse-zone-path} attribute to antora.yml

Session 13: Wazuh DNS Alignment & k3s NAT Discovery

Time: Late Night

Wazuh DNS Mismatch

Fixed forward and reverse zone misalignment between DNS records and actual MetalLB IPs.

Forward Zone Updates:

# Fix wazuh-indexer: .131 → .130
sudo sed -i '/^wazuh-indexer/s/10.50.1.131/10.50.1.130/' $FORWARD_ZONE

# Fix wazuh-workers: .133 → .134
sudo sed -i '/^wazuh-workers/s/10.50.1.133/10.50.1.134/' $FORWARD_ZONE

# Add wazuh-dashboard
sudo sed -i '/^wazuh[[:space:]]/a wazuh-dashboard IN  A       10.50.1.133' $FORWARD_ZONE

Reverse Zone Updates:

# Add 130 for wazuh-indexer
sudo sed -i '/^131/i 130     IN  PTR     wazuh-indexer.inside.domusdigitalis.dev.' $REVERSE_ZONE

# Fix 133: wazuh-workers → wazuh-dashboard
sudo sed -i '/^133/s/wazuh-workers/wazuh-dashboard/' $REVERSE_ZONE

# Fix 134: wazuh-api → wazuh-workers
sudo sed -i '/^134/s/wazuh-api/wazuh-workers/' $REVERSE_ZONE

Final DNS State:

IP Forward (A) Reverse (PTR)

10.50.1.130

wazuh-indexer

wazuh-indexer

10.50.1.131

traefik

traefik

10.50.1.132

wazuh

wazuh

10.50.1.133

wazuh-dashboard

wazuh-dashboard

10.50.1.134

wazuh-workers

wazuh-workers

Wazuh Indexer Pod - ImagePullBackOff

Symptom: Wazuh dashboard returns 503 - can’t reach indexer.

Root Cause Chain: 1. wazuh-indexer-0 stuck in Init:ImagePullBackOff 2. Init container can’t pull busybox image 3. k3s pod network (10.42.0.0/16) not in VyOS NAT rules

Fix Applied (vyos-01 + vyos-02):

configure

# Create k3s pod network group
set firewall group network-group NET_K3S_PODS description 'k3s Pod Network'
set firewall group network-group NET_K3S_PODS network '10.42.0.0/16'

# Add NAT masquerade rule
set nat source rule 170 source group network-group 'NET_K3S_PODS'
set nat source rule 170 outbound-interface name 'eth0'
set nat source rule 170 translation address 'masquerade'
set nat source rule 170 description 'k3s pods to internet'

commit
save

Status: NAT rule applied, needs verification tomorrow (test curl still timing out).

Key Learnings

Concept Insight

Masquerade

Linux/VyOS term for Cisco PAT/NAT Overload - hides RFC1918 behind single IP

Pod network isolation

k3s pods use 10.42.0.0/16, separate from node network - needs explicit NAT

Unix philosophy

Small tools, one job each: awk (Aho/Weinberger/Kernighan), grep (g/re/p), sed (stream editor), tee (T-junction)

awk for VyOS parsing

show configuration commands | awk '/pattern/' - powerful for filtering VyOS output

Carried to Tomorrow

  • Verify k3s pod internet access after NAT rule

  • Restart wazuh-indexer pod

  • Test Wazuh dashboard accessibility

iPSK Manager Deployment Status

Current State

Component Status Notes

VM (ipsk-mgr-01)

RUNNING

Ubuntu 24.04 on kvm-01

Apache HTTPS

WORKING

HTTP 200 on ipsk-mgr-01.inside.domusdigitalis.dev/

MariaDB

RUNNING

TLS enabled, network listening on 0.0.0.0:3306

DNS Records

ADDED

ipsk-mgr-01, ipsk-mgr-02, ipsk-mgr (VIP)

Web Installer

COMPLETE

Setup wizard finished

ISE ODBC

COMPLETE

All 5 tests passing

iPSK Wireless

COMPLETE

DHCP relay fix applied, clients getting IPs

WLC Config

COMPLETE

Legacy cleanup, DOMUS tag architecture

ISE ODBC Integration Requirements

From iPSK Manager documentation:

Required Stored Procedures:

  • iPSK_AttributeFetch

  • iPSK_AuthMACPlain

  • iPSK_FetchGroups

  • iPSK_FetchPasswordForMAC

  • iPSK_MACLookup

Optional (non-expired variants):

  • iPSK_AuthMACPlainNonExpired

  • iPSK_FetchPasswordForMACNonExpired

  • iPSK_MACLookupNonExpired

ISE ODBC Settings:

Setting Value

Name

iPSK-Manager-ODBC

ODBC Driver

MySQL ODBC 8.0 Unicode Driver

Hostname

10.50.1.30

Database

ipsk

Username

ipsk_readonly

Port

3306

Enable SSL

Yes

Session Log

Session 1: ISE ODBC Identity Source Configuration - COMPLETE

Time: Morning

Objective: Configure ISE ODBC connection to iPSK Manager database.

Result: All 5 ISE ODBC tests passing.

Issue Found: iPSK_AttributeFetch procedure was missing. Root cause: v7 migration file had unsubstituted placeholders (\{{DB_NAME}\}, \{{ISE_DB_USERNAME}\}).

Fix Applied:

ssh ipsk-mgr-01 "cat /var/www/ipsk/supportfiles/db/migrations/v7__attribute_fetch_subscriber_name.sql | sed 's/\{\{DB_NAME\}\}/ipsk/g; s/\{\{ISE_DB_USERNAME\}\}/ipsk-ise/g' | sudo mysql"

ISE ODBC Test Results:

Test connection          Connection succeeded
Fetch attributes         Test succeeded
Fetch password info      Test succeeded
Fetch groups             Test succeeded
Check user exists        Test succeeded

Documentation Created:

  • examples/ise/ipsk-api-operations.sh - ISE API reference with correct env vars

  • Updated ipsk-manager-deployment.adoc with troubleshooting section

Key Learning: ISE API environment variables (from netapi source):

ISE_PAN_FQDN or ISE_PAN_IP
ISE_API_USER
ISE_API_PASS
ISE_API_TOKEN (Base64 alternative)

Session 2: iPSK ISE Policy Configuration - COMPLETE

Time: Afternoon

Objective: Configure ISE authorization rules to use iPSK ODBC identity source.

Status: Authorization rules already configured - Domus_IoT_Wireless exists in Domus_MAB policy set.

Existing Configuration Discovered:

  • Policy Set: Domus_MAB (ID: 76bba980-befd-45d4-9c2a-4be81ac47f8c)

  • Authorization Rule: Domus_IoT_Wireless (rank 1)

  • Condition: iPSKManager:ExternalGroups equals Domus-IoT

  • Profile: Domus-IoT-iPSK (created this session)

Session 3: iPSK End-to-End Validation - IN PROGRESS

Time: Evening

Objective: Validate complete iPSK flow from device to network access.

Test Device: 64:32:A8:C4:C7:19

Step 1: Release Rejected Endpoint

netapi ise release-rejected 64:32:A8:C4:C7:19
Command is release-rejected NOT release-rejected-endpoint

Step 2: First Auth Attempt - VLAN Empty Issue

ISE auth succeeded (5200) but:

Tunnel-Private-Group-ID: Empty

Root Cause: Original profile iPSK-Auth not found in ERS API. Created new profile Domus-IoT-iPSK.

Step 3: Authorization Profile Investigation

Profile Domus-IoT-iPSK created with dynamic VLAN from iPSK Manager:

"vlan": {
  "nameID": "ipsk-mgr-01:vlan",
  "tagID": 1
}

This means ISE queries iPSK Manager ODBC for VLAN value per endpoint.

Step 4: iPSK Manager ODBC Query - VLAN Empty

Queried iPSK_AttributeFetch directly:

MAC="64:32:A8:C4:C7:19"
ssh ipsk-mgr-01 "sudo mysql ipsk -e \"SET @result=0; CALL iPSK_AttributeFetch('$MAC', @result);\""

Key Learning: iPSK_AttributeFetch requires 2 arguments: 1. MAC address (IN) 2. Result (OUT parameter)

Result: vlan: [] - EMPTY!

Step 5: Fix VLAN in iPSK Manager Database

MAC="64:32:A8:C4:C7:19"
VLAN="40"
ssh ipsk-mgr-01 "sudo mysql ipsk -e \"UPDATE endpoints SET vlan='$VLAN' WHERE macAddress='$MAC';\""

Verified:

vlan: [40] dacl: []

Step 6: Second Auth - VLAN 40 Assigned, No DHCP

ISE now returns Tunnel-Private-Group-ID: 40 ✓

But WLC shows client stuck in IP Learn:

Client MAC: 6432.a8c4.c719
VLAN: 40
VLAN Name: IOT_VLAN
Policy Manager State: IP Learn
Client IPv4 Address: (empty)

Step 7: VyOS DHCP Investigation

VyOS configuration verified:

Pool: IOT (10.50.40.100-10.50.40.199)
Interface: eth1.40 (10.50.40.1/24, 10.50.40.2/24 VRRP)

No IOT leases in DHCP table - requests not reaching VyOS.

Current Status: Troubleshooting DHCP path from AP/WLC to VyOS VLAN 40.

Documentation Created

  • examples/ise/ipsk-odbc-operations.sh - CRITICAL: MariaDB troubleshooting commands

  • examples/ise/ipsk-api-operations.sh - ISE ERS API commands for profiles

Key Learnings

Topic Learning

netapi command

release-rejected not release-rejected-endpoint

iPSK_AttributeFetch

Requires OUT parameter: SET @result=0; CALL proc('MAC', @result);

Dynamic VLAN

Profile uses ipsk-mgr-01:vlan - queries ODBC for per-endpoint VLAN

Database fix

UPDATE endpoints SET vlan='40' WHERE macAddress='…​'

Key Commands Reference

MariaDB Diagnostics

# Check network binding
ss -tlnp | grep 3306
# Check SSL status
sudo mysql -e "SHOW VARIABLES LIKE '%ssl%';"
# List stored procedures
sudo mysql -e "SHOW PROCEDURE STATUS WHERE Db='ipsk';" | awk -F'\t' 'NR>1{print $2}'
# Test stored procedure
sudo mysql ipsk -e "CALL iPSK_MACLookup('AA:BB:CC:DD:EE:FF');"

ISE ODBC Testing

# Check ISE can reach MariaDB (from workstation with netapi)
nc -zv 10.50.1.30 3306
# ISE RADIUS authentication check
netapi ise mnt sessions | awk '/iPSK/{print}'

Session 4: VLAN 40 DHCP Path Troubleshooting - IN PROGRESS

Time: Evening

Objective: Determine why VLAN 40 DHCP requests are not reaching VyOS.

Evidence: VLAN 10 Works, VLAN 40 Doesn’t

VyOS DHCP leases show VLAN 10 (DATA) active, VLAN 40 (IOT) empty:

IP Address    MAC address        Pool    Hostname
10.50.10.100  9c:83:06:ce:89:46  DATA    evan-s-z-fold7
10.50.10.102  dc:8c:37:96:20:a6  DATA    ap4800

No leases in IOT pool (10.50.40.x).

Incident: kvm-02 Bridge VLAN Outage

CRITICAL LESSON: Adding VLANs to eno8 without preserving PVID 100 caused connectivity loss.

What Happened: - Discovered kvm-02 eno8 and br-mgmt missing VLANs 20/30/40/110/120 - Added VLANs with: for vid in 10 20 30 40 100 110 120; do sudo bridge vlan add vid $vid dev eno8; done - This REMOVED PVID 100 from eno8 → lost SSH to kvm-02

Recovery (via IPMI console):

sudo bridge vlan add vid 100 dev eno8 pvid untagged

Key Learning: When adding VLANs to a bridge interface with PVID, ALWAYS include the pvid untagged flags for the management VLAN:

# SAFE: Preserves PVID 100 for management
sudo bridge vlan add vid 100 dev eno8 pvid untagged
sudo bridge vlan add vid 40 dev eno8

VMs Status: All VMs survived - no corruption.

kvm-02 Bridge VLANs After Fix

eno8 (physical NIC):
  10, 20, 30, 40, 100 PVID Egress Untagged, 110, 120

br-mgmt (bridge):
  10, 20, 30, 40, 100 PVID Egress Untagged, 110, 120

Troubleshooting Path

Hop Device VLAN 40 Status Verified

1

WLC trunk (Te0/0/0)

allowed vlan 1,10,20,30,40,100,110,120

✓

2

3560 Te1/0/1 (to kvm-01)

Trunk, VLAN 40 allowed

✓

3

kvm-01 virbr0

All VLANs present

✓

4

kvm-02 eno8

VLAN 40 present (after fix)

✓

5

kvm-02 br-mgmt

VLAN 40 present

✓

6

3560 G1/0/11 (to VyOS)

Trunk, VLAN 40 allowed

Need verify

7

vyos-01 eth1.40

Configured

tcpdump shows 0 packets

Next Step: Verify 3560 Port to VyOS

Need to check which 3560 port vyos-01 is connected to and verify trunk config.

tcpdump Commands

Capture on VyOS trunk interface (see ALL VLANs):

sudo tcpdump -i eth1 -e port 67 or port 68 -n

Capture VLAN 40 only:

sudo tcpdump -i eth1.40 port 67 or port 68 -n

Capture VLAN 10 only (verification):

sudo tcpdump -i eth1.10 port 67 or port 68 -n

Session 5: WLC DHCP Relay Fix and Configuration Cleanup - COMPLETE

Time: Evening

Objective: Fix wireless VLAN 40 DHCP issue and clean up WLC legacy configuration.

Root Cause: WLC DHCP Relay Misconfiguration

Symptom: Wired VLAN 40 devices get DHCP, but wireless VLAN 40 clients stuck in "IP Learn" state.

Comparison of Policy Profiles:

POLICY-DOMUS_IoT (BROKEN):
  DHCP required: ENABLED
  server address: 10.50.40.1

POLICY-DOMUS_SECURE (WORKING):
  DHCP required: DISABLED
  server address: 0.0.0.0

Root Cause: WLC was configured as DHCP relay (ipv4 dhcp server 10.50.40.1) for IoT profile, but WLC only has Vlan100 SVI (10.50.1.40). Without an SVI on VLAN 40, the relay cannot function.

Fix Applied:

configure terminal
wireless profile policy POLICY-DOMUS_IoT
  shutdown
  no ipv4 dhcp required
  no ipv4 dhcp server
  no shutdown
  exit
write memory

Result: Client immediately received DHCP (10.50.40.102) from VyOS.

WLC Configuration Cleanup

Removed orphaned WLAN-POLICY mappings from default-policy-tag:

configure terminal
wireless tag policy default-policy-tag
  no wlan HomeRF
  no wlan IoT_Net
  no wlan DOMUS_IoT
  no wlan Guest_Net
  exit

Deleted orphaned policy profiles:

no wireless profile policy IoT-Policy
no wireless profile policy Guest-Policy
no wireless profile policy VLAN10-Policy

Deleted orphaned policy tags:

no wireless tag policy IoT-Tag
no wireless tag policy HomeRF-Tag
no wireless tag policy TAG-DOMUS_SECURE

New DOMUS Tag Architecture

Created standardized tags following DOMUS naming convention:

! Policy Tag - maps WLANs to policies
wireless tag policy TAG-DOMUS
  description "Domus production policy tag"
  wlan Domus-Secure policy POLICY-DOMUS_SECURE
  wlan Domus-IoT policy POLICY-DOMUS_IoT
  exit

! Site Tag - AP profile and local-site setting
wireless tag site TAG-DOMUS-SITE
  description "Domus site settings"
  ap-profile default-ap-profile
  local-site
  exit

! RF Tag - Radio frequency profiles
wireless tag rf TAG-DOMUS-RF
  description "Domus RF settings"
  24ghz-rf-policy Low_Client_Density_rf_24gh
  5ghz-rf-policy Low_Client_Density_rf_5gh
  exit

! Assign all tags to AP4800
ap dc8c.3796.20a6
  policy-tag TAG-DOMUS
  site-tag TAG-DOMUS-SITE
  rf-tag TAG-DOMUS-RF
  exit

write memory

Final State

show ap tag summary
AP Name   Site Tag Name      Policy Tag Name   RF Tag Name      Misconfigured   Tag Source
AP4800    TAG-DOMUS-SITE     TAG-DOMUS         TAG-DOMUS-RF     No              Static

Key Learnings

Topic Learning

DHCP Relay

WLC needs SVI on target VLAN to relay DHCP. Without it, relay fails silently.

Policy Profile Fix

Remove ipv4 dhcp required and ipv4 dhcp server - let broadcasts flow.

Tag Changes

Assigning new tags causes AP to disconnect and rejoin (1-2 min outage).

Production Impact

NEVER change AP tags in production without maintenance window.

RF Tag "Misconfigured"

Must assign DIFFERENT RF policies to 2.4GHz and 5GHz bands.

AP Config Syntax

Use MAC address format ap dc8c.3796.20a6, not AP name.

Documentation Updated

  • wlc-vyos-integration.adoc - Added DHCP relay troubleshooting section

Runbook Updates Needed

  • ipsk-manager-deployment.adoc - Add stored procedure verification section

  • ipsk-manager-deployment.adoc - Add ISE ODBC stored procedure configuration

  • kvm-operations.adoc - Add PVID warning for bridge VLAN commands

Session 6: WiFi EAP-TLS VLAN/DACL Troubleshooting - COMPLETE

Time: Late Evening

Objective: Fix Linux laptop (modestus-razer) failing to connect to Domus-Secure WiFi after successful EAP-TLS authentication.

Symptom

  • Phone (Android) connects fine → gets DATA_VLAN (VLAN 10)

  • Laptop (Arch Linux) EAP succeeds at ISE → WLC disconnects with reason 250

  • ISE DataConnect shows PASSED=1 with Domus_Admin_Profile

wpa_supplicant Log Pattern

CTRL-EVENT-EAP-SUCCESS                    # ISE accepted authentication
CTRL-EVENT-DISCONNECTED reason=250        # WLC rejected session

Key: reason=250 means "Association denied" - typically VLAN/ACL failure.

Issue 1: VLAN Naming Mismatch

WLC logs showed:

*ewlc-infra-capwapEventTrace: VLAN Failure. Failed attribute name MANAGEMENT_VLAN

Problem: ISE sends MANAGEMENT_VLAN, WLC had MGMT_VLAN.

Fix:

vlan 100
 name MANAGEMENT_VLAN

Issue 2: VLAN Not in Policy Profile

After rename, still failing. Policy profile had only vlan 10 hardcoded.

Fix: (Required brief SSID outage)

wireless profile policy POLICY-DOMUS_SECURE
 shutdown
 vlan MANAGEMENT_VLAN
 no shutdown

Issue 3: DACL Download Disabled

WLC logs showed:

*ewlc-infra-capwapEventTrace: ACL Failure. Failed attribute name xACSACLx-IP-DACL_ADMIN_FULL-696eef58

Root Cause: WLC missing required commands for DACL download:

radius-server vsa send authentication       ! Enable Vendor-Specific Attributes
aaa authorization network default group radius  ! Enable network authorization

Issue 4: DHCP Timeout

After all WLC fixes, client authenticated but DHCP failed.

Root Cause: MANAGEMENT_VLAN (10.50.1.0/24) uses static IPs by design.

Fix: Configure static IP on Linux WiFi connection:

nmcli conn modify "Domus-WiFi-EAP-TLS" \
  ipv4.method manual \
  ipv4.addresses "10.50.1.200/24" \
  ipv4.gateway "10.50.1.1" \
  ipv4.dns "10.50.1.90"

Final Result

show wireless client summary
MAC Address        AP Name    WLAN   State    Protocol  Method
7015.fbf8.47ec     AP4800     4      Run      11ac      Dot1x

show access-lists | include xACSACLx
Extended IP access list xACSACLx-IP-DACL_ADMIN_FULL-696eef58
    2 permit ip any any (10 matches)

Success: Client connected, VLAN 100 assigned, DACL downloaded.

WLC Configuration Summary

! VLAN renamed
vlan 100
 name MANAGEMENT_VLAN

! Policy profile updated
wireless profile policy POLICY-DOMUS_SECURE
 vlan MANAGEMENT_VLAN

! DACL enablement
radius-server vsa send authentication
aaa authorization network default group radius

Documentation Created

  • runbooks/wlc-eaptls-vlan-dacl-troubleshooting.adoc - Comprehensive troubleshooting runbook

Key Learnings

Topic Learning

VLAN names

Must match EXACTLY between ISE profile and WLC VLAN name

Policy profile VLANs

Policy must list ALL VLANs it can assign dynamically

VSA for DACL

radius-server vsa send authentication enables DACL download

AAA authorization

aaa authorization network default group radius processes attributes

WLC disconnect reason 250

Association denied - VLAN/ACL configuration issue

Infrastructure VLANs

May use static IPs - client must configure static on WiFi

Session 7: Switch DACL, WLC VLAN, ISE Rule Refinement - COMPLETE

Time: Late Night

Objective: Fix switch DACL download, WLC default VLAN, and restrict ISE admin authorization rule.

Issue 1: Switch DACL Not Downloading (3560CX)

Symptom: Wired 802.1X working but DACL not applied on switch.

Diagnosis:

show run | include radius-server vsa
! (no output)

Fix:

configure terminal
radius-server vsa send authentication
end
write memory

Verification:

show access-s int g1/0/2 d
ACS ACL:  xACSACLx-IP-DACL_ADMIN_FULL-696eef58

Issue 2: WLC Default VLAN Wrong

Problem: After Session 6, policy profile default was MANAGEMENT_VLAN. Should be DATA_VLAN.

Fix:

configure terminal
wireless profile policy POLICY-DOMUS_SECURE
 shutdown
 no vlan MANAGEMENT_VLAN
 vlan DATA_VLAN
 no shutdown
 exit
end
write memory

Verification:

show wireless profile policy detailed POLICY-DOMUS_SECURE | include VLAN|vlan
VLAN                                : DATA_VLAN
AAA Override still allows ISE to dynamically assign MANAGEMENT_VLAN or IOT_VLAN.

Issue 3: Non-Admin Hitting Admin Rule

Symptom: Son’s laptop (modestus-p50) getting Domus_Admin_Profile and MANAGEMENT_VLAN.

Root Cause: Authorization rule Domus_Cert_Admins condition too broad:

CERTIFICATE:Subject - Organization contains 'Infrastructure'

All certs with O=Domus-Infrastructure matched - including non-admin users.

Solution: Add CN condition to restrict to admin workstations only.

API Approach:

Step 1 - Get current rule:

POLICY_ID="056a2880-5821-465f-adb2-90c32de0b06f"
curl -sk -u "$ISE_API_USER:$ISE_API_PASS" \
  "https://$ISE_PAN_FQDN/api/v1/policy/network-access/policy-set/$POLICY_ID/authorization" \
  -H "Accept: application/json" | jq '.response[] | select(.rule.name=="Domus_Cert_Admins")' > /tmp/admin_rule.json

Step 2 - Create updated rule with CN condition:

{
  "rule": {
    "id": "a493e874-a27b-4f31-8465-eca9b7f1feb3",
    "name": "Domus_Cert_Admins",
    "rank": 0,
    "state": "enabled",
    "condition": {
      "conditionType": "ConditionAndBlock",
      "isNegate": false,
      "children": [
        {
          "conditionType": "ConditionAttributes",
          "dictionaryName": "Network Access",
          "attributeName": "EapAuthentication",
          "operator": "equals",
          "attributeValue": "EAP-TLS"
        },
        {
          "conditionType": "ConditionAttributes",
          "dictionaryName": "CERTIFICATE",
          "attributeName": "Subject - Organization",
          "operator": "contains",
          "attributeValue": "Infrastructure"
        },
        {
          "conditionType": "ConditionOrBlock",
          "isNegate": false,
          "children": [
            {
              "conditionType": "ConditionAttributes",
              "dictionaryName": "CERTIFICATE",
              "attributeName": "Subject - Common Name",
              "operator": "contains",
              "attributeValue": "razer"
            },
            {
              "conditionType": "ConditionAttributes",
              "dictionaryName": "CERTIFICATE",
              "attributeName": "Subject - Common Name",
              "operator": "contains",
              "attributeValue": "aw"
            }
          ]
        }
      ]
    }
  },
  "profile": ["Domus_Admin_Profile"]
}

Step 3 - Apply update:

RULE_ID="a493e874-a27b-4f31-8465-eca9b7f1feb3"
curl -sk -u "$ISE_API_USER:$ISE_API_PASS" \
  -X PUT \
  "https://$ISE_PAN_FQDN/api/v1/policy/network-access/policy-set/$POLICY_ID/authorization/$RULE_ID" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d @/tmp/admin_rule_updated.json | jq .

Verification:

netapi ise dc query "
  SELECT
    TO_CHAR(acs_timestamp, 'HH24:MI:SS') as time,
    user_name,
    selected_azn_profiles as profile
  FROM mnt.radius_auth_48_live
  WHERE user_name LIKE '%p50%'
  ORDER BY acs_timestamp DESC
  FETCH FIRST 3 ROWS ONLY"

Result:

TIME      USER_NAME                               PROFILE
18:06:19  modestus-p50.inside.domusdigitalis.dev  Domus_Secure_Profile  ← NEW (correct)
17:42:19  modestus-p50.inside.domusdigitalis.dev  Domus_Admin_Profile   ← OLD (wrong)

New Rule Condition

EAP-TLS AND
O contains 'Infrastructure' AND
(CN contains 'razer' OR CN contains 'aw')

Only modestus-razer and modestus-aw get admin profile. All other certs fall through to lower rules.

Key Learnings

Topic Learning

Switch DACL

Requires radius-server vsa send authentication (same as WLC)

WLC default VLAN

Use DATA_VLAN as default; AAA Override handles dynamic assignment

ISE OpenAPI

Certificate OU conditions not available; use CN or O instead

Rule condition structure

ConditionAndBlock with nested ConditionOrBlock for complex logic

API PUT

Must include full rule JSON, not just changed fields

Session 8: WLC HA SSO Configuration - COMPLETE

Time: Night

Objective: Configure Stateful Switchover (SSO) between WLC-01 and WLC-02.

Prerequisites Verified

  • WLC-01 (10.50.1.40) running on kvm-01

  • WLC-02 (10.50.1.41) running on kvm-02

  • Both running IOS-XE 17.15.x

  • Both VMs have 2 NICs: Gi1 (trunk) and Gi2 (HA)

HA Configuration Applied

On WLC-01 (Active):

configure terminal
redundancy
 mode sso
exit
chassis redundancy ha-interface GigabitEthernet 2 local-ip 169.254.1.1 /24 remote-ip 169.254.1.2
write memory

On WLC-02 (Standby):

configure terminal
redundancy
 mode sso
exit
chassis redundancy ha-interface GigabitEthernet 2 local-ip 169.254.1.2 /24 remote-ip 169.254.1.1
write memory

Post-Reload Verification

show redundancy
Redundant System Information :
             Hardware Mode = Duplex
    Configured Redundancy Mode = sso
       Operating Redundancy Mode = sso
              Communications = Up

Current Processor Information :
       Active Location = slot 1
   Current Software state = ACTIVE

Peer Processor Information :
      Standby Location = slot 2
   Current Software state = STANDBY HOT

Key Learnings

Topic Learning

Interface availability

Check show ip int brief - WLC VMs may only have Gi1/Gi2 (no Gi3)

2-NIC approach

Use Gi2 for HA with link-local IPs (169.254.x.x)

Standby IP inaccessible

Normal SSO behavior - only Active WLC owns management IP

-STBY prompt

Console shows hostname with -STBY suffix on standby unit

Communications = Up

Confirms HA heartbeat working between WLCs

Documentation Updated

  • wlc-ha-sso.adoc - Added "HA Interface Options" section with 3 approaches (3-NIC, 2-NIC, 1-NIC)

  • wlc-vyos-integration.adoc - Updated Phase 4 to reflect completion

Session 9: VM Migration kvm-02 → kvm-01 - COMPLETE

Time: Night

Objective: Move all primary VMs from kvm-02 (NAS-dependent) to kvm-01 (onboard SSD) for resilience until UPS installed.

VMs Migrated

VM Size Status

vault-01

~5GB

✓ Migrated, SSH CA working

bind-01

~3GB

✓ Migrated, DNS resolving

home-dc01

41GB

✓ Migrated via NAS staging

ipa-01

3.3GB

✓ Deployed from NAS (no XML existed)

Migration Patterns

Standard (small VMs via workstation /tmp):

SSH not configured between kvm-01 and kvm-02. All transfers via workstation.
# kvm-02: Export
sudo virsh dumpxml $vm > /tmp/$vm.xml
sudo cp /var/lib/libvirt/images/$vm.qcow2 /tmp/ && sudo chmod 644 /tmp/$vm.qcow2

# Workstation: Transfer
scp kvm-02:/tmp/{vault-01,bind-01}.{xml,qcow2} /tmp/
scp /tmp/{vault-01,bind-01}.{xml,qcow2} kvm-01:/tmp/

# kvm-01: Define and start
for vm in vault-01 bind-01; do
  sudo mv /tmp/$vm.qcow2 /mnt/onboard-ssd/libvirt/images/
  sed -i 's|/var/lib/libvirt/images/|/mnt/onboard-ssd/libvirt/images/|g' /tmp/$vm.xml
  sudo virsh define /tmp/$vm.xml
  sudo virsh start $vm
done

Large VMs via NAS staging:

# kvm-02: Copy to NAS
sudo cp /var/lib/libvirt/images/home-dc01.qcow2 /mnt/nas/vms/

# kvm-01: Copy from NAS
sudo cp /mnt/nas/vms/home-dc01.qcow2 /mnt/onboard-ssd/libvirt/images/

Import from NAS (no XML):

sudo cp /mnt/nas/vms/ipa-01.qcow2 /mnt/onboard-ssd/libvirt/images/

sudo virt-install \
  --name ipa-01 \
  --memory 4096 \
  --vcpus 2 \
  --disk /mnt/onboard-ssd/libvirt/images/ipa-01.qcow2,bus=virtio \
  --import \
  --os-variant rocky9 \
  --network bridge=br-mgmt,model=virtio \
  --graphics vnc,listen=0.0.0.0 \
  --noautoconsole

# CRITICAL: virt-install doesn't trigger libvirt "started" hook
sudo virsh destroy ipa-01
sudo virsh start ipa-01

Final kvm-01 State

Id   Name            State
-------------------------------
11   k3s-master-01   running
14   vyos-01         running
21   ipsk-mgr-01     running
23   9800-WLC-01     running
24   vault-01        running
25   bind-01         running
26   home-dc01       running
27   ipa-01          running

kvm-02 State (Secondaries Only)

Id   Name          State
------------------------------
27   vyos-02       running
42   ise-02        running
66   9800-WLC-02   running

Key Learnings

Topic Learning

SSH between hypervisors

NOT configured - use workstation as intermediary

Large VM transfer

NAS as staging bypasses workstation /tmp quota

virt-install hook

Does NOT trigger libvirt "started" hook - use destroy/start cycle

Clock drift

Common after VM migration - chronyc makestep to fix

PVID verification

bridge vlan show dev vnetX to confirm PVID 100

Cockpit VMs

Install cockpit-machines package for VM management tab

Documentation Updated

  • kvm-operations.adoc - Added kvm-02→kvm-01 migration via workstation

  • kvm-operations.adoc - Added NAS import for VMs without XML

  • CLAUDE.md - Updated VM Migration priority section

Tomorrow (2026-03-09)

Personal Infrastructure

  • ~~Complete iPSK ISE integration~~ - DONE (Session 3-5)

  • ~~WiFi EAP-TLS VLAN/DACL fix~~ - DONE (Session 6)

  • ~~WLC HA SSO configuration~~ - DONE (Session 8)

  • kvm-01 Rocky rebuild Phase 4-7

  • VyOS remaining VLAN cutover (VOICE, STORAGE, DMZ)

  • Test iPSK with additional IoT devices

CHLA

  • Xianming Ding Linux SSH (priority)

  • iPSK DB replication investigation