TAC Case: 802.1X Authentication Failures (~500 Endpoints)

Case Summary

SR Number

pending

Severity

S1 (production network down - medical facility)

Product

Cisco ISE 3.2 Patch 6

Contract

add SmartNet contract ID

Opened

2026-03-12

Request

Live engineer with distributed ISE/MNT experience

Problem Statement

Approximately 500 endpoints are failing 802.1X authentication across wired and wireless networks. Affects domain-joined Windows and Jamf-managed Macs using EAP-TEAP, MSCHAPv2, EAP-TLS, and SCEP-issued certificates.

CRITICAL: This is a medical facility. Patient care systems may be impacted.

SECONDARY SYMPTOM: Live Logs show authentication entries, but clicking details returns:

No data available for this record. Either the data is purged or authentication for this session record happened a week ago. Or if this is a 'PassiveID' or 'PassiveID Visibility' session, it will not have authentication details on ISE.

PassiveID/Visibility services are NOT enabled. This suggests MNT database or replication issue.

Environment

ISE Deployment

Node Role Status

ppan.ise.chla.org

Primary PAN

check

span.ise.chla.org

Secondary PAN

check

pmnt.ise.chla.org

Primary MNT

check

smnt.ise.chla.org

Secondary MNT

check

psn-1.ise.chla.org

PSN

check

psn-2.ise.chla.org

PSN

check

psn-3.ise.chla.org

PSN

check

psn-4.ise.chla.org

PSN

check

ISE Version: 3.2 Patch 6

Deployment Type: Distributed (8 nodes)

Affected Networks

  • Wired (LAN) - 802.1X

  • Wireless (WLAN) - 802.1X

Affected Endpoints

Device Type Auth Method ~Count

Windows 10/11 (Domain Joined)

EAP-TEAP, MSCHAPv2, EAP-TLS

estimate

macOS (Jamf Managed)

EAP-TLS (SCEP certs)

estimate

WOWs (Wyse on Wheels)

confirm auth method

estimate

Chromebooks

confirm auth method

estimate

WOWs and Chromebooks are critical to patient care. These devices are used at bedside for clinical workflows.

RADIUS Architecture (NetScaler Load Balancing)

VIP Backend PSNs Usage

VIP-1 (NetScaler SNIP)

psn-1.ise.chla.org, psn-2.ise.chla.org

Primary for all NADs (except ASA)

VIP-2 (NetScaler SNIP)

psn-3.ise.chla.org, psn-4.ise.chla.org

Secondary for all NADs (except ASA)

  • NADs configured: Primary = VIP-1, Secondary = VIP-2

  • ASA uses direct PSN addressing (not behind VIP)

Timeline

Date/Time Event

~2026-03-05

Noticed lack of logs / logging anomalies (1 week ago)

2026-03-11

Authentication failures reported (~500 endpoints)

2026-03-12

TAC case opened

investigate

Any changes in the 2 weeks before 03-05? (patch, cert renewal, DB maintenance, replication changes)

Key observation: Logging issues preceded auth failures by ~6 days. These are likely related.

Symptoms

User Experience

  • Endpoints fail to connect to network

  • Previously working devices now failing

  • add specific error messages users see

ISE Live Logs

Failure Reason(s): check Operations > Live Logs

# Common failure reasons to look for:
# 12514 - EAP-TLS failed SSL/TLS handshake
# 12308 - Client certificate chain not trusted
# 22056 - Subject not found in identity store
# 24408 - User/machine not found in AD
# 24415 - Could not locate AD domain

Sample Failed Authentications:

Timestamp Username/MAC Auth Method PSN Failure Reason

sample 1

sample 2

sample 3

Pattern Analysis

  • Failures on ALL PSNs or specific PSN?

  • Failures started suddenly or gradual increase?

  • Specific SSID/switch affected?

  • Time-based pattern?

Current Workaround

Adding devices by MAC address CSV import to General-Device-Onboard identity group.

This identity group is referenced in an authorization policy positioned before the default guest rule as a safety net.

Impact: Manual process, not scalable for 500+ devices.

Diagnostic Data

Support Bundle

# Generate from Primary PAN GUI:
# Administration > System > Logging > Debug Log Configuration
# Set debug levels BEFORE reproducing, then generate bundle

Support bundle generated: [ ] Yes  [ ] No
Bundle filename: ise-support-bundle-YYYY-MM-DD.tar.gz

Debug Logs to Enable

Before reproducing the issue, enable these debugs on the failing PSN:

Component Level

runtime-AAA

DEBUG

eap

DEBUG

eap-tls

DEBUG

ad-connector

DEBUG

identity-store-AD

DEBUG

Show Commands (from ISE CLI)

# Run on each PSN
show application status ise
show logging application ise-psc.log tail count 100

# AD connectivity
test aaa group <AD-join-point> <test-user> <password>

MNT Replication Health (CRITICAL - check this first)

From Primary PAN GUI:

  • Administration > System > Deployment - check node sync status

  • Administration > System > Settings > Logging > Log Collector - verify pmnt/smnt status

From MNT CLI (pmnt.ise.chla.org):

# Database status
show application status ise

# Check replication
application configure ise
# Select option 24: View DB Replication Status

# Disk space (if DB is full, session data won't write)
show disk

If replication is broken or DB is full, that explains both symptoms.

Session Data Check

# From MNT CLI - check if session database is responding
application configure ise
# Select option 14: Purge Runtime Sessions
# (DO NOT purge - just see if it responds)

Working Theory

The "no data available" error for recent sessions + auth failures suggests:

  1. MNT database issue - session data not being written or replicated

  2. Disk space exhaustion - DB partition full, can’t write new records

  3. Replication failure - PSNs can’t sync session data to MNT

  4. Database corruption - requires TAC intervention

The logging issue appearing ~6 days before auth failures suggests a DB/storage problem that gradually worsened until it began affecting live authentications.

What TAC Will Ask

Be ready with:

  1. [ ] SmartNet contract number

  2. [ ] ISE version: 3.2 Patch 6

  3. [ ] When logging issues started: ~2026-03-05

  4. [ ] When auth failures started: 2026-03-11

  5. [ ] Any changes in past 2 weeks? (patches, certs, AD changes, VM snapshots)

  6. [ ] Output of show disk from each MNT node

  7. [ ] Output of show application status ise from each node

  8. [ ] DB replication status (option 24 from application configure ise)

  9. [ ] Support bundle from Primary PAN

TAC Communication Log

Date Who Notes

2026-03-12

your name

Case opened

Notes

TAC Engagement — 2026‑03‑12 15:08

Below is an organized summary of TAC observations and all recommendations, with accurate tracking sections for Action, Owner, Status, and Next Steps.


Subject: ISE 3.2P6 - 802.1X Auth Failures ~500 Endpoints - Medical Facility - MNT Session Data Unavailable


Description:

ENVIRONMENT - ISE 3.2 Patch 6, 8-node distributed deployment - PAN: ppan.ise.chla.org, span.ise.chla.org - MNT: pmnt.ise.chla.org, smnt.ise.chla.org - PSN: psn-1 through psn-4 behind NetScaler VIPs (SNIP) - NADs point to VIP-1 (psn-1/2) primary, VIP-2 (psn-3/4) secondary

PROBLEM ~500 endpoints failing 802.1X authentication across wired and wireless networks.

Affected devices: - Windows 10/11 domain-joined (EAP-TEAP, MSCHAPv2, EAP-TLS) - macOS Jamf-managed (EAP-TLS via SCEP) - WOWs (Wyse on Wheels) - CRITICAL FOR PATIENT CARE - Chromebooks - CRITICAL FOR PATIENT CARE

TIMELINE - ~Mar 5: Noticed logging anomalies / lack of logs - Mar 11: Authentication failures reported (~500 endpoints) - Mar 12: TAC case opened

SECONDARY SYMPTOM Live Logs show authentication entries, but clicking details returns: "No data available for this record. Either the data is purged or authentication for this session record happened a week ago."

PassiveID is NOT enabled. This appears to be MNT database or replication issue.

CURRENT WORKAROUND Adding devices by MAC CSV import to General-Device-Onboard identity group (not scalable).

REQUEST Live engineer with distributed ISE/MNT experience. This is a medical facility with patient care impact.


scratch space for case notes

API Diagnostic Commands (netapi)

Run these before/during TAC call to have data ready.

Deployment Status

# Node overview
netapi ise api info

# All nodes with roles/services
netapi ise -f json api-call openapi GET "/api/v1/deployment/node" | jq -r '["HOSTNAME","IP","ROLES","SERVICES","STATUS"], (.response[] | [.hostname, .ipAddress, (.roles|join("/")), (.services|join(",")), .nodeStatus]) | @tsv' | column -t

MNT Health Check

# Check MNT node status specifically
netapi ise -f json api-call openapi GET "/api/v1/deployment/node" | jq '.response[] | select(.roles[] | contains("MNT")) | {hostname, ipAddress, status: .nodeStatus, services}'

Recent Auth Failures (Live Logs via API)

# Last 24 hours failed authentications
netapi ise -f json mnt failures --hours 24 | jq -r '.[] | [.timestamp, .username, .nas_ip, .failure_reason] | @tsv' | head -100

# Group failures by reason code
netapi ise -f json mnt failures --hours 24 | jq -r '.[].failure_reason' | sort | uniq -c | sort -rn

# Group failures by PSN (which PSN is seeing failures?)
netapi ise -f json mnt failures --hours 24 | jq -r '.[].psn' | sort | uniq -c | sort -rn

Active Sessions

# Current session count per PSN
netapi ise -f json mnt sessions | jq -r 'group_by(.psn) | .[] | {psn: .[0].psn, count: length}'

# Total active sessions
netapi ise -f json mnt sessions | jq 'length'

Policy Sets

# List authentication policy sets
netapi ise policy-sets

# Check the General-Device-Onboard identity group (workaround)
netapi ise -f json identity-groups | jq '.[] | select(.name | contains("General-Device-Onboard"))'

AD Connectivity

# AD join point status
netapi ise -f json api-call openapi GET "/api/v1/active-directory" | jq '.response[] | {name, domain, status: .adJoinPointStatus}'

Export Full Config (for TAC upload)

# Dump deployment info to JSON
netapi ise export > /tmp/ise-config-$(date +%Y%m%d).json

1. TAC Initial Observations

1.1 Disabled ISE Messaging Services

TAC observed that ISE Messaging Service for UDP syslog delivery to MNT is disabled. .Location ppan.ise.chla.org/admin/#administration/administration_system/administration_system_logging/local_log .Setting ISE Messaging Settings

Use "ISE Messaging Service" for UDP Syslogs delivery to MnT

Impact: If disabled, PSNs may fail to send session/auth records to MNT, contributing to “No data available for this record” errors.

2. TAC Recommendations Tracking

Recommendation

Description

Owner

Status

Notes / Next Steps

Enable ISE Messaging Services

Turn on “Use ISE Messaging Service for UDP syslogs delivery to MnT”.

InfoSec Engineering

Pending

Must be enabled during maintenance window; confirm PSN → MNT log ingestion resumes.

Resolve MNT Replication Failure

PAN dashboard shows alarms: Replication Failed from PMnT. Deregister/re‑register affected nodes.

InfoSec Engineering + TAC

In Progress

Perform on both PMnT and SMnT. Validate DB state & cluster hashing before re-registration.

Promote SMnT to Primary MNT

TAC recommends promoting secondary MNT to primary role temporarily.

InfoSec Engineering

Pending Decision

Requires validation of replication health and disk space. Ensure no corruption on SMnT.

Upgrade to ISE 3.2 Patch 9

TAC recommends installing latest patch to address known replication and logging issues.

InfoSec Engineering

Pending

Download link: software.cisco.com/download/home/283801620/type/283802505/release/3.2%20Patch%209

Review Disk Space on PMnT + SMnT

Verify DB/log partitions; full partitions can break logging and replication.

InfoSec Engineering

In Progress

Capture from CLI: show disk

Validate ISE Node Sync Status

Ensure deployment sync and configuration database replication are functioning.

InfoSec Engineering

Pending

GUI: Admin → System → Deployment

4. Scratch Space (Working Notes)

(Keep this for live call notes, timestamps, commands run, replication output, disk output, etc.)

ISE Primary MNT CPU rabbit mq service is over 100%

Management Summary — Primary MNT CPU / RabbitMQ Issue

We identified a critical performance issue on the Primary Monitoring Node (MNT) within our Cisco ISE deployment. The RabbitMQ messaging service, which is responsible for processing authentication and session logs, is running at over 100% CPU. This indicates that the MNT is unable to process messages efficiently, causing backlog and instability in the logging and monitoring functions. Recommended Action Cisco TAC has advised us to reboot the Primary MNT to clear the overloaded messaging service. During this reboot:

The Secondary MNT will automatically take over all monitoring/logging responsibilities. There is no impact to user authentication or network access. All authentication is handled by the four Policy Service Nodes (PSNs), which remain fully operational.

Why This Matters The overloaded Primary MNT is contributing to the issues we are seeing with missing log data and failed session lookups. Addressing this is part of stabilizing the overall environment and restoring full visibility into authentication events. Next Steps

After reboot, validate replication, queue processing, and log ingestion. Continue working with Cisco TAC to assess whether additional corrective actions are require

2026-03-12 16:19 pmnt services stopped and node rebooted
- [ ] app ise stop
- [ ] reboot
- [ ] saved ade-os
- [ ] acknowledged reboot
- [ ] ssh'd into server at 2026-03-12 16:21
- [ ] pmnt/admin#show uptime
 16:21:26 up 3 min,  1 user,  load average: 2.77, 1.35, 0.53
- [ ] 2026-03-12 16:29 running show application status ise

ISE PROCESS NAME                       STATE            PROCESS ID
--------------------------------------------------------------------
Database Listener                      running          8851
Database Server                        running          300 PROCESSES
Application Server                     running          28749
Profiler Database                      running          17555
ISE Indexing Engine                    disabled
AD Connector                           running          29781
M&T Session Database                   running          24955
M&T Log Processor                      running          29017
Certificate Authority Service          disabled
EST Service                            running          157854
SXP Engine Service                     disabled
TC-NAC Service                         disabled
PassiveID WMI Service                  disabled
PassiveID Syslog Service               disabled
PassiveID API Service                  disabled
PassiveID Agent Service                disabled
PassiveID Endpoint Service             disabled
PassiveID SPAN Service                 disabled
DHCP Server (dhcpd)                    disabled
DNS Server (named)                     disabled
ISE Messaging Service                  running          12480
ISE API Gateway Database Service       running          16233
ISE API Gateway Service                running          23247
ISE pxGrid Direct Service              disabled
Segmentation Policy Service            disabled
REST Auth Service                      running          145209
SSE Connector                          disabled
Hermes (pxGrid Cloud Agent)            disabled
McTrust (Meraki Sync Service)          disabled
ISE Node Exporter                      running          48340
ISE Prometheus Service                 disabled
ISE Grafana Service                    disabled
ISE MNT LogAnalytics Elasticsearch     running          57535
ISE Logstash Service                   running          75054
ISE Kibana Service                     running          92228

- [ ]





[scratch area for case log / CLI findings]

Appendix

  • show logging application rabbitmq.log tail count 50

ECCRB Submission (Emergency Change Control Review Board)

Summary

ISE Primary MNT RabbitMQ Message Queue Optimization - TAC-Guided Intervention

Description

During routine monitoring of the ISE distributed deployment, elevated CPU utilization (109%) was identified on the Primary Monitoring Node (pmnt.ise.chla.org) RabbitMQ messaging service. RabbitMQ handles inter-node communication for session logging and replication.

Cisco TAC was engaged proactively (S1 - Healthcare Environment). TAC analysis confirmed message queue saturation was degrading logging pipeline performance and would progressively impact operational visibility if not addressed.

TAC recommended controlled service restart to clear accumulated queue backlog and restore optimal message processing throughput.

Business Justification

  • ISE MNT provides visibility into 802.1X authentication events for ~26,000+ endpoints

  • Degraded logging impacts security incident response capability

  • Proactive intervention prevents escalation to authentication service impact

Service Impact

None

  • Secondary MNT (smnt.ise.chla.org) provides continuous monitoring during restart

  • RADIUS authentication handled by four independent PSNs - unaffected by MNT operations

  • No endpoint connectivity impact

Detailed Implementation Plan

1. SSH to Primary MNT
   - ssh pmnt.ise.chla.org

2. Initiate reload
   - reload
   - Prompt: "Save ade-os [y/n]" → y

3. Wait for reboot (~5 minutes)

4. Reconnect and verify
   - ssh pmnt.ise.chla.org
   - Wait ~10 minutes for services to initialize
   - show application status ise
   - Confirm critical services running:
     * Database Listener: running
     * Database Server: running
     * Application Server: running
     * M&T Session Database: running
     * M&T Log Processor: running
     * ISE Messaging Service: running

5. Validate logging restored
   - Log into Primary PAN (ppan.ise.chla.org)
   - Navigate to Operations > RADIUS Live Logs
   - Click endpoint details to confirm "No data available" error is resolved

Detailed Backout Plan

Change cannot be backed out (server reboot is atomic).

Mitigation: Secondary MNT (smnt.ise.chla.org) automatically assumes primary monitoring role if pmnt fails to recover. No authentication impact regardless of outcome.

Testing Plan

Pre-Change Validation

  • Confirm Secondary MNT (smnt.ise.chla.org) is healthy and synchronized

  • Verify PSNs (psn-1 through psn-4) are processing RADIUS authentications

  • Document current RabbitMQ CPU utilization on pmnt

Post-Change Validation

  • SSH to pmnt.ise.chla.org - confirm node is accessible

  • Run show application status ise - all critical services running

  • Run show logging application rabbitmq.log tail count 50 - no errors

  • Log into Primary PAN (ppan.ise.chla.org)

  • Navigate to Operations > RADIUS Live Logs

  • Click endpoint details on 3+ recent authentications - confirm data is available (no "No data available" error)

  • Verify replication status: Administration > System > Deployment

  • Confirm no alarms on dashboard

Success Criteria

  • Primary MNT services running

  • RabbitMQ CPU < 50%

  • Live Logs session details displaying correctly

  • No replication alarms

Benefits of Change

  • Restore RabbitMQ message processing to nominal throughput

  • Ensure continuous security event visibility for compliance and incident response

  • Prevent progressive degradation that could impact authentication diagnostics

  • Align with Cisco TAC best practices for ISE MNT health maintenance

Risk Analysis / Mitigation Plan

Risk Probability Mitigation

Temporary loss of primary logging during restart

Expected (by design)

Secondary MNT assumes logging automatically; no data loss

Primary MNT fails to recover

Low

Secondary MNT continues operation; TAC on standby for escalation

Authentication impact

None

MNT is monitoring-only; PSNs operate independently

Evidence / Justification

Observed Condition - Primary MNT CLI (pmnt.ise.chla.org)

RabbitMQ messaging service at 109% CPU utilization. Node unable to process authentication/session logs, causing backlog.

  • Command: top / show application status ise

  • Finding: RabbitMQ process consuming >100% CPU

Observed Condition - Primary PAN Dashboard (ppan.ise.chla.org)

Alarms displayed:

  • "Replication Failed from PMnT"

  • MNT sync status showing degraded state

Location: Administration > System > Deployment

Finding: Primary MNT replication failing to secondary

Observed Condition - RADIUS Live Logs

Live Logs show authentication entries, but clicking session details returns:

No data available for this record. Either the data is purged or authentication for this session record happened a week ago.

Root Cause: RabbitMQ overload preventing session data from being written to MNT database.

TAC Engagement

  • SR Number: add your case number

  • Severity: S1 (Medical Facility - Patient Care Impact)

  • TAC Engineer: add name if available

  • Recommendation: Immediate reboot of Primary MNT to clear RabbitMQ queue

Authorization

Change authorized by Cisco TAC guidance via live WebEx session.