ISE Incident Response Prep: 802.1X Auth Failures
Executive Summary
Current Status: ISE is operating normally. TAC case remains open at S2 for monitoring.
What Happened: - RabbitMQ messaging service on Primary MNT reached 100%+ CPU - Caused session logging backlog ("No data available" in Live Logs) - Primary MNT rebooted 2026-03-12 16:19 per TAC recommendation - All services restored, replication normalized
What Did NOT Happen: - Authentication services (PSNs) were NEVER impacted - Network access was NEVER denied due to this issue - The ~500 endpoint failures are a separate investigation (auth protocol/certificate issues)
Key Stakeholders
| Name | Title | Interest | Likely Questions |
|---|---|---|---|
Sarah Clizer |
CISO |
Security posture, risk |
"Is ISE stable? What’s our exposure?" |
Jonathan Carr |
Assoc. Dir. Field Support Services |
End-user impact |
"Are users still having issues connecting?" |
Albert Rodriguez |
Manager, Collaboration Services |
Network stability |
"Is ISE causing network problems?" |
Defense Matrix
"Is ISE causing the authentication failures?"
| Answer | No. ISE PSNs are processing RADIUS authentications correctly. |
|---|---|
Evidence |
|
Clarification |
The ~500 endpoint failures predated the MNT issue and require separate investigation (certificate chain, supplicant config, AD connectivity). |
"Why did RabbitMQ spike to 100%+ CPU?"
| Answer | Message queue saturation from high session volume combined with replication delay. |
|---|---|
Root Cause |
|
TAC Guidance |
Known issue addressed in ISE 3.2 Patch 9. Reboot cleared backlog. Patch upgrade scheduled. |
"Is the network safe right now?"
| Answer | Yes. Authentication infrastructure is fully operational. |
|---|---|
Architecture |
|
Monitoring |
TAC case remains open. Proactive monitoring in place. |
"What’s the timeline to full resolution?"
Pending Actions |
|
Current State |
Stable. No authentication impact. Logging restored. |
Next Steps |
Schedule maintenance window for remaining changes after business validation. |
Quick Diagnostic Commands
Run these before/during any meeting to have current data:
System Health Check
# All nodes status
netapi ise -f json api-call openapi GET "/api/v1/deployment/node" | \
jq -r '["HOSTNAME","STATUS"], (.response[] | [.hostname, .nodeStatus]) | @tsv' | column -t
# Expected: All nodes "OK" or "Connected"
Authentication Verification
# Active session count (proves auth is working)
netapi ise -f json mnt sessions | jq 'length'
# Sessions by PSN (distribution should be balanced)
netapi ise -f json mnt sessions | jq -r 'group_by(.psn) | .[] | "\(.[0].psn): \(length)"'
# Recent failures (should be minimal)
netapi ise -f json mnt failures --hours 1 | jq 'length'
MNT Health Specifically
# Check MNT nodes
netapi ise -f json api-call openapi GET "/api/v1/deployment/node" | \
jq '.response[] | select(.roles[] | contains("MNT")) | {hostname, status: .nodeStatus}'
Architecture Diagram
Key Points to Highlight:
-
MNT nodes (red) are MONITORING only - they don’t process authentications
-
PSNs (blue) handle all RADIUS - they were never impacted
-
NetScaler provides VIP load balancing - resilient architecture
Timeline (For Reference)
| Date | Event |
|---|---|
~2026-03-05 |
Logging anomalies noticed (hindsight - MNT queue building) |
2026-03-11 |
Authentication failures reported (~500 endpoints) |
2026-03-12 |
TAC case opened (S1 - medical facility) |
2026-03-12 15:08 |
TAC identified RabbitMQ CPU spike, recommended reboot |
2026-03-12 16:19 |
Primary MNT rebooted per TAC |
2026-03-12 16:29 |
All services confirmed running |
2026-03-13+ |
Monitoring, case downgraded to S2 |
Related Open Issues
Strongline Gateway VLAN Assignment (INC-2026-03-16-001)
-
Separate Issue - Not related to MNT/RabbitMQ
-
Reported by: Arin Khachikyan (Network Engineer)
-
Issue: 8 devices in wrong identity group → wrong VLAN
-
Status: Under investigation by David Rukiza, Ntashamaje
Appendix: ISE Service Architecture
┌─────────────────────────────────────────────────────────────┐
│ ISE Node Roles │
├─────────────────────────────────────────────────────────────┤
│ PAN (Policy Admin Node) │
│ - Policy configuration and distribution │
│ - NOT involved in real-time authentication │
├─────────────────────────────────────────────────────────────┤
│ PSN (Policy Service Node) ← HANDLES ALL AUTHENTICATION │
│ - RADIUS server │
│ - Real-time 802.1X processing │
│ - 4 nodes behind NetScaler VIPs │
├─────────────────────────────────────────────────────────────┤
│ MNT (Monitoring Node) ← THIS IS WHERE THE ISSUE WAS │
│ - Session logging and reporting │
│ - Does NOT affect authentication │
│ - Primary + Secondary for redundancy │
└─────────────────────────────────────────────────────────────┘
CRITICAL DISTINCTION:
- MNT issues affect VISIBILITY (logs, reports)
- MNT issues do NOT affect AUTHENTICATION (network access)