ISE Incident Response Prep: 802.1X Auth Failures

Executive Summary

Current Status: ISE is operating normally. TAC case remains open at S2 for monitoring.

What Happened: - RabbitMQ messaging service on Primary MNT reached 100%+ CPU - Caused session logging backlog ("No data available" in Live Logs) - Primary MNT rebooted 2026-03-12 16:19 per TAC recommendation - All services restored, replication normalized

What Did NOT Happen: - Authentication services (PSNs) were NEVER impacted - Network access was NEVER denied due to this issue - The ~500 endpoint failures are a separate investigation (auth protocol/certificate issues)

Key Stakeholders

Name Title Interest Likely Questions

Sarah Clizer

CISO

Security posture, risk

"Is ISE stable? What’s our exposure?"

Jonathan Carr

Assoc. Dir. Field Support Services

End-user impact

"Are users still having issues connecting?"

Albert Rodriguez

Manager, Collaboration Services

Network stability

"Is ISE causing network problems?"

Defense Matrix

"Is ISE causing the authentication failures?"

Answer No. ISE PSNs are processing RADIUS authentications correctly.

Evidence

  • All 4 PSNs show running status for all RADIUS services

  • Active session count confirms endpoints authenticating

  • The MNT logging issue affected visibility, not authentication

Clarification

The ~500 endpoint failures predated the MNT issue and require separate investigation (certificate chain, supplicant config, AD connectivity).

"Why did RabbitMQ spike to 100%+ CPU?"

Answer Message queue saturation from high session volume combined with replication delay.

Root Cause

  • MNT receives session data from all 4 PSNs

  • Replication between Primary/Secondary MNT was degraded

  • Queue backed up → CPU spike → logging stopped working

TAC Guidance

Known issue addressed in ISE 3.2 Patch 9. Reboot cleared backlog. Patch upgrade scheduled.

"Is the network safe right now?"

Answer Yes. Authentication infrastructure is fully operational.

Architecture

  • 4 PSNs behind NetScaler VIPs - load balanced, redundant

  • Secondary MNT provides logging redundancy

  • PANs handle policy distribution (unaffected)

Monitoring

TAC case remains open. Proactive monitoring in place.

"What’s the timeline to full resolution?"

Pending Actions

  1. ISE Messaging Service enable - Maintenance window required

  2. ISE 3.2 Patch 9 upgrade - TAC coordinated, addresses known replication issues

Current State

Stable. No authentication impact. Logging restored.

Next Steps

Schedule maintenance window for remaining changes after business validation.

Quick Diagnostic Commands

Run these before/during any meeting to have current data:

System Health Check

# All nodes status
netapi ise -f json api-call openapi GET "/api/v1/deployment/node" | \
  jq -r '["HOSTNAME","STATUS"], (.response[] | [.hostname, .nodeStatus]) | @tsv' | column -t

# Expected: All nodes "OK" or "Connected"

Authentication Verification

# Active session count (proves auth is working)
netapi ise -f json mnt sessions | jq 'length'

# Sessions by PSN (distribution should be balanced)
netapi ise -f json mnt sessions | jq -r 'group_by(.psn) | .[] | "\(.[0].psn): \(length)"'

# Recent failures (should be minimal)
netapi ise -f json mnt failures --hours 1 | jq 'length'

MNT Health Specifically

# Check MNT nodes
netapi ise -f json api-call openapi GET "/api/v1/deployment/node" | \
  jq '.response[] | select(.roles[] | contains("MNT")) | {hostname, status: .nodeStatus}'

Architecture Diagram

ISE 8-Node Distributed Deployment

Key Points to Highlight:

  • MNT nodes (red) are MONITORING only - they don’t process authentications

  • PSNs (blue) handle all RADIUS - they were never impacted

  • NetScaler provides VIP load balancing - resilient architecture

Timeline (For Reference)

Date Event

~2026-03-05

Logging anomalies noticed (hindsight - MNT queue building)

2026-03-11

Authentication failures reported (~500 endpoints)

2026-03-12

TAC case opened (S1 - medical facility)

2026-03-12 15:08

TAC identified RabbitMQ CPU spike, recommended reboot

2026-03-12 16:19

Primary MNT rebooted per TAC

2026-03-12 16:29

All services confirmed running

2026-03-13+

Monitoring, case downgraded to S2

Strongline Gateway VLAN Assignment (INC-2026-03-16-001)

  • Separate Issue - Not related to MNT/RabbitMQ

  • Reported by: Arin Khachikyan (Network Engineer)

  • Issue: 8 devices in wrong identity group → wrong VLAN

  • Status: Under investigation by David Rukiza, Ntashamaje

Appendix: ISE Service Architecture

┌─────────────────────────────────────────────────────────────┐
│                    ISE Node Roles                           │
├─────────────────────────────────────────────────────────────┤
│ PAN (Policy Admin Node)                                     │
│   - Policy configuration and distribution                   │
│   - NOT involved in real-time authentication                │
├─────────────────────────────────────────────────────────────┤
│ PSN (Policy Service Node)  ← HANDLES ALL AUTHENTICATION     │
│   - RADIUS server                                           │
│   - Real-time 802.1X processing                             │
│   - 4 nodes behind NetScaler VIPs                           │
├─────────────────────────────────────────────────────────────┤
│ MNT (Monitoring Node)  ← THIS IS WHERE THE ISSUE WAS        │
│   - Session logging and reporting                           │
│   - Does NOT affect authentication                          │
│   - Primary + Secondary for redundancy                      │
└─────────────────────────────────────────────────────────────┘

CRITICAL DISTINCTION:
- MNT issues affect VISIBILITY (logs, reports)
- MNT issues do NOT affect AUTHENTICATION (network access)