802.1X Auth Failures - Resolution

Appendix

  • show logging application rabbitmq.log tail count 50

ECCRB Submission (Emergency Change Control Review Board)

Summary

ISE Primary MNT RabbitMQ Message Queue Optimization - TAC-Guided Intervention

Description

During routine monitoring of the ISE distributed deployment, elevated CPU utilization (109%) was identified on the Primary Monitoring Node (pmnt.ise.chla.org) RabbitMQ messaging service. RabbitMQ handles inter-node communication for session logging and replication.

Cisco TAC was engaged proactively (S1 - Healthcare Environment). TAC analysis confirmed message queue saturation was degrading logging pipeline performance and would progressively impact operational visibility if not addressed.

TAC recommended controlled service restart to clear accumulated queue backlog and restore optimal message processing throughput.

Business Justification

  • ISE MNT provides visibility into 802.1X authentication events for ~26,000+ endpoints

  • Degraded logging impacts security incident response capability

  • Proactive intervention prevents escalation to authentication service impact

Service Impact

None

  • Secondary MNT (smnt.ise.chla.org) provides continuous monitoring during restart

  • RADIUS authentication handled by four independent PSNs - unaffected by MNT operations

  • No endpoint connectivity impact

Detailed Implementation Plan

1. SSH to Primary MNT
   - ssh pmnt.ise.chla.org

2. Initiate reload
   - reload
   - Prompt: "Save ade-os [y/n]" -> y

3. Wait for reboot (~5 minutes)

4. Reconnect and verify
   - ssh pmnt.ise.chla.org
   - Wait ~10 minutes for services to initialize
   - show application status ise
   - Confirm critical services running:
     * Database Listener: running
     * Database Server: running
     * Application Server: running
     * M&T Session Database: running
     * M&T Log Processor: running
     * ISE Messaging Service: running

5. Validate logging restored
   - Log into Primary PAN (ppan.ise.chla.org)
   - Navigate to Operations > RADIUS Live Logs
   - Click endpoint details to confirm "No data available" error is resolved

Detailed Backout Plan

Change cannot be backed out (server reboot is atomic).

Mitigation: Secondary MNT (smnt.ise.chla.org) automatically assumes primary monitoring role if pmnt fails to recover. No authentication impact regardless of outcome.

Testing Plan

Pre-Change Validation

  • Confirm Secondary MNT (smnt.ise.chla.org) is healthy and synchronized

  • Verify PSNs (psn-1 through psn-4) are processing RADIUS authentications

  • Document current RabbitMQ CPU utilization on pmnt

Post-Change Validation

  • SSH to pmnt.ise.chla.org - confirm node is accessible

  • Run show application status ise - all critical services running

  • Run show logging application rabbitmq.log tail count 50 - no errors

  • Log into Primary PAN (ppan.ise.chla.org)

  • Navigate to Operations > RADIUS Live Logs

  • Click endpoint details on 3+ recent authentications - confirm data is available (no "No data available" error)

  • Verify replication status: Administration > System > Deployment

  • Confirm no alarms on dashboard

Success Criteria

  • Primary MNT services running

  • RabbitMQ CPU < 50%

  • Live Logs session details displaying correctly

  • No replication alarms

Benefits of Change

  • Restore RabbitMQ message processing to nominal throughput

  • Ensure continuous security event visibility for compliance and incident response

  • Prevent progressive degradation that could impact authentication diagnostics

  • Align with Cisco TAC best practices for ISE MNT health maintenance

Risk Analysis / Mitigation Plan

Risk Probability Mitigation

Temporary loss of primary logging during restart

Expected (by design)

Secondary MNT assumes logging automatically; no data loss

Primary MNT fails to recover

Low

Secondary MNT continues operation; TAC on standby for escalation

Authentication impact

None

MNT is monitoring-only; PSNs operate independently

Evidence / Justification

Observed Condition - Primary MNT CLI (pmnt.ise.chla.org)

RabbitMQ messaging service at 109% CPU utilization. Node unable to process authentication/session logs, causing backlog.

  • Command: top / show application status ise

  • Finding: RabbitMQ process consuming >100% CPU

Observed Condition - Primary PAN Dashboard (ppan.ise.chla.org)

Alarms displayed:

  • "Replication Failed from PMnT"

  • MNT sync status showing degraded state

Location: Administration > System > Deployment

Finding: Primary MNT replication failing to secondary

Observed Condition - RADIUS Live Logs

Live Logs show authentication entries, but clicking session details returns:

No data available for this record. Either the data is purged or authentication for this session record happened a week ago.

Root Cause: RabbitMQ overload preventing session data from being written to MNT database.

TAC Engagement

  • SR Number: add your case number

  • Severity: S1 (Medical Facility - Patient Care Impact)

  • TAC Engineer: add name if available

  • Recommendation: Immediate reboot of Primary MNT to clear RabbitMQ queue

Authorization

Change authorized by Cisco TAC guidance via live WebEx session.

Notes

scratch space for case notes