STD-011: Incident Response

When something breaks, the response follows a defined structure: detect, triage, investigate, resolve, verify, document. Every incident gets a report. Severity determines urgency and what comes after.

Principles

  1. Structure over heroics. A calm, structured response outperforms frantic troubleshooting. Follow the workflow even under pressure.

  2. Document as you go. The incident report starts at detection, not after resolution. Timeline entries are captured in real time.

  3. Severity drives follow-up. P1/P2 incidents MUST produce an RCA (STD-010). P3 SHOULD. P4 MAY.

  4. Verify before declaring resolved. "It seems to work" is not resolved. Verification commands with expected output MUST confirm restoration.

Severity Classification

Level Description Criteria Response Time

P1 — Critical

Complete outage, data loss risk, security breach

Production down, no workaround, active data loss or security event

Immediate, all hands

P2 — High

Major feature broken, significant impact

Core functionality impaired, workaround difficult or degraded

< 1 hour

P3 — Medium

Functionality degraded, workaround available

Non-critical feature broken, users can work around it

< 4 hours

P4 — Low

Cosmetic, minor inconvenience

No functional impact, annoyance only

Next business day

Requirements

  1. Every incident MUST be documented in an incident report using INC-YYYY-MM-DD-slug.adoc naming.

  2. Incident reports MUST be filed in pages/case-studies/incidents/.

  3. Every incident report MUST include: Summary table (detected, mitigated, resolved, duration, severity, impact, root cause), Timeline, Symptoms, Investigation (triage + logs + findings), Root Cause, Resolution (fix + verification), Impact Assessment, Prevention (short-term + long-term), Lessons Learned, and Metadata.

  4. The Timeline MUST be populated in real time during investigation, not reconstructed after the fact.

  5. Resolution verification MUST include specific commands and their expected output — not just "confirmed working."

  6. P1 and P2 incidents MUST produce a corresponding RCA document (STD-010) within 5 business days.

  7. P1 and P2 incidents MUST include a Communication Log documenting who was notified and when.

  8. Changes made during incident resolution MUST be documented as change records (CR-) and follow change control (STD-005).

  9. The incident report MUST cross-reference all related documents: RCA, CR, runbooks used, related incidents.

Incident Response Workflow

1. DETECT    → Alert, user report, or observation triggers investigation
2. TRIAGE    → Classify severity (P1-P4), determine response urgency
3. INVESTIGATE → Initial triage checklist, log analysis, diagnostic commands
4. RESOLVE   → Apply fix following verify-change-verify (STD-005)
5. VERIFY    → Confirm restoration with specific commands + expected output
6. DOCUMENT  → Complete incident report, create CR for changes, trigger RCA if P1/P2

Initial Triage Checklist

Every investigation MUST start with:

  • Check service status (systemctl status, process health)

  • Review recent changes (deploys, config changes, updates)

  • Check dependent services and connectivity

  • Review error logs (journalctl -u <service> --since "1 hour ago")

  • Check monitoring dashboards (if available)

Compliance

Check Method Pass Criterion

Incident documented

Every known outage or failure has an INC- document

No undocumented incidents

Severity assigned

Summary table includes severity level P1-P4

Classification present and justified

Timeline populated

Timeline table has at least: detection, investigation start, root cause identified, resolution

Minimum 4 timeline entries

Verification present

Resolution section includes specific commands with expected output

Not just "confirmed working"

RCA triggered

P1/P2 incidents have a corresponding RCA- document

No orphaned P1/P2 incidents

Changes tracked

Changes made during resolution are documented as CR- records

Resolution actions traceable

Exceptions

Personal infrastructure incidents (home lab, personal workstations) MAY use a simplified format omitting the Communication Log section. All other sections remain required.