STD-011: Incident Response

When something breaks, the response follows a defined structure: detect, triage, investigate, resolve, verify, document. Every incident gets a report. Severity determines urgency and what comes after.

Principles

Structure over heroics. A calm, structured response outperforms frantic troubleshooting. Follow the workflow even under pressure.
Document as you go. The incident report starts at detection, not after resolution. Timeline entries are captured in real time.
Severity drives follow-up. P1/P2 incidents MUST produce an RCA (STD-010). P3 SHOULD. P4 MAY.
Verify before declaring resolved. "It seems to work" is not resolved. Verification commands with expected output MUST confirm restoration.

Severity Classification

Level	Description	Criteria	Response Time
P1 — Critical	Complete outage, data loss risk, security breach	Production down, no workaround, active data loss or security event	Immediate, all hands
P2 — High	Major feature broken, significant impact	Core functionality impaired, workaround difficult or degraded	< 1 hour
P3 — Medium	Functionality degraded, workaround available	Non-critical feature broken, users can work around it	< 4 hours
P4 — Low	Cosmetic, minor inconvenience	No functional impact, annoyance only	Next business day

Level

Description

Criteria

Response Time

P1 — Critical

Complete outage, data loss risk, security breach

Production down, no workaround, active data loss or security event

Immediate, all hands

P2 — High

Major feature broken, significant impact

Core functionality impaired, workaround difficult or degraded

< 1 hour

P3 — Medium

Functionality degraded, workaround available

Non-critical feature broken, users can work around it

< 4 hours

P4 — Low

Cosmetic, minor inconvenience

No functional impact, annoyance only

Next business day

Requirements

Every incident MUST be documented in an incident report using INC-YYYY-MM-DD-slug.adoc naming.
Incident reports MUST be filed in pages/case-studies/incidents/.
Every incident report MUST include: Summary table (detected, mitigated, resolved, duration, severity, impact, root cause), Timeline, Symptoms, Investigation (triage + logs + findings), Root Cause, Resolution (fix + verification), Impact Assessment, Prevention (short-term + long-term), Lessons Learned, and Metadata.
The Timeline MUST be populated in real time during investigation, not reconstructed after the fact.
Resolution verification MUST include specific commands and their expected output — not just "confirmed working."
P1 and P2 incidents MUST produce a corresponding RCA document (STD-010) within 5 business days.
P1 and P2 incidents MUST include a Communication Log documenting who was notified and when.
Changes made during incident resolution MUST be documented as change records (CR-) and follow change control (STD-005).
The incident report MUST cross-reference all related documents: RCA, CR, runbooks used, related incidents.

Incident Response Workflow

1. DETECT    → Alert, user report, or observation triggers investigation
2. TRIAGE    → Classify severity (P1-P4), determine response urgency
3. INVESTIGATE → Initial triage checklist, log analysis, diagnostic commands
4. RESOLVE   → Apply fix following verify-change-verify (STD-005)
5. VERIFY    → Confirm restoration with specific commands + expected output
6. DOCUMENT  → Complete incident report, create CR for changes, trigger RCA if P1/P2

Initial Triage Checklist

Every investigation MUST start with:

Check service status (systemctl status, process health)
Review recent changes (deploys, config changes, updates)
Check dependent services and connectivity
Review error logs (journalctl -u <service> --since "1 hour ago")
Check monitoring dashboards (if available)

Compliance

Check Method Pass Criterion

Check	Method	Pass Criterion
Incident documented	Every known outage or failure has an `INC-` document	No undocumented incidents
Severity assigned	Summary table includes severity level P1-P4	Classification present and justified
Timeline populated	Timeline table has at least: detection, investigation start, root cause identified, resolution	Minimum 4 timeline entries
Verification present	Resolution section includes specific commands with expected output	Not just "confirmed working"
RCA triggered	P1/P2 incidents have a corresponding `RCA-` document	No orphaned P1/P2 incidents
Changes tracked	Changes made during resolution are documented as `CR-` records	Resolution actions traceable

Incident documented

Every known outage or failure has an INC- document

No undocumented incidents

Severity assigned

Summary table includes severity level P1-P4

Classification present and justified

Timeline populated

Timeline table has at least: detection, investigation start, root cause identified, resolution

Minimum 4 timeline entries

Verification present

Resolution section includes specific commands with expected output

Not just "confirmed working"

RCA triggered

P1/P2 incidents have a corresponding RCA- document

No orphaned P1/P2 incidents

Changes tracked

Changes made during resolution are documented as CR- records

Resolution actions traceable

Exceptions

Personal infrastructure incidents (home lab, personal workstations) MAY use a simplified format omitting the Communication Log section. All other sections remain required.

STD-010: Root Cause Analysis — post-incident analysis standard
STD-005: Change Control — changes during resolution follow verify-change-verify
Incident Response Patterns — professional judgment patterns
Incident Report Template — starting point for new incident reports
Incident Runbook Template — operational response template