STD-011: Incident Response
When something breaks, the response follows a defined structure: detect, triage, investigate, resolve, verify, document. Every incident gets a report. Severity determines urgency and what comes after.
Principles
-
Structure over heroics. A calm, structured response outperforms frantic troubleshooting. Follow the workflow even under pressure.
-
Document as you go. The incident report starts at detection, not after resolution. Timeline entries are captured in real time.
-
Severity drives follow-up. P1/P2 incidents MUST produce an RCA (STD-010). P3 SHOULD. P4 MAY.
-
Verify before declaring resolved. "It seems to work" is not resolved. Verification commands with expected output MUST confirm restoration.
Severity Classification
| Level | Description | Criteria | Response Time |
|---|---|---|---|
P1 — Critical |
Complete outage, data loss risk, security breach |
Production down, no workaround, active data loss or security event |
Immediate, all hands |
P2 — High |
Major feature broken, significant impact |
Core functionality impaired, workaround difficult or degraded |
< 1 hour |
P3 — Medium |
Functionality degraded, workaround available |
Non-critical feature broken, users can work around it |
< 4 hours |
P4 — Low |
Cosmetic, minor inconvenience |
No functional impact, annoyance only |
Next business day |
Requirements
-
Every incident MUST be documented in an incident report using
INC-YYYY-MM-DD-slug.adocnaming. -
Incident reports MUST be filed in
pages/case-studies/incidents/. -
Every incident report MUST include: Summary table (detected, mitigated, resolved, duration, severity, impact, root cause), Timeline, Symptoms, Investigation (triage + logs + findings), Root Cause, Resolution (fix + verification), Impact Assessment, Prevention (short-term + long-term), Lessons Learned, and Metadata.
-
The Timeline MUST be populated in real time during investigation, not reconstructed after the fact.
-
Resolution verification MUST include specific commands and their expected output — not just "confirmed working."
-
P1 and P2 incidents MUST produce a corresponding RCA document (STD-010) within 5 business days.
-
P1 and P2 incidents MUST include a Communication Log documenting who was notified and when.
-
Changes made during incident resolution MUST be documented as change records (
CR-) and follow change control (STD-005). -
The incident report MUST cross-reference all related documents: RCA, CR, runbooks used, related incidents.
Incident Response Workflow
1. DETECT → Alert, user report, or observation triggers investigation
2. TRIAGE → Classify severity (P1-P4), determine response urgency
3. INVESTIGATE → Initial triage checklist, log analysis, diagnostic commands
4. RESOLVE → Apply fix following verify-change-verify (STD-005)
5. VERIFY → Confirm restoration with specific commands + expected output
6. DOCUMENT → Complete incident report, create CR for changes, trigger RCA if P1/P2
Initial Triage Checklist
Every investigation MUST start with:
-
Check service status (
systemctl status, process health) -
Review recent changes (deploys, config changes, updates)
-
Check dependent services and connectivity
-
Review error logs (
journalctl -u <service> --since "1 hour ago") -
Check monitoring dashboards (if available)
Compliance
| Check | Method | Pass Criterion |
|---|---|---|
Incident documented |
Every known outage or failure has an |
No undocumented incidents |
Severity assigned |
Summary table includes severity level P1-P4 |
Classification present and justified |
Timeline populated |
Timeline table has at least: detection, investigation start, root cause identified, resolution |
Minimum 4 timeline entries |
Verification present |
Resolution section includes specific commands with expected output |
Not just "confirmed working" |
RCA triggered |
P1/P2 incidents have a corresponding |
No orphaned P1/P2 incidents |
Changes tracked |
Changes made during resolution are documented as |
Resolution actions traceable |
Exceptions
Personal infrastructure incidents (home lab, personal workstations) MAY use a simplified format omitting the Communication Log section. All other sections remain required.
Related
-
STD-010: Root Cause Analysis — post-incident analysis standard
-
STD-005: Change Control — changes during resolution follow verify-change-verify
-
Incident Response Patterns — professional judgment patterns
-
Incident Report Template — starting point for new incident reports
-
Incident Runbook Template — operational response template