STD-010: Root Cause Analysis
Every significant failure gets an RCA. The purpose is prevention, not blame. A fix without prevention is firefighting — the RCA format forces extraction of the transferable lesson.
Principles
-
Prevention over fix. Every RCA MUST answer four questions: (1) What happened, (2) Why it happened, (3) What was fixed, (4) How to PREVENT it. The prevention step is the most valuable.
-
5 Whys to root cause. Surface symptoms are not root causes. Ask "why" until you reach a systemic gap — process, knowledge, tooling, or design.
-
Blameless by design. RCAs document system failures, not human failures. "Operator error" is never a root cause — the system that allowed the error is.
-
Timely capture. RCA MUST be written within 5 business days of resolution for P1/P2 incidents. Context decays rapidly.
-
Pattern extraction. Every RCA MUST produce at least one entry in the Pattern Journal. The RCA is the investigation; the pattern is the transferable wisdom.
Requirements
-
Every P1 (Critical) and P2 (High) incident MUST have a corresponding RCA document.
-
P3 incidents SHOULD have an RCA. P4 incidents MAY have one if the failure reveals a systemic issue.
-
RCA documents MUST use the naming convention:
RCA-YYYY-MM-DD-NNN.adocwhere NNN is sequential per day. -
RCA documents MUST be filed in
pages/case-studies/rca/. -
Every RCA MUST include: Executive Summary, Timeline, Problem Statement (symptoms + expected vs actual), Root Cause (5 Whys analysis + root cause statement), Contributing Factors, Impact (severity + duration + scope), Resolution (immediate actions + verification), Preventive Measures (short-term + long-term), Detection analysis, Lessons Learned, and Metadata.
-
The Root Cause section MUST contain a 5 Whys analysis table AND a single-sentence root cause statement in a NOTE block.
-
Preventive Measures MUST have assigned owners and tracked status.
-
Every RCA MUST cross-reference the originating incident report (
INC-) and any change records (CR-) created as a result. -
The Lessons Learned section MUST produce at least one Pattern Journal entry or standard update.
RCA Format
Naming
RCA-YYYY-MM-DD-NNN-slug.adoc — date of the incident (not the RCA creation date), sequential number, brief slug.
Required Sections
| Section | Content |
|---|---|
Executive Summary |
Single paragraph: what happened, why, what was done, outcome |
Timeline |
Chronological table: detection → investigation → identification → fix → verification |
Problem Statement |
Symptoms, expected behavior, actual behavior |
Root Cause (5 Whys) |
Table drilling from symptom to systemic cause + single-sentence root cause statement |
Contributing Factors |
Table: factor, description, preventable (yes/no) |
Impact |
Severity (P1-P4), duration, users/systems affected, data loss, business impact |
Resolution |
Immediate actions with commands + verification commands |
Preventive Measures |
Short-term (this week) + long-term (this quarter), with owner and status |
Detection |
How detected, detection gap analysis (could we have caught this earlier?) |
Lessons Learned |
What went well, what could be improved, key takeaways |
Related |
xrefs to incident report, change records, codex entries, pattern journal |
Metadata |
RCA ID, author, dates, status (Draft/In Review/Final), review date |
Compliance
| Check | Method | Pass Criterion |
|---|---|---|
P1/P2 RCA exists |
Every |
No orphaned P1/P2 incidents |
5 Whys complete |
RCA contains a 5 Whys table with at least 3 levels of depth |
Root cause is systemic, not superficial |
Prevention documented |
Preventive Measures section has at least one short-term and one long-term action |
Actions have owners and status |
Pattern extracted |
Lessons Learned references a pattern journal entry or standard update |
Knowledge captured beyond the incident |
Timely |
RCA created within 5 business days of resolution for P1/P2 |
No stale investigations |
Exceptions
P4 incidents do not require RCA unless they reveal a systemic pattern (3+ P4s with the same root cause category within 30 days).
Related
-
STD-005: Change Control — changes made during resolution follow change control
-
STD-002: Deployment Validation — validation failures trigger RCAs
-
RCA Patterns — field notebook entries with dated examples
-
RCA Template — starting point for new RCA documents