STD-010: Root Cause Analysis

Every significant failure gets an RCA. The purpose is prevention, not blame. A fix without prevention is firefighting — the RCA format forces extraction of the transferable lesson.

Principles

  1. Prevention over fix. Every RCA MUST answer four questions: (1) What happened, (2) Why it happened, (3) What was fixed, (4) How to PREVENT it. The prevention step is the most valuable.

  2. 5 Whys to root cause. Surface symptoms are not root causes. Ask "why" until you reach a systemic gap — process, knowledge, tooling, or design.

  3. Blameless by design. RCAs document system failures, not human failures. "Operator error" is never a root cause — the system that allowed the error is.

  4. Timely capture. RCA MUST be written within 5 business days of resolution for P1/P2 incidents. Context decays rapidly.

  5. Pattern extraction. Every RCA MUST produce at least one entry in the Pattern Journal. The RCA is the investigation; the pattern is the transferable wisdom.

Requirements

  1. Every P1 (Critical) and P2 (High) incident MUST have a corresponding RCA document.

  2. P3 incidents SHOULD have an RCA. P4 incidents MAY have one if the failure reveals a systemic issue.

  3. RCA documents MUST use the naming convention: RCA-YYYY-MM-DD-NNN.adoc where NNN is sequential per day.

  4. RCA documents MUST be filed in pages/case-studies/rca/.

  5. Every RCA MUST include: Executive Summary, Timeline, Problem Statement (symptoms + expected vs actual), Root Cause (5 Whys analysis + root cause statement), Contributing Factors, Impact (severity + duration + scope), Resolution (immediate actions + verification), Preventive Measures (short-term + long-term), Detection analysis, Lessons Learned, and Metadata.

  6. The Root Cause section MUST contain a 5 Whys analysis table AND a single-sentence root cause statement in a NOTE block.

  7. Preventive Measures MUST have assigned owners and tracked status.

  8. Every RCA MUST cross-reference the originating incident report (INC-) and any change records (CR-) created as a result.

  9. The Lessons Learned section MUST produce at least one Pattern Journal entry or standard update.

RCA Format

Naming

RCA-YYYY-MM-DD-NNN-slug.adoc — date of the incident (not the RCA creation date), sequential number, brief slug.

Required Sections

Section Content

Executive Summary

Single paragraph: what happened, why, what was done, outcome

Timeline

Chronological table: detection → investigation → identification → fix → verification

Problem Statement

Symptoms, expected behavior, actual behavior

Root Cause (5 Whys)

Table drilling from symptom to systemic cause + single-sentence root cause statement

Contributing Factors

Table: factor, description, preventable (yes/no)

Impact

Severity (P1-P4), duration, users/systems affected, data loss, business impact

Resolution

Immediate actions with commands + verification commands

Preventive Measures

Short-term (this week) + long-term (this quarter), with owner and status

Detection

How detected, detection gap analysis (could we have caught this earlier?)

Lessons Learned

What went well, what could be improved, key takeaways

Related

xrefs to incident report, change records, codex entries, pattern journal

Metadata

RCA ID, author, dates, status (Draft/In Review/Final), review date

Compliance

Check Method Pass Criterion

P1/P2 RCA exists

Every INC- with severity P1 or P2 has a corresponding RCA- with matching date

No orphaned P1/P2 incidents

5 Whys complete

RCA contains a 5 Whys table with at least 3 levels of depth

Root cause is systemic, not superficial

Prevention documented

Preventive Measures section has at least one short-term and one long-term action

Actions have owners and status

Pattern extracted

Lessons Learned references a pattern journal entry or standard update

Knowledge captured beyond the incident

Timely

RCA created within 5 business days of resolution for P1/P2

No stale investigations

Exceptions

P4 incidents do not require RCA unless they reveal a systemic pattern (3+ P4s with the same root cause category within 30 days).