Competencies: DevOps > Site Reliability Engineering

Site Reliability Engineering

Body of Knowledge

Topic Description Relevance Career Tracks

SRE Fundamentals

SRE principles, DevOps vs SRE, toil reduction, automation, reliability engineering culture.

Critical

SRE, Platform Engineer

SLIs/SLOs/SLAs

Service level indicators, objectives, agreements, defining reliability targets, measurement strategies.

Critical

SRE, Product Manager

Error Budgets

Error budget concept, budget-based decisions, feature velocity vs reliability tradeoff, burn rate.

Critical

SRE, Engineering Manager

Incident Management

On-call, paging, incident response, severity levels, incident commander, communication, postmortems.

Critical

SRE, DevOps

Chaos Engineering

Fault injection, Chaos Monkey, Litmus, steady state hypothesis, blast radius, game days.

Medium

SRE, Platform Engineer

Capacity Planning

Load forecasting, resource planning, autoscaling, performance testing, cost optimization.

High

SRE, Platform Engineer

Runbooks and Documentation

Operational runbooks, escalation procedures, troubleshooting guides, knowledge management.

High

SRE, DevOps

Blameless Postmortems

Postmortem process, timeline reconstruction, root cause analysis, action items, learning culture.

Critical

SRE, Engineering Manager

Reliability Patterns

Circuit breakers, retries with backoff, bulkheads, timeouts, graceful degradation, load shedding.

High

SRE, Backend Developer

Production Excellence

Production readiness reviews, launch checklists, operational maturity, handoff processes.

High

SRE, Platform Engineer

Personal Status

Topic Level Evidence Active Projects Gaps

No personal status recorded

 — 

 — 

 — 

 —