Incident Response Template

Template

= [INCIDENT-YYYY-MM-DD] Brief Description
:description: Incident summary
:severity: P1 / P2 / P3 / P4
:status: investigating / identified / monitoring / resolved
:revdate: YYYY-MM-DD

== Incident Summary

[cols="1,2"]
|===
| Field | Value

| Detected
| YYYY-MM-DD HH:MM TZ

| Resolved
| YYYY-MM-DD HH:MM TZ (or "Ongoing")

| Duration
| X hours Y minutes

| Severity
| P1 (Critical) / P2 (High) / P3 (Medium) / P4 (Low)

| Impact
| Brief description of user/business impact

| Root Cause
| TBD / Brief description
|===

== Timeline

[cols="1,3"]
|===
| Time | Event

| HH:MM
| Issue first reported / detected

| HH:MM
| Initial investigation started

| HH:MM
| Root cause identified

| HH:MM
| Fix implemented

| HH:MM
| Monitoring confirmed resolution
|===

== Symptoms

What did users/systems experience?

* Symptom 1
* Symptom 2
* Symptom 3

== Investigation

=== Initial Triage

* [ ] Check monitoring dashboards
* [ ] Review recent changes (deploys, config changes)
* [ ] Check dependent services
* [ ] Review error logs

=== Diagnostic Commands

```bash
# Check service status
systemctl status <service>

# View recent logs
journalctl -u <service> --since "1 hour ago"

# Check connectivity
ping <host>
curl -v <endpoint>
```

=== Findings

Document what you discovered during investigation.

== Root Cause

Explain the underlying cause of the incident.

== Resolution

=== Immediate Fix

What was done to restore service?

```bash
# Commands executed
```

=== Verification

* [ ] Service responding normally
* [ ] Monitoring shows green
* [ ] Users confirmed functionality restored
* [ ] No error spikes in logs

== Prevention

=== Short-term

* [ ] Action item 1
* [ ] Action item 2

=== Long-term

* [ ] Action item 1
* [ ] Action item 2

== Lessons Learned

* What went well?
* What could be improved?
* What will we do differently?

== Related

* Link to monitoring dashboard
* Link to runbook used
* Link to related incidents

Severity Levels

Level Description Response Time

P1 - Critical

Complete outage, data loss risk, security breach

Immediate, all hands

P2 - High

Major feature broken, significant user impact

< 1 hour

P3 - Medium

Minor feature broken, workaround available

< 4 hours

P4 - Low

Cosmetic, minor inconvenience

Next business day

Communication Templates

Initial Notification

Subject: [INCIDENT] Brief description - Investigating

We are aware of an issue affecting [service/feature].

Impact: [description]
Status: Investigating

We will provide updates every [30 minutes / 1 hour].

Resolution Notification

Subject: [RESOLVED] Brief description

The issue affecting [service/feature] has been resolved.

Duration: X hours Y minutes
Root Cause: [brief explanation]
Resolution: [what was done]

We apologize for any inconvenience.