Retry Strategies
Retrying is not always the right move. A retry on the wrong error doubles the damage. This page covers when to retry, how to retry safely, and how to stop retrying when the situation is unrecoverable.
Which errors to retry
| HTTP Code | Retry? | Reasoning |
|---|---|---|
429 |
Yes |
Rate limited — the server is telling you to slow down and try again |
500 |
Yes (with caution) |
Internal server error — may be transient |
502 |
Yes |
Bad gateway — upstream server may recover |
503 |
Yes |
Service unavailable — server is overloaded or restarting |
504 |
Yes |
Gateway timeout — upstream was too slow |
400 |
No |
Bad request — your payload is wrong; retrying sends the same bad data |
401 |
No (refresh token first) |
Unauthorized — retry only after re-authenticating |
403 |
No |
Forbidden — you lack permission; retrying will not grant it |
404 |
No |
Not found — the resource does not exist |
409 |
Maybe |
Conflict — depends on whether the conflict is transient (concurrent edit) or permanent (duplicate key) |
Network timeout |
Yes |
The request may not have reached the server |
Connection refused |
Yes (limited) |
Server may be starting up — but give up quickly |
The rule: retry on server errors (5xx) and rate limits (429). Never retry on client errors (4xx) without fixing the request first.
Exponential backoff with jitter
Pure exponential backoff (1s, 2s, 4s, 8s…) causes thundering herds — all clients retry at the same moment. Jitter randomizes the delay to spread retries over time.
#!/usr/bin/env bash
# Exponential backoff with full jitter
retry_with_jitter() {
local url="$1"
local method="${2:-GET}"
local data="$3"
local max_retries=5
local base_delay=1
local max_delay=30
local attempt=0
while (( attempt <= max_retries )); do
if [[ "$method" == "GET" ]]; then
response=$(curl -sw '\n%{http_code}' "$url")
else
response=$(curl -sw '\n%{http_code}' -X "$method" -d "$data" \
-H "Content-Type: application/json" "$url")
fi
http_code=$(echo "$response" | tail -1)
body=$(echo "$response" | sed '$d')
case "$http_code" in
2[0-9][0-9])
echo "$body"
return 0
;;
429|5[0-9][0-9])
attempt=$((attempt + 1))
if (( attempt > max_retries )); then
echo "Exhausted retries after ${max_retries} attempts" >&2
return 1
fi
# Full jitter: random value between 0 and exponential ceiling
ceiling=$(( base_delay * (2 ** (attempt - 1)) ))
(( ceiling > max_delay )) && ceiling=$max_delay
delay=$(( RANDOM % (ceiling + 1) ))
echo "Retry ${attempt}/${max_retries}: HTTP ${http_code}, waiting ${delay}s" >&2
sleep "$delay"
;;
*)
echo "$body"
return 1
;;
esac
done
}
Three jitter strategies exist:
| Strategy | Formula |
|---|---|
Full jitter |
|
Equal jitter |
|
Decorrelated jitter |
|
Full jitter produces the best spread in practice. AWS recommends it in their architecture blog.
Circuit breaker pattern
A circuit breaker stops calling a failing endpoint after repeated failures. It prevents wasting time and quota on an endpoint that is clearly down.
Three states:
| Closed |
Normal operation — requests pass through |
| Open |
Too many failures — requests fail immediately without calling the API |
| Half-open |
After a cooldown, allow one test request to check if the service recovered |
#!/usr/bin/env bash
# Simple circuit breaker using a state file
CIRCUIT_FILE="/tmp/circuit-$(echo "$API_URL" | md5sum | cut -d' ' -f1)"
FAILURE_THRESHOLD=5
COOLDOWN_SECONDS=60
circuit_call() {
local url="$1"
# Check circuit state
if [[ -f "$CIRCUIT_FILE" ]]; then
local failures opened_at
failures=$(awk 'NR==1' "$CIRCUIT_FILE")
opened_at=$(awk 'NR==2' "$CIRCUIT_FILE")
local now
now=$(date +%s)
if (( failures >= FAILURE_THRESHOLD )); then
local elapsed=$(( now - opened_at ))
if (( elapsed < COOLDOWN_SECONDS )); then
echo "Circuit OPEN: ${elapsed}s/${COOLDOWN_SECONDS}s cooldown" >&2
return 1
fi
# Half-open: try one request
echo "Circuit HALF-OPEN: testing..." >&2
fi
fi
# Make the request
local http_code
http_code=$(curl -so /dev/null -w '%{http_code}' "$url")
case "$http_code" in
2[0-9][0-9])
# Success: reset circuit
rm -f "$CIRCUIT_FILE"
curl -s "$url"
return 0
;;
429|5[0-9][0-9])
# Failure: increment counter
local current_failures=0
[[ -f "$CIRCUIT_FILE" ]] && current_failures=$(awk 'NR==1' "$CIRCUIT_FILE")
current_failures=$((current_failures + 1))
printf '%d\n%d\n' "$current_failures" "$(date +%s)" > "$CIRCUIT_FILE"
echo "Circuit failure ${current_failures}/${FAILURE_THRESHOLD}: HTTP ${http_code}" >&2
return 1
;;
*)
echo "HTTP ${http_code}: client error, not counted as circuit failure" >&2
return 1
;;
esac
}
For production automation, use a proper circuit breaker library. The shell implementation above demonstrates the concept for understanding and quick scripts.
Idempotency keys for safe retries
GET requests are inherently idempotent — repeating them has no side effect. POST and PUT requests are not. If you retry a POST that actually succeeded (but the response was lost to a network error), you create a duplicate.
Idempotency keys solve this. You generate a unique key for each logical operation and send it as a header. The server deduplicates.
# Generate an idempotency key
idempotency_key=$(uuidgen)
# Use it on a POST request
curl -s -X POST "$API_URL/payments" \
-H "Content-Type: application/json" \
-H "Idempotency-Key: $idempotency_key" \
-d '{"amount": 1000, "currency": "USD"}'
# Safe to retry with the SAME key -- server returns cached response
curl -s -X POST "$API_URL/payments" \
-H "Content-Type: application/json" \
-H "Idempotency-Key: $idempotency_key" \
-d '{"amount": 1000, "currency": "USD"}'
Providers that support idempotency keys:
-
Stripe:
Idempotency-Keyheader -
AWS: built into SDK retry logic
-
Shopify:
Idempotency-Keyheader -
Many payment and financial APIs
If the API does not support idempotency keys, use a read-before-write pattern:
# Check if the resource already exists before creating
existing=$(curl -s "${API_URL}/users?email=user@example.com" | jq '.data | length')
if (( existing == 0 )); then
curl -s -X POST "$API_URL/users" -d '{"email": "user@example.com"}'
fi
This is not perfectly safe (race condition between check and create), but it prevents the most common duplicate scenario.
Maximum retry limits
Always set a ceiling. Without one, a persistent 500 error produces infinite retries.
Reasonable defaults:
| Scenario | Max retries | Reasoning |
|---|---|---|
Interactive CLI |
3 |
User is waiting — fail fast with a clear message |
Background automation |
5-7 |
More tolerance, but still bounded |
Critical write operation |
3 (with idempotency key) |
Fewer retries, but safe to retry |
Batch processing |
10 (with circuit breaker) |
Let the circuit breaker handle sustained failures |
After exhausting retries, log the failure with enough context to diagnose: HTTP code, response body, URL, timestamp, and attempt count.