k3s Operations & Maintenance
Production operations runbook for k3s clusters. All commands use advanced patterns (awk, jq, custom-columns) - no basic output. Built for senior engineers who demand precision.
1. Network Diagnostics
1.1. Host Network State
Interface summary:
ip -4 -o addr show | awk '{split($4,a,"/"); print $2": "$4" (scope:"$NF")"}' | grep -v "^lo:"
eth0: 10.50.1.120/24 (scope:global)
NetworkManager connections:
nmcli -t -f NAME,DEVICE,STATE conn show | awk -F: '{printf "%-20s %-10s %s\n", $1, $2, $3}'
Default gateway:
ip route | awk '/default/ {print "Gateway: "$3" via "$5}'
DNS configuration:
awk '/^nameserver/{print "DNS: "$2} /^search/{print "Search: "$2}' /etc/resolv.conf
Connectivity matrix:
for target in 10.50.1.1 10.50.1.60 10.50.1.90 8.8.8.8; do
ping -c1 -W1 $target &>/dev/null && echo "$target: OK" || echo "$target: FAIL"
done
1.2. Fix NetworkManager Connection
Delete stale connections:
sudo nmcli conn delete "Wired connection 1" 2>/dev/null
sudo nmcli conn delete "cloud-init eth0" 2>/dev/null
Create static connection:
sudo nmcli conn add con-name eth0 type ethernet ifname eth0 \
ipv4.method manual \
ipv4.addresses 10.50.1.120/24 \
ipv4.gateway 10.50.1.1 \
ipv4.dns "10.50.1.90,10.50.1.91" \
autoconnect yes
Activate:
sudo nmcli conn up eth0
Verify:
nmcli -t conn show --active | awk -F: '{print $1": "$3}'
2. kubectl Advanced Patterns
2.1. Node Operations
Node status (custom-columns):
kubectl get nodes -o custom-columns='NAME:.metadata.name,STATUS:.status.conditions[-1].type,VERSION:.status.nodeInfo.kubeletVersion'
Node capacity (jq):
kubectl get nodes -o json | jq -r '.items[] | "CPU: \(.status.capacity.cpu) | Memory: \(.status.capacity.memory) | Pods: \(.status.capacity.pods)"'
Node conditions (jq deep dive):
kubectl get nodes -o json | jq -r '.items[].status.conditions[] | "\(.type): \(.status) (\(.reason))"'
Node labels (filtered):
kubectl get nodes -o json | jq -r '.items[] | .metadata.labels | to_entries[] | "\(.key)=\(.value)"' | grep -v "kubernetes.io"
2.2. Pod Operations
All pods (custom-columns):
kubectl get pods -A -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName,IP:.status.podIP'
Pod status matrix (awk pivot table):
kubectl get pods -A --no-headers | awk '
{
ns[$1]++
status[$4]++
}
END {
print "=== By Namespace ==="
for(n in ns) printf "%-20s %d\n", n, ns[n]
print "\n=== By Status ==="
for(s in status) printf "%-15s %d\n", s, status[s]
}'
Unhealthy pods only (jq):
kubectl get pods -A -o json | jq -r '.items[] | select(.status.phase != "Running" and .status.phase != "Succeeded") | "\(.metadata.namespace)/\(.metadata.name): \(.status.phase)"'
Pod restart counts (jq):
kubectl get pods -A -o json | jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name): \([.status.containerStatuses[]?.restartCount] | add // 0) restarts"' | grep -v ": 0 restarts"
Pods with resource limits (jq):
kubectl get pods -A -o json | jq -r '.items[] | select(.spec.containers[].resources.limits != null) | "\(.metadata.namespace)/\(.metadata.name)"'
2.3. Service Operations
All services (custom-columns):
kubectl get svc -A -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,TYPE:.spec.type,CLUSTER-IP:.spec.clusterIP,PORTS:.spec.ports[*].port'
Services with external access:
kubectl get svc -A -o json | jq -r '.items[] | select(.spec.type == "LoadBalancer" or .spec.type == "NodePort") | "\(.metadata.namespace)/\(.metadata.name): \(.spec.type)"'
2.4. Deployment Operations
Deployment status:
kubectl get deploy -A -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,READY:.status.readyReplicas,DESIRED:.spec.replicas,AVAILABLE:.status.availableReplicas'
Deployments not fully ready:
kubectl get deploy -A -o json | jq -r '.items[] | select(.status.readyReplicas != .spec.replicas) | "\(.metadata.namespace)/\(.metadata.name): \(.status.readyReplicas // 0)/\(.spec.replicas)"'
3. Cilium Operations
3.1. Status Checks
Cilium pod status (kubectl fallback for CLI panic):
kubectl get pods -n kube-system -l k8s-app=cilium -o custom-columns='NAME:.metadata.name,STATUS:.status.phase,READY:.status.containerStatuses[0].ready,NODE:.spec.nodeName'
Cilium operator status:
kubectl get pods -n kube-system -l name=cilium-operator -o custom-columns='NAME:.metadata.name,STATUS:.status.phase'
Cilium DaemonSet health:
kubectl get ds -n kube-system cilium -o json | jq -r '"Desired: \(.status.desiredNumberScheduled) | Ready: \(.status.numberReady) | Available: \(.status.numberAvailable)"'
3.2. Hubble Operations
Hubble relay status:
kubectl get pods -n kube-system -l k8s-app=hubble-relay -o custom-columns='NAME:.metadata.name,STATUS:.status.phase,READY:.status.containerStatuses[0].ready'
Real-time flow observation:
hubble observe --last 10
Dropped traffic only:
hubble observe --verdict DROPPED --last 20
Traffic by namespace:
hubble observe --namespace kube-system --last 10
HTTP traffic:
hubble observe --protocol http --last 10
4. System Operations
4.1. k3s Service
Service status (structured):
systemctl show k3s --property=ActiveState,SubState,MainPID | awk -F= '{print $1": "$2}'
Recent k3s logs:
sudo journalctl -u k3s --since "5 minutes ago" --no-pager | tail -30
k3s errors only:
sudo journalctl -u k3s --since "1 hour ago" --no-pager | grep -iE 'error|fail|warn' | tail -20
4.2. Resource Usage
Node resource usage (metrics-server required):
kubectl top nodes 2>/dev/null || echo "metrics-server not installed"
Pod resource usage:
kubectl top pods -A 2>/dev/null | sort -k3 -rh | head -10
Host memory/CPU:
free -h | awk 'NR<=2 {print}'
uptime | awk -F'load average:' '{print "Load:"$2}'
4.3. Certificate Management
Check k3s certificates:
sudo ls -la /var/lib/rancher/k3s/server/tls/*.crt | awk '{print $NF}' | xargs -I{} sh -c 'echo "=== {} ===" && sudo openssl x509 -in {} -noout -dates 2>/dev/null'
Certificate expiry (days remaining):
for cert in /var/lib/rancher/k3s/server/tls/*.crt; do
exp=$(sudo openssl x509 -in "$cert" -noout -enddate 2>/dev/null | cut -d= -f2)
[ -n "$exp" ] && {
days=$(( ($(date -d "$exp" +%s) - $(date +%s)) / 86400 ))
printf "%-50s %d days\n" "$(basename $cert)" "$days"
}
done 2>/dev/null | sort -k2 -n
5. Troubleshooting
5.1. Node NotReady
Check node conditions:
kubectl get nodes -o json | jq -r '.items[].status.conditions[] | select(.status != "False" or .type == "Ready") | "\(.type): \(.status) - \(.message)"'
Check kubelet:
sudo systemctl status k3s | head -10
Check CNI (Cilium):
kubectl get pods -n kube-system -l k8s-app=cilium --no-headers | awk '{print $1": "$3}'
5.2. Pod CrashLoopBackOff
Find crashing pods:
kubectl get pods -A --no-headers | awk '$4 ~ /CrashLoop|Error/ {print $1"/"$2": "$4}'
Get pod events:
kubectl get events -A --sort-by='.lastTimestamp' | tail -20
Pod logs (last crash):
kubectl logs -n <namespace> <pod-name> --previous 2>/dev/null | tail -50
5.3. DNS Issues
Test DNS resolution from pod:
kubectl run -it --rm dns-test --image=busybox:1.36 --restart=Never -- nslookup kubernetes.default
Check CoreDNS pods:
kubectl get pods -n kube-system -l k8s-app=kube-dns -o custom-columns='NAME:.metadata.name,STATUS:.status.phase,READY:.status.containerStatuses[0].ready'
5.4. Network Connectivity
Test pod-to-pod connectivity:
kubectl run -it --rm nettest --image=busybox:1.36 --restart=Never -- wget -qO- --timeout=2 http://<service-name>.<namespace>.svc.cluster.local
Check Cilium connectivity:
cilium connectivity test 2>/dev/null || echo "Run from host with cilium CLI"
6. Quick Reference
6.1. One-Liners
| Task | Command |
|---|---|
Count pods per namespace |
|
Find pods using most restarts |
|
Get pod IPs |
|
Find images in use |
|
Node taints |
|
Events (errors only) |
|
7. Backup and Recovery
7.1. Architecture: Single-Node vs HA
|
k3s single-node uses SQLite, NOT etcd.
Check your configuration:
|
7.2. Backup Locations
| Component | Path | Description |
|---|---|---|
SQLite state (single-node) |
|
Kubernetes state database |
Manifests |
|
Static pod definitions, Cilium YAML |
TLS certificates |
|
API server, kubelet, service account certs |
Node token |
|
Join token for additional nodes |
7.3. Manual Backup to NAS
Mount NAS share:
sudo mkdir -p /mnt/k3s_backups
sudo mount -t nfs nas-01:/volume1/k3s_backups /mnt/k3s_backups
Backup SQLite database:
sudo cp -v /var/lib/rancher/k3s/server/db/state.db /mnt/k3s_backups/etcd/state.db.$(date +%Y%m%d)
Backup manifests, TLS, token:
|
Use The glob |
sudo cp -rv /var/lib/rancher/k3s/server/manifests /mnt/k3s_backups/
sudo cp -rv /var/lib/rancher/k3s/server/tls /mnt/k3s_backups/
sudo cp -v /var/lib/rancher/k3s/server/token /mnt/k3s_backups/
Verify backup:
ls -la /mnt/k3s_backups/
ls -la /mnt/k3s_backups/etcd/
ls -la /mnt/k3s_backups/manifests/
Unmount:
sudo umount /mnt/k3s_backups
7.4. Automated Backup (Cron)
Create backup script:
sudo tee /usr/local/bin/k3s-backup.sh << 'EOF'
#!/bin/bash
set -e
NAS_SHARE="nas-01:/volume1/k3s_backups"
MOUNT_POINT="/mnt/k3s_backups"
DATE=$(date +%Y%m%d-%H%M)
K3S_SERVER="/var/lib/rancher/k3s/server"
# Mount if not mounted
mountpoint -q $MOUNT_POINT || mount -t nfs $NAS_SHARE $MOUNT_POINT
# Backup SQLite
cp -v $K3S_SERVER/db/state.db $MOUNT_POINT/etcd/state.db.$DATE
# Backup configs (daily only)
if [[ $(date +%H) == "02" ]]; then
cp -rv $K3S_SERVER/manifests $MOUNT_POINT/manifests-$DATE
cp -rv $K3S_SERVER/tls $MOUNT_POINT/tls-$DATE
cp -v $K3S_SERVER/token $MOUNT_POINT/token-$DATE
fi
# Cleanup backups older than 7 days
find $MOUNT_POINT/etcd -name "state.db.*" -mtime +7 -delete
find $MOUNT_POINT -maxdepth 1 -name "manifests-*" -mtime +7 -exec rm -rf {} \;
find $MOUNT_POINT -maxdepth 1 -name "tls-*" -mtime +7 -exec rm -rf {} \;
echo "$(date): Backup completed" >> /var/log/k3s-backup.log
EOF
sudo chmod +x /usr/local/bin/k3s-backup.sh
Add cron job (every 6 hours):
echo "0 */6 * * * root /usr/local/bin/k3s-backup.sh" | sudo tee /etc/cron.d/k3s-backup
7.5. Restore from Backup
Stop k3s:
sudo systemctl stop k3s
Restore SQLite:
sudo mount -t nfs nas-01:/volume1/k3s_backups /mnt/k3s_backups
sudo cp /mnt/k3s_backups/etcd/state.db.YYYYMMDD /var/lib/rancher/k3s/server/db/state.db
Restore manifests if needed:
sudo cp -rv /mnt/k3s_backups/manifests-YYYYMMDD/* /var/lib/rancher/k3s/server/manifests/
Start k3s:
sudo systemctl start k3s
kubectl get nodes
kubectl get pods -A
7.6. etcd Snapshots (HA Clusters Only)
If running HA (3+ master nodes), use native etcd snapshots:
Create snapshot:
sudo k3s etcd-snapshot save --name manual-$(date +%Y%m%d)
List snapshots:
sudo k3s etcd-snapshot list
Restore from snapshot:
sudo systemctl stop k3s
sudo k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/<snapshot-name>
|
|