k3s Prometheus + Grafana
Complete deployment guide for the kube-prometheus-stack on k3s with persistent storage on Synology NAS.
Architecture
Components
| Component | Purpose | Persistence |
|---|---|---|
Prometheus |
Metrics collection and storage |
50Gi on NAS ( |
Grafana |
Visualization and dashboards |
10Gi on NAS ( |
AlertManager |
Alert routing and notifications |
5Gi on NAS ( |
Node Exporter |
Host-level metrics (CPU, memory, disk) |
None (DaemonSet) |
kube-state-metrics |
Kubernetes object metrics |
None (Deployment) |
Concepts: Network-to-Kubernetes Reference
Reference table mapping traditional network concepts to Kubernetes equivalents.
Network Plane Comparison
| Concept | Traditional Network (CCNP) | Kubernetes |
|---|---|---|
Data Plane |
ASICs forwarding packets (CEF, TCAM) |
Container runtime (containerd) moving packets between pods |
Control Plane |
Routing protocols (OSPF, BGP, EIGRP) |
kube-controller-manager, kube-scheduler |
Management Plane |
CLI/API (IOS, NX-OS, DNA Center) |
kubectl, Kubernetes API server |
Overlay Network |
VXLAN, OTV, LISP |
Cilium (eBPF), Flannel (VXLAN), Calico (BGP) |
Service Discovery |
DNS, ARP, CDP/LLDP |
CoreDNS, kube-proxy, Service objects |
Load Balancing |
F5, NetScaler, ECMP |
Service (ClusterIP, LoadBalancer), Ingress |
ACLs / Security |
ACLs, VACL, SGT/dACL |
NetworkPolicy, Cilium policies |
QoS |
DSCP, queuing, policing |
Resource limits, PriorityClass |
Storage: SAN/NAS to PersistentVolume
| Concept | Traditional Storage | Kubernetes |
|---|---|---|
Storage Pool |
LUN, Volume Group, RAID |
StorageClass |
Volume Provisioning |
Manual LUN masking, zoning |
Dynamic provisioning (PVC → PV) |
Mount/Export |
NFS export, iSCSI target |
PersistentVolumeClaim (PVC) |
Storage Tiering |
SSD tier, HDD tier, archive |
StorageClass with different provisioners |
Why NFS StorageClass?
NFS allows pods to migrate between nodes while retaining data. local-path = local storage. nfs-client = shared storage.
Prometheus vs Traditional Monitoring
| Concept | Traditional Monitoring | Prometheus |
|---|---|---|
Data Collection |
SNMP polling (GET, WALK) |
HTTP scraping ( |
Data Format |
MIBs, OIDs |
OpenMetrics (text-based, human-readable) |
Time Series |
RRD files, SQL database |
TSDB (custom time-series database) |
Alerting |
Threshold triggers → email/SNMP trap |
PromQL rules → AlertManager → Slack/PagerDuty |
Target Discovery |
Manual polling lists, CDP/LLDP |
ServiceMonitor, PodMonitor (label selectors) |
Prometheus scrapes metrics like node_network_receive_bytes_total{device="eth0"} instead of SNMP OIDs.
Phase 1: NFS Storage Class
k3s includes local-path provisioner by default, but we need NFS for shared storage across nodes and NAS backup integration.
Dynamic Provisioning
With a StorageClass + provisioner, PVCs automatically create PVs. No manual binding required.
1.1 Install NFS Subdir External Provisioner
helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm repo update
1.2 Create NFS StorageClass
helm install nfs-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
--namespace kube-system \
--set nfs.server=10.50.1.70 \
--set nfs.path=/volume1/k3s \
--set storageClass.name=nfs-client \
--set storageClass.defaultClass=false \
--set storageClass.archiveOnDelete=true
1.3 Verify StorageClass
kubectl get storageclass
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION
local-path (default) rancher.io/local-path Delete WaitForFirstConsumer false
nfs-client cluster.local/nfs-provisioner Delete Immediate true
1.4 Test NFS Provisioning
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-nfs-claim
spec:
storageClassName: nfs-client
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1Gi
EOF
# Verify
kubectl get pvc test-nfs-claim
# Cleanup
kubectl delete pvc test-nfs-claim
Phase 3: Create Namespace
kubectl create namespace monitoring
kubectl label namespace monitoring name=monitoring
Phase 4: Configure Values
Create the Helm values file with NFS persistence and custom settings.
4.1 Secrets Management (gopass + dsec)
Credentials are stored in two locations for different use cases:
| System | Location | Use Case |
|---|---|---|
gopass |
|
Interactive retrieval, metadata, password managers |
dsec |
|
Shell scripts, automation, |
Step 1: Generate Password in gopass
Use gopass generate -e which generates a secure password AND opens your editor for metadata:
gopass generate -e v3/domains/d000/k3s/grafana 32
In the editor that opens, add metadata below the generated password:
<generated-password-on-first-line>
---
description: Grafana admin credentials (k3s monitoring)
url: https://grafana.inside.domusdigitalis.dev:3000
username: admin
namespace: monitoring
helm_release: prometheus
k3s_node: k3s-master-01.inside.domusdigitalis.dev
ip: 10.50.1.120
Save and exit. The password is now stored with metadata.
Create DNS Records (if not exists)
DNS records are added to BIND (authoritative DNS). See DNS Operations for full procedure.
Records to add:
| Hostname | FQDN | IP |
|---|---|---|
grafana |
grafana.inside.domusdigitalis.dev |
10.50.1.120 |
prometheus |
prometheus.inside.domusdigitalis.dev |
10.50.1.120 |
alertmanager |
alertmanager.inside.domusdigitalis.dev |
10.50.1.120 |
Step 1: SSH to bind-01
ssh bind-01
Step 2: Add Forward (A) Records
sudo nsupdate -l << 'EOF'
zone inside.domusdigitalis.dev
update add grafana.inside.domusdigitalis.dev. 3600 A 10.50.1.120
update add prometheus.inside.domusdigitalis.dev. 3600 A 10.50.1.120
update add alertmanager.inside.domusdigitalis.dev. 3600 A 10.50.1.120
send
EOF
Verify:
for h in grafana prometheus alertmanager; do dig +short $h.inside.domusdigitalis.dev @localhost; done
Step 3: Add Reverse (PTR) Records
| All three hostnames share the same IP, so only ONE PTR record is needed. Convention: use the "primary" service name. |
sudo nsupdate -l << 'EOF'
zone 1.50.10.in-addr.arpa
update add 120.1.50.10.in-addr.arpa. 3600 PTR grafana.inside.domusdigitalis.dev.
send
EOF
Verify:
dig +short -x 10.50.1.120 @localhost
grafana.inside.domusdigitalis.dev.
Step 4: Verify SOA Serial Updated
dig SOA inside.domusdigitalis.dev +short | awk '{print "Serial: "$3}'
Step 5: Force Zone Transfer to Secondary
sudo rndc notify inside.domusdigitalis.dev
sudo rndc notify 1.50.10.in-addr.arpa
Step 6: Verify on bind-02
dig +short grafana.inside.domusdigitalis.dev @10.50.1.91
dig +short -x 10.50.1.120 @10.50.1.91
Step 7: Exit bind-01
exit
Step 8: Verify from Workstation
# Forward lookups
for h in grafana prometheus alertmanager; do echo -n "$h: "; dig +short $h.inside.domusdigitalis.dev; done
grafana: 10.50.1.120
prometheus: 10.50.1.120
alertmanager: 10.50.1.120
# Reverse lookup
dig +short -x 10.50.1.120
grafana.inside.domusdigitalis.dev.
Step 2: Add Prometheus/AlertManager Metadata (No Passwords)
gopass generate -e v3/domains/d000/k3s/prometheus 32
<generated-password - not used, but required by gopass>
---
description: Prometheus metrics server (no auth required)
url: http://prometheus.inside.domusdigitalis.dev:9090
namespace: monitoring
helm_release: prometheus
k3s_node: k3s-master-01.inside.domusdigitalis.dev
ip: 10.50.1.120
storage: 50Gi NFS
retention: 30d
gopass generate -e v3/domains/d000/k3s/alertmanager 32
<generated-password - not used, but required by gopass>
---
description: AlertManager notification routing (no auth required)
url: http://alertmanager.inside.domusdigitalis.dev:9093
namespace: monitoring
helm_release: prometheus
k3s_node: k3s-master-01.inside.domusdigitalis.dev
ip: 10.50.1.120
storage: 5Gi NFS
Step 3: Add to dsec (app.env.age)
dsec edit d000 dev/app
Add this section (copy Grafana password from gopass):
# ============================================================================
# === k3s Infrastructure Monitoring ===
# ============================================================================
# Prometheus, Grafana, AlertManager deployed via kube-prometheus-stack
# Namespace: monitoring
# Storage: NFS on nas-01:/volume1/k3s
# ============================================================================
# --- Grafana ---
K3S_GRAFANA_ADMIN_USER=admin
K3S_GRAFANA_ADMIN_PASS=<paste from: gopass show -o v3/domains/d000/k3s/grafana>
K3S_GRAFANA_URL=http://{k3s-master-01-ip}:3000
# --- Prometheus ---
K3S_PROMETHEUS_URL=http://{k3s-master-01-ip}:9090
# --- AlertManager ---
K3S_ALERTMANAGER_URL=http://{k3s-master-01-ip}:9093
Step 4: Sync and Push
# Push gopass to git
gopass sync
# Push dsec to git
cd ~/.secrets && git add -A && git commit -m "feat(d000/dev): Add k3s monitoring credentials" && git push origin main
Retrieve Password (for Helm install)
# Option A: From gopass (interactive)
GRAFANA_PASS=$(gopass show -o v3/domains/d000/k3s/grafana)
# Option B: From dsec (automation)
eval "$(dsec source d000 dev/app)"
GRAFANA_PASS=$K3S_GRAFANA_ADMIN_PASS
Step 5: Validate Credentials with curl (From Workstation)
After Helm install, validate Grafana authentication from your workstation.
Test 1: Basic Authentication (Direct API)
GRAFANA_PASS=$(gopass show -o v3/domains/d000/k3s/grafana)
curl -sS -u "admin:${GRAFANA_PASS}" http://grafana.inside.domusdigitalis.dev:3000/api/org | jq .
{
"id": 1,
"name": "Main Org.",
"address": {
"address1": "",
"address2": "",
"city": "",
"zipCode": "",
"state": "",
"country": ""
}
}
{
"message": "invalid username or password"
}
Test 2: Verbose Mode (See Full HTTP Transaction)
GRAFANA_PASS=$(gopass show -o v3/domains/d000/k3s/grafana)
curl -v -u "admin:${GRAFANA_PASS}" http://grafana.inside.domusdigitalis.dev:3000/api/org 2>&1 | grep -E "^[<>*]|HTTP/"
* Trying 10.50.1.120:3000...
* Connected to grafana.inside.domusdigitalis.dev (10.50.1.120) port 3000
> GET /api/org HTTP/1.1 # Request line
> Host: grafana.inside.domusdigitalis.dev:3000 # Target host
> Authorization: Basic YWRtaW46...(base64)... # Credentials (base64 encoded)
> User-Agent: curl/8.x.x
> Accept: */*
>
< HTTP/1.1 200 OK # Success!
< Cache-Control: no-store # No caching (security)
< Content-Type: application/json # Response format
< X-Content-Type-Options: nosniff # XSS protection
< X-Frame-Options: deny # Clickjacking protection
< X-Xss-Protection: 1; mode=block # XSS protection
Test 3: Session Cookie Authentication (How Browser Works)
# Step 1: Login and capture session cookie
GRAFANA_PASS=$(gopass show -o v3/domains/d000/k3s/grafana)
curl -sS -c /tmp/grafana-cookies.txt \
-H "Content-Type: application/json" \
-d '{"user":"admin","password":"'"${GRAFANA_PASS}"'"}' \
http://grafana.inside.domusdigitalis.dev:3000/login | jq .
{
"message": "Logged in"
}
# Step 2: View the session cookie
cat /tmp/grafana-cookies.txt
# Netscape HTTP Cookie File
grafana.inside.domusdigitalis.dev FALSE / FALSE 0 grafana_session abc123def456...
# Step 3: Use cookie for subsequent requests (no password needed)
curl -sS -b /tmp/grafana-cookies.txt http://grafana.inside.domusdigitalis.dev:3000/api/user | jq .
{
"id": 1,
"email": "admin@localhost",
"name": "",
"login": "admin",
"theme": "",
"orgId": 1,
"isGrafanaAdmin": true,
"isDisabled": false,
"isExternal": false,
"authLabels": [],
"updatedAt": "2026-02-23T...",
"createdAt": "2026-02-23T...",
"isGrafanaAdminExternallySynced": false
}
Test 4: Health Check Endpoints (No Auth Required)
# Grafana health
curl -sS http://grafana.inside.domusdigitalis.dev:3000/api/health | jq .
{
"commit": "abc1234",
"database": "ok",
"version": "11.x.x"
}
# Prometheus health
curl -sS http://prometheus.inside.domusdigitalis.dev:9090/-/healthy
Prometheus Server is Healthy.
# AlertManager health
curl -sS http://alertmanager.inside.domusdigitalis.dev:9093/-/healthy
OK
Understanding HTTP Authentication
| Method | How It Works | When Used |
|---|---|---|
Basic Auth |
|
API calls, scripts, simple automation |
Session Cookie |
Login once → server returns |
Browser sessions, interactive use |
API Key |
Generate key in Grafana UI → |
Long-lived automation, CI/CD (more secure than Basic) |
Troubleshooting with curl
# DNS resolution check
dig +short grafana.inside.domusdigitalis.dev
# Expected: 10.50.1.120
# Connection refused (service not running or firewall)
curl -v http://grafana.inside.domusdigitalis.dev:3000/api/health 2>&1 | grep -E "Connection refused|Failed to connect"
# Check if port-forward is running
ssh k3s-master-01.inside.domusdigitalis.dev "ss -tlnp | grep 3000"
# Check firewall
ssh k3s-master-01.inside.domusdigitalis.dev "sudo firewall-cmd --list-ports | grep 3000"
# Redirect loop (grafana.ini misconfigured)
curl -v -L http://grafana.inside.domusdigitalis.dev:3000/ 2>&1 | grep -E "Location:|HTTP/"
# Should NOT see multiple redirects to localhost
# Cleanup
rm -f /tmp/grafana-cookies.txt
4.2 Create Values File
cat > /tmp/prometheus-values.yaml << 'EOF'
# =============================================================================
# kube-prometheus-stack values
# Runbook: k3s-monitoring.adoc
# =============================================================================
# -----------------------------------------------------------------------------
# Global Settings
# -----------------------------------------------------------------------------
defaultRules:
create: true
rules:
alertmanager: true
etcd: false # k3s uses SQLite, not etcd
kubeApiserver: true
kubeScheduler: true
kubeControllerManager: true
kubeProxy: true
node: true
# -----------------------------------------------------------------------------
# Prometheus
# -----------------------------------------------------------------------------
prometheus:
prometheusSpec:
replicas: 1
retention: 30d
retentionSize: "45GB"
# NFS Storage
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: nfs-client
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
# Resource limits
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 1000m
memory: 2Gi
# Scrape all namespaces
podMonitorNamespaceSelector: {}
podMonitorSelector: {}
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector: {}
# -----------------------------------------------------------------------------
# Grafana
# -----------------------------------------------------------------------------
grafana:
enabled: true
adminPassword: "CHANGE_ME_BEFORE_INSTALL"
# NFS Persistence
persistence:
enabled: true
storageClassName: nfs-client
size: 10Gi
# Resource limits
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
# Default dashboards
defaultDashboardsEnabled: true
defaultDashboardsTimezone: America/Los_Angeles
# Sidecar for dashboard provisioning
sidecar:
dashboards:
enabled: true
searchNamespace: ALL
datasources:
enabled: true
# Grafana.ini settings
# IMPORTANT: For port-forward access, use explicit IP and serve_from_sub_path: false
# The %(domain)s variable resolves to 'localhost' inside the container, breaking redirects
grafana.ini:
server:
root_url: "http://10.50.1.120:3000"
serve_from_sub_path: false
security:
admin_user: admin
cookie_secure: false # Set true when using HTTPS/Traefik
users:
auto_assign_org_role: Viewer
auth.anonymous:
enabled: false
# -----------------------------------------------------------------------------
# AlertManager
# -----------------------------------------------------------------------------
alertmanager:
enabled: true
alertmanagerSpec:
replicas: 1
# NFS Storage
storage:
volumeClaimTemplate:
spec:
storageClassName: nfs-client
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 5Gi
# Resource limits
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 200m
memory: 256Mi
# -----------------------------------------------------------------------------
# Node Exporter (DaemonSet)
# -----------------------------------------------------------------------------
nodeExporter:
enabled: true
# -----------------------------------------------------------------------------
# kube-state-metrics
# -----------------------------------------------------------------------------
kubeStateMetrics:
enabled: true
# -----------------------------------------------------------------------------
# Prometheus Operator
# -----------------------------------------------------------------------------
prometheusOperator:
enabled: true
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 200m
memory: 200Mi
# -----------------------------------------------------------------------------
# k3s Specific: Disable components not present
# -----------------------------------------------------------------------------
kubeEtcd:
enabled: false
kubeScheduler:
enabled: false
kubeControllerManager:
enabled: false
kubeProxy:
enabled: false
EOF
4.3 Inject Grafana Password
Option A: Shell sed (from workstation)
# Get password from gopass
GRAFANA_PASS=$(gopass show -o v3/domains/d000/k3s/grafana)
# Replace placeholder in values file
sed -i "s/CHANGE_ME_BEFORE_INSTALL/$GRAFANA_PASS/" /tmp/prometheus-values.yaml
# Verify (should show actual password, not placeholder)
grep adminPassword /tmp/prometheus-values.yaml
Option B: vi/vim on k3s node (minimal tooling)
If editing directly on the node with vi:
# Quick vim setup (no vimrc needed)
:set nu rnu ic scs is hls ts=2 sw=2 et ai
# Substitute - use # delimiter (passwords often contain /)
:%s#CHANGE_ME_BEFORE_INSTALL#YOUR_ACTUAL_PASS#g
When passwords contain /, use alternative delimiters: #, |, or @.Example: :%s|old|new|g works identically to :%s/old/new/g
|
Phase 5: Install kube-prometheus-stack
What Helm Deploys
helm install deploys these resources:
| Resource Type | Count | Purpose |
|---|---|---|
Deployment |
3 |
Grafana, Prometheus Operator, kube-state-metrics |
StatefulSet |
2 |
Prometheus, AlertManager (ordered startup, stable network IDs) |
DaemonSet |
1 |
Node Exporter (runs on EVERY node, like SNMP agent) |
Service |
5+ |
ClusterIP services for internal communication |
ServiceMonitor |
10+ |
Auto-discovery rules for Prometheus scraping |
ConfigMap |
5+ |
Dashboards, alerting rules, Grafana datasources |
Secret |
2+ |
Grafana admin password, AlertManager config |
StatefulSet vs Deployment: Prometheus uses StatefulSet because each instance has its own TSDB on disk.
The Operator Pattern
The prometheus-operator watches for ServiceMonitor/PrometheusRule resources and automatically updates Prometheus configuration.
5.1 Dry Run (Optional - Will Fail)
| Dry-run will fail with CRD errors on first install. This is expected. |
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values /tmp/prometheus-values.yaml \
--dry-run
Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest:
resource mapping not found for name: "prometheus-kube-prometheus-alertmanager"
no matches for kind "Alertmanager" in version "monitoring.coreos.com/v1"
ensure CRDs are installed first
Why this happens: Dry-run validates against existing cluster state. CRDs (Custom Resource Definitions) like Alertmanager, Prometheus, ServiceMonitor don’t exist yet - they get installed during the actual Helm install.
Solution: Skip dry-run on first install, or proceed directly to 5.2. Dry-run works on subsequent upgrades.
5.2 Install
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values /tmp/prometheus-values.yaml
5.3 Wait for Pods
kubectl -n monitoring get pods -w
NAME READY STATUS RESTARTS AGE
alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 0 67s
prometheus-grafana-6dcfb98c6-rjxsm 3/3 Running 0 83s
prometheus-kube-prometheus-operator-664f9898f9-txw9b 1/1 Running 0 77s
prometheus-kube-state-metrics-7dfddfdf48-5cwlc 1/1 Running 0 77s
prometheus-prometheus-kube-prometheus-prometheus-0 2/2 Running 0 66s
prometheus-prometheus-node-exporter-kp67p 1/1 Running 0 78s
Grafana may show 2/3 briefly while sidecars initialize. Wait for 3/3.
|
5.4 Verify Password
Confirm your password was deployed correctly:
kubectl -n monitoring get secrets prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 -d ; echo
Should match what you stored in gopass: gopass show -o v3/domains/d000/k3s/grafana
5.5 Verify PVCs
kubectl -n monitoring get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS
prometheus-prometheus-kube-prometheus-prom... Bound pvc-xxxx 50Gi RWO nfs-client
alertmanager-prometheus-kube-prometheus-al... Bound pvc-xxxx 5Gi RWO nfs-client
prometheus-grafana Bound pvc-xxxx 10Gi RWO nfs-client
5.6 Verify NAS Storage
# Use short hostname (SSH config handles the rest)
ssh nas-01 "ls -la /volume1/k3s/"
drwxrwxrwx+ 1 root root 1012 Feb 22 21:56 .
drwxrwxrwx 1 root root 0 Feb 22 21:56 monitoring-alertmanager-prometheus-kube-prometheus-...
drwxrwxrwx 1 472 472 52 Feb 22 22:01 monitoring-prometheus-grafana-pvc-...
drwxrwxrwx 1 root root 26 Feb 22 21:56 monitoring-prometheus-prometheus-kube-prometheus-...
Directory names follow pattern: <namespace>-<pvc-name>-<pvc-uuid>
|
Phase 6: Access Services
6.1 Get Service IPs
kubectl -n monitoring get svc | awk '{print $1, $3, $5}'
NAME CLUSTER-IP PORT(S)
alertmanager-operated None 9093/TCP,9094/TCP,9094/UDP
prometheus-grafana 10.43.18.82 80/TCP
prometheus-kube-prometheus-alertmanager 10.43.69.110 9093/TCP,8080/TCP
prometheus-kube-prometheus-operator 10.43.162.126 443/TCP
prometheus-kube-prometheus-prometheus 10.43.205.102 9090/TCP,8080/TCP
prometheus-kube-state-metrics 10.43.147.44 8080/TCP
prometheus-operated None 9090/TCP
prometheus-prometheus-node-exporter 10.43.228.32 9100/TCP
6.2 Open Firewall Ports
Rocky Linux 9 has firewalld enabled. Port-forward binds to 0.0.0.0 but firewall blocks external access.
sudo firewall-cmd --add-port=3000/tcp --permanent # Grafana
sudo firewall-cmd --add-port=9090/tcp --permanent # Prometheus
sudo firewall-cmd --add-port=9093/tcp --permanent # AlertManager
sudo firewall-cmd --add-port=9100/tcp --permanent # Node Exporter (metrics scraping)
sudo firewall-cmd --reload
success
success
success
success
success
Port 9100 is required for Prometheus to scrape Node Exporter metrics. Without it, you’ll see TargetDown alerts and "no route to host" errors in Prometheus targets.
|
Verify from workstation:
nc -zv 10.50.1.120 3000
# Expected: Connection to 10.50.1.120 3000 port [tcp/hbci] succeeded!
6.3 Port Forward (Quick Access)
# Grafana (localhost:3000)
kubectl -n monitoring port-forward svc/prometheus-grafana 3000:80 --address 0.0.0.0 &
# Prometheus (localhost:9090)
kubectl -n monitoring port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 --address 0.0.0.0 &
# AlertManager (localhost:9093)
kubectl -n monitoring port-forward svc/prometheus-kube-prometheus-alertmanager 9093:9093 --address 0.0.0.0 &
After helm upgrade, pods restart and port-forwards lose connection. Restart them:
|
# Kill old port-forwards
kill %1 %2 %3
# Restart all
kubectl -n monitoring port-forward svc/prometheus-grafana 3000:80 --address 0.0.0.0 &
kubectl -n monitoring port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 --address 0.0.0.0 &
kubectl -n monitoring port-forward svc/prometheus-kube-prometheus-alertmanager 9093:9093 --address 0.0.0.0 &
6.4 Persistent Port-Forward (Systemd Services)
Background port-forwards (&) die when SSH disconnects. Use systemd for persistence.
Create Grafana Service
cat << 'EOF' | sudo tee /etc/systemd/system/grafana-pf.service
[Unit]
Description=Grafana Port Forward
After=network.target k3s.service
Wants=k3s.service
[Service]
Type=simple
Environment="KUBECONFIG=/etc/rancher/k3s/k3s.yaml"
ExecStart=/usr/local/bin/kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 --address=0.0.0.0
Restart=always
RestartSec=10
User=root
[Install]
WantedBy=multi-user.target
EOF
Create Prometheus Service
cat << 'EOF' | sudo tee /etc/systemd/system/prometheus-pf.service
[Unit]
Description=Prometheus Port Forward
After=network.target k3s.service
Wants=k3s.service
[Service]
Type=simple
Environment="KUBECONFIG=/etc/rancher/k3s/k3s.yaml"
ExecStart=/usr/local/bin/kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090 --address=0.0.0.0
Restart=always
RestartSec=10
User=root
[Install]
WantedBy=multi-user.target
EOF
Create Alertmanager Service
cat << 'EOF' | sudo tee /etc/systemd/system/alertmanager-pf.service
[Unit]
Description=Alertmanager Port Forward
After=network.target k3s.service
Wants=k3s.service
[Service]
Type=simple
Environment="KUBECONFIG=/etc/rancher/k3s/k3s.yaml"
ExecStart=/usr/local/bin/kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-alertmanager 9093:9093 --address=0.0.0.0
Restart=always
RestartSec=10
User=root
[Install]
WantedBy=multi-user.target
EOF
Phase 7: Traefik Ingress with Vault PKI (Production Access)
For production, expose via Traefik IngressRoute with TLS certificates from internal Vault PKI.
|
Why Vault PKI, not Let’s Encrypt?
|
7.0 Security Decision: Individual Certs vs Wildcard
| Approach | Pros | Cons |
|---|---|---|
Individual Certs (recommended) |
Blast radius limited to single service if key compromised; follows least-privilege; better audit trail |
More certs to manage; more Vault operations |
Wildcard Cert ( |
Single cert for all services; simpler management |
Single key compromise exposes ALL subdomains; violates least-privilege; broader attack surface |
This runbook uses individual certs per service.
7.1 Issue Certificates from Vault PKI
From workstation (where Vault CLI is configured):
Grafana Certificate
vault write -format=json pki_int/issue/domus-client \
common_name="grafana.inside.domusdigitalis.dev" \
ttl="8760h" > /tmp/grafana-cert.json
jq -r '.data.certificate' /tmp/grafana-cert.json > /tmp/grafana.crt
jq -r '.data.private_key' /tmp/grafana-cert.json > /tmp/grafana.key
jq -r '.data.ca_chain[]' /tmp/grafana-cert.json >> /tmp/grafana.crt
Verify:
openssl x509 -in /tmp/grafana.crt -noout -subject -issuer -dates | head -4
Prometheus Certificate
vault write -format=json pki_int/issue/domus-client \
common_name="prometheus.inside.domusdigitalis.dev" \
ttl="8760h" > /tmp/prometheus-cert.json
jq -r '.data.certificate' /tmp/prometheus-cert.json > /tmp/prometheus.crt
jq -r '.data.private_key' /tmp/prometheus-cert.json > /tmp/prometheus.key
jq -r '.data.ca_chain[]' /tmp/prometheus-cert.json >> /tmp/prometheus.crt
AlertManager Certificate
vault write -format=json pki_int/issue/domus-client \
common_name="alertmanager.inside.domusdigitalis.dev" \
ttl="8760h" > /tmp/alertmanager-cert.json
jq -r '.data.certificate' /tmp/alertmanager-cert.json > /tmp/alertmanager.crt
jq -r '.data.private_key' /tmp/alertmanager-cert.json > /tmp/alertmanager.key
jq -r '.data.ca_chain[]' /tmp/alertmanager-cert.json >> /tmp/alertmanager.crt
7.2 Transfer Certificates to k3s Node
scp /tmp/grafana.crt /tmp/grafana.key k3s-master-01.inside.domusdigitalis.dev:/tmp/
scp /tmp/prometheus.crt /tmp/prometheus.key k3s-master-01.inside.domusdigitalis.dev:/tmp/
scp /tmp/alertmanager.crt /tmp/alertmanager.key k3s-master-01.inside.domusdigitalis.dev:/tmp/
7.3 Create TLS Secrets in Kubernetes
On k3s-master-01:
kubectl -n monitoring create secret tls grafana-tls \
--cert=/tmp/grafana.crt \
--key=/tmp/grafana.key
kubectl -n monitoring create secret tls prometheus-tls \
--cert=/tmp/prometheus.crt \
--key=/tmp/prometheus.key
kubectl -n monitoring create secret tls alertmanager-tls \
--cert=/tmp/alertmanager.crt \
--key=/tmp/alertmanager.key
Verify:
kubectl -n monitoring get secrets | grep -E "tls"
alertmanager-tls kubernetes.io/tls 2 10s grafana-tls kubernetes.io/tls 2 15s prometheus-tls kubernetes.io/tls 2 12s
7.4 Create IngressRoutes
cat <<'EOF' | kubectl apply -f -
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: grafana
namespace: monitoring
spec:
entryPoints:
- websecure
routes:
- match: Host(`grafana.inside.domusdigitalis.dev`)
kind: Rule
services:
- name: prometheus-grafana
port: 80
tls:
secretName: grafana-tls
EOF
cat <<'EOF' | kubectl apply -f -
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: prometheus
namespace: monitoring
spec:
entryPoints:
- websecure
routes:
- match: Host(`prometheus.inside.domusdigitalis.dev`)
kind: Rule
services:
- name: prometheus-kube-prometheus-prometheus
port: 9090
tls:
secretName: prometheus-tls
EOF
cat <<'EOF' | kubectl apply -f -
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: alertmanager
namespace: monitoring
spec:
entryPoints:
- websecure
routes:
- match: Host(`alertmanager.inside.domusdigitalis.dev`)
kind: Rule
services:
- name: prometheus-kube-prometheus-alertmanager
port: 9093
tls:
secretName: alertmanager-tls
EOF
Verify IngressRoutes exist:
kubectl get ingressroute -n monitoring
NAME AGE alertmanager 5s grafana 15s prometheus 10s
Verify hostnames with custom-columns:
kubectl get ingressroute -n monitoring -o custom-columns=NAME:.metadata.name,HOST:.spec.routes[0].match
NAME HOST alertmanager Host(`alertmanager.inside.domusdigitalis.dev`) grafana Host(`grafana.inside.domusdigitalis.dev`) prometheus Host(`prometheus.inside.domusdigitalis.dev`)
7.5 Verify HTTPS Access
From workstation (must trust DOMUS-ROOT-CA):
curl -sS -o /dev/null -w "HTTP: %{http_code}\n" https://grafana.inside.domusdigitalis.dev
HTTP: 200
echo | openssl s_client -connect grafana.inside.domusdigitalis.dev:443 2>/dev/null | \
openssl x509 -noout -subject -issuer
subject=CN=grafana.inside.domusdigitalis.dev issuer=CN=DOMUS-ISSUING-CA
7.6 DNS Records (if not already added)
DNS should already exist from Phase 4. Verify:
for h in grafana prometheus alertmanager; do
host ${h}.inside.domusdigitalis.dev | awk '\{print $1, $NF\}'
done
grafana.inside.domusdigitalis.dev 10.50.1.120 prometheus.inside.domusdigitalis.dev 10.50.1.120 alertmanager.inside.domusdigitalis.dev 10.50.1.120
If missing, add via BIND:
ssh bind-01 "sudo nsupdate -l << 'EOF'
zone inside.domusdigitalis.dev
update add grafana.inside.domusdigitalis.dev. 3600 A 10.50.1.120
update add prometheus.inside.domusdigitalis.dev. 3600 A 10.50.1.120
update add alertmanager.inside.domusdigitalis.dev. 3600 A 10.50.1.120
send
EOF"
Phase 8: Custom Dashboards
8.1 Import via ConfigMap
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
name: custom-dashboard-example
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
my-dashboard.json: |
{
"title": "Custom Dashboard",
"uid": "custom-example",
"version": 1,
"panels": []
}
EOF
8.2 Recommended Dashboards
Import these from Grafana Dashboards:
| ID | Name | Description |
|---|---|---|
1860 |
Node Exporter Full |
Comprehensive host metrics |
13502 |
Mini Kubernetes Cluster |
Lightweight k8s overview |
15757 |
Kubernetes Views / Pods |
Pod-level metrics |
14981 |
CoreDNS |
DNS query metrics |
15759 |
Kubernetes Views / Nodes |
Node resource usage |
Phase 9: AlertManager Configuration
9.1 Configure Slack Alerts (Example)
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-prometheus-kube-prometheus-alertmanager
namespace: monitoring
stringData:
alertmanager.yaml: |
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
receivers:
- name: 'default'
# No action for default
- name: 'critical-alerts'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#alerts'
send_resolved: true
EOF
Phase 10: Validation
10.1 Comprehensive Health Check
echo "=== Monitoring Stack Health Check ==="
echo -e "\n[1] Pods Status:"
kubectl -n monitoring get pods -o custom-columns='NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[0].restartCount'
echo -e "\n[2] PVCs:"
kubectl -n monitoring get pvc -o custom-columns='NAME:.metadata.name,STATUS:.status.phase,CAPACITY:.status.capacity.storage'
echo -e "\n[3] Services:"
kubectl -n monitoring get svc -o custom-columns='NAME:.metadata.name,TYPE:.spec.type,CLUSTER-IP:.spec.clusterIP,PORT:.spec.ports[0].port'
echo -e "\n[4] Prometheus Targets:"
kubectl -n monitoring exec -it prometheus-prometheus-kube-prometheus-prometheus-0 -- wget -qO- http://localhost:9090/api/v1/targets | jq -r '.data.activeTargets | length'
echo -e "\n[5] Grafana Datasources:"
kubectl -n monitoring exec -it deploy/prometheus-grafana -c grafana -- curl -s http://localhost:3000/api/datasources | jq -r '.[].name'
Troubleshooting
Pods Stuck in Pending
# Check events
kubectl -n monitoring describe pod <pod-name> | grep -A10 Events
# Common cause: PVC not binding
kubectl -n monitoring get pvc
kubectl describe pvc <pvc-name>
NFS Mount Issues
# Test NFS from node
ssh k3s-master-01.inside.domusdigitalis.dev
sudo mount -t nfs 10.50.1.70:/volume1/k3s /mnt/test
ls /mnt/test
sudo umount /mnt/test
Grafana Login Issues
# Copy password to clipboard (secure - no screen exposure)
gopass show -c v3/domains/d000/k3s/grafana
# Or show password (use -o only in scripts, not interactively)
gopass show -o v3/domains/d000/k3s/grafana
# Get password from dsec
eval "$(dsec source d000 dev/app)" && echo $K3S_GRAFANA_ADMIN_PASS
# Get current password from k8s secret (what Helm deployed)
kubectl -n monitoring get secret prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 -d; echo
# Reset password (updates k8s only - also update gopass/dsec!)
kubectl -n monitoring exec -it deploy/prometheus-grafana -c grafana -- grafana-cli admin reset-admin-password <new-password>
Grafana Redirects to localhost (Can’t Connect)
Symptom: Browser shows "Unable to connect" or curl shows redirect to localhost:
< Location: http://localhost:3000/grafana/
Cause: The %(domain)s variable in root_url resolves to localhost inside the container.
Fix: Update values with explicit IP:
vi /tmp/prometheus-values.yaml
# Change grafana.ini section to:
grafana.ini:
server:
root_url: "http://{k3s-master-01-ip}:3000"
serve_from_sub_path: false
security:
cookie_secure: false
# Upgrade
helm upgrade prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values /tmp/prometheus-values.yaml
# Restart port-forwards (pods restarted)
kill %1 %2 %3
kubectl -n monitoring port-forward svc/prometheus-grafana 3000:80 --address 0.0.0.0 &
Upgrade Stack
# Update repo
helm repo update
# Check available versions
helm search repo prometheus-community/kube-prometheus-stack --versions | head -5
# Upgrade
helm upgrade prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values /tmp/prometheus-values.yaml
Uninstall
This deletes all monitoring data. PVCs with archiveOnDelete=true preserve data on NAS.
|
helm uninstall prometheus -n monitoring
kubectl delete namespace monitoring
Appendix: Vault PKI Certificate for Grafana
Replace the default Grafana self-signed certificate with a Vault-issued certificate to eliminate browser warnings and enable secure HTTPS access.
Issue Certificate from Vault
From workstation:
vault write -format=json pki_int/issue/domus-client \
common_name="grafana.inside.domusdigitalis.dev" \
ttl="8760h" > /tmp/grafana-cert.json
Extract Certificate Components
jq -r '.data.certificate' /tmp/grafana-cert.json > /tmp/grafana.crt
jq -r '.data.private_key' /tmp/grafana-cert.json > /tmp/grafana.key
jq -r '.data.ca_chain[]' /tmp/grafana-cert.json > /tmp/grafana-ca.crt
Verify Certificate
openssl x509 -in /tmp/grafana.crt -noout -subject -issuer -dates
Expected:
subject=CN=grafana.inside.domusdigitalis.dev issuer=CN=DOMUS-ISSUING-CA notBefore=... notAfter=... (1 year from now)
Transfer to k3s Node
From workstation:
scp /tmp/grafana.crt /tmp/grafana.key /tmp/grafana-ca.crt k3s-master-01.inside.domusdigitalis.dev:/tmp/
Create Kubernetes TLS Secret
On k3s-master-01:
kubectl -n monitoring create secret tls grafana-tls-vault \
--cert=/tmp/grafana.crt \
--key=/tmp/grafana.key \
--dry-run=client -o yaml | kubectl apply -f -
Update Helm Values for HTTPS
Add TLS configuration to your values file:
grafana:
# ... existing settings ...
# Enable HTTPS with Vault certificate
grafana.ini:
server:
protocol: https
cert_file: /etc/grafana/tls/tls.crt
cert_key: /etc/grafana/tls/tls.key
root_url: "https://grafana.inside.domusdigitalis.dev"
security:
cookie_secure: true
# Mount TLS secret
extraSecretMounts:
- name: grafana-tls
secretName: grafana-tls-vault
mountPath: /etc/grafana/tls
readOnly: true
Upgrade Helm Release
helm upgrade prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--values /tmp/prometheus-values.yaml
Update Port-Forward for HTTPS
If using port-forward (not Traefik), update the systemd service:
cat << 'EOF' | sudo tee /etc/systemd/system/grafana-pf.service
[Unit]
Description=Grafana Port Forward (HTTPS)
After=network.target k3s.service
Wants=k3s.service
[Service]
Type=simple
Environment="KUBECONFIG=/etc/rancher/k3s/k3s.yaml"
ExecStart=/usr/local/bin/kubectl port-forward -n monitoring svc/prometheus-grafana 3000:443 --address=0.0.0.0
Restart=always
RestartSec=10
User=root
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl restart grafana-pf
Verify HTTPS
echo | openssl s_client -connect grafana.inside.domusdigitalis.dev:3000 2>/dev/null | openssl x509 -noout -subject -issuer
Expected:
subject=CN=grafana.inside.domusdigitalis.dev issuer=CN=DOMUS-ISSUING-CA
|
Browser must trust the DOMUS-ROOT-CA for the certificate to show as valid. Import the root CA into your browser/OS trust store if not already done. |
Related Documentation
-
Wazuh SIEM - Has Vault PKI cert section for dashboard