k3s Prometheus + Grafana

Complete deployment guide for the kube-prometheus-stack on k3s with persistent storage on Synology NAS.

Architecture

Prometheus Monitoring Stack

Components

Component Purpose Persistence

Prometheus

Metrics collection and storage

50Gi on NAS (/volume1/k3s/prometheus)

Grafana

Visualization and dashboards

10Gi on NAS (/volume1/k3s/grafana)

AlertManager

Alert routing and notifications

5Gi on NAS (/volume1/k3s/alertmanager)

Node Exporter

Host-level metrics (CPU, memory, disk)

None (DaemonSet)

kube-state-metrics

Kubernetes object metrics

None (Deployment)

Concepts: Network-to-Kubernetes Reference

Reference table mapping traditional network concepts to Kubernetes equivalents.

Network Plane Comparison

Concept Traditional Network (CCNP) Kubernetes

Data Plane

ASICs forwarding packets (CEF, TCAM)

Container runtime (containerd) moving packets between pods

Control Plane

Routing protocols (OSPF, BGP, EIGRP)

kube-controller-manager, kube-scheduler

Management Plane

CLI/API (IOS, NX-OS, DNA Center)

kubectl, Kubernetes API server

Overlay Network

VXLAN, OTV, LISP

Cilium (eBPF), Flannel (VXLAN), Calico (BGP)

Service Discovery

DNS, ARP, CDP/LLDP

CoreDNS, kube-proxy, Service objects

Load Balancing

F5, NetScaler, ECMP

Service (ClusterIP, LoadBalancer), Ingress

ACLs / Security

ACLs, VACL, SGT/dACL

NetworkPolicy, Cilium policies

QoS

DSCP, queuing, policing

Resource limits, PriorityClass

Storage: SAN/NAS to PersistentVolume

Concept Traditional Storage Kubernetes

Storage Pool

LUN, Volume Group, RAID

StorageClass

Volume Provisioning

Manual LUN masking, zoning

Dynamic provisioning (PVC → PV)

Mount/Export

NFS export, iSCSI target

PersistentVolumeClaim (PVC)

Storage Tiering

SSD tier, HDD tier, archive

StorageClass with different provisioners

Why NFS StorageClass?

NFS allows pods to migrate between nodes while retaining data. local-path = local storage. nfs-client = shared storage.

Prometheus vs Traditional Monitoring

Concept Traditional Monitoring Prometheus

Data Collection

SNMP polling (GET, WALK)

HTTP scraping (/metrics endpoint)

Data Format

MIBs, OIDs

OpenMetrics (text-based, human-readable)

Time Series

RRD files, SQL database

TSDB (custom time-series database)

Alerting

Threshold triggers → email/SNMP trap

PromQL rules → AlertManager → Slack/PagerDuty

Target Discovery

Manual polling lists, CDP/LLDP

ServiceMonitor, PodMonitor (label selectors)

Prometheus scrapes metrics like node_network_receive_bytes_total{device="eth0"} instead of SNMP OIDs.

Service Discovery

What happens How it maps

Pod starts with label app=prometheus

Like a host coming online and registering in DNS

ServiceMonitor selects app=prometheus

Like a DNS zone transfer / dynamic DNS update

Prometheus scrapes discovered targets

Like NMS polling hosts discovered via CDP

Prerequisites

  • k3s cluster operational: kubectl get nodes

  • Helm 3 installed on k3s master

  • NFS share created on NAS: /volume1/k3s

  • dsec credentials loaded: dsource d000 dev/storage

Verify k3s Cluster

ssh k3s-master-01.inside.domusdigitalis.dev
kubectl get nodes -o wide
kubectl get pods -A | grep -v Running

Phase 1: NFS Storage Class

k3s includes local-path provisioner by default, but we need NFS for shared storage across nodes and NAS backup integration.

Dynamic Provisioning

With a StorageClass + provisioner, PVCs automatically create PVs. No manual binding required.

1.1 Install NFS Subdir External Provisioner

helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/
helm repo update

1.2 Create NFS StorageClass

helm install nfs-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \
  --namespace kube-system \
  --set nfs.server=10.50.1.70 \
  --set nfs.path=/volume1/k3s \
  --set storageClass.name=nfs-client \
  --set storageClass.defaultClass=false \
  --set storageClass.archiveOnDelete=true

1.3 Verify StorageClass

kubectl get storageclass
Expected Output
NAME                   PROVISIONER                                     RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION
local-path (default)   rancher.io/local-path                           Delete          WaitForFirstConsumer   false
nfs-client             cluster.local/nfs-provisioner                   Delete          Immediate              true

1.4 Test NFS Provisioning

cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-nfs-claim
spec:
  storageClassName: nfs-client
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi
EOF

# Verify
kubectl get pvc test-nfs-claim

# Cleanup
kubectl delete pvc test-nfs-claim

Phase 2: Prometheus Helm Repository

2.1 Add Repository

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

2.2 Verify Repository

helm search repo prometheus-community/kube-prometheus-stack --versions | head -5

Phase 3: Create Namespace

kubectl create namespace monitoring
kubectl label namespace monitoring name=monitoring

Phase 4: Configure Values

Create the Helm values file with NFS persistence and custom settings.

4.1 Secrets Management (gopass + dsec)

Credentials are stored in two locations for different use cases:

System Location Use Case

gopass

v3/domains/d000/k3s/grafana

Interactive retrieval, metadata, password managers

dsec

d000/dev/app.env.age

Shell scripts, automation, eval "$(dsec source …​)"

Step 1: Generate Password in gopass

Use gopass generate -e which generates a secure password AND opens your editor for metadata:

gopass generate -e v3/domains/d000/k3s/grafana 32

In the editor that opens, add metadata below the generated password:

<generated-password-on-first-line>
---
description: Grafana admin credentials (k3s monitoring)
url: https://grafana.inside.domusdigitalis.dev:3000
username: admin
namespace: monitoring
helm_release: prometheus
k3s_node: k3s-master-01.inside.domusdigitalis.dev
ip: 10.50.1.120

Save and exit. The password is now stored with metadata.

Create DNS Records (if not exists)

DNS records are added to BIND (authoritative DNS). See DNS Operations for full procedure.

Records to add:

Hostname FQDN IP

grafana

grafana.inside.domusdigitalis.dev

10.50.1.120

prometheus

prometheus.inside.domusdigitalis.dev

10.50.1.120

alertmanager

alertmanager.inside.domusdigitalis.dev

10.50.1.120

Step 1: SSH to bind-01

ssh bind-01

Step 2: Add Forward (A) Records

sudo nsupdate -l << 'EOF'
zone inside.domusdigitalis.dev
update add grafana.inside.domusdigitalis.dev. 3600 A 10.50.1.120
update add prometheus.inside.domusdigitalis.dev. 3600 A 10.50.1.120
update add alertmanager.inside.domusdigitalis.dev. 3600 A 10.50.1.120
send
EOF

Verify:

for h in grafana prometheus alertmanager; do dig +short $h.inside.domusdigitalis.dev @localhost; done

Step 3: Add Reverse (PTR) Records

All three hostnames share the same IP, so only ONE PTR record is needed. Convention: use the "primary" service name.
sudo nsupdate -l << 'EOF'
zone 1.50.10.in-addr.arpa
update add 120.1.50.10.in-addr.arpa. 3600 PTR grafana.inside.domusdigitalis.dev.
send
EOF

Verify:

dig +short -x 10.50.1.120 @localhost
Expected output
grafana.inside.domusdigitalis.dev.

Step 4: Verify SOA Serial Updated

dig SOA inside.domusdigitalis.dev +short | awk '{print "Serial: "$3}'

Step 5: Force Zone Transfer to Secondary

sudo rndc notify inside.domusdigitalis.dev
sudo rndc notify 1.50.10.in-addr.arpa

Step 6: Verify on bind-02

dig +short grafana.inside.domusdigitalis.dev @10.50.1.91
dig +short -x 10.50.1.120 @10.50.1.91

Step 7: Exit bind-01

exit

Step 8: Verify from Workstation

# Forward lookups
for h in grafana prometheus alertmanager; do echo -n "$h: "; dig +short $h.inside.domusdigitalis.dev; done
Expected output
grafana: 10.50.1.120
prometheus: 10.50.1.120
alertmanager: 10.50.1.120
# Reverse lookup
dig +short -x 10.50.1.120
Expected output
grafana.inside.domusdigitalis.dev.

Step 2: Add Prometheus/AlertManager Metadata (No Passwords)

gopass generate -e v3/domains/d000/k3s/prometheus 32
<generated-password - not used, but required by gopass>
---
description: Prometheus metrics server (no auth required)
url: http://prometheus.inside.domusdigitalis.dev:9090
namespace: monitoring
helm_release: prometheus
k3s_node: k3s-master-01.inside.domusdigitalis.dev
ip: 10.50.1.120
storage: 50Gi NFS
retention: 30d
gopass generate -e v3/domains/d000/k3s/alertmanager 32
<generated-password - not used, but required by gopass>
---
description: AlertManager notification routing (no auth required)
url: http://alertmanager.inside.domusdigitalis.dev:9093
namespace: monitoring
helm_release: prometheus
k3s_node: k3s-master-01.inside.domusdigitalis.dev
ip: 10.50.1.120
storage: 5Gi NFS

Step 3: Add to dsec (app.env.age)

dsec edit d000 dev/app

Add this section (copy Grafana password from gopass):

# ============================================================================
# === k3s Infrastructure Monitoring ===
# ============================================================================
# Prometheus, Grafana, AlertManager deployed via kube-prometheus-stack
# Namespace: monitoring
# Storage: NFS on nas-01:/volume1/k3s
# ============================================================================

# --- Grafana ---
K3S_GRAFANA_ADMIN_USER=admin
K3S_GRAFANA_ADMIN_PASS=<paste from: gopass show -o v3/domains/d000/k3s/grafana>
K3S_GRAFANA_URL=http://{k3s-master-01-ip}:3000

# --- Prometheus ---
K3S_PROMETHEUS_URL=http://{k3s-master-01-ip}:9090

# --- AlertManager ---
K3S_ALERTMANAGER_URL=http://{k3s-master-01-ip}:9093

Step 4: Sync and Push

# Push gopass to git
gopass sync
# Push dsec to git
cd ~/.secrets && git add -A && git commit -m "feat(d000/dev): Add k3s monitoring credentials" && git push origin main

Retrieve Password (for Helm install)

# Option A: From gopass (interactive)
GRAFANA_PASS=$(gopass show -o v3/domains/d000/k3s/grafana)
# Option B: From dsec (automation)
eval "$(dsec source d000 dev/app)"
GRAFANA_PASS=$K3S_GRAFANA_ADMIN_PASS

Step 5: Validate Credentials with curl (From Workstation)

After Helm install, validate Grafana authentication from your workstation.

Test 1: Basic Authentication (Direct API)

GRAFANA_PASS=$(gopass show -o v3/domains/d000/k3s/grafana)
curl -sS -u "admin:${GRAFANA_PASS}" http://grafana.inside.domusdigitalis.dev:3000/api/org | jq .
Expected Response (HTTP 200 - Success)
{
  "id": 1,
  "name": "Main Org.",
  "address": {
    "address1": "",
    "address2": "",
    "city": "",
    "zipCode": "",
    "state": "",
    "country": ""
  }
}
Failed Response (HTTP 401 - Wrong Password)
{
  "message": "invalid username or password"
}

Test 2: Verbose Mode (See Full HTTP Transaction)

GRAFANA_PASS=$(gopass show -o v3/domains/d000/k3s/grafana)
curl -v -u "admin:${GRAFANA_PASS}" http://grafana.inside.domusdigitalis.dev:3000/api/org 2>&1 | grep -E "^[<>*]|HTTP/"
Expected Output (Annotated)
*   Trying 10.50.1.120:3000...
* Connected to grafana.inside.domusdigitalis.dev (10.50.1.120) port 3000
> GET /api/org HTTP/1.1                          # Request line
> Host: grafana.inside.domusdigitalis.dev:3000   # Target host
> Authorization: Basic YWRtaW46...(base64)...    # Credentials (base64 encoded)
> User-Agent: curl/8.x.x
> Accept: */*
>
< HTTP/1.1 200 OK                                # Success!
< Cache-Control: no-store                        # No caching (security)
< Content-Type: application/json                 # Response format
< X-Content-Type-Options: nosniff               # XSS protection
< X-Frame-Options: deny                         # Clickjacking protection
< X-Xss-Protection: 1; mode=block               # XSS protection

Test 3: Session Cookie Authentication (How Browser Works)

# Step 1: Login and capture session cookie
GRAFANA_PASS=$(gopass show -o v3/domains/d000/k3s/grafana)
curl -sS -c /tmp/grafana-cookies.txt \
  -H "Content-Type: application/json" \
  -d '{"user":"admin","password":"'"${GRAFANA_PASS}"'"}' \
  http://grafana.inside.domusdigitalis.dev:3000/login | jq .
Expected Response
{
  "message": "Logged in"
}
# Step 2: View the session cookie
cat /tmp/grafana-cookies.txt
Expected Output (Cookie File)
# Netscape HTTP Cookie File
grafana.inside.domusdigitalis.dev	FALSE	/	FALSE	0	grafana_session	abc123def456...
# Step 3: Use cookie for subsequent requests (no password needed)
curl -sS -b /tmp/grafana-cookies.txt http://grafana.inside.domusdigitalis.dev:3000/api/user | jq .
Expected Response (Your User Info)
{
  "id": 1,
  "email": "admin@localhost",
  "name": "",
  "login": "admin",
  "theme": "",
  "orgId": 1,
  "isGrafanaAdmin": true,
  "isDisabled": false,
  "isExternal": false,
  "authLabels": [],
  "updatedAt": "2026-02-23T...",
  "createdAt": "2026-02-23T...",
  "isGrafanaAdminExternallySynced": false
}

Test 4: Health Check Endpoints (No Auth Required)

# Grafana health
curl -sS http://grafana.inside.domusdigitalis.dev:3000/api/health | jq .
Expected Response
{
  "commit": "abc1234",
  "database": "ok",
  "version": "11.x.x"
}
# Prometheus health
curl -sS http://prometheus.inside.domusdigitalis.dev:9090/-/healthy
Expected Response
Prometheus Server is Healthy.
# AlertManager health
curl -sS http://alertmanager.inside.domusdigitalis.dev:9093/-/healthy
Expected Response
OK

Understanding HTTP Authentication

Method How It Works When Used

Basic Auth

Authorization: Basic base64(user:pass) header sent with every request

API calls, scripts, simple automation

Session Cookie

Login once → server returns Set-Cookie → browser sends cookie with subsequent requests

Browser sessions, interactive use

API Key

Generate key in Grafana UI → Authorization: Bearer <key> header

Long-lived automation, CI/CD (more secure than Basic)

Troubleshooting with curl

# DNS resolution check
dig +short grafana.inside.domusdigitalis.dev
# Expected: 10.50.1.120

# Connection refused (service not running or firewall)
curl -v http://grafana.inside.domusdigitalis.dev:3000/api/health 2>&1 | grep -E "Connection refused|Failed to connect"

# Check if port-forward is running
ssh k3s-master-01.inside.domusdigitalis.dev "ss -tlnp | grep 3000"

# Check firewall
ssh k3s-master-01.inside.domusdigitalis.dev "sudo firewall-cmd --list-ports | grep 3000"
# Redirect loop (grafana.ini misconfigured)
curl -v -L http://grafana.inside.domusdigitalis.dev:3000/ 2>&1 | grep -E "Location:|HTTP/"

# Should NOT see multiple redirects to localhost
# Cleanup
rm -f /tmp/grafana-cookies.txt

4.2 Create Values File

cat > /tmp/prometheus-values.yaml << 'EOF'
# =============================================================================
# kube-prometheus-stack values
# Runbook: k3s-monitoring.adoc
# =============================================================================

# -----------------------------------------------------------------------------
# Global Settings
# -----------------------------------------------------------------------------
defaultRules:
  create: true
  rules:
    alertmanager: true
    etcd: false  # k3s uses SQLite, not etcd
    kubeApiserver: true
    kubeScheduler: true
    kubeControllerManager: true
    kubeProxy: true
    node: true

# -----------------------------------------------------------------------------
# Prometheus
# -----------------------------------------------------------------------------
prometheus:
  prometheusSpec:
    replicas: 1
    retention: 30d
    retentionSize: "45GB"

    # NFS Storage
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: nfs-client
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi

    # Resource limits
    resources:
      requests:
        cpu: 200m
        memory: 512Mi
      limits:
        cpu: 1000m
        memory: 2Gi

    # Scrape all namespaces
    podMonitorNamespaceSelector: {}
    podMonitorSelector: {}
    serviceMonitorNamespaceSelector: {}
    serviceMonitorSelector: {}

# -----------------------------------------------------------------------------
# Grafana
# -----------------------------------------------------------------------------
grafana:
  enabled: true
  adminPassword: "CHANGE_ME_BEFORE_INSTALL"

  # NFS Persistence
  persistence:
    enabled: true
    storageClassName: nfs-client
    size: 10Gi

  # Resource limits
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 512Mi

  # Default dashboards
  defaultDashboardsEnabled: true
  defaultDashboardsTimezone: America/Los_Angeles

  # Sidecar for dashboard provisioning
  sidecar:
    dashboards:
      enabled: true
      searchNamespace: ALL
    datasources:
      enabled: true

  # Grafana.ini settings
  # IMPORTANT: For port-forward access, use explicit IP and serve_from_sub_path: false
  # The %(domain)s variable resolves to 'localhost' inside the container, breaking redirects
  grafana.ini:
    server:
      root_url: "http://10.50.1.120:3000"
      serve_from_sub_path: false
    security:
      admin_user: admin
      cookie_secure: false  # Set true when using HTTPS/Traefik
    users:
      auto_assign_org_role: Viewer
    auth.anonymous:
      enabled: false

# -----------------------------------------------------------------------------
# AlertManager
# -----------------------------------------------------------------------------
alertmanager:
  enabled: true
  alertmanagerSpec:
    replicas: 1

    # NFS Storage
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: nfs-client
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 5Gi

    # Resource limits
    resources:
      requests:
        cpu: 50m
        memory: 64Mi
      limits:
        cpu: 200m
        memory: 256Mi

# -----------------------------------------------------------------------------
# Node Exporter (DaemonSet)
# -----------------------------------------------------------------------------
nodeExporter:
  enabled: true

# -----------------------------------------------------------------------------
# kube-state-metrics
# -----------------------------------------------------------------------------
kubeStateMetrics:
  enabled: true

# -----------------------------------------------------------------------------
# Prometheus Operator
# -----------------------------------------------------------------------------
prometheusOperator:
  enabled: true
  resources:
    requests:
      cpu: 100m
      memory: 100Mi
    limits:
      cpu: 200m
      memory: 200Mi

# -----------------------------------------------------------------------------
# k3s Specific: Disable components not present
# -----------------------------------------------------------------------------
kubeEtcd:
  enabled: false

kubeScheduler:
  enabled: false

kubeControllerManager:
  enabled: false

kubeProxy:
  enabled: false
EOF

4.3 Inject Grafana Password

Option A: Shell sed (from workstation)

# Get password from gopass
GRAFANA_PASS=$(gopass show -o v3/domains/d000/k3s/grafana)

# Replace placeholder in values file
sed -i "s/CHANGE_ME_BEFORE_INSTALL/$GRAFANA_PASS/" /tmp/prometheus-values.yaml

# Verify (should show actual password, not placeholder)
grep adminPassword /tmp/prometheus-values.yaml

Option B: vi/vim on k3s node (minimal tooling)

If editing directly on the node with vi:

# Quick vim setup (no vimrc needed)
:set nu rnu ic scs is hls ts=2 sw=2 et ai

# Substitute - use # delimiter (passwords often contain /)
:%s#CHANGE_ME_BEFORE_INSTALL#YOUR_ACTUAL_PASS#g
When passwords contain /, use alternative delimiters: #, |, or @.
Example: :%s|old|new|g works identically to :%s/old/new/g

Phase 5: Install kube-prometheus-stack

What Helm Deploys

helm install deploys these resources:

Resource Type Count Purpose

Deployment

3

Grafana, Prometheus Operator, kube-state-metrics

StatefulSet

2

Prometheus, AlertManager (ordered startup, stable network IDs)

DaemonSet

1

Node Exporter (runs on EVERY node, like SNMP agent)

Service

5+

ClusterIP services for internal communication

ServiceMonitor

10+

Auto-discovery rules for Prometheus scraping

ConfigMap

5+

Dashboards, alerting rules, Grafana datasources

Secret

2+

Grafana admin password, AlertManager config

StatefulSet vs Deployment: Prometheus uses StatefulSet because each instance has its own TSDB on disk.

The Operator Pattern

The prometheus-operator watches for ServiceMonitor/PrometheusRule resources and automatically updates Prometheus configuration.

5.1 Dry Run (Optional - Will Fail)

Dry-run will fail with CRD errors on first install. This is expected.
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values /tmp/prometheus-values.yaml \
  --dry-run
Expected Error (first install)
Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest:
resource mapping not found for name: "prometheus-kube-prometheus-alertmanager"
no matches for kind "Alertmanager" in version "monitoring.coreos.com/v1"
ensure CRDs are installed first

Why this happens: Dry-run validates against existing cluster state. CRDs (Custom Resource Definitions) like Alertmanager, Prometheus, ServiceMonitor don’t exist yet - they get installed during the actual Helm install.

Solution: Skip dry-run on first install, or proceed directly to 5.2. Dry-run works on subsequent upgrades.

5.2 Install

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values /tmp/prometheus-values.yaml

5.3 Wait for Pods

kubectl -n monitoring get pods -w
Expected Output (all Running, 6 pods)
NAME                                                     READY   STATUS    RESTARTS   AGE
alertmanager-prometheus-kube-prometheus-alertmanager-0   2/2     Running   0          67s
prometheus-grafana-6dcfb98c6-rjxsm                       3/3     Running   0          83s
prometheus-kube-prometheus-operator-664f9898f9-txw9b     1/1     Running   0          77s
prometheus-kube-state-metrics-7dfddfdf48-5cwlc           1/1     Running   0          77s
prometheus-prometheus-kube-prometheus-prometheus-0       2/2     Running   0          66s
prometheus-prometheus-node-exporter-kp67p                1/1     Running   0          78s
Grafana may show 2/3 briefly while sidecars initialize. Wait for 3/3.

5.4 Verify Password

Confirm your password was deployed correctly:

kubectl -n monitoring get secrets prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 -d ; echo

Should match what you stored in gopass: gopass show -o v3/domains/d000/k3s/grafana

5.5 Verify PVCs

kubectl -n monitoring get pvc
Expected Output
NAME                                             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS
prometheus-prometheus-kube-prometheus-prom...    Bound    pvc-xxxx                                   50Gi       RWO            nfs-client
alertmanager-prometheus-kube-prometheus-al...    Bound    pvc-xxxx                                   5Gi        RWO            nfs-client
prometheus-grafana                               Bound    pvc-xxxx                                   10Gi       RWO            nfs-client

5.6 Verify NAS Storage

# Use short hostname (SSH config handles the rest)
ssh nas-01 "ls -la /volume1/k3s/"
Expected Output
drwxrwxrwx+ 1 root root 1012 Feb 22 21:56 .
drwxrwxrwx  1 root root    0 Feb 22 21:56 monitoring-alertmanager-prometheus-kube-prometheus-...
drwxrwxrwx  1  472  472   52 Feb 22 22:01 monitoring-prometheus-grafana-pvc-...
drwxrwxrwx  1 root root   26 Feb 22 21:56 monitoring-prometheus-prometheus-kube-prometheus-...
Directory names follow pattern: <namespace>-<pvc-name>-<pvc-uuid>

Phase 6: Access Services

6.1 Get Service IPs

kubectl -n monitoring get svc | awk '{print $1, $3, $5}'
Expected Output
NAME CLUSTER-IP PORT(S)
alertmanager-operated None 9093/TCP,9094/TCP,9094/UDP
prometheus-grafana 10.43.18.82 80/TCP
prometheus-kube-prometheus-alertmanager 10.43.69.110 9093/TCP,8080/TCP
prometheus-kube-prometheus-operator 10.43.162.126 443/TCP
prometheus-kube-prometheus-prometheus 10.43.205.102 9090/TCP,8080/TCP
prometheus-kube-state-metrics 10.43.147.44 8080/TCP
prometheus-operated None 9090/TCP
prometheus-prometheus-node-exporter 10.43.228.32 9100/TCP

6.2 Open Firewall Ports

Rocky Linux 9 has firewalld enabled. Port-forward binds to 0.0.0.0 but firewall blocks external access.

sudo firewall-cmd --add-port=3000/tcp --permanent   # Grafana
sudo firewall-cmd --add-port=9090/tcp --permanent   # Prometheus
sudo firewall-cmd --add-port=9093/tcp --permanent   # AlertManager
sudo firewall-cmd --add-port=9100/tcp --permanent   # Node Exporter (metrics scraping)
sudo firewall-cmd --reload
Expected Output
success
success
success
success
success
Port 9100 is required for Prometheus to scrape Node Exporter metrics. Without it, you’ll see TargetDown alerts and "no route to host" errors in Prometheus targets.

Verify from workstation:

nc -zv 10.50.1.120 3000
# Expected: Connection to 10.50.1.120 3000 port [tcp/hbci] succeeded!

6.3 Port Forward (Quick Access)

# Grafana (localhost:3000)
kubectl -n monitoring port-forward svc/prometheus-grafana 3000:80 --address 0.0.0.0 &

# Prometheus (localhost:9090)
kubectl -n monitoring port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 --address 0.0.0.0 &

# AlertManager (localhost:9093)
kubectl -n monitoring port-forward svc/prometheus-kube-prometheus-alertmanager 9093:9093 --address 0.0.0.0 &
After helm upgrade, pods restart and port-forwards lose connection. Restart them:
# Kill old port-forwards
kill %1 %2 %3

# Restart all
kubectl -n monitoring port-forward svc/prometheus-grafana 3000:80 --address 0.0.0.0 &
kubectl -n monitoring port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 --address 0.0.0.0 &
kubectl -n monitoring port-forward svc/prometheus-kube-prometheus-alertmanager 9093:9093 --address 0.0.0.0 &

6.4 Persistent Port-Forward (Systemd Services)

Background port-forwards (&) die when SSH disconnects. Use systemd for persistence.

Create Grafana Service

cat << 'EOF' | sudo tee /etc/systemd/system/grafana-pf.service
[Unit]
Description=Grafana Port Forward
After=network.target k3s.service
Wants=k3s.service

[Service]
Type=simple
Environment="KUBECONFIG=/etc/rancher/k3s/k3s.yaml"
ExecStart=/usr/local/bin/kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 --address=0.0.0.0
Restart=always
RestartSec=10
User=root

[Install]
WantedBy=multi-user.target
EOF

Create Prometheus Service

cat << 'EOF' | sudo tee /etc/systemd/system/prometheus-pf.service
[Unit]
Description=Prometheus Port Forward
After=network.target k3s.service
Wants=k3s.service

[Service]
Type=simple
Environment="KUBECONFIG=/etc/rancher/k3s/k3s.yaml"
ExecStart=/usr/local/bin/kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090 --address=0.0.0.0
Restart=always
RestartSec=10
User=root

[Install]
WantedBy=multi-user.target
EOF

Create Alertmanager Service

cat << 'EOF' | sudo tee /etc/systemd/system/alertmanager-pf.service
[Unit]
Description=Alertmanager Port Forward
After=network.target k3s.service
Wants=k3s.service

[Service]
Type=simple
Environment="KUBECONFIG=/etc/rancher/k3s/k3s.yaml"
ExecStart=/usr/local/bin/kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-alertmanager 9093:9093 --address=0.0.0.0
Restart=always
RestartSec=10
User=root

[Install]
WantedBy=multi-user.target
EOF

Enable Services

sudo systemctl daemon-reload
sudo systemctl enable --now grafana-pf prometheus-pf alertmanager-pf

Verify

systemctl status grafana-pf prometheus-pf alertmanager-pf --no-pager | awk '/Active:/{print prev, $0} {prev=$0}'
Expected Output
● grafana-pf.service      Active: active (running)
● prometheus-pf.service   Active: active (running)
● alertmanager-pf.service Active: active (running)

6.5 Access URLs

Service URL Credentials

Grafana

10.50.1.120:3000

admin / gopass show -c v3/domains/d000/k3s/grafana (copies to clipboard)

Prometheus

10.50.1.120:9090

None

AlertManager

10.50.1.120:9093

None

Phase 7: Traefik Ingress with Vault PKI (Production Access)

For production, expose via Traefik IngressRoute with TLS certificates from internal Vault PKI.

Why Vault PKI, not Let’s Encrypt?

  • inside.domusdigitalis.dev is an internal domain (not publicly resolvable)

  • Let’s Encrypt ACME challenges require public DNS

  • Vault PKI provides internal CA trust chain

  • Browser/OS must trust DOMUS-ROOT-CA for valid cert display

7.0 Security Decision: Individual Certs vs Wildcard

Approach Pros Cons

Individual Certs (recommended)

Blast radius limited to single service if key compromised; follows least-privilege; better audit trail

More certs to manage; more Vault operations

Wildcard Cert (*.inside.domusdigitalis.dev)

Single cert for all services; simpler management

Single key compromise exposes ALL subdomains; violates least-privilege; broader attack surface

This runbook uses individual certs per service.

7.1 Issue Certificates from Vault PKI

From workstation (where Vault CLI is configured):

Grafana Certificate

vault write -format=json pki_int/issue/domus-client \
  common_name="grafana.inside.domusdigitalis.dev" \
  ttl="8760h" > /tmp/grafana-cert.json
jq -r '.data.certificate' /tmp/grafana-cert.json > /tmp/grafana.crt
jq -r '.data.private_key' /tmp/grafana-cert.json > /tmp/grafana.key
jq -r '.data.ca_chain[]' /tmp/grafana-cert.json >> /tmp/grafana.crt

Verify:

openssl x509 -in /tmp/grafana.crt -noout -subject -issuer -dates | head -4

Prometheus Certificate

vault write -format=json pki_int/issue/domus-client \
  common_name="prometheus.inside.domusdigitalis.dev" \
  ttl="8760h" > /tmp/prometheus-cert.json
jq -r '.data.certificate' /tmp/prometheus-cert.json > /tmp/prometheus.crt
jq -r '.data.private_key' /tmp/prometheus-cert.json > /tmp/prometheus.key
jq -r '.data.ca_chain[]' /tmp/prometheus-cert.json >> /tmp/prometheus.crt

AlertManager Certificate

vault write -format=json pki_int/issue/domus-client \
  common_name="alertmanager.inside.domusdigitalis.dev" \
  ttl="8760h" > /tmp/alertmanager-cert.json
jq -r '.data.certificate' /tmp/alertmanager-cert.json > /tmp/alertmanager.crt
jq -r '.data.private_key' /tmp/alertmanager-cert.json > /tmp/alertmanager.key
jq -r '.data.ca_chain[]' /tmp/alertmanager-cert.json >> /tmp/alertmanager.crt

7.2 Transfer Certificates to k3s Node

scp /tmp/grafana.crt /tmp/grafana.key k3s-master-01.inside.domusdigitalis.dev:/tmp/
scp /tmp/prometheus.crt /tmp/prometheus.key k3s-master-01.inside.domusdigitalis.dev:/tmp/
scp /tmp/alertmanager.crt /tmp/alertmanager.key k3s-master-01.inside.domusdigitalis.dev:/tmp/

7.3 Create TLS Secrets in Kubernetes

On k3s-master-01:

kubectl -n monitoring create secret tls grafana-tls \
  --cert=/tmp/grafana.crt \
  --key=/tmp/grafana.key
kubectl -n monitoring create secret tls prometheus-tls \
  --cert=/tmp/prometheus.crt \
  --key=/tmp/prometheus.key
kubectl -n monitoring create secret tls alertmanager-tls \
  --cert=/tmp/alertmanager.crt \
  --key=/tmp/alertmanager.key

Verify:

kubectl -n monitoring get secrets | grep -E "tls"
Expected output
alertmanager-tls   kubernetes.io/tls   2      10s
grafana-tls        kubernetes.io/tls   2      15s
prometheus-tls     kubernetes.io/tls   2      12s

7.4 Create IngressRoutes

cat <<'EOF' | kubectl apply -f -
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: grafana
  namespace: monitoring
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`grafana.inside.domusdigitalis.dev`)
      kind: Rule
      services:
        - name: prometheus-grafana
          port: 80
  tls:
    secretName: grafana-tls
EOF
cat <<'EOF' | kubectl apply -f -
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: prometheus
  namespace: monitoring
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`prometheus.inside.domusdigitalis.dev`)
      kind: Rule
      services:
        - name: prometheus-kube-prometheus-prometheus
          port: 9090
  tls:
    secretName: prometheus-tls
EOF
cat <<'EOF' | kubectl apply -f -
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
  name: alertmanager
  namespace: monitoring
spec:
  entryPoints:
    - websecure
  routes:
    - match: Host(`alertmanager.inside.domusdigitalis.dev`)
      kind: Rule
      services:
        - name: prometheus-kube-prometheus-alertmanager
          port: 9093
  tls:
    secretName: alertmanager-tls
EOF

Verify IngressRoutes exist:

kubectl get ingressroute -n monitoring
Expected output
NAME           AGE
alertmanager   5s
grafana        15s
prometheus     10s

Verify hostnames with custom-columns:

kubectl get ingressroute -n monitoring -o custom-columns=NAME:.metadata.name,HOST:.spec.routes[0].match
Expected output
NAME           HOST
alertmanager   Host(`alertmanager.inside.domusdigitalis.dev`)
grafana        Host(`grafana.inside.domusdigitalis.dev`)
prometheus     Host(`prometheus.inside.domusdigitalis.dev`)

7.5 Verify HTTPS Access

From workstation (must trust DOMUS-ROOT-CA):

curl -sS -o /dev/null -w "HTTP: %{http_code}\n" https://grafana.inside.domusdigitalis.dev
Expected output
HTTP: 200
echo | openssl s_client -connect grafana.inside.domusdigitalis.dev:443 2>/dev/null | \
  openssl x509 -noout -subject -issuer
Expected output
subject=CN=grafana.inside.domusdigitalis.dev
issuer=CN=DOMUS-ISSUING-CA

7.6 DNS Records (if not already added)

DNS should already exist from Phase 4. Verify:

for h in grafana prometheus alertmanager; do
  host ${h}.inside.domusdigitalis.dev | awk '\{print $1, $NF\}'
done
Expected output
grafana.inside.domusdigitalis.dev 10.50.1.120
prometheus.inside.domusdigitalis.dev 10.50.1.120
alertmanager.inside.domusdigitalis.dev 10.50.1.120

If missing, add via BIND:

ssh bind-01 "sudo nsupdate -l << 'EOF'
zone inside.domusdigitalis.dev
update add grafana.inside.domusdigitalis.dev. 3600 A 10.50.1.120
update add prometheus.inside.domusdigitalis.dev. 3600 A 10.50.1.120
update add alertmanager.inside.domusdigitalis.dev. 3600 A 10.50.1.120
send
EOF"

7.7 Update Access URLs

After enabling Traefik ingress, update the access table:

Service URL Credentials

Grafana

grafana.inside.domusdigitalis.dev

admin / gopass show -c v3/domains/d000/k3s/grafana

Prometheus

prometheus.inside.domusdigitalis.dev

None

AlertManager

alertmanager.inside.domusdigitalis.dev

None

Port-forward services (Phase 6) can be disabled once Traefik ingress is verified working.

Phase 8: Custom Dashboards

8.1 Import via ConfigMap

cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-dashboard-example
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  my-dashboard.json: |
    {
      "title": "Custom Dashboard",
      "uid": "custom-example",
      "version": 1,
      "panels": []
    }
EOF

Import these from Grafana Dashboards:

ID Name Description

1860

Node Exporter Full

Comprehensive host metrics

13502

Mini Kubernetes Cluster

Lightweight k8s overview

15757

Kubernetes Views / Pods

Pod-level metrics

14981

CoreDNS

DNS query metrics

15759

Kubernetes Views / Nodes

Node resource usage

Phase 9: AlertManager Configuration

9.1 Configure Slack Alerts (Example)

cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-prometheus-kube-prometheus-alertmanager
  namespace: monitoring
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m
    route:
      group_by: ['alertname', 'namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'default'
      routes:
        - match:
            severity: critical
          receiver: 'critical-alerts'
    receivers:
      - name: 'default'
        # No action for default
      - name: 'critical-alerts'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
            channel: '#alerts'
            send_resolved: true
EOF

Phase 10: Validation

10.1 Comprehensive Health Check

echo "=== Monitoring Stack Health Check ==="

echo -e "\n[1] Pods Status:"
kubectl -n monitoring get pods -o custom-columns='NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[0].restartCount'

echo -e "\n[2] PVCs:"
kubectl -n monitoring get pvc -o custom-columns='NAME:.metadata.name,STATUS:.status.phase,CAPACITY:.status.capacity.storage'

echo -e "\n[3] Services:"
kubectl -n monitoring get svc -o custom-columns='NAME:.metadata.name,TYPE:.spec.type,CLUSTER-IP:.spec.clusterIP,PORT:.spec.ports[0].port'

echo -e "\n[4] Prometheus Targets:"
kubectl -n monitoring exec -it prometheus-prometheus-kube-prometheus-prometheus-0 -- wget -qO- http://localhost:9090/api/v1/targets | jq -r '.data.activeTargets | length'

echo -e "\n[5] Grafana Datasources:"
kubectl -n monitoring exec -it deploy/prometheus-grafana -c grafana -- curl -s http://localhost:3000/api/datasources | jq -r '.[].name'

10.2 Test Metrics Collection

# Query node CPU usage
kubectl -n monitoring exec -it prometheus-prometheus-kube-prometheus-prometheus-0 -- \
  wget -qO- 'http://localhost:9090/api/v1/query?query=node_cpu_seconds_total' | jq -r '.data.result | length'

Troubleshooting

Pods Stuck in Pending

# Check events
kubectl -n monitoring describe pod <pod-name> | grep -A10 Events

# Common cause: PVC not binding
kubectl -n monitoring get pvc
kubectl describe pvc <pvc-name>

NFS Mount Issues

# Test NFS from node
ssh k3s-master-01.inside.domusdigitalis.dev
sudo mount -t nfs 10.50.1.70:/volume1/k3s /mnt/test
ls /mnt/test
sudo umount /mnt/test

Grafana Login Issues

# Copy password to clipboard (secure - no screen exposure)
gopass show -c v3/domains/d000/k3s/grafana

# Or show password (use -o only in scripts, not interactively)
gopass show -o v3/domains/d000/k3s/grafana

# Get password from dsec
eval "$(dsec source d000 dev/app)" && echo $K3S_GRAFANA_ADMIN_PASS

# Get current password from k8s secret (what Helm deployed)
kubectl -n monitoring get secret prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 -d; echo

# Reset password (updates k8s only - also update gopass/dsec!)
kubectl -n monitoring exec -it deploy/prometheus-grafana -c grafana -- grafana-cli admin reset-admin-password <new-password>

Grafana Redirects to localhost (Can’t Connect)

Symptom: Browser shows "Unable to connect" or curl shows redirect to localhost:

< Location: http://localhost:3000/grafana/

Cause: The %(domain)s variable in root_url resolves to localhost inside the container.

Fix: Update values with explicit IP:

vi /tmp/prometheus-values.yaml

# Change grafana.ini section to:
  grafana.ini:
    server:
      root_url: "http://{k3s-master-01-ip}:3000"
      serve_from_sub_path: false
    security:
      cookie_secure: false

# Upgrade
helm upgrade prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values /tmp/prometheus-values.yaml

# Restart port-forwards (pods restarted)
kill %1 %2 %3
kubectl -n monitoring port-forward svc/prometheus-grafana 3000:80 --address 0.0.0.0 &

Prometheus Not Scraping

# Check ServiceMonitor resources
kubectl get servicemonitors -A

# Check Prometheus config
kubectl -n monitoring exec -it prometheus-prometheus-kube-prometheus-prometheus-0 -- cat /etc/prometheus/prometheus.yml | head -50

Upgrade Stack

# Update repo
helm repo update

# Check available versions
helm search repo prometheus-community/kube-prometheus-stack --versions | head -5

# Upgrade
helm upgrade prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values /tmp/prometheus-values.yaml

Uninstall

This deletes all monitoring data. PVCs with archiveOnDelete=true preserve data on NAS.
helm uninstall prometheus -n monitoring
kubectl delete namespace monitoring

Appendix: Vault PKI Certificate for Grafana

Replace the default Grafana self-signed certificate with a Vault-issued certificate to eliminate browser warnings and enable secure HTTPS access.

Issue Certificate from Vault

From workstation:

vault write -format=json pki_int/issue/domus-client \
  common_name="grafana.inside.domusdigitalis.dev" \
  ttl="8760h" > /tmp/grafana-cert.json

Extract Certificate Components

jq -r '.data.certificate' /tmp/grafana-cert.json > /tmp/grafana.crt
jq -r '.data.private_key' /tmp/grafana-cert.json > /tmp/grafana.key
jq -r '.data.ca_chain[]' /tmp/grafana-cert.json > /tmp/grafana-ca.crt

Verify Certificate

openssl x509 -in /tmp/grafana.crt -noout -subject -issuer -dates

Expected:

subject=CN=grafana.inside.domusdigitalis.dev
issuer=CN=DOMUS-ISSUING-CA
notBefore=...
notAfter=... (1 year from now)

Transfer to k3s Node

From workstation:

scp /tmp/grafana.crt /tmp/grafana.key /tmp/grafana-ca.crt k3s-master-01.inside.domusdigitalis.dev:/tmp/

Create Kubernetes TLS Secret

On k3s-master-01:

kubectl -n monitoring create secret tls grafana-tls-vault \
  --cert=/tmp/grafana.crt \
  --key=/tmp/grafana.key \
  --dry-run=client -o yaml | kubectl apply -f -

Update Helm Values for HTTPS

Add TLS configuration to your values file:

grafana:
  # ... existing settings ...

  # Enable HTTPS with Vault certificate
  grafana.ini:
    server:
      protocol: https
      cert_file: /etc/grafana/tls/tls.crt
      cert_key: /etc/grafana/tls/tls.key
      root_url: "https://grafana.inside.domusdigitalis.dev"
    security:
      cookie_secure: true

  # Mount TLS secret
  extraSecretMounts:
    - name: grafana-tls
      secretName: grafana-tls-vault
      mountPath: /etc/grafana/tls
      readOnly: true

Upgrade Helm Release

helm upgrade prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --values /tmp/prometheus-values.yaml

Update Port-Forward for HTTPS

If using port-forward (not Traefik), update the systemd service:

cat << 'EOF' | sudo tee /etc/systemd/system/grafana-pf.service
[Unit]
Description=Grafana Port Forward (HTTPS)
After=network.target k3s.service
Wants=k3s.service

[Service]
Type=simple
Environment="KUBECONFIG=/etc/rancher/k3s/k3s.yaml"
ExecStart=/usr/local/bin/kubectl port-forward -n monitoring svc/prometheus-grafana 3000:443 --address=0.0.0.0
Restart=always
RestartSec=10
User=root

[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl restart grafana-pf

Verify HTTPS

echo | openssl s_client -connect grafana.inside.domusdigitalis.dev:3000 2>/dev/null | openssl x509 -noout -subject -issuer

Expected:

subject=CN=grafana.inside.domusdigitalis.dev
issuer=CN=DOMUS-ISSUING-CA

Browser must trust the DOMUS-ROOT-CA for the certificate to show as valid. Import the root CA into your browser/OS trust store if not already done.