Kubernetes Advanced Reference
Preface
This document assumes familiarity with basic Kubernetes concepts. It covers the internals, edge cases, and production patterns that separate operators from experts.
The goal is precision and depth, not breadth. Each section stands alone as a reference.
Part I: API Machinery
The Kubernetes API Model
Every Kubernetes object is defined by three coordinates:
| Coordinate | Description | Example |
|---|---|---|
Group |
Logical collection of related kinds |
|
Version |
API maturity level |
|
Kind |
The object type |
|
The combination forms a GroupVersionKind (GVK):
apps/v1/Deployment
core/v1/Pod # core group is empty string
batch/v1/Job
networking.k8s.io/v1/NetworkPolicy
GroupVersionResource (GVR)
While GVK identifies the type, GVR identifies the REST path:
GVK: apps/v1/Deployment
GVR: apps/v1/deployments # plural, lowercase
REST path: /apis/apps/v1/namespaces/{ns}/deployments/{name}
The mapping between Kind and Resource is not always predictable:
| Kind | Resource |
|---|---|
Endpoints |
endpoints (same) |
NetworkPolicy |
networkpolicies |
Ingress |
ingresses |
Discovery and REST Mapping
The API server exposes discovery endpoints:
# List all API groups
kubectl get --raw /apis | jq '.groups[].name'
# List resources in a group
kubectl get --raw /apis/apps/v1 | jq '.resources[].name'
# Get specific resource schema
kubectl get --raw /apis/apps/v1 | jq '.resources[] | select(.name=="deployments")'
Watches and Informers
The watch mechanism is fundamental to Kubernetes' reactive architecture.
Watch Protocol:
GET /api/v1/namespaces/default/pods?watch=true&resourceVersion=12345
HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked
{"type":"ADDED","object":{...}}
{"type":"MODIFIED","object":{...}}
{"type":"DELETED","object":{...}}
{"type":"BOOKMARK","object":{"metadata":{"resourceVersion":"12350"}}}
Event Types:
| Type | Meaning |
|---|---|
ADDED |
Object created or first seen in watch |
MODIFIED |
Object spec or status changed |
DELETED |
Object removed |
BOOKMARK |
Checkpoint for resourceVersion (no object change) |
ERROR |
Watch failed, must re-list |
Informer Architecture:
An informer combines:
-
Reflector: Watches API server, maintains local cache
-
Delta FIFO: Queue of changes (adds, updates, deletes)
-
Indexer: In-memory store with custom indexes
-
Event Handlers: OnAdd, OnUpdate, OnDelete callbacks
// Conceptual informer usage (client-go)
informer := cache.NewSharedIndexInformer(
&cache.ListWatch{
ListFunc: func(options metav1.ListOptions) (runtime.Object, error) { ... },
WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) { ... },
},
&v1.Pod{},
resyncPeriod,
cache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc},
)
informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) { ... },
UpdateFunc: func(old, new interface{}) { ... },
DeleteFunc: func(obj interface{}) { ... },
})
Resource Versions and Consistency
Every Kubernetes object has a resourceVersion field - an opaque string representing the object’s version in etcd.
Consistency Guarantees:
| Operation | Consistency | resourceVersion Behavior |
|---|---|---|
GET |
Serializable (latest) |
Returns current version |
GET ?resourceVersion=0 |
Any (from cache) |
May return stale data |
LIST |
Serializable |
Returns consistent snapshot |
LIST ?resourceVersion=X |
From that version forward |
May miss very recent changes |
WATCH ?resourceVersion=X |
Guaranteed delivery from X |
Will see all changes after X |
Optimistic Concurrency:
Updates must include the current resourceVersion:
apiVersion: v1
kind: ConfigMap
metadata:
name: example
resourceVersion: "12345" # Must match current version
data:
key: newvalue
If the version doesn’t match, the API server returns 409 Conflict.
Part II: etcd Internals
Data Model
etcd stores Kubernetes data as key-value pairs:
Key: /registry/pods/default/nginx-abc123
Value: protobuf-encoded Pod object
Key: /registry/deployments/default/nginx
Value: protobuf-encoded Deployment object
Key: /registry/secrets/kube-system/bootstrap-token-abc123
Value: protobuf-encoded Secret (encrypted at rest if configured)
Key Structure:
/registry/{resource}/{namespace}/{name} # namespaced resources
/registry/{resource}/{name} # cluster-scoped resources
/registry/ranges/{type} # allocation ranges
MVCC and Revisions
etcd uses Multi-Version Concurrency Control (MVCC):
# Every write increments the global revision
etcdctl put /test/key1 "value1" # revision 100
etcdctl put /test/key2 "value2" # revision 101
etcdctl put /test/key1 "value2" # revision 102
# Read at specific revision
etcdctl get /test/key1 --rev=100 # returns "value1"
etcdctl get /test/key1 --rev=102 # returns "value2"
# Watch from revision
etcdctl watch /test/key1 --rev=100 # sees all changes since 100
Kubernetes resourceVersion = etcd revision
Compaction
etcd retains all revisions until compacted:
# Check current revision
etcdctl endpoint status --write-out=table
# Compact to revision (deletes history before this point)
etcdctl compact 150000
# After compaction, watches before revision 150000 fail with:
# "mvcc: required revision has been compacted"
Automatic Compaction (Kubernetes default):
# kube-apiserver flag
--etcd-compaction-interval=5m
Defragmentation
Compaction marks space as free but doesn’t reclaim it. Defragmentation reclaims disk space:
# Check fragmentation (db size vs db size in use)
etcdctl endpoint status --write-out=table
# Defragment (blocks writes, run on one member at a time)
etcdctl defrag --endpoints=https://etcd-0:2379
# In production, defrag during maintenance window
for ep in etcd-0 etcd-1 etcd-2; do
etcdctl defrag --endpoints=https://${ep}:2379
sleep 10
done
Backup and Restore
# Snapshot (consistent point-in-time backup)
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify snapshot
etcdctl snapshot status /backup/etcd-20260222.db --write-out=table
# Restore (creates new data directory)
etcdctl snapshot restore /backup/etcd-20260222.db \
--name=etcd-0 \
--initial-cluster=etcd-0=https://10.0.0.10:2380 \
--initial-cluster-token=etcd-cluster-1 \
--initial-advertise-peer-urls=https://10.0.0.10:2380 \
--data-dir=/var/lib/etcd-restored
Monitoring etcd
Critical metrics:
| Metric | Meaning |
|---|---|
|
Total database size |
|
Actual data size (difference = fragmentation) |
|
Write-ahead log sync latency (should be <10ms) |
|
Backend commit latency (should be <25ms) |
|
Failed Raft proposals (indicates cluster issues) |
|
Leader elections (frequent = instability) |
Part III: Scheduler Deep Dive
Scheduling Framework
The scheduler uses a plugin-based framework with defined extension points:
Extension Points (in order):
| Phase | Purpose | Example Plugins |
|---|---|---|
PreFilter |
Pre-process or check pod info |
PodTopologySpread |
Filter |
Exclude nodes that cannot run pod |
NodeAffinity, TaintToleration, NodePorts |
PostFilter |
Called if no nodes pass Filter |
DefaultPreemption |
PreScore |
Pre-process for scoring |
InterPodAffinity |
Score |
Rank remaining nodes |
NodeResourcesBalancedAllocation, ImageLocality |
NormalizeScore |
Normalize scores to 0-100 |
(built-in) |
Reserve |
Reserve resources for pod |
VolumeBinding |
Permit |
Approve/deny/wait |
(custom webhooks) |
PreBind |
Pre-binding operations |
VolumeBinding |
Bind |
Bind pod to node |
DefaultBinder |
PostBind |
Post-binding cleanup |
(informational) |
Predicates (Filter Phase)
Predicates determine if a node CAN run a pod:
PodFitsResources:
node.Allocatable.cpu >= sum(pod.containers[*].requests.cpu)
node.Allocatable.memory >= sum(pod.containers[*].requests.memory)
PodFitsHostPorts:
for port in pod.spec.containers[*].ports:
if port.hostPort != 0:
port.hostPort not in node.usedHostPorts
PodMatchNodeSelector:
pod.spec.nodeSelector matches node.labels
TaintToleration:
for taint in node.taints:
exists pod.spec.tolerations[*] that tolerates taint
CheckNodeMemoryPressure:
node.conditions.MemoryPressure != True
Priorities (Score Phase)
Priorities rank nodes that passed predicates:
LeastRequestedPriority:
score = ((capacity - sum(requests)) / capacity) * 100
# Prefers nodes with more available resources
BalancedResourceAllocation:
cpuFraction = requested.cpu / allocatable.cpu
memFraction = requested.memory / allocatable.memory
score = 100 - abs(cpuFraction - memFraction) * 100
# Prefers balanced CPU/memory usage
ImageLocalityPriority:
if image exists on node: score += size(image) / totalImageSize * 100
# Prefers nodes that already have the image
InterPodAffinityPriority:
for each matching pod on node:
score += weight from affinity rule
# Implements pod affinity/anti-affinity
Scheduler Profiles
Multiple scheduling profiles for different workloads:
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
plugins:
score:
enabled:
- name: NodeResourcesBalancedAllocation
weight: 1
- name: ImageLocality
weight: 1
disabled:
- name: NodeResourcesLeastAllocated
- schedulerName: batch-scheduler
plugins:
score:
enabled:
- name: NodeResourcesLeastAllocated
weight: 2
disabled:
- name: NodeResourcesBalancedAllocation
Use in Pod:
apiVersion: v1
kind: Pod
spec:
schedulerName: batch-scheduler # Uses batch-scheduler profile
Preemption
When no node can fit a pod, preemption evicts lower-priority pods:
1. Pod cannot be scheduled (PostFilter triggered)
2. For each node:
a. Find pods that could be preempted (lower priority)
b. Simulate removing them
c. Check if pod would now fit
3. Select node with minimum disruption:
- Fewer preempted pods
- Lower aggregate priority of victims
- Later PDB violations
4. Delete victim pods (graceful termination)
5. Pod gains NominatedNodeName
6. Reschedule once victims terminate
Priority Classes:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-infrastructure
value: 1000000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "Critical infrastructure pods"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-workloads
value: 100
preemptionPolicy: Never # Will not preempt others
description: "Batch jobs that can wait"
Part IV: Controller Patterns
The Reconciliation Loop
Controllers implement the observe-diff-act pattern:
while true:
desired := getDesiredState() # From spec
current := getCurrentState() # From status/cluster
diff := compare(desired, current)
if diff.isEmpty():
continue
actions := planActions(diff)
for action in actions:
execute(action)
Level-Triggered vs Edge-Triggered:
| Aspect | Edge-Triggered | Level-Triggered (Kubernetes) |
|---|---|---|
Trigger |
On change event |
On state difference |
Missed events |
Can lose events |
Always converges |
Idempotency |
Must track what was done |
Naturally idempotent |
Implementation |
React to ADDED/MODIFIED/DELETED |
Compare desired vs current |
Kubernetes controllers are level-triggered: they react to state, not events. If the controller crashes and misses events, it will still converge on restart.
Work Queue Patterns
Controllers use work queues to decouple event handling from processing:
// Typical controller structure
type Controller struct {
informer cache.SharedIndexInformer
queue workqueue.RateLimitingInterface
}
func (c *Controller) Run(stopCh <-chan struct{}) {
defer c.queue.ShutDown()
// Start informer
go c.informer.Run(stopCh)
// Wait for cache sync
if !cache.WaitForCacheSync(stopCh, c.informer.HasSynced) {
return
}
// Process work queue
for c.processNextItem() {
}
}
func (c *Controller) processNextItem() bool {
key, shutdown := c.queue.Get()
if shutdown {
return false
}
defer c.queue.Done(key)
err := c.reconcile(key.(string))
if err != nil {
c.queue.AddRateLimited(key) // Retry with backoff
return true
}
c.queue.Forget(key) // Success, clear rate limit
return true
}
Leader Election
Only one controller instance should be active:
// Leader election using Lease objects
leaderElectionConfig := leaderelection.LeaderElectionConfig{
Lock: &resourcelock.LeaseLock{
LeaseMeta: metav1.ObjectMeta{
Name: "my-controller-lock",
Namespace: "kube-system",
},
Client: client.CoordinationV1(),
LockConfig: resourcelock.ResourceLockConfig{
Identity: hostname,
},
},
LeaseDuration: 15 * time.Second,
RenewDeadline: 10 * time.Second,
RetryPeriod: 2 * time.Second,
Callbacks: leaderelection.LeaderCallbacks{
OnStartedLeading: func(ctx context.Context) {
controller.Run(ctx.Done())
},
OnStoppedLeading: func() {
log.Fatal("lost leadership")
},
},
}
leaderelection.RunOrDie(ctx, leaderElectionConfig)
Lease Object:
kubectl get lease -n kube-system
kubectl get lease kube-controller-manager -n kube-system -o yaml
apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
name: kube-controller-manager
namespace: kube-system
spec:
holderIdentity: master-1_xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
leaseDurationSeconds: 15
acquireTime: "2026-02-22T10:00:00Z"
renewTime: "2026-02-22T10:05:30Z"
leaseTransitions: 3
Part V: Custom Resource Definitions
CRD Structure
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: certificates.cert-manager.io
spec:
group: cert-manager.io
names:
kind: Certificate
listKind: CertificateList
plural: certificates
singular: certificate
shortNames:
- cert
- certs
categories:
- cert-manager
scope: Namespaced
versions:
- name: v1
served: true
storage: true
subresources:
status: {}
additionalPrinterColumns:
- name: Ready
type: string
jsonPath: .status.conditions[?(@.type=="Ready")].status
- name: Secret
type: string
jsonPath: .spec.secretName
- name: Age
type: date
jsonPath: .metadata.creationTimestamp
schema:
openAPIV3Schema:
type: object
required:
- spec
properties:
spec:
type: object
required:
- secretName
- issuerRef
properties:
secretName:
type: string
duration:
type: string
pattern: '^[0-9]+(h|m|s)$'
issuerRef:
type: object
required:
- name
- kind
properties:
name:
type: string
kind:
type: string
enum:
- Issuer
- ClusterIssuer
status:
type: object
properties:
conditions:
type: array
items:
type: object
properties:
type:
type: string
status:
type: string
lastTransitionTime:
type: string
format: date-time
reason:
type: string
message:
type: string
Validation
Structural Schema (Required):
All CRDs must have a structural schema where:
- Every field has a type
- No additionalProperties: true at root
- No nullable: true without type
Common Validation Patterns:
properties:
# Enum constraint
protocol:
type: string
enum: ["TCP", "UDP", "SCTP"]
# Pattern constraint
hostname:
type: string
pattern: '^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$'
# Numeric constraints
replicas:
type: integer
minimum: 0
maximum: 100
# String constraints
name:
type: string
minLength: 1
maxLength: 63
# Default values
retries:
type: integer
default: 3
# Required one of
x-kubernetes-validations:
- rule: "has(self.secretRef) || has(self.configMapRef)"
message: "must specify either secretRef or configMapRef"
Conversion Webhooks
Support multiple API versions with conversion:
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
spec:
conversion:
strategy: Webhook
webhook:
conversionReviewVersions: ["v1"]
clientConfig:
service:
name: my-conversion-webhook
namespace: my-system
path: /convert
caBundle: LS0tLS1CRUdJTi...
Part VI: Admission Control
Admission Pipeline
Request → Authentication → Authorization → Admission → Persist to etcd
│
┌────────────────────┴────────────────────┐
│ │
Mutating Webhooks Validating Webhooks
(can modify object) (can only reject)
│ │
└────────────────────┬────────────────────┘
│
Object Validation (schema)
Built-in Admission Controllers
| Controller | Purpose |
|---|---|
NamespaceLifecycle |
Prevents operations in terminating/non-existent namespaces |
LimitRanger |
Applies default resource requests/limits |
ServiceAccount |
Mounts service account tokens |
DefaultStorageClass |
Assigns default storage class to PVCs |
PodSecurity |
Enforces Pod Security Standards |
ResourceQuota |
Enforces namespace resource quotas |
PodTolerationRestriction |
Limits toleration modifications |
Webhook Configuration
Mutating Webhook:
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
name: vault-agent-injector
webhooks:
- name: vault.hashicorp.com
clientConfig:
service:
name: vault-agent-injector
namespace: vault
path: /mutate
caBundle: LS0tLS1CRUdJTi...
rules:
- apiGroups: [""]
apiVersions: ["v1"]
operations: ["CREATE"]
resources: ["pods"]
scope: "Namespaced"
namespaceSelector:
matchExpressions:
- key: vault-injection
operator: NotIn
values: ["disabled"]
failurePolicy: Ignore # Don't block pod creation if webhook fails
sideEffects: None
admissionReviewVersions: ["v1"]
reinvocationPolicy: IfNeeded # Re-run if other webhooks modify
timeoutSeconds: 10
Validating Webhook:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: policy-controller
webhooks:
- name: validation.policy.example.com
clientConfig:
service:
name: policy-controller
namespace: policy-system
path: /validate
rules:
- apiGroups: ["apps"]
apiVersions: ["v1"]
operations: ["CREATE", "UPDATE"]
resources: ["deployments"]
failurePolicy: Fail # Block if webhook unreachable
matchPolicy: Equivalent # Match equivalent API versions
Debugging Admission
# Check webhook configurations
kubectl get mutatingwebhookconfigurations
kubectl get validatingwebhookconfigurations
# Describe specific webhook
kubectl describe mutatingwebhookconfiguration vault-agent-injector
# Check webhook endpoint
kubectl get svc -n vault vault-agent-injector
kubectl get endpoints -n vault vault-agent-injector
# Test webhook directly (requires port-forward)
kubectl port-forward -n vault svc/vault-agent-injector 8080:443
curl -k -X POST https://localhost:8080/mutate \
-H "Content-Type: application/json" \
-d '{"apiVersion":"admission.k8s.io/v1","kind":"AdmissionReview",...}'
# Check audit logs for admission decisions
kubectl logs -n kube-system kube-apiserver-master | grep admission
Part VII: Advanced Networking
CNI Plugin Chain
{
"cniVersion": "1.0.0",
"name": "k8s-pod-network",
"plugins": [
{
"type": "cilium-cni"
},
{
"type": "bandwidth",
"capabilities": {"bandwidth": true}
},
{
"type": "portmap",
"capabilities": {"portMappings": true}
}
]
}
eBPF Datapath (Cilium)
Traditional kube-proxy uses iptables:
Packet → iptables (PREROUTING) → routing decision → iptables (FORWARD) →
→ iptables (POSTROUTING) → egress
# Results in O(n) rule matching for n services
# iptables rules must be fully rewritten on any change
Cilium’s eBPF datapath:
Packet → eBPF (XDP/TC ingress) → direct routing/NAT → eBPF (TC egress) → egress
# O(1) lookups using BPF maps
# Incremental updates without full rewrite
# Operates at kernel level, no context switching
Key eBPF Maps:
| Map | Purpose |
|---|---|
cilium_lxc |
Pod endpoints (IP, MAC, identity) |
cilium_ipcache |
IP → identity mapping |
cilium_lb4_services |
Service VIP → backend mapping |
cilium_lb4_backends |
Backend pod IPs |
cilium_policy |
NetworkPolicy rules |
cilium_ct4_global |
Connection tracking |
# Inspect Cilium BPF maps
cilium bpf lb list
cilium bpf ct list global
cilium bpf policy get -n default
# Debug packet flow
cilium monitor --type trace
cilium monitor --type drop
# Hubble flow observation
hubble observe --namespace production --protocol TCP
hubble observe --verdict DROPPED
Service Implementation
ClusterIP:
1. Pod sends packet to ClusterIP (10.43.x.x)
2. eBPF/iptables performs DNAT:
- Select backend pod (round-robin, random, or session affinity)
- Rewrite destination IP to pod IP
3. Packet routed to backend pod
4. Reply packet has SNAT applied (or preserved with externalTrafficPolicy: Local)
NodePort:
1. External client connects to NodeIP:NodePort
2. Any node can receive (kube-proxy listens on all nodes)
3. DNAT to backend pod (may be on different node)
4. If pod on different node: SNAT to node IP (default) or reject (externalTrafficPolicy: Local)
LoadBalancer:
Cloud LB → NodePort → ClusterIP → Pod
Or with MetalLB (bare metal):
1. MetalLB assigns external IP from pool
2. ARP announcement (Layer 2) or BGP advertisement (Layer 3)
3. Traffic flows to node advertising the IP
4. NodePort processing continues
NetworkPolicy
Default Deny All:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {} # Matches all pods
policyTypes:
- Ingress
- Egress
Allow Specific Traffic:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-policy
namespace: production
spec:
podSelector:
matchLabels:
app: api-server
policyTypes:
- Ingress
- Egress
ingress:
- from:
# From pods with specific label
- podSelector:
matchLabels:
role: frontend
# OR from specific namespace
- namespaceSelector:
matchLabels:
name: monitoring
# OR from specific CIDR
- ipBlock:
cidr: 10.50.1.0/24
except:
- 10.50.1.100/32
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: database
ports:
- protocol: TCP
port: 5432
# Allow DNS
- to:
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
Cilium Extended Policies:
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: layer7-policy
spec:
endpointSelector:
matchLabels:
app: api-server
ingress:
- fromEndpoints:
- matchLabels:
role: frontend
toPorts:
- ports:
- port: "8080"
protocol: TCP
rules:
http:
- method: "GET"
path: "/api/v1/.*"
- method: "POST"
path: "/api/v1/data"
headers:
- 'Content-Type: application/json'
Part VIII: Storage Deep Dive
CSI Architecture
Components:
| Component | Function |
|---|---|
external-provisioner |
Watches PVCs, calls CreateVolume |
external-attacher |
Watches VolumeAttachment, calls ControllerPublishVolume |
external-resizer |
Watches PVCs for resize, calls ControllerExpandVolume |
external-snapshotter |
Creates VolumeSnapshots via CreateSnapshot |
node-driver-registrar |
Registers CSI driver with kubelet |
CSI Driver |
Implements CSI gRPC interface |
CSI Methods:
Controller Service:
CreateVolume(name, capacity, parameters) → volume_id
DeleteVolume(volume_id)
ControllerPublishVolume(volume_id, node_id) → attach info
ControllerUnpublishVolume(volume_id, node_id)
CreateSnapshot(source_volume_id, name) → snapshot_id
ControllerExpandVolume(volume_id, new_capacity)
Node Service:
NodeStageVolume(volume_id, staging_path) # Mount to staging (e.g., format)
NodePublishVolume(volume_id, target_path) # Bind mount to pod
NodeUnpublishVolume(volume_id, target_path)
NodeUnstageVolume(volume_id, staging_path)
NodeExpandVolume(volume_id, new_capacity) # Filesystem resize
Volume Lifecycle
1. User creates PVC
│
2. PV Controller finds/creates matching PV
│ (via StorageClass provisioner)
│
3. PVC and PV become Bound
│
4. Pod scheduled, references PVC
│
5. Volume Controller creates VolumeAttachment
│
6. external-attacher calls ControllerPublishVolume
│ (attaches volume to node in cloud/SAN)
│
7. kubelet sees pod needs volume
│
8. kubelet calls NodeStageVolume
│ (format filesystem if needed, mount to staging)
│
9. kubelet calls NodePublishVolume
│ (bind mount to pod's directory)
│
10. Container starts with volume mounted
Topology-Aware Provisioning
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: regional-ssd
provisioner: pd.csi.storage.gke.io
parameters:
type: pd-ssd
replication-type: regional-pd
volumeBindingMode: WaitForFirstConsumer # Don't provision until pod scheduled
allowedTopologies:
- matchLabelExpressions:
- key: topology.gke.io/zone
values:
- us-central1-a
- us-central1-b
Volume Snapshots
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-snapclass
driver: pd.csi.storage.gke.io
deletionPolicy: Delete
parameters:
storage-locations: us-central1
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: prometheus-snapshot
spec:
volumeSnapshotClassName: csi-snapclass
source:
persistentVolumeClaimName: prometheus-data
---
# Restore from snapshot
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-restored
spec:
storageClassName: regional-ssd
dataSource:
name: prometheus-snapshot
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
Part IX: Security Model
Pod Security Standards
Three profiles (replace deprecated PodSecurityPolicy):
| Profile | Restrictions |
|---|---|
Privileged |
No restrictions (cluster admins, CNI, storage drivers) |
Baseline |
Minimal restrictions: - No privileged containers - No host namespaces - No hostPath except specific paths - Limited capabilities |
Restricted |
Maximum restrictions: - Must run as non-root - Must drop ALL capabilities - Read-only root filesystem - No privilege escalation - Seccomp profile required |
Enforcement:
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/enforce-version: latest
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
RBAC Patterns
Principle of Least Privilege:
# Narrow: specific verbs, specific resources, specific names
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: deployment-restarter
namespace: production
rules:
- apiGroups: ["apps"]
resources: ["deployments"]
resourceNames: ["api-server", "web-frontend"] # Only these deployments
verbs: ["get", "patch"] # Only read and patch (for restart)
---
# For rollout restart
- apiGroups: ["apps"]
resources: ["deployments/scale"]
resourceNames: ["api-server", "web-frontend"]
verbs: ["patch"]
Aggregated ClusterRoles:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: monitoring-reader
labels:
rbac.example.com/aggregate-to-monitoring: "true"
rules:
- apiGroups: [""]
resources: ["pods", "services", "endpoints"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: monitoring-aggregated
aggregationRule:
clusterRoleSelectors:
- matchLabels:
rbac.example.com/aggregate-to-monitoring: "true"
rules: [] # Automatically populated
Service Account Token Projection
Bound tokens (recommended):
apiVersion: v1
kind: Pod
spec:
serviceAccountName: my-app
automountServiceAccountToken: false # Disable default mount
containers:
- name: app
volumeMounts:
- name: token
mountPath: /var/run/secrets/tokens
readOnly: true
volumes:
- name: token
projected:
sources:
- serviceAccountToken:
path: token
expirationSeconds: 3600 # Short-lived
audience: my-app.example.com # Bound to specific audience
Secrets Management
Encryption at Rest:
# /etc/kubernetes/encryption-config.yaml
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
- secrets
providers:
- aescbc:
keys:
- name: key1
secret: <base64-encoded-32-byte-key>
- identity: {} # Fallback for reading old unencrypted secrets
External Secrets:
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: database-credentials
spec:
refreshInterval: 1h
secretStoreRef:
name: vault-backend
kind: ClusterSecretStore
target:
name: database-credentials
creationPolicy: Owner
data:
- secretKey: username
remoteRef:
key: secret/data/database/production
property: username
- secretKey: password
remoteRef:
key: secret/data/database/production
property: password
Part X: Debugging
kubectl debug
Ephemeral Debug Containers:
# Add debug container to running pod
kubectl debug -it pod/nginx --image=nicolaka/netshoot --target=nginx
# The --target flag shares process namespace with that container
# Allows: ps aux, strace -p <pid>, nsenter, etc.
Node Debugging:
# Debug node (creates privileged pod)
kubectl debug node/worker-1 -it --image=ubuntu
# Access node filesystem
chroot /host
Copy Pod for Debugging:
# Create copy with modified command
kubectl debug pod/myapp --copy-to=myapp-debug --container=app -- sh
# Create copy with different image
kubectl debug pod/myapp --copy-to=myapp-debug --set-image=app=busybox
crictl
Direct container runtime interaction:
# List containers
crictl ps -a
# Get container details
crictl inspect <container-id>
# Container logs (bypasses kubelet)
crictl logs <container-id>
# Execute in container
crictl exec -it <container-id> sh
# Pull image
crictl pull docker.io/library/nginx:latest
# List images
crictl images
# Pod sandbox operations
crictl pods
crictl inspectp <pod-id>
# Stats
crictl stats
Event Investigation
# All events, sorted by time
kubectl get events --sort-by=.metadata.creationTimestamp
# Events for specific resource
kubectl get events --field-selector involvedObject.name=nginx-abc123
# Warning events only
kubectl get events --field-selector type=Warning
# Events in watch mode
kubectl get events -w
# Describe shows events at bottom
kubectl describe pod nginx-abc123
Log Analysis
# Structured log parsing (JSON logs)
kubectl logs deployment/api-server | jq -r 'select(.level=="error") | .message'
# Multi-container
kubectl logs pod/myapp -c sidecar --previous
# Follow with timestamps
kubectl logs -f deployment/api-server --timestamps=true
# Since duration
kubectl logs deployment/api-server --since=1h
# Aggregate logs from all pods in deployment
kubectl logs deployment/api-server --all-containers=true --prefix=true
# Stern for multi-pod tailing (if installed)
stern "api-.*" -n production --since 10m
Resource Inspection
# Custom columns
kubectl get pods -o custom-columns=\
NAME:.metadata.name,\
NODE:.spec.nodeName,\
STATUS:.status.phase,\
IP:.status.podIP,\
RESTARTS:.status.containerStatuses[0].restartCount
# JSONPath
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\n"}{end}'
# Get specific field
kubectl get pod nginx -o jsonpath='{.spec.containers[0].image}'
# Sort by field
kubectl get pods --sort-by=.status.startTime
# Raw API request
kubectl get --raw /api/v1/namespaces/default/pods | jq .
Part XI: Performance Tuning
Resource Management
Requests vs Limits:
| Aspect | Requests | Limits |
|---|---|---|
Scheduling |
Used to find suitable node |
Not considered |
QoS Class |
Determines class |
Must equal requests for Guaranteed |
Throttling |
No effect |
CPU throttled, memory OOM killed |
Recommendation |
Set based on actual usage |
Set to prevent runaway |
QoS Classes:
Guaranteed: requests == limits for all containers (CPU and memory)
→ Highest priority, last to be evicted
Burstable: requests < limits OR only requests set
→ Medium priority
BestEffort: no requests or limits set
→ Lowest priority, first to be evicted
Pod Disruption Budgets
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-server-pdb
spec:
minAvailable: 2 # OR maxUnavailable: 1
selector:
matchLabels:
app: api-server
unhealthyPodEvictionPolicy: IfHealthyBudget # New in 1.27
Topology Spread Constraints
apiVersion: v1
kind: Pod
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api-server
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: api-server
Vertical Pod Autoscaler
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Auto" # Off, Initial, Recreate, Auto
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 4
memory: 8Gi
controlledResources: ["cpu", "memory"]
controlledValues: RequestsAndLimits
Appendix A: Quick Reference
Object Short Names
| Kind | Short Name | API Group |
|---|---|---|
ConfigMap |
cm |
core |
DaemonSet |
ds |
apps |
Deployment |
deploy |
apps |
Endpoints |
ep |
core |
Event |
ev |
core |
HorizontalPodAutoscaler |
hpa |
autoscaling |
Ingress |
ing |
networking.k8s.io |
Namespace |
ns |
core |
NetworkPolicy |
netpol |
networking.k8s.io |
Node |
no |
core |
PersistentVolume |
pv |
core |
PersistentVolumeClaim |
pvc |
core |
Pod |
po |
core |
ReplicaSet |
rs |
apps |
Service |
svc |
core |
ServiceAccount |
sa |
core |
StatefulSet |
sts |
apps |
StorageClass |
sc |
storage.k8s.io |
Essential Commands
# Cluster info
kubectl cluster-info
kubectl get nodes -o wide
kubectl top nodes
# Resource usage
kubectl top pods --containers
kubectl describe node <node> | grep -A5 "Allocated resources"
# Debugging
kubectl get events --sort-by='.lastTimestamp'
kubectl logs <pod> --previous
kubectl debug -it <pod> --image=nicolaka/netshoot
# Dry run and diff
kubectl apply -f manifest.yaml --dry-run=server
kubectl diff -f manifest.yaml
# Force delete
kubectl delete pod <pod> --grace-period=0 --force
# Get all resources in namespace
kubectl api-resources --verbs=list --namespaced -o name | \
xargs -n 1 kubectl get --show-kind --ignore-not-found -n <namespace>
Appendix B: Recommended Reading
-
Brendan Burns, Joe Beda, Kelsey Hightower. Kubernetes: Up and Running. O’Reilly, 2022.
-
Michael Hausenblas, Stefan Schimanski. Programming Kubernetes. O’Reilly, 2019.
-
Kubernetes Documentation. kubernetes.io/docs/
-
Cilium Documentation. docs.cilium.io/
-
etcd Documentation. etcd.io/docs/