Kubernetes Advanced Reference

Preface

This document assumes familiarity with basic Kubernetes concepts. It covers the internals, edge cases, and production patterns that separate operators from experts.

The goal is precision and depth, not breadth. Each section stands alone as a reference.

Part I: API Machinery

The Kubernetes API Model

Every Kubernetes object is defined by three coordinates:

Coordinate Description Example

Coordinate	Description	Example
Group	Logical collection of related kinds	`apps`, `batch`, `networking.k8s.io`
Version	API maturity level	`v1`, `v1beta1`, `v2alpha1`
Kind	The object type	`Deployment`, `Pod`, `Service`

Group

Logical collection of related kinds

apps, batch, networking.k8s.io

Version

API maturity level

v1, v1beta1, v2alpha1

Kind

The object type

Deployment, Pod, Service

The combination forms a GroupVersionKind (GVK):

apps/v1/Deployment
core/v1/Pod           # core group is empty string
batch/v1/Job
networking.k8s.io/v1/NetworkPolicy

GroupVersionResource (GVR)

While GVK identifies the type, GVR identifies the REST path:

GVK: apps/v1/Deployment
GVR: apps/v1/deployments    # plural, lowercase

REST path: /apis/apps/v1/namespaces/{ns}/deployments/{name}

The mapping between Kind and Resource is not always predictable:

Kind	Resource
Endpoints	endpoints (same)
NetworkPolicy	networkpolicies
Ingress	ingresses

Kind

Resource

Endpoints

endpoints (same)

NetworkPolicy

networkpolicies

Ingress

ingresses

Discovery and REST Mapping

The API server exposes discovery endpoints:

# List all API groups
kubectl get --raw /apis | jq '.groups[].name'

# List resources in a group
kubectl get --raw /apis/apps/v1 | jq '.resources[].name'

# Get specific resource schema
kubectl get --raw /apis/apps/v1 | jq '.resources[] | select(.name=="deployments")'

Watches and Informers

The watch mechanism is fundamental to Kubernetes' reactive architecture.

Watch Protocol:

GET /api/v1/namespaces/default/pods?watch=true&resourceVersion=12345

HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked

{"type":"ADDED","object":{...}}
{"type":"MODIFIED","object":{...}}
{"type":"DELETED","object":{...}}
{"type":"BOOKMARK","object":{"metadata":{"resourceVersion":"12350"}}}

Event Types:

Type	Meaning
ADDED	Object created or first seen in watch
MODIFIED	Object spec or status changed
DELETED	Object removed
BOOKMARK	Checkpoint for resourceVersion (no object change)
ERROR	Watch failed, must re-list

Type

Meaning

ADDED

Object created or first seen in watch

MODIFIED

Object spec or status changed

DELETED

Object removed

BOOKMARK

Checkpoint for resourceVersion (no object change)

ERROR

Watch failed, must re-list

Informer Architecture:

An informer combines:

Reflector: Watches API server, maintains local cache
Delta FIFO: Queue of changes (adds, updates, deletes)
Indexer: In-memory store with custom indexes
Event Handlers: OnAdd, OnUpdate, OnDelete callbacks

// Conceptual informer usage (client-go)
informer := cache.NewSharedIndexInformer(
    &cache.ListWatch{
        ListFunc:  func(options metav1.ListOptions) (runtime.Object, error) { ... },
        WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) { ... },
    },
    &v1.Pod{},
    resyncPeriod,
    cache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc},
)

informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
    AddFunc:    func(obj interface{}) { ... },
    UpdateFunc: func(old, new interface{}) { ... },
    DeleteFunc: func(obj interface{}) { ... },
})

Resource Versions and Consistency

Every Kubernetes object has a resourceVersion field - an opaque string representing the object’s version in etcd.

Consistency Guarantees:

Operation	Consistency	resourceVersion Behavior
GET	Serializable (latest)	Returns current version
GET ?resourceVersion=0	Any (from cache)	May return stale data
LIST	Serializable	Returns consistent snapshot
LIST ?resourceVersion=X	From that version forward	May miss very recent changes
WATCH ?resourceVersion=X	Guaranteed delivery from X	Will see all changes after X

Operation

Consistency

resourceVersion Behavior

GET

Serializable (latest)

Returns current version

GET ?resourceVersion=0

Any (from cache)

May return stale data

LIST

Serializable

Returns consistent snapshot

LIST ?resourceVersion=X

From that version forward

May miss very recent changes

WATCH ?resourceVersion=X

Guaranteed delivery from X

Will see all changes after X

Optimistic Concurrency:

Updates must include the current resourceVersion:

apiVersion: v1
kind: ConfigMap
metadata:
  name: example
  resourceVersion: "12345"  # Must match current version
data:
  key: newvalue

If the version doesn’t match, the API server returns 409 Conflict.

Part II: etcd Internals

Data Model

etcd stores Kubernetes data as key-value pairs:

Key: /registry/pods/default/nginx-abc123
Value: protobuf-encoded Pod object

Key: /registry/deployments/default/nginx
Value: protobuf-encoded Deployment object

Key: /registry/secrets/kube-system/bootstrap-token-abc123
Value: protobuf-encoded Secret (encrypted at rest if configured)

Key Structure:

/registry/{resource}/{namespace}/{name}     # namespaced resources
/registry/{resource}/{name}                  # cluster-scoped resources
/registry/ranges/{type}                      # allocation ranges

MVCC and Revisions

etcd uses Multi-Version Concurrency Control (MVCC):

# Every write increments the global revision
etcdctl put /test/key1 "value1"  # revision 100
etcdctl put /test/key2 "value2"  # revision 101
etcdctl put /test/key1 "value2"  # revision 102

# Read at specific revision
etcdctl get /test/key1 --rev=100  # returns "value1"
etcdctl get /test/key1 --rev=102  # returns "value2"

# Watch from revision
etcdctl watch /test/key1 --rev=100  # sees all changes since 100

Kubernetes resourceVersion = etcd revision

Compaction

etcd retains all revisions until compacted:

# Check current revision
etcdctl endpoint status --write-out=table

# Compact to revision (deletes history before this point)
etcdctl compact 150000

# After compaction, watches before revision 150000 fail with:
# "mvcc: required revision has been compacted"

Automatic Compaction (Kubernetes default):

# kube-apiserver flag
--etcd-compaction-interval=5m

Defragmentation

Compaction marks space as free but doesn’t reclaim it. Defragmentation reclaims disk space:

# Check fragmentation (db size vs db size in use)
etcdctl endpoint status --write-out=table

# Defragment (blocks writes, run on one member at a time)
etcdctl defrag --endpoints=https://etcd-0:2379

# In production, defrag during maintenance window
for ep in etcd-0 etcd-1 etcd-2; do
    etcdctl defrag --endpoints=https://${ep}:2379
    sleep 10
done

Backup and Restore

# Snapshot (consistent point-in-time backup)
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
    --endpoints=https://127.0.0.1:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key

# Verify snapshot
etcdctl snapshot status /backup/etcd-20260222.db --write-out=table

# Restore (creates new data directory)
etcdctl snapshot restore /backup/etcd-20260222.db \
    --name=etcd-0 \
    --initial-cluster=etcd-0=https://10.0.0.10:2380 \
    --initial-cluster-token=etcd-cluster-1 \
    --initial-advertise-peer-urls=https://10.0.0.10:2380 \
    --data-dir=/var/lib/etcd-restored

Monitoring etcd

Critical metrics:

Metric Meaning

Metric	Meaning
`etcd_mvcc_db_total_size_in_bytes`	Total database size
`etcd_mvcc_db_total_size_in_use_in_bytes`	Actual data size (difference = fragmentation)
`etcd_disk_wal_fsync_duration_seconds`	Write-ahead log sync latency (should be <10ms)
`etcd_disk_backend_commit_duration_seconds`	Backend commit latency (should be <25ms)
`etcd_server_proposals_failed_total`	Failed Raft proposals (indicates cluster issues)
`etcd_server_leader_changes_seen_total`	Leader elections (frequent = instability)

etcd_mvcc_db_total_size_in_bytes

Total database size

etcd_mvcc_db_total_size_in_use_in_bytes

Actual data size (difference = fragmentation)

etcd_disk_wal_fsync_duration_seconds

Write-ahead log sync latency (should be <10ms)

etcd_disk_backend_commit_duration_seconds

Backend commit latency (should be <25ms)

etcd_server_proposals_failed_total

Failed Raft proposals (indicates cluster issues)

etcd_server_leader_changes_seen_total

Leader elections (frequent = instability)

Part III: Scheduler Deep Dive

Scheduling Framework

The scheduler uses a plugin-based framework with defined extension points:

Extension Points (in order):

Phase	Purpose	Example Plugins
PreFilter	Pre-process or check pod info	PodTopologySpread
Filter	Exclude nodes that cannot run pod	NodeAffinity, TaintToleration, NodePorts
PostFilter	Called if no nodes pass Filter	DefaultPreemption
PreScore	Pre-process for scoring	InterPodAffinity
Score	Rank remaining nodes	NodeResourcesBalancedAllocation, ImageLocality
NormalizeScore	Normalize scores to 0-100	(built-in)
Reserve	Reserve resources for pod	VolumeBinding
Permit	Approve/deny/wait	(custom webhooks)
PreBind	Pre-binding operations	VolumeBinding
Bind	Bind pod to node	DefaultBinder
PostBind	Post-binding cleanup	(informational)

Phase

Purpose

Example Plugins

PreFilter

Pre-process or check pod info

PodTopologySpread

Filter

Exclude nodes that cannot run pod

NodeAffinity, TaintToleration, NodePorts

PostFilter

Called if no nodes pass Filter

DefaultPreemption

PreScore

Pre-process for scoring

InterPodAffinity

Score

Rank remaining nodes

NodeResourcesBalancedAllocation, ImageLocality

NormalizeScore

Normalize scores to 0-100

(built-in)

Reserve

Reserve resources for pod

VolumeBinding

Permit

Approve/deny/wait

(custom webhooks)

PreBind

Pre-binding operations

VolumeBinding

Bind

Bind pod to node

DefaultBinder

PostBind

Post-binding cleanup

(informational)

Predicates (Filter Phase)

Predicates determine if a node CAN run a pod:

PodFitsResources:
    node.Allocatable.cpu >= sum(pod.containers[*].requests.cpu)
    node.Allocatable.memory >= sum(pod.containers[*].requests.memory)

PodFitsHostPorts:
    for port in pod.spec.containers[*].ports:
        if port.hostPort != 0:
            port.hostPort not in node.usedHostPorts

PodMatchNodeSelector:
    pod.spec.nodeSelector matches node.labels

TaintToleration:
    for taint in node.taints:
        exists pod.spec.tolerations[*] that tolerates taint

CheckNodeMemoryPressure:
    node.conditions.MemoryPressure != True

Priorities (Score Phase)

Priorities rank nodes that passed predicates:

LeastRequestedPriority:
    score = ((capacity - sum(requests)) / capacity) * 100
    # Prefers nodes with more available resources

BalancedResourceAllocation:
    cpuFraction = requested.cpu / allocatable.cpu
    memFraction = requested.memory / allocatable.memory
    score = 100 - abs(cpuFraction - memFraction) * 100
    # Prefers balanced CPU/memory usage

ImageLocalityPriority:
    if image exists on node: score += size(image) / totalImageSize * 100
    # Prefers nodes that already have the image

InterPodAffinityPriority:
    for each matching pod on node:
        score += weight from affinity rule
    # Implements pod affinity/anti-affinity

Scheduler Profiles

Multiple scheduling profiles for different workloads:

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
  plugins:
    score:
      enabled:
      - name: NodeResourcesBalancedAllocation
        weight: 1
      - name: ImageLocality
        weight: 1
      disabled:
      - name: NodeResourcesLeastAllocated

- schedulerName: batch-scheduler
  plugins:
    score:
      enabled:
      - name: NodeResourcesLeastAllocated
        weight: 2
      disabled:
      - name: NodeResourcesBalancedAllocation

Use in Pod:

apiVersion: v1
kind: Pod
spec:
  schedulerName: batch-scheduler  # Uses batch-scheduler profile

Preemption

When no node can fit a pod, preemption evicts lower-priority pods:

1. Pod cannot be scheduled (PostFilter triggered)
2. For each node:
   a. Find pods that could be preempted (lower priority)
   b. Simulate removing them
   c. Check if pod would now fit
3. Select node with minimum disruption:
   - Fewer preempted pods
   - Lower aggregate priority of victims
   - Later PDB violations
4. Delete victim pods (graceful termination)
5. Pod gains NominatedNodeName
6. Reschedule once victims terminate

Priority Classes:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-infrastructure
value: 1000000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "Critical infrastructure pods"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-workloads
value: 100
preemptionPolicy: Never  # Will not preempt others
description: "Batch jobs that can wait"

Part IV: Controller Patterns

The Reconciliation Loop

Controllers implement the observe-diff-act pattern:

while true:
    desired := getDesiredState()      # From spec
    current := getCurrentState()       # From status/cluster
    diff    := compare(desired, current)

    if diff.isEmpty():
        continue

    actions := planActions(diff)
    for action in actions:
        execute(action)

Level-Triggered vs Edge-Triggered:

Aspect	Edge-Triggered	Level-Triggered (Kubernetes)
Trigger	On change event	On state difference
Missed events	Can lose events	Always converges
Idempotency	Must track what was done	Naturally idempotent
Implementation	React to ADDED/MODIFIED/DELETED	Compare desired vs current

Aspect

Edge-Triggered

Level-Triggered (Kubernetes)

Trigger

On change event

On state difference

Missed events

Can lose events

Always converges

Idempotency

Must track what was done

Naturally idempotent

Implementation

React to ADDED/MODIFIED/DELETED

Compare desired vs current

Kubernetes controllers are level-triggered: they react to state, not events. If the controller crashes and misses events, it will still converge on restart.

Work Queue Patterns

Controllers use work queues to decouple event handling from processing:

// Typical controller structure
type Controller struct {
    informer cache.SharedIndexInformer
    queue    workqueue.RateLimitingInterface
}

func (c *Controller) Run(stopCh <-chan struct{}) {
    defer c.queue.ShutDown()

    // Start informer
    go c.informer.Run(stopCh)

    // Wait for cache sync
    if !cache.WaitForCacheSync(stopCh, c.informer.HasSynced) {
        return
    }

    // Process work queue
    for c.processNextItem() {
    }
}

func (c *Controller) processNextItem() bool {
    key, shutdown := c.queue.Get()
    if shutdown {
        return false
    }
    defer c.queue.Done(key)

    err := c.reconcile(key.(string))
    if err != nil {
        c.queue.AddRateLimited(key)  // Retry with backoff
        return true
    }

    c.queue.Forget(key)  // Success, clear rate limit
    return true
}

Leader Election

Only one controller instance should be active:

// Leader election using Lease objects
leaderElectionConfig := leaderelection.LeaderElectionConfig{
    Lock: &resourcelock.LeaseLock{
        LeaseMeta: metav1.ObjectMeta{
            Name:      "my-controller-lock",
            Namespace: "kube-system",
        },
        Client: client.CoordinationV1(),
        LockConfig: resourcelock.ResourceLockConfig{
            Identity: hostname,
        },
    },
    LeaseDuration: 15 * time.Second,
    RenewDeadline: 10 * time.Second,
    RetryPeriod:   2 * time.Second,
    Callbacks: leaderelection.LeaderCallbacks{
        OnStartedLeading: func(ctx context.Context) {
            controller.Run(ctx.Done())
        },
        OnStoppedLeading: func() {
            log.Fatal("lost leadership")
        },
    },
}

leaderelection.RunOrDie(ctx, leaderElectionConfig)

Lease Object:

kubectl get lease -n kube-system
kubectl get lease kube-controller-manager -n kube-system -o yaml

apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: kube-controller-manager
  namespace: kube-system
spec:
  holderIdentity: master-1_xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
  leaseDurationSeconds: 15
  acquireTime: "2026-02-22T10:00:00Z"
  renewTime: "2026-02-22T10:05:30Z"
  leaseTransitions: 3

Part V: Custom Resource Definitions

CRD Structure

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: certificates.cert-manager.io
spec:
  group: cert-manager.io
  names:
    kind: Certificate
    listKind: CertificateList
    plural: certificates
    singular: certificate
    shortNames:
    - cert
    - certs
    categories:
    - cert-manager
  scope: Namespaced
  versions:
  - name: v1
    served: true
    storage: true
    subresources:
      status: {}
    additionalPrinterColumns:
    - name: Ready
      type: string
      jsonPath: .status.conditions[?(@.type=="Ready")].status
    - name: Secret
      type: string
      jsonPath: .spec.secretName
    - name: Age
      type: date
      jsonPath: .metadata.creationTimestamp
    schema:
      openAPIV3Schema:
        type: object
        required:
        - spec
        properties:
          spec:
            type: object
            required:
            - secretName
            - issuerRef
            properties:
              secretName:
                type: string
              duration:
                type: string
                pattern: '^[0-9]+(h|m|s)$'
              issuerRef:
                type: object
                required:
                - name
                - kind
                properties:
                  name:
                    type: string
                  kind:
                    type: string
                    enum:
                    - Issuer
                    - ClusterIssuer
          status:
            type: object
            properties:
              conditions:
                type: array
                items:
                  type: object
                  properties:
                    type:
                      type: string
                    status:
                      type: string
                    lastTransitionTime:
                      type: string
                      format: date-time
                    reason:
                      type: string
                    message:
                      type: string

Validation

Structural Schema (Required):

All CRDs must have a structural schema where: - Every field has a type - No additionalProperties: true at root - No nullable: true without type

Common Validation Patterns:

properties:
  # Enum constraint
  protocol:
    type: string
    enum: ["TCP", "UDP", "SCTP"]

  # Pattern constraint
  hostname:
    type: string
    pattern: '^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$'

  # Numeric constraints
  replicas:
    type: integer
    minimum: 0
    maximum: 100

  # String constraints
  name:
    type: string
    minLength: 1
    maxLength: 63

  # Default values
  retries:
    type: integer
    default: 3

  # Required one of
  x-kubernetes-validations:
  - rule: "has(self.secretRef) || has(self.configMapRef)"
    message: "must specify either secretRef or configMapRef"

Conversion Webhooks

Support multiple API versions with conversion:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
spec:
  conversion:
    strategy: Webhook
    webhook:
      conversionReviewVersions: ["v1"]
      clientConfig:
        service:
          name: my-conversion-webhook
          namespace: my-system
          path: /convert
        caBundle: LS0tLS1CRUdJTi...

Part VI: Admission Control

Admission Pipeline

Request → Authentication → Authorization → Admission → Persist to etcd
                                              │
                         ┌────────────────────┴────────────────────┐
                         │                                         │
                   Mutating Webhooks                    Validating Webhooks
                   (can modify object)                  (can only reject)
                         │                                         │
                         └────────────────────┬────────────────────┘
                                              │
                              Object Validation (schema)

Built-in Admission Controllers

Controller	Purpose
NamespaceLifecycle	Prevents operations in terminating/non-existent namespaces
LimitRanger	Applies default resource requests/limits
ServiceAccount	Mounts service account tokens
DefaultStorageClass	Assigns default storage class to PVCs
PodSecurity	Enforces Pod Security Standards
ResourceQuota	Enforces namespace resource quotas
PodTolerationRestriction	Limits toleration modifications

Controller

Purpose

NamespaceLifecycle

Prevents operations in terminating/non-existent namespaces

LimitRanger

Applies default resource requests/limits

ServiceAccount

Mounts service account tokens

DefaultStorageClass

Assigns default storage class to PVCs

PodSecurity

Enforces Pod Security Standards

ResourceQuota

Enforces namespace resource quotas

PodTolerationRestriction

Limits toleration modifications

Webhook Configuration

Mutating Webhook:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: vault-agent-injector
webhooks:
- name: vault.hashicorp.com
  clientConfig:
    service:
      name: vault-agent-injector
      namespace: vault
      path: /mutate
    caBundle: LS0tLS1CRUdJTi...
  rules:
  - apiGroups: [""]
    apiVersions: ["v1"]
    operations: ["CREATE"]
    resources: ["pods"]
    scope: "Namespaced"
  namespaceSelector:
    matchExpressions:
    - key: vault-injection
      operator: NotIn
      values: ["disabled"]
  failurePolicy: Ignore  # Don't block pod creation if webhook fails
  sideEffects: None
  admissionReviewVersions: ["v1"]
  reinvocationPolicy: IfNeeded  # Re-run if other webhooks modify
  timeoutSeconds: 10

Validating Webhook:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: policy-controller
webhooks:
- name: validation.policy.example.com
  clientConfig:
    service:
      name: policy-controller
      namespace: policy-system
      path: /validate
  rules:
  - apiGroups: ["apps"]
    apiVersions: ["v1"]
    operations: ["CREATE", "UPDATE"]
    resources: ["deployments"]
  failurePolicy: Fail  # Block if webhook unreachable
  matchPolicy: Equivalent  # Match equivalent API versions

Debugging Admission

# Check webhook configurations
kubectl get mutatingwebhookconfigurations
kubectl get validatingwebhookconfigurations

# Describe specific webhook
kubectl describe mutatingwebhookconfiguration vault-agent-injector

# Check webhook endpoint
kubectl get svc -n vault vault-agent-injector
kubectl get endpoints -n vault vault-agent-injector

# Test webhook directly (requires port-forward)
kubectl port-forward -n vault svc/vault-agent-injector 8080:443

curl -k -X POST https://localhost:8080/mutate \
  -H "Content-Type: application/json" \
  -d '{"apiVersion":"admission.k8s.io/v1","kind":"AdmissionReview",...}'

# Check audit logs for admission decisions
kubectl logs -n kube-system kube-apiserver-master | grep admission

Part VII: Advanced Networking

CNI Plugin Chain

{
  "cniVersion": "1.0.0",
  "name": "k8s-pod-network",
  "plugins": [
    {
      "type": "cilium-cni"
    },
    {
      "type": "bandwidth",
      "capabilities": {"bandwidth": true}
    },
    {
      "type": "portmap",
      "capabilities": {"portMappings": true}
    }
  ]
}

eBPF Datapath (Cilium)

Traditional kube-proxy uses iptables:

Packet → iptables (PREROUTING) → routing decision → iptables (FORWARD) →
       → iptables (POSTROUTING) → egress

# Results in O(n) rule matching for n services
# iptables rules must be fully rewritten on any change

Cilium’s eBPF datapath:

Packet → eBPF (XDP/TC ingress) → direct routing/NAT → eBPF (TC egress) → egress

# O(1) lookups using BPF maps
# Incremental updates without full rewrite
# Operates at kernel level, no context switching

Key eBPF Maps:

Map	Purpose
cilium_lxc	Pod endpoints (IP, MAC, identity)
cilium_ipcache	IP → identity mapping
cilium_lb4_services	Service VIP → backend mapping
cilium_lb4_backends	Backend pod IPs
cilium_policy	NetworkPolicy rules
cilium_ct4_global	Connection tracking

Map

Purpose

cilium_lxc

Pod endpoints (IP, MAC, identity)

cilium_ipcache

IP → identity mapping

cilium_lb4_services

Service VIP → backend mapping

cilium_lb4_backends

Backend pod IPs

cilium_policy

NetworkPolicy rules

cilium_ct4_global

Connection tracking

# Inspect Cilium BPF maps
cilium bpf lb list
cilium bpf ct list global
cilium bpf policy get -n default

# Debug packet flow
cilium monitor --type trace
cilium monitor --type drop

# Hubble flow observation
hubble observe --namespace production --protocol TCP
hubble observe --verdict DROPPED

Service Implementation

ClusterIP:

1. Pod sends packet to ClusterIP (10.43.x.x)
2. eBPF/iptables performs DNAT:
   - Select backend pod (round-robin, random, or session affinity)
   - Rewrite destination IP to pod IP
3. Packet routed to backend pod
4. Reply packet has SNAT applied (or preserved with externalTrafficPolicy: Local)

NodePort:

1. External client connects to NodeIP:NodePort
2. Any node can receive (kube-proxy listens on all nodes)
3. DNAT to backend pod (may be on different node)
4. If pod on different node: SNAT to node IP (default) or reject (externalTrafficPolicy: Local)

LoadBalancer:

Cloud LB → NodePort → ClusterIP → Pod

Or with MetalLB (bare metal):
1. MetalLB assigns external IP from pool
2. ARP announcement (Layer 2) or BGP advertisement (Layer 3)
3. Traffic flows to node advertising the IP
4. NodePort processing continues

NetworkPolicy

Default Deny All:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}  # Matches all pods
  policyTypes:
  - Ingress
  - Egress

Allow Specific Traffic:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api-server
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    # From pods with specific label
    - podSelector:
        matchLabels:
          role: frontend
    # OR from specific namespace
    - namespaceSelector:
        matchLabels:
          name: monitoring
    # OR from specific CIDR
    - ipBlock:
        cidr: 10.50.1.0/24
        except:
        - 10.50.1.100/32
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: database
    ports:
    - protocol: TCP
      port: 5432
  # Allow DNS
  - to:
    - namespaceSelector: {}
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: 53

Cilium Extended Policies:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: layer7-policy
spec:
  endpointSelector:
    matchLabels:
      app: api-server
  ingress:
  - fromEndpoints:
    - matchLabels:
        role: frontend
    toPorts:
    - ports:
      - port: "8080"
        protocol: TCP
      rules:
        http:
        - method: "GET"
          path: "/api/v1/.*"
        - method: "POST"
          path: "/api/v1/data"
          headers:
          - 'Content-Type: application/json'

Part VIII: Storage Deep Dive

CSI Architecture

Components:

Component	Function
external-provisioner	Watches PVCs, calls CreateVolume
external-attacher	Watches VolumeAttachment, calls ControllerPublishVolume
external-resizer	Watches PVCs for resize, calls ControllerExpandVolume
external-snapshotter	Creates VolumeSnapshots via CreateSnapshot
node-driver-registrar	Registers CSI driver with kubelet
CSI Driver	Implements CSI gRPC interface

Component

Function

external-provisioner

Watches PVCs, calls CreateVolume

external-attacher

Watches VolumeAttachment, calls ControllerPublishVolume

external-resizer

Watches PVCs for resize, calls ControllerExpandVolume

external-snapshotter

Creates VolumeSnapshots via CreateSnapshot

node-driver-registrar

Registers CSI driver with kubelet

CSI Driver

Implements CSI gRPC interface

CSI Methods:

Controller Service:
  CreateVolume(name, capacity, parameters) → volume_id
  DeleteVolume(volume_id)
  ControllerPublishVolume(volume_id, node_id) → attach info
  ControllerUnpublishVolume(volume_id, node_id)
  CreateSnapshot(source_volume_id, name) → snapshot_id
  ControllerExpandVolume(volume_id, new_capacity)

Node Service:
  NodeStageVolume(volume_id, staging_path)    # Mount to staging (e.g., format)
  NodePublishVolume(volume_id, target_path)   # Bind mount to pod
  NodeUnpublishVolume(volume_id, target_path)
  NodeUnstageVolume(volume_id, staging_path)
  NodeExpandVolume(volume_id, new_capacity)   # Filesystem resize

Volume Lifecycle

1. User creates PVC
   │
2. PV Controller finds/creates matching PV
   │ (via StorageClass provisioner)
   │
3. PVC and PV become Bound
   │
4. Pod scheduled, references PVC
   │
5. Volume Controller creates VolumeAttachment
   │
6. external-attacher calls ControllerPublishVolume
   │ (attaches volume to node in cloud/SAN)
   │
7. kubelet sees pod needs volume
   │
8. kubelet calls NodeStageVolume
   │ (format filesystem if needed, mount to staging)
   │
9. kubelet calls NodePublishVolume
   │ (bind mount to pod's directory)
   │
10. Container starts with volume mounted

Topology-Aware Provisioning

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: regional-ssd
provisioner: pd.csi.storage.gke.io
parameters:
  type: pd-ssd
  replication-type: regional-pd
volumeBindingMode: WaitForFirstConsumer  # Don't provision until pod scheduled
allowedTopologies:
- matchLabelExpressions:
  - key: topology.gke.io/zone
    values:
    - us-central1-a
    - us-central1-b

Volume Snapshots

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-snapclass
driver: pd.csi.storage.gke.io
deletionPolicy: Delete
parameters:
  storage-locations: us-central1
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: prometheus-snapshot
spec:
  volumeSnapshotClassName: csi-snapclass
  source:
    persistentVolumeClaimName: prometheus-data
---
# Restore from snapshot
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-restored
spec:
  storageClassName: regional-ssd
  dataSource:
    name: prometheus-snapshot
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi

Part IX: Security Model

Pod Security Standards

Three profiles (replace deprecated PodSecurityPolicy):

Profile	Restrictions
Privileged	No restrictions (cluster admins, CNI, storage drivers)
Baseline	Minimal restrictions: - No privileged containers - No host namespaces - No hostPath except specific paths - Limited capabilities
Restricted	Maximum restrictions: - Must run as non-root - Must drop ALL capabilities - Read-only root filesystem - No privilege escalation - Seccomp profile required

Profile

Restrictions

Privileged

No restrictions (cluster admins, CNI, storage drivers)

Baseline

Minimal restrictions: - No privileged containers - No host namespaces - No hostPath except specific paths - Limited capabilities

Restricted

Maximum restrictions: - Must run as non-root - Must drop ALL capabilities - Read-only root filesystem - No privilege escalation - Seccomp profile required

Enforcement:

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

RBAC Patterns

Principle of Least Privilege:

# Narrow: specific verbs, specific resources, specific names
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: deployment-restarter
  namespace: production
rules:
- apiGroups: ["apps"]
  resources: ["deployments"]
  resourceNames: ["api-server", "web-frontend"]  # Only these deployments
  verbs: ["get", "patch"]  # Only read and patch (for restart)
---
# For rollout restart
- apiGroups: ["apps"]
  resources: ["deployments/scale"]
  resourceNames: ["api-server", "web-frontend"]
  verbs: ["patch"]

Aggregated ClusterRoles:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: monitoring-reader
  labels:
    rbac.example.com/aggregate-to-monitoring: "true"
rules:
- apiGroups: [""]
  resources: ["pods", "services", "endpoints"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: monitoring-aggregated
aggregationRule:
  clusterRoleSelectors:
  - matchLabels:
      rbac.example.com/aggregate-to-monitoring: "true"
rules: []  # Automatically populated

Service Account Token Projection

Bound tokens (recommended):

apiVersion: v1
kind: Pod
spec:
  serviceAccountName: my-app
  automountServiceAccountToken: false  # Disable default mount
  containers:
  - name: app
    volumeMounts:
    - name: token
      mountPath: /var/run/secrets/tokens
      readOnly: true
  volumes:
  - name: token
    projected:
      sources:
      - serviceAccountToken:
          path: token
          expirationSeconds: 3600  # Short-lived
          audience: my-app.example.com  # Bound to specific audience

Secrets Management

Encryption at Rest:

# /etc/kubernetes/encryption-config.yaml
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
  - secrets
  providers:
  - aescbc:
      keys:
      - name: key1
        secret: <base64-encoded-32-byte-key>
  - identity: {}  # Fallback for reading old unencrypted secrets

External Secrets:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: database-credentials
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: ClusterSecretStore
  target:
    name: database-credentials
    creationPolicy: Owner
  data:
  - secretKey: username
    remoteRef:
      key: secret/data/database/production
      property: username
  - secretKey: password
    remoteRef:
      key: secret/data/database/production
      property: password

Part X: Debugging

kubectl debug

Ephemeral Debug Containers:

# Add debug container to running pod
kubectl debug -it pod/nginx --image=nicolaka/netshoot --target=nginx

# The --target flag shares process namespace with that container
# Allows: ps aux, strace -p <pid>, nsenter, etc.

Node Debugging:

# Debug node (creates privileged pod)
kubectl debug node/worker-1 -it --image=ubuntu

# Access node filesystem
chroot /host

Copy Pod for Debugging:

# Create copy with modified command
kubectl debug pod/myapp --copy-to=myapp-debug --container=app -- sh

# Create copy with different image
kubectl debug pod/myapp --copy-to=myapp-debug --set-image=app=busybox

crictl

Direct container runtime interaction:

# List containers
crictl ps -a

# Get container details
crictl inspect <container-id>

# Container logs (bypasses kubelet)
crictl logs <container-id>

# Execute in container
crictl exec -it <container-id> sh

# Pull image
crictl pull docker.io/library/nginx:latest

# List images
crictl images

# Pod sandbox operations
crictl pods
crictl inspectp <pod-id>

# Stats
crictl stats

Event Investigation

# All events, sorted by time
kubectl get events --sort-by=.metadata.creationTimestamp

# Events for specific resource
kubectl get events --field-selector involvedObject.name=nginx-abc123

# Warning events only
kubectl get events --field-selector type=Warning

# Events in watch mode
kubectl get events -w

# Describe shows events at bottom
kubectl describe pod nginx-abc123

Log Analysis

# Structured log parsing (JSON logs)
kubectl logs deployment/api-server | jq -r 'select(.level=="error") | .message'

# Multi-container
kubectl logs pod/myapp -c sidecar --previous

# Follow with timestamps
kubectl logs -f deployment/api-server --timestamps=true

# Since duration
kubectl logs deployment/api-server --since=1h

# Aggregate logs from all pods in deployment
kubectl logs deployment/api-server --all-containers=true --prefix=true

# Stern for multi-pod tailing (if installed)
stern "api-.*" -n production --since 10m

Resource Inspection

# Custom columns
kubectl get pods -o custom-columns=\
NAME:.metadata.name,\
NODE:.spec.nodeName,\
STATUS:.status.phase,\
IP:.status.podIP,\
RESTARTS:.status.containerStatuses[0].restartCount

# JSONPath
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\n"}{end}'

# Get specific field
kubectl get pod nginx -o jsonpath='{.spec.containers[0].image}'

# Sort by field
kubectl get pods --sort-by=.status.startTime

# Raw API request
kubectl get --raw /api/v1/namespaces/default/pods | jq .

Part XI: Performance Tuning

Resource Management

Requests vs Limits:

Aspect	Requests	Limits
Scheduling	Used to find suitable node	Not considered
QoS Class	Determines class	Must equal requests for Guaranteed
Throttling	No effect	CPU throttled, memory OOM killed
Recommendation	Set based on actual usage	Set to prevent runaway

Aspect

Requests

Limits

Scheduling

Used to find suitable node

Not considered

QoS Class

Determines class

Must equal requests for Guaranteed

Throttling

No effect

CPU throttled, memory OOM killed

Recommendation

Set based on actual usage

Set to prevent runaway

QoS Classes:

Guaranteed: requests == limits for all containers (CPU and memory)
    → Highest priority, last to be evicted

Burstable: requests < limits OR only requests set
    → Medium priority

BestEffort: no requests or limits set
    → Lowest priority, first to be evicted

Pod Disruption Budgets

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
spec:
  minAvailable: 2  # OR maxUnavailable: 1
  selector:
    matchLabels:
      app: api-server
  unhealthyPodEvictionPolicy: IfHealthyBudget  # New in 1.27

Topology Spread Constraints

apiVersion: v1
kind: Pod
spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: api-server
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway
    labelSelector:
      matchLabels:
        app: api-server

Vertical Pod Autoscaler

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Auto"  # Off, Initial, Recreate, Auto
  resourcePolicy:
    containerPolicies:
    - containerName: "*"
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 4
        memory: 8Gi
      controlledResources: ["cpu", "memory"]
      controlledValues: RequestsAndLimits

Appendix A: Quick Reference

Object Short Names

Kind	Short Name	API Group
ConfigMap	cm	core
DaemonSet	ds	apps
Deployment	deploy	apps
Endpoints	ep	core
Event	ev	core
HorizontalPodAutoscaler	hpa	autoscaling
Ingress	ing	networking.k8s.io
Namespace	ns	core
NetworkPolicy	netpol	networking.k8s.io
Node	no	core
PersistentVolume	pv	core
PersistentVolumeClaim	pvc	core
Pod	po	core
ReplicaSet	rs	apps
Service	svc	core
ServiceAccount	sa	core
StatefulSet	sts	apps
StorageClass	sc	storage.k8s.io

Kind

Short Name

API Group

ConfigMap

core

DaemonSet

apps

Deployment

deploy

apps

Endpoints

core

Event

core

HorizontalPodAutoscaler

hpa

autoscaling

Ingress

ing

networking.k8s.io

Namespace

core

NetworkPolicy

netpol

networking.k8s.io

Node

core

PersistentVolume

core

PersistentVolumeClaim

pvc

core

Pod

core

ReplicaSet

apps

Service

svc

core

ServiceAccount

core

StatefulSet

sts

apps

StorageClass

storage.k8s.io

Essential Commands

# Cluster info
kubectl cluster-info
kubectl get nodes -o wide
kubectl top nodes

# Resource usage
kubectl top pods --containers
kubectl describe node <node> | grep -A5 "Allocated resources"

# Debugging
kubectl get events --sort-by='.lastTimestamp'
kubectl logs <pod> --previous
kubectl debug -it <pod> --image=nicolaka/netshoot

# Dry run and diff
kubectl apply -f manifest.yaml --dry-run=server
kubectl diff -f manifest.yaml

# Force delete
kubectl delete pod <pod> --grace-period=0 --force

# Get all resources in namespace
kubectl api-resources --verbs=list --namespaced -o name | \
  xargs -n 1 kubectl get --show-kind --ignore-not-found -n <namespace>

Appendix B: Recommended Reading

Brendan Burns, Joe Beda, Kelsey Hightower. Kubernetes: Up and Running. O’Reilly, 2022.
Michael Hausenblas, Stefan Schimanski. Programming Kubernetes. O’Reilly, 2019.
Kubernetes Documentation. kubernetes.io/docs/
Cilium Documentation. docs.cilium.io/
etcd Documentation. etcd.io/docs/

Document Information

Author

Evan Modestus

Version

1.0

Last Updated

2026-02-22

Classification

Reference