Kubernetes Architecture Deep Dive
A deep technical reference on Kubernetes architecture, written by a network engineer transitioning from traditional infrastructure. This document bridges the gap between CCNP-level networking and cloud-native container orchestration.
Executive Summary
Kubernetes is an API-driven orchestration platform that treats infrastructure as code. Every component—pods, services, storage, networking—is a declarative object managed through a unified API.
| Concept | Traditional Infrastructure Equivalent |
|---|---|
Pod |
Process running on a server |
Service |
Load balancer VIP + DNS entry |
Ingress |
Reverse proxy / HAProxy |
ConfigMap |
Configuration file |
Secret |
Encrypted credential store |
PersistentVolume |
SAN LUN / NFS mount |
Namespace |
VLAN / VRF segmentation |
NetworkPolicy |
Firewall ACL |
The API-Driven Model
Everything in Kubernetes is an API call. This is the fundamental concept that makes Kubernetes powerful.
How It Works
When you run kubectl apply -f deployment.yaml:
1. kubectl serializes YAML → JSON
2. kubectl sends HTTPS POST to kube-apiserver
3. API server validates against OpenAPI schema
4. API server authenticates (x509, token, OIDC)
5. API server authorizes (RBAC policies)
6. API server persists to etcd (distributed key-value store)
7. Controllers WATCH for changes
8. Controllers RECONCILE reality to match desired state
Control Plane Components
kube-apiserver
The central hub of Kubernetes. All communication flows through it.
| Function | Description | Network Equivalent |
|---|---|---|
Authentication |
Verify identity (x509, tokens, OIDC) |
RADIUS/TACACS+ |
Authorization |
Check permissions (RBAC) |
ACLs / privilege levels |
Admission Control |
Validate/mutate requests |
Firewall inspection |
Persistence |
Store in etcd |
Configuration database |
Watch |
Push changes to controllers |
Syslog / SNMP traps |
etcd
Distributed key-value store. The single source of truth.
etcd cluster (Raft consensus)
├── /registry/pods/default/nginx-abc123
├── /registry/services/monitoring/prometheus
├── /registry/secrets/vault-system/vault-tls
└── /registry/configmaps/argocd/argocd-cm
Network analogy: This is like the NVRAM/startup-config that persists across reboots, but distributed and versioned.
Controllers
Specialized reconciliation loops for each resource type:
| Controller | Watches | Actions |
|---|---|---|
Deployment Controller |
Deployments |
Creates/updates ReplicaSets |
ReplicaSet Controller |
ReplicaSets |
Creates/deletes Pods |
Service Controller |
Services type=LoadBalancer |
Provisions cloud LBs |
Endpoint Controller |
Services + Pods |
Updates endpoint lists |
Node Controller |
Nodes |
Marks unhealthy nodes |
Node Components
kubelet
The agent running on each node. Responsibilities:
-
Register node with API server
-
Watch for pod assignments
-
Call container runtime (CRI)
-
Report pod status back
-
Execute liveness/readiness probes
Container Runtime (containerd)
Actually runs containers. The stack:
Key insight: Docker is NOT required. containerd speaks the same image format (OCI) but without Docker’s overhead.
kube-proxy vs Cilium
Traditional kube-proxy uses iptables. Cilium replaces this with eBPF.
| Feature | kube-proxy (iptables) | Cilium (eBPF) |
|---|---|---|
Implementation |
iptables rules |
eBPF programs in kernel |
Performance |
O(n) rule matching |
O(1) hash lookups |
Observability |
Limited |
Hubble (full flow visibility) |
Network Policy |
Basic |
L3-L7 with identity |
Service Mesh |
Requires Istio/Linkerd |
Built-in (optional) |
Pod Lifecycle
States
| State | Description | Common Causes |
|---|---|---|
Pending |
Accepted but not scheduled |
No node capacity, image pull |
Running |
At least one container running |
Normal operation |
Succeeded |
All containers exited 0 |
Jobs, init containers |
Failed |
Containers exited non-zero |
Application crash |
Unknown |
Node communication lost |
Network partition |
Pod Phases in Detail
The pod lifecycle diagram above shows the complete state machine. Key transitions:
-
PENDING → Init - Pod scheduled to a node, init containers starting
-
Init → Initializing - Init containers complete, main containers starting
-
Initializing → RUNNING - Containers ready, probes passing
-
RUNNING → SUCCEEDED/FAILED - Containers exit (exit code determines state)
Networking Model
The Four Requirements
Kubernetes networking must satisfy:
-
Pod-to-Pod: All pods can communicate without NAT
-
Pod-to-Service: Services provide stable endpoints
-
External-to-Service: Ingress exposes services
-
Pod-to-External: Pods can reach internet
Network Identity
Every pod gets:
-
Unique IP (from CNI plugin)
-
DNS name (
pod-ip.namespace.pod.cluster.local) -
Namespace isolation (optional, via NetworkPolicy)
Service Types
| Type | Scope | Use Case | Network Equivalent |
|---|---|---|---|
ClusterIP |
Internal only |
Inter-service communication |
Private VLAN |
NodePort |
Node IP:Port |
Development, testing |
Static NAT |
LoadBalancer |
External IP |
Production external access |
VIP with health checks |
ExternalName |
DNS CNAME |
External service reference |
DNS alias |
Storage Architecture
Security Model
RBAC Model
# Role: What actions on what resources
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: monitoring
name: prometheus-reader
rules:
- apiGroups: [""]
resources: ["pods", "services", "endpoints"]
verbs: ["get", "list", "watch"]
---
# RoleBinding: Who gets the role
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
namespace: monitoring
name: prometheus-reader-binding
subjects:
- kind: ServiceAccount
name: prometheus
namespace: monitoring
roleRef:
kind: Role
name: prometheus-reader
apiGroup: rbac.authorization.k8s.io
k3s: Lightweight Kubernetes
What k3s Simplifies
| Component | Full Kubernetes | k3s |
|---|---|---|
Binary |
Multiple (apiserver, scheduler, etc.) |
Single binary (~60MB) |
etcd |
External cluster |
Embedded SQLite or etcd |
Container Runtime |
Docker/containerd |
containerd (built-in) |
Networking |
Manual CNI setup |
Flannel included (or disable for Cilium) |
Storage |
Manual CSI setup |
Local-path provisioner included |
Load Balancer |
Cloud provider or MetalLB |
ServiceLB included |
Workload Patterns
Current Domus Digitalis Workloads
| Workload | Type | Purpose | Status |
|---|---|---|---|
Prometheus |
StatefulSet |
Metrics collection |
Planned |
Grafana |
Deployment |
Visualization |
Planned |
ArgoCD |
Deployment |
GitOps CD |
Planned |
Traefik |
DaemonSet/Deployment |
Ingress |
Planned |
Wazuh |
StatefulSet |
SIEM/XDR |
Planned |
MinIO |
StatefulSet |
S3 storage |
Planned |
Additional Workloads to Consider
| Workload | Purpose | Why Consider | Complexity |
|---|---|---|---|
Loki |
Log aggregation |
Complements Prometheus (metrics + logs) |
Medium |
Tempo |
Distributed tracing |
Complete observability stack |
Medium |
Cert-Manager |
Certificate automation |
Auto-renew certs from Vault PKI |
Low |
External-DNS |
DNS automation |
Auto-create DNS entries for ingress |
Low |
Velero |
Backup/restore |
Disaster recovery for k8s resources |
Medium |
Kyverno |
Policy engine |
Enforce security policies |
Medium |
Falco |
Runtime security |
Detect anomalous behavior |
Medium |
Harbor |
Container registry |
Private OCI registry with scanning |
High |
Keycloak |
Identity provider |
SSO for all services (could move from VM) |
High |
Gitea |
Git server |
Could move from NAS to k8s |
Medium |
Vault |
Secrets management |
Could run Vault IN k8s (HA easier) |
High |
Teleport |
Access management |
SSH/k8s/DB access gateway |
High |
Backstage |
Developer portal |
Service catalog, documentation |
High |
Mental Models
For Network Engineers
| Network Concept | Kubernetes Equivalent |
|---|---|
VLAN |
Namespace (logical isolation) |
VRF |
NetworkPolicy (routing isolation) |
ACL |
NetworkPolicy (allow/deny rules) |
HSRP/VRRP |
Service (stable VIP) |
Load Balancer |
Service type=LoadBalancer + Ingress |
DNS |
CoreDNS + Service discovery |
SNMP/Syslog |
Prometheus metrics + Loki logs |
RADIUS |
ServiceAccount + RBAC |
802.1X |
Pod Security Admission |
Spanning Tree |
Pod anti-affinity (avoid single points) |
BGP Peering |
Cilium BGP (advertise service IPs) |
For Systems Administrators
| Sysadmin Concept | Kubernetes Equivalent |
|---|---|
VM |
Pod (but ephemeral) |
systemd service |
Deployment/StatefulSet |
/etc/config |
ConfigMap |
/etc/secrets |
Secret (or Vault) |
cron job |
CronJob |
init scripts |
Init containers |
health check |
Liveness/Readiness probes |
log files |
stdout/stderr → log aggregator |
disk mount |
PersistentVolumeClaim |
firewall rules |
NetworkPolicy |
Command Reference
Essential kubectl Commands
# Cluster info
kubectl cluster-info
kubectl get nodes -o wide
# Workloads
kubectl get pods -A # All namespaces
kubectl get pods -n monitoring -o wide # Specific namespace
kubectl describe pod <name> # Detailed info
kubectl logs <pod> -f # Follow logs
kubectl logs <pod> -c <container> # Specific container
kubectl exec -it <pod> -- /bin/sh # Shell access
# Resources
kubectl top nodes # Node resource usage
kubectl top pods -A # Pod resource usage
# Debugging
kubectl get events --sort-by='.lastTimestamp'
kubectl describe node <node> | grep -A5 Conditions
kubectl auth can-i create pods --as=system:serviceaccount:default:mysa
# Advanced
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\n"}{end}'
kubectl get pods -o custom-columns='NAME:.metadata.name,STATUS:.status.phase,IP:.status.podIP'
Cilium Commands
# Status
cilium status
cilium connectivity test
# Network flows (Hubble)
hubble observe --namespace monitoring
hubble observe --protocol TCP --port 9090
# Policy
cilium policy get
cilium endpoint list
Helm Commands
# Repository management
helm repo add <name> <url>
helm repo update
helm search repo <keyword>
# Installation
helm install <release> <chart> -n <namespace> -f values.yaml
helm upgrade <release> <chart> -n <namespace> -f values.yaml
helm rollback <release> <revision>
# Inspection
helm list -A
helm get values <release> -n <namespace>
helm get manifest <release> -n <namespace>
Troubleshooting Framework
The Debugging Ladder
Level 1: Is the pod running?
├── kubectl get pods -n <namespace>
├── kubectl describe pod <pod>
└── kubectl logs <pod>
Level 2: Is the service routing?
├── kubectl get svc,endpoints -n <namespace>
├── kubectl exec <pod> -- curl <service>:<port>
└── hubble observe --namespace <namespace>
Level 3: Is networking working?
├── cilium connectivity test
├── kubectl exec <pod> -- nslookup <service>
└── kubectl exec <pod> -- ping <ip>
Level 4: Is storage attached?
├── kubectl get pv,pvc
├── kubectl describe pvc <pvc>
└── kubectl exec <pod> -- df -h
Level 5: Is the node healthy?
├── kubectl describe node <node>
├── kubectl top node
└── ssh <node> "journalctl -u k3s"
Common Issues and Fixes
| Symptom | Likely Cause | Fix |
|---|---|---|
Pod stuck Pending |
No resources / no PV |
|
Pod CrashLoopBackOff |
App crashing |
|
Service not resolving |
CoreDNS issue |
|
ImagePullBackOff |
Registry auth / image not found |
Check image name, registry credentials |
Vault injection failed |
TLS or auth issue |
Check vault-agent logs in pod |
NetworkPolicy blocking |
Missing allow rule |
|
Architecture Decision Records
ADR-001: k3s over Full Kubernetes
Decision: Use k3s instead of kubeadm/RKE/EKS.
Rationale: - Single-node deployment (no HA requirement yet) - Reduced resource overhead (~512MB vs 2GB+) - Simpler operations (single binary) - Still 100% Kubernetes compatible
Trade-offs: - Less flexibility in component versions - Some enterprise features require additional setup
ADR-002: Cilium over Flannel
Decision: Replace default Flannel with Cilium.
Rationale: - eBPF performance (O(1) vs iptables O(n)) - Hubble observability (network flow visibility) - L7 network policies (HTTP-aware rules) - Native integration with Prometheus
Trade-offs: - More complex initial setup - Higher memory usage (~200MB)
ADR-003: Vault External over In-Cluster
Decision: Keep Vault as external VM, not in k8s.
Rationale: - Vault manages secrets for k8s (chicken-egg problem) - HA easier with dedicated VMs - Existing investment in vault-01 infrastructure
Trade-offs: - External dependency for k8s workloads - Network latency for secret retrieval
Glossary
| Term | Definition |
|---|---|
CNI |
Container Network Interface - plugin API for pod networking |
CRI |
Container Runtime Interface - plugin API for container execution |
CSI |
Container Storage Interface - plugin API for storage |
CRD |
Custom Resource Definition - extend k8s API with custom types |
DaemonSet |
Run exactly one pod per node (like agents) |
Deployment |
Stateless workload with rolling updates |
eBPF |
Extended Berkeley Packet Filter - kernel-level programmability |
Helm |
Package manager for Kubernetes (charts = packages) |
Ingress |
L7 routing (HTTP host/path to service) |
Kubelet |
Node agent that runs pods |
Namespace |
Logical isolation boundary (like VLAN) |
Operator |
Custom controller managing complex applications |
Pod |
Smallest deployable unit (one or more containers) |
RBAC |
Role-Based Access Control |
Service |
Stable network endpoint for pods (like VIP) |
StatefulSet |
Stateful workload with stable identity and storage |