Performance Troubleshooting
Quick Reference
# System overview
top
htop
vmstat 1 5
# CPU
mpstat -P ALL 1
pidstat -u 1
# Memory
free -h
vmstat -s
pidstat -r 1
# Disk I/O
iostat -xz 1
iotop
pidstat -d 1
# Network
sar -n DEV 1
ss -s
iftop
# Process analysis
ps aux --sort=-%cpu | head
ps aux --sort=-%mem | head
strace -c -p <pid>
perf top
Performance Analysis Methodology
USE Method
| Metric | Description |
|---|---|
Utilization |
Time resource is busy (percentage) |
Saturation |
Work queued waiting for resource |
Errors |
Error events for the resource |
Analysis Flow
┌─────────────────────────────────────────────────────────────────┐
│ Performance Issue │
└───────────────────────────┬─────────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌────▼────┐ ┌─────▼────┐ ┌────▼────┐
│ CPU │ │ Memory │ │ Disk │
│ Issue │ │ Issue │ │ I/O Issue│
└────┬────┘ └────┬─────┘ └────┬────┘
│ │ │
┌────▼────┐ ┌────▼─────┐ ┌────▼────┐
│ mpstat │ │ free │ │ iostat │
│ pidstat │ │ vmstat │ │ iotop │
│ perf │ │ slabtop │ │ blktrace│
└─────────┘ └──────────┘ └─────────┘
System Overview Tools
top/htop
# Basic top
top
# Top hotkeys:
# 1 - Show individual CPUs
# M - Sort by memory
# P - Sort by CPU
# c - Show full command
# k - Kill process
# r - Renice process
# H - Show threads
# f - Configure fields
# Batch mode (for scripting)
top -b -n 1 > top_output.txt
# Monitor specific process
top -p 1234
# htop (interactive, better UI)
htop
# htop features:
# F6 - Sort by column
# F9 - Kill process
# F5 - Tree view
# \ - Filter
# / - Search
vmstat
# Basic vmstat (1 second interval, 5 samples)
vmstat 1 5
# Output columns:
# procs:
# r = Running/waiting processes
# b = Blocked processes (I/O)
# memory:
# swpd = Virtual memory used
# free = Idle memory
# buff = Buffer memory
# cache = Cache memory
# swap:
# si = Swapped in from disk
# so = Swapped out to disk
# io:
# bi = Blocks received from device
# bo = Blocks sent to device
# system:
# in = Interrupts per second
# cs = Context switches per second
# cpu:
# us = User time
# sy = System time
# id = Idle time
# wa = I/O wait time
# st = Stolen time (VM)
# With timestamps
vmstat -t 1 5
# Memory statistics
vmstat -s
# Disk statistics
vmstat -d
# Watch for high values:
# - r > CPU count = CPU saturation
# - b > 0 = I/O blocking
# - si/so > 0 = Active swapping
# - wa > 20% = I/O wait issues
sar (System Activity Reporter)
# Install sysstat package first
# CPU usage
sar -u 1 5
# Memory usage
sar -r 1 5
# Disk I/O
sar -d 1 5
# Network
sar -n DEV 1 5
# All statistics
sar -A 1 5
# Historical data (if collected)
sar -u -f /var/log/sa/sa01
# Enable data collection
systemctl enable --now sysstat
CPU Performance
Diagnosing CPU Issues
# CPU utilization per core
mpstat -P ALL 1
# Per-process CPU usage
pidstat -u 1
# Top CPU-consuming processes
ps aux --sort=-%cpu | head -10
# Load average
uptime
cat /proc/loadavg
# Load average interpretation:
# 1.00 = 100% on single CPU
# Compare to: nproc (CPU count)
# < CPU count = OK
# > CPU count = Overloaded
CPU Saturation
# Check run queue (vmstat r column)
vmstat 1
# If r > CPU count, system is saturated
# Check context switches
vmstat 1 | awk '{print $12, $13}'
# High context switches may indicate:
# - Too many processes
# - Lock contention
# - Interrupt issues
# Check scheduler latency
cat /proc/schedstat
Process CPU Analysis
# strace - system call tracing
strace -c -p <pid> # Summary
strace -tt -p <pid> # Detailed with timestamps
strace -e trace=open,read,write -p <pid>
# perf - performance profiling
perf top # Real-time profile
perf record -p <pid> sleep 30 # Record profile
perf report # Analyze recording
# Flame graphs (with perf)
perf record -F 99 -a -g sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg
# Check process state
cat /proc/<pid>/status | grep State
# R = Running, S = Sleeping, D = Disk sleep (uninterruptible)
Runaway Process Triage
When your system fans spin up unexpectedly, use this workflow:
# Step 1: Find top CPU consumers
ps aux --sort=-%cpu | head -15
# Step 2: Check system load
cat /proc/loadavg
# Output: 3.73 3.84 3.69 6/1780 2244908
# ^1m ^5m ^15m ^running/total ^last_pid
Identifying Orphaned Processes
Orphaned processes (PPID=1) were adopted by init after their parent died. They often indicate a crashed session or stuck background job.
# Find orphaned processes consuming CPU
ps -eo pid,ppid,stat,%cpu,etime,cmd --sort=-%cpu | awk '$2==1 && $4>5'
# Detailed check on suspect process
ps -p <PID> -o pid,ppid,stat,etime,%cpu,%mem,cmd
Real Example: Stuck hyprlock
$ ps aux --sort=-%cpu | head -5
USER PID %CPU %MEM TIME COMMAND
user 3603830 103 0.3 2369:32 hyprlock # <-- 39+ hours CPU time!
$ ps -p 3603830 -o pid,ppid,stat,etime,cmd
PID PPID STAT ELAPSED CMD
3603830 1 Rl 1-14:19:52 hyprlock # <-- PPID=1 (orphaned), Rl (running)
Process State Indicators
| State | Meaning | Action |
|---|---|---|
|
Running (l=low priority) |
Check if legitimate workload |
|
Sleeping (interruptible) |
Normal for most processes |
|
Uninterruptible sleep (I/O) |
Check disk/NFS issues |
|
Zombie (defunct) |
Parent needs to reap; kill parent |
|
Stopped (signal/debugger) |
Resume with |
Safe Kill Workflow
# Graceful termination first
kill <PID>
# If still running after 5 seconds
kill -9 <PID>
# For zombie processes, kill the parent
ps -p <ZOMBIE_PID> -o ppid= # Get parent PID
kill <PARENT_PID>
Bulk Cleanup
# Kill all processes by name
pkill hyprlock
# Kill all orphaned processes by a specific user (CAREFUL!)
ps -eo pid,ppid,user,cmd | awk '$2==1 && $3=="username" {print $1}' | xargs kill
# Kill processes consuming >90% CPU for more than 1 hour
ps -eo pid,%cpu,etime,cmd --sort=-%cpu | awk '$2>90 && $3~/[0-9]+-/ {print $1}' | head -5
# ^-- etime format: days-HH:MM:SS, so [0-9]+- matches >1 day
CPU Tuning
# Check CPU frequency
cat /proc/cpuinfo | grep MHz
cpupower frequency-info
# Set CPU governor
cpupower frequency-set -g performance
# Governors: performance, powersave, ondemand, conservative
# Process priority (nice)
nice -n 10 command # Lower priority
renice -n -5 -p <pid> # Higher priority (root)
# CPU affinity
taskset -c 0,1 command # Run on CPUs 0,1
taskset -p -c 0 <pid> # Set affinity for running process
# Disable CPU cores (for testing)
echo 0 > /sys/devices/system/cpu/cpu3/online
Memory Performance
Diagnosing Memory Issues
# Memory overview
free -h
# total = Total physical RAM
# used = Used memory
# free = Completely unused
# buff/cache = Kernel buffers and page cache
# available = Estimated available for apps
# Detailed memory stats
cat /proc/meminfo
vmstat -s
# Per-process memory
ps aux --sort=-%mem | head -10
pidstat -r 1
# Memory by process
smem -tk
pmap -x <pid>
Memory Saturation
# Check for swapping
vmstat 1 | awk '{print $7, $8}' # si, so columns
# si/so > 0 = Active swapping (performance impact)
# Check swap usage
free -h
swapon --show
# OOM killer activity
dmesg | grep -i "out of memory"
journalctl -k | grep -i oom
# Memory pressure
cat /proc/pressure/memory
# Check NUMA statistics
numastat
numactl --hardware
Memory Analysis
# Slab memory (kernel objects)
slabtop
cat /proc/slabinfo
# Page cache
cat /proc/meminfo | grep -E "Cached|Buffers|Active|Inactive"
# Per-process detailed
cat /proc/<pid>/status | grep -E "VmSize|VmRSS|VmSwap"
# VmSize = Virtual memory size
# VmRSS = Resident Set Size (physical memory)
# VmSwap = Swapped memory
# Memory maps
pmap -x <pid>
cat /proc/<pid>/smaps
# Find memory leaks
valgrind --leak-check=full ./program
Memory Tuning
# Clear page cache (for testing)
sync; echo 3 > /proc/sys/vm/drop_caches
# Swappiness (0-100, lower = prefer RAM)
sysctl vm.swappiness
sysctl -w vm.swappiness=10
# Dirty page settings
sysctl vm.dirty_ratio # % RAM for dirty pages
sysctl vm.dirty_background_ratio # % before background flush
# OOM score adjustment
echo -1000 > /proc/<pid>/oom_score_adj # Never kill
echo 1000 > /proc/<pid>/oom_score_adj # Kill first
# Huge pages
cat /proc/meminfo | grep Huge
sysctl vm.nr_hugepages=128
# NUMA tuning
numactl --membind=0 --cpunodebind=0 ./program
Disk I/O Performance
Diagnosing I/O Issues
# I/O statistics
iostat -xz 1
# Key columns:
# r/s, w/s = Reads/writes per second
# rkB/s, wkB/s = KB read/written per second
# await = Average I/O wait time (ms)
# avgqu-sz = Average queue length
# %util = Device utilization
# Warning signs:
# - %util > 60% = Device saturated
# - await > 10ms = Slow I/O
# - avgqu-sz > 1 = I/O queuing
# Per-process I/O
iotop
pidstat -d 1
# Block device statistics
cat /proc/diskstats
I/O Saturation
# Check for blocked processes
vmstat 1 | awk '{print $2}' # 'b' column
# b > 0 = Processes blocked on I/O
# I/O wait
vmstat 1 | awk '{print $16}' # 'wa' column
# wa > 20% = I/O bottleneck
# Queue depth
iostat -xz 1 | awk '{print $1, $10}' # avgqu-sz
# Check for I/O pressure
cat /proc/pressure/io
I/O Analysis
# Detailed block I/O tracing
blktrace -d /dev/sda -o trace
blkparse trace.* > trace.txt
# Simpler I/O tracing
sudo bpftrace -e 'tracepoint:block:block_rq_issue { @[args->comm] = count(); }'
# File-level I/O
fatrace # File access tracing
inotifywait -m /path # Monitor file events
# Find I/O-heavy processes
iotop -o # Only show I/O processes
pidstat -d 1 | grep -v "^$"
# Check filesystem
df -h
df -i # Inode usage
I/O Tuning
# I/O scheduler
cat /sys/block/sda/queue/scheduler
# Options: none, mq-deadline, kyber, bfq
# Change scheduler (for NVMe, 'none' is often best)
echo none > /sys/block/nvme0n1/queue/scheduler
# Read-ahead
cat /sys/block/sda/queue/read_ahead_kb
echo 256 > /sys/block/sda/queue/read_ahead_kb
# Queue depth
cat /sys/block/sda/queue/nr_requests
echo 256 > /sys/block/sda/queue/nr_requests
# Dirty page writeback (for write-heavy workloads)
sysctl vm.dirty_expire_centisecs=500
sysctl vm.dirty_writeback_centisecs=100
# Filesystem mount options
# noatime - Don't update access times
# nodiratime - Don't update directory access times
# barrier=0 - Disable write barriers (risky!)
Network Performance
Diagnosing Network Issues
# Network statistics
sar -n DEV 1 5
# Interface statistics
ip -s link
cat /proc/net/dev
# Socket statistics
ss -s
ss -tunap
# Per-process network
nethogs
iftop
# Packet analysis
tcpdump -i eth0 -c 100
Network Saturation
# Check for dropped packets
ip -s link show eth0 | grep -E "dropped|errors"
netstat -i
cat /proc/net/dev
# Check for buffer overflows
netstat -s | grep -i drop
netstat -s | grep -i overflow
# Socket buffer sizes
sysctl net.core.rmem_max
sysctl net.core.wmem_max
# TCP statistics
ss -s
cat /proc/net/netstat
Network Analysis
# Connection states
ss -tan state established | wc -l
ss -tan state time-wait | wc -l
# Port utilization
ss -tulnp
# Bandwidth test
iperf3 -s # Server
iperf3 -c server_ip # Client
# Latency test
ping -c 100 host
mtr host
# DNS performance
dig +stats example.com
Network Tuning
# Increase socket buffer sizes
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
# TCP tuning
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
# Enable TCP window scaling
sysctl -w net.ipv4.tcp_window_scaling=1
# Increase connection backlog
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_max_syn_backlog=65535
# TIME_WAIT tuning
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_fin_timeout=15
# Network queue (for high-speed NICs)
ethtool -g eth0 # Show ring buffer
ethtool -G eth0 rx 4096 # Increase RX buffer
# Interrupt coalescing
ethtool -c eth0
ethtool -C eth0 rx-usecs 50
Process Analysis
Finding Problem Processes
# CPU hogs
ps aux --sort=-%cpu | head -10
top -bn1 | head -20
# Memory hogs
ps aux --sort=-%mem | head -10
smem -tk | head -10
# I/O hogs
iotop -b -n 1 | head -10
pidstat -d 1 1
# Thread count
ps -eLf | awk '{print $1}' | sort | uniq -c | sort -rn | head
# Open files
lsof -p <pid> | wc -l
ls -la /proc/<pid>/fd | wc -l
# File descriptor limits
cat /proc/<pid>/limits | grep "open files"
Process Profiling
# strace - system calls
strace -c -p <pid> # Summary
strace -T -p <pid> # With timing
strace -e trace=file -p <pid> # File operations
strace -e trace=network -p <pid> # Network operations
# ltrace - library calls
ltrace -c -p <pid>
# perf - CPU profiling
perf record -p <pid> -g sleep 30
perf report
# perf - specific events
perf stat -p <pid> sleep 10
# BPF tools
execsnoop # New processes
opensnoop # File opens
biolatency # Block I/O latency
Process Resource Limits
# View limits
cat /proc/<pid>/limits
ulimit -a
# Modify limits (in shell)
ulimit -n 65535 # Open files
ulimit -u 65535 # Max processes
# Persistent limits in /etc/security/limits.conf
# user soft nofile 65535
# user hard nofile 65535
# systemd service limits
systemctl show <service> | grep Limit
# Add to service file:
# [Service]
# LimitNOFILE=65535
# LimitNPROC=65535
System-Wide Analysis
BPF Tools (bcc/bpftrace)
# Install bcc-tools
dnf install bcc-tools
apt install bpfcc-tools
# Common tools (in /usr/share/bcc/tools/)
execsnoop # New process execution
opensnoop # File opens
biosnoop # Block I/O with latency
tcpconnect # TCP connections
tcpretrans # TCP retransmissions
runqlat # CPU scheduler latency
ext4slower # Slow ext4 operations
# bpftrace examples
bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s %s\n", comm, str(args->filename)); }'
bpftrace -e 'profile:hz:99 { @[kstack] = count(); }'
Performance Recording
# Record system state
sar -A 1 60 > sar_output.txt &
vmstat 1 60 > vmstat_output.txt &
iostat -xz 1 60 > iostat_output.txt &
# Continuous monitoring script
while true; do
date >> /var/log/perf_monitor.log
vmstat 1 5 >> /var/log/perf_monitor.log
echo "---" >> /var/log/perf_monitor.log
sleep 60
done
# Performance Co-Pilot (PCP)
systemctl enable --now pmcd
pmstat
pmrep -t 1 kernel.all.load
Baseline Comparison
# Create baseline
sar -A -o /var/log/baseline_$(date +%Y%m%d).sar 1 3600
# Compare to baseline
sar -A -f /var/log/baseline_20240315.sar
# Quick baseline check
cat << 'EOF' > check_baseline.sh
#!/bin/bash
echo "=== Load Average ==="
uptime
echo "=== Memory ==="
free -h
echo "=== Disk ==="
df -h
echo "=== CPU ==="
mpstat 1 5
echo "=== I/O ==="
iostat -xz 1 5
EOF
chmod +x check_baseline.sh
Quick Troubleshooting Checklist
□ Check load average: uptime
□ Check CPU: mpstat -P ALL 1, top
□ Check memory: free -h, vmstat 1
□ Check swapping: vmstat 1 (si/so columns)
□ Check disk I/O: iostat -xz 1, iotop
□ Check disk space: df -h, df -i
□ Check network: sar -n DEV 1, ss -s
□ Find CPU-heavy processes: ps aux --sort=-%cpu
□ Find memory-heavy processes: ps aux --sort=-%mem
□ Find I/O-heavy processes: iotop
□ Check system logs: journalctl -p err -b
□ Check for OOM kills: dmesg | grep -i oom
Quick Command Reference
# === System Overview ===
top / htop # Interactive process viewer
vmstat 1 5 # Virtual memory stats
sar -A 1 5 # All system stats
# === CPU ===
mpstat -P ALL 1 # Per-CPU stats
pidstat -u 1 # Per-process CPU
perf top # CPU profiling
ps aux --sort=-%cpu | head # Top CPU processes
# === Memory ===
free -h # Memory overview
vmstat -s # Memory statistics
pidstat -r 1 # Per-process memory
ps aux --sort=-%mem | head # Top memory processes
# === Disk I/O ===
iostat -xz 1 # I/O statistics
iotop # Per-process I/O
pidstat -d 1 # Per-process disk
df -h # Disk space
# === Network ===
sar -n DEV 1 # Network stats
ss -s # Socket summary
nethogs # Per-process bandwidth
iftop # Interface bandwidth
# === Process ===
strace -c -p <pid> # System call trace
perf record -p <pid> # CPU profiling
lsof -p <pid> # Open files
pmap -x <pid> # Memory maps
# === Tuning ===
sysctl -a # All kernel parameters
ulimit -a # Shell resource limits