Performance Troubleshooting

Quick Reference

# System overview
top
htop
vmstat 1 5

# CPU
mpstat -P ALL 1
pidstat -u 1

# Memory
free -h
vmstat -s
pidstat -r 1

# Disk I/O
iostat -xz 1
iotop
pidstat -d 1

# Network
sar -n DEV 1
ss -s
iftop

# Process analysis
ps aux --sort=-%cpu | head
ps aux --sort=-%mem | head
strace -c -p <pid>
perf top

Performance Analysis Methodology

USE Method

Metric	Description
Utilization	Time resource is busy (percentage)
Saturation	Work queued waiting for resource
Errors	Error events for the resource

Metric

Description

Utilization

Time resource is busy (percentage)

Saturation

Work queued waiting for resource

Errors

Error events for the resource

Analysis Flow

┌─────────────────────────────────────────────────────────────────┐
│                    Performance Issue                             │
└───────────────────────────┬─────────────────────────────────────┘
                            │
        ┌───────────────────┼───────────────────┐
        │                   │                   │
   ┌────▼────┐        ┌─────▼────┐        ┌────▼────┐
   │   CPU   │        │  Memory  │        │  Disk   │
   │  Issue  │        │  Issue   │        │ I/O Issue│
   └────┬────┘        └────┬─────┘        └────┬────┘
        │                  │                   │
   ┌────▼────┐        ┌────▼─────┐        ┌────▼────┐
   │ mpstat  │        │  free    │        │ iostat  │
   │ pidstat │        │  vmstat  │        │ iotop   │
   │ perf    │        │  slabtop │        │ blktrace│
   └─────────┘        └──────────┘        └─────────┘

System Overview Tools

top/htop

# Basic top
top

# Top hotkeys:
#   1     - Show individual CPUs
#   M     - Sort by memory
#   P     - Sort by CPU
#   c     - Show full command
#   k     - Kill process
#   r     - Renice process
#   H     - Show threads
#   f     - Configure fields

# Batch mode (for scripting)
top -b -n 1 > top_output.txt

# Monitor specific process
top -p 1234

# htop (interactive, better UI)
htop

# htop features:
#   F6    - Sort by column
#   F9    - Kill process
#   F5    - Tree view
#   \     - Filter
#   /     - Search

vmstat

# Basic vmstat (1 second interval, 5 samples)
vmstat 1 5

# Output columns:
# procs:
#   r  = Running/waiting processes
#   b  = Blocked processes (I/O)
# memory:
#   swpd  = Virtual memory used
#   free  = Idle memory
#   buff  = Buffer memory
#   cache = Cache memory
# swap:
#   si = Swapped in from disk
#   so = Swapped out to disk
# io:
#   bi = Blocks received from device
#   bo = Blocks sent to device
# system:
#   in = Interrupts per second
#   cs = Context switches per second
# cpu:
#   us = User time
#   sy = System time
#   id = Idle time
#   wa = I/O wait time
#   st = Stolen time (VM)

# With timestamps
vmstat -t 1 5

# Memory statistics
vmstat -s

# Disk statistics
vmstat -d

# Watch for high values:
# - r > CPU count = CPU saturation
# - b > 0 = I/O blocking
# - si/so > 0 = Active swapping
# - wa > 20% = I/O wait issues

sar (System Activity Reporter)

# Install sysstat package first

# CPU usage
sar -u 1 5

# Memory usage
sar -r 1 5

# Disk I/O
sar -d 1 5

# Network
sar -n DEV 1 5

# All statistics
sar -A 1 5

# Historical data (if collected)
sar -u -f /var/log/sa/sa01

# Enable data collection
systemctl enable --now sysstat

CPU Performance

Diagnosing CPU Issues

# CPU utilization per core
mpstat -P ALL 1

# Per-process CPU usage
pidstat -u 1

# Top CPU-consuming processes
ps aux --sort=-%cpu | head -10

# Load average
uptime
cat /proc/loadavg

# Load average interpretation:
# 1.00 = 100% on single CPU
# Compare to: nproc (CPU count)
# < CPU count = OK
# > CPU count = Overloaded

CPU Saturation

# Check run queue (vmstat r column)
vmstat 1

# If r > CPU count, system is saturated

# Check context switches
vmstat 1 | awk '{print $12, $13}'

# High context switches may indicate:
# - Too many processes
# - Lock contention
# - Interrupt issues

# Check scheduler latency
cat /proc/schedstat

Process CPU Analysis

# strace - system call tracing
strace -c -p <pid>             # Summary
strace -tt -p <pid>            # Detailed with timestamps
strace -e trace=open,read,write -p <pid>

# perf - performance profiling
perf top                       # Real-time profile
perf record -p <pid> sleep 30  # Record profile
perf report                    # Analyze recording

# Flame graphs (with perf)
perf record -F 99 -a -g sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg

# Check process state
cat /proc/<pid>/status | grep State
# R = Running, S = Sleeping, D = Disk sleep (uninterruptible)

Runaway Process Triage

When your system fans spin up unexpectedly, use this workflow:

# Step 1: Find top CPU consumers
ps aux --sort=-%cpu | head -15

# Step 2: Check system load
cat /proc/loadavg
# Output: 3.73 3.84 3.69 6/1780 2244908
#         ^1m  ^5m  ^15m ^running/total ^last_pid

Identifying Orphaned Processes

Orphaned processes (PPID=1) were adopted by init after their parent died. They often indicate a crashed session or stuck background job.

# Find orphaned processes consuming CPU
ps -eo pid,ppid,stat,%cpu,etime,cmd --sort=-%cpu | awk '$2==1 && $4>5'

# Detailed check on suspect process
ps -p <PID> -o pid,ppid,stat,etime,%cpu,%mem,cmd

Real Example: Stuck hyprlock

$ ps aux --sort=-%cpu | head -5
USER       PID %CPU %MEM  TIME COMMAND
user   3603830  103  0.3 2369:32 hyprlock    # <-- 39+ hours CPU time!

$ ps -p 3603830 -o pid,ppid,stat,etime,cmd
    PID    PPID STAT     ELAPSED CMD
3603830       1 Rl    1-14:19:52 hyprlock    # <-- PPID=1 (orphaned), Rl (running)

Process State Indicators

State Meaning Action

State	Meaning	Action
`R` / `Rl`	Running (l=low priority)	Check if legitimate workload
`S` / `Sl`	Sleeping (interruptible)	Normal for most processes
`D`	Uninterruptible sleep (I/O)	Check disk/NFS issues
`Z`	Zombie (defunct)	Parent needs to reap; kill parent
`T`	Stopped (signal/debugger)	Resume with `kill -CONT` or kill

R / Rl

Running (l=low priority)

Check if legitimate workload

S / Sl

Sleeping (interruptible)

Normal for most processes

D

Uninterruptible sleep (I/O)

Check disk/NFS issues

Z

Zombie (defunct)

Parent needs to reap; kill parent

T

Stopped (signal/debugger)

Resume with kill -CONT or kill

Safe Kill Workflow

# Graceful termination first
kill <PID>

# If still running after 5 seconds
kill -9 <PID>

# For zombie processes, kill the parent
ps -p <ZOMBIE_PID> -o ppid=    # Get parent PID
kill <PARENT_PID>

Bulk Cleanup

# Kill all processes by name
pkill hyprlock

# Kill all orphaned processes by a specific user (CAREFUL!)
ps -eo pid,ppid,user,cmd | awk '$2==1 && $3=="username" {print $1}' | xargs kill

# Kill processes consuming >90% CPU for more than 1 hour
ps -eo pid,%cpu,etime,cmd --sort=-%cpu | awk '$2>90 && $3~/[0-9]+-/ {print $1}' | head -5
# ^-- etime format: days-HH:MM:SS, so [0-9]+- matches >1 day

CPU Tuning

# Check CPU frequency
cat /proc/cpuinfo | grep MHz
cpupower frequency-info

# Set CPU governor
cpupower frequency-set -g performance
# Governors: performance, powersave, ondemand, conservative

# Process priority (nice)
nice -n 10 command            # Lower priority
renice -n -5 -p <pid>         # Higher priority (root)

# CPU affinity
taskset -c 0,1 command        # Run on CPUs 0,1
taskset -p -c 0 <pid>         # Set affinity for running process

# Disable CPU cores (for testing)
echo 0 > /sys/devices/system/cpu/cpu3/online

Memory Performance

Diagnosing Memory Issues

# Memory overview
free -h
# total = Total physical RAM
# used = Used memory
# free = Completely unused
# buff/cache = Kernel buffers and page cache
# available = Estimated available for apps

# Detailed memory stats
cat /proc/meminfo
vmstat -s

# Per-process memory
ps aux --sort=-%mem | head -10
pidstat -r 1

# Memory by process
smem -tk
pmap -x <pid>

Memory Saturation

# Check for swapping
vmstat 1 | awk '{print $7, $8}'  # si, so columns
# si/so > 0 = Active swapping (performance impact)

# Check swap usage
free -h
swapon --show

# OOM killer activity
dmesg | grep -i "out of memory"
journalctl -k | grep -i oom

# Memory pressure
cat /proc/pressure/memory

# Check NUMA statistics
numastat
numactl --hardware

Memory Analysis

# Slab memory (kernel objects)
slabtop
cat /proc/slabinfo

# Page cache
cat /proc/meminfo | grep -E "Cached|Buffers|Active|Inactive"

# Per-process detailed
cat /proc/<pid>/status | grep -E "VmSize|VmRSS|VmSwap"
# VmSize = Virtual memory size
# VmRSS = Resident Set Size (physical memory)
# VmSwap = Swapped memory

# Memory maps
pmap -x <pid>
cat /proc/<pid>/smaps

# Find memory leaks
valgrind --leak-check=full ./program

Memory Tuning

# Clear page cache (for testing)
sync; echo 3 > /proc/sys/vm/drop_caches

# Swappiness (0-100, lower = prefer RAM)
sysctl vm.swappiness
sysctl -w vm.swappiness=10

# Dirty page settings
sysctl vm.dirty_ratio              # % RAM for dirty pages
sysctl vm.dirty_background_ratio   # % before background flush

# OOM score adjustment
echo -1000 > /proc/<pid>/oom_score_adj  # Never kill
echo 1000 > /proc/<pid>/oom_score_adj   # Kill first

# Huge pages
cat /proc/meminfo | grep Huge
sysctl vm.nr_hugepages=128

# NUMA tuning
numactl --membind=0 --cpunodebind=0 ./program

Disk I/O Performance

Diagnosing I/O Issues

# I/O statistics
iostat -xz 1

# Key columns:
#   r/s, w/s     = Reads/writes per second
#   rkB/s, wkB/s = KB read/written per second
#   await        = Average I/O wait time (ms)
#   avgqu-sz     = Average queue length
#   %util        = Device utilization

# Warning signs:
# - %util > 60% = Device saturated
# - await > 10ms = Slow I/O
# - avgqu-sz > 1 = I/O queuing

# Per-process I/O
iotop
pidstat -d 1

# Block device statistics
cat /proc/diskstats

I/O Saturation

# Check for blocked processes
vmstat 1 | awk '{print $2}'  # 'b' column
# b > 0 = Processes blocked on I/O

# I/O wait
vmstat 1 | awk '{print $16}'  # 'wa' column
# wa > 20% = I/O bottleneck

# Queue depth
iostat -xz 1 | awk '{print $1, $10}'  # avgqu-sz

# Check for I/O pressure
cat /proc/pressure/io

I/O Analysis

# Detailed block I/O tracing
blktrace -d /dev/sda -o trace
blkparse trace.* > trace.txt

# Simpler I/O tracing
sudo bpftrace -e 'tracepoint:block:block_rq_issue { @[args->comm] = count(); }'

# File-level I/O
fatrace                    # File access tracing
inotifywait -m /path       # Monitor file events

# Find I/O-heavy processes
iotop -o                   # Only show I/O processes
pidstat -d 1 | grep -v "^$"

# Check filesystem
df -h
df -i                      # Inode usage

I/O Tuning

# I/O scheduler
cat /sys/block/sda/queue/scheduler
# Options: none, mq-deadline, kyber, bfq

# Change scheduler (for NVMe, 'none' is often best)
echo none > /sys/block/nvme0n1/queue/scheduler

# Read-ahead
cat /sys/block/sda/queue/read_ahead_kb
echo 256 > /sys/block/sda/queue/read_ahead_kb

# Queue depth
cat /sys/block/sda/queue/nr_requests
echo 256 > /sys/block/sda/queue/nr_requests

# Dirty page writeback (for write-heavy workloads)
sysctl vm.dirty_expire_centisecs=500
sysctl vm.dirty_writeback_centisecs=100

# Filesystem mount options
# noatime    - Don't update access times
# nodiratime - Don't update directory access times
# barrier=0  - Disable write barriers (risky!)

Network Performance

Diagnosing Network Issues

# Network statistics
sar -n DEV 1 5

# Interface statistics
ip -s link
cat /proc/net/dev

# Socket statistics
ss -s
ss -tunap

# Per-process network
nethogs
iftop

# Packet analysis
tcpdump -i eth0 -c 100

Network Saturation

# Check for dropped packets
ip -s link show eth0 | grep -E "dropped|errors"
netstat -i
cat /proc/net/dev

# Check for buffer overflows
netstat -s | grep -i drop
netstat -s | grep -i overflow

# Socket buffer sizes
sysctl net.core.rmem_max
sysctl net.core.wmem_max

# TCP statistics
ss -s
cat /proc/net/netstat

Network Analysis

# Connection states
ss -tan state established | wc -l
ss -tan state time-wait | wc -l

# Port utilization
ss -tulnp

# Bandwidth test
iperf3 -s                  # Server
iperf3 -c server_ip        # Client

# Latency test
ping -c 100 host
mtr host

# DNS performance
dig +stats example.com

Network Tuning

# Increase socket buffer sizes
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216

# TCP tuning
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

# Enable TCP window scaling
sysctl -w net.ipv4.tcp_window_scaling=1

# Increase connection backlog
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_max_syn_backlog=65535

# TIME_WAIT tuning
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_fin_timeout=15

# Network queue (for high-speed NICs)
ethtool -g eth0            # Show ring buffer
ethtool -G eth0 rx 4096    # Increase RX buffer

# Interrupt coalescing
ethtool -c eth0
ethtool -C eth0 rx-usecs 50

Process Analysis

Finding Problem Processes

# CPU hogs
ps aux --sort=-%cpu | head -10
top -bn1 | head -20

# Memory hogs
ps aux --sort=-%mem | head -10
smem -tk | head -10

# I/O hogs
iotop -b -n 1 | head -10
pidstat -d 1 1

# Thread count
ps -eLf | awk '{print $1}' | sort | uniq -c | sort -rn | head

# Open files
lsof -p <pid> | wc -l
ls -la /proc/<pid>/fd | wc -l

# File descriptor limits
cat /proc/<pid>/limits | grep "open files"

Process Profiling

# strace - system calls
strace -c -p <pid>                    # Summary
strace -T -p <pid>                    # With timing
strace -e trace=file -p <pid>         # File operations
strace -e trace=network -p <pid>      # Network operations

# ltrace - library calls
ltrace -c -p <pid>

# perf - CPU profiling
perf record -p <pid> -g sleep 30
perf report

# perf - specific events
perf stat -p <pid> sleep 10

# BPF tools
execsnoop                             # New processes
opensnoop                             # File opens
biolatency                            # Block I/O latency

Process Resource Limits

# View limits
cat /proc/<pid>/limits
ulimit -a

# Modify limits (in shell)
ulimit -n 65535                       # Open files
ulimit -u 65535                       # Max processes

# Persistent limits in /etc/security/limits.conf
# user    soft    nofile    65535
# user    hard    nofile    65535

# systemd service limits
systemctl show <service> | grep Limit
# Add to service file:
# [Service]
# LimitNOFILE=65535
# LimitNPROC=65535

System-Wide Analysis

BPF Tools (bcc/bpftrace)

# Install bcc-tools
dnf install bcc-tools
apt install bpfcc-tools

# Common tools (in /usr/share/bcc/tools/)
execsnoop           # New process execution
opensnoop           # File opens
biosnoop            # Block I/O with latency
tcpconnect          # TCP connections
tcpretrans          # TCP retransmissions
runqlat             # CPU scheduler latency
ext4slower          # Slow ext4 operations

# bpftrace examples
bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s %s\n", comm, str(args->filename)); }'
bpftrace -e 'profile:hz:99 { @[kstack] = count(); }'

Performance Recording

# Record system state
sar -A 1 60 > sar_output.txt &
vmstat 1 60 > vmstat_output.txt &
iostat -xz 1 60 > iostat_output.txt &

# Continuous monitoring script
while true; do
    date >> /var/log/perf_monitor.log
    vmstat 1 5 >> /var/log/perf_monitor.log
    echo "---" >> /var/log/perf_monitor.log
    sleep 60
done

# Performance Co-Pilot (PCP)
systemctl enable --now pmcd
pmstat
pmrep -t 1 kernel.all.load

Baseline Comparison

# Create baseline
sar -A -o /var/log/baseline_$(date +%Y%m%d).sar 1 3600

# Compare to baseline
sar -A -f /var/log/baseline_20240315.sar

# Quick baseline check
cat << 'EOF' > check_baseline.sh
#!/bin/bash
echo "=== Load Average ==="
uptime
echo "=== Memory ==="
free -h
echo "=== Disk ==="
df -h
echo "=== CPU ==="
mpstat 1 5
echo "=== I/O ==="
iostat -xz 1 5
EOF
chmod +x check_baseline.sh

Quick Troubleshooting Checklist

□ Check load average: uptime
□ Check CPU: mpstat -P ALL 1, top
□ Check memory: free -h, vmstat 1
□ Check swapping: vmstat 1 (si/so columns)
□ Check disk I/O: iostat -xz 1, iotop
□ Check disk space: df -h, df -i
□ Check network: sar -n DEV 1, ss -s
□ Find CPU-heavy processes: ps aux --sort=-%cpu
□ Find memory-heavy processes: ps aux --sort=-%mem
□ Find I/O-heavy processes: iotop
□ Check system logs: journalctl -p err -b
□ Check for OOM kills: dmesg | grep -i oom

Quick Command Reference

# === System Overview ===
top / htop                     # Interactive process viewer
vmstat 1 5                     # Virtual memory stats
sar -A 1 5                     # All system stats

# === CPU ===
mpstat -P ALL 1                # Per-CPU stats
pidstat -u 1                   # Per-process CPU
perf top                       # CPU profiling
ps aux --sort=-%cpu | head     # Top CPU processes

# === Memory ===
free -h                        # Memory overview
vmstat -s                      # Memory statistics
pidstat -r 1                   # Per-process memory
ps aux --sort=-%mem | head     # Top memory processes

# === Disk I/O ===
iostat -xz 1                   # I/O statistics
iotop                          # Per-process I/O
pidstat -d 1                   # Per-process disk
df -h                          # Disk space

# === Network ===
sar -n DEV 1                   # Network stats
ss -s                          # Socket summary
nethogs                        # Per-process bandwidth
iftop                          # Interface bandwidth

# === Process ===
strace -c -p <pid>             # System call trace
perf record -p <pid>           # CPU profiling
lsof -p <pid>                  # Open files
pmap -x <pid>                  # Memory maps

# === Tuning ===
sysctl -a                      # All kernel parameters
ulimit -a                      # Shell resource limits