Performance Troubleshooting

Quick Reference

# System overview
top
htop
vmstat 1 5

# CPU
mpstat -P ALL 1
pidstat -u 1

# Memory
free -h
vmstat -s
pidstat -r 1

# Disk I/O
iostat -xz 1
iotop
pidstat -d 1

# Network
sar -n DEV 1
ss -s
iftop

# Process analysis
ps aux --sort=-%cpu | head
ps aux --sort=-%mem | head
strace -c -p <pid>
perf top

Performance Analysis Methodology

USE Method

Metric Description

Utilization

Time resource is busy (percentage)

Saturation

Work queued waiting for resource

Errors

Error events for the resource

Analysis Flow

┌─────────────────────────────────────────────────────────────────┐
│                    Performance Issue                             │
└───────────────────────────┬─────────────────────────────────────┘
                            │
        ┌───────────────────┼───────────────────┐
        │                   │                   │
   ┌────▼────┐        ┌─────▼────┐        ┌────▼────┐
   │   CPU   │        │  Memory  │        │  Disk   │
   │  Issue  │        │  Issue   │        │ I/O Issue│
   └────┬────┘        └────┬─────┘        └────┬────┘
        │                  │                   │
   ┌────▼────┐        ┌────▼─────┐        ┌────▼────┐
   │ mpstat  │        │  free    │        │ iostat  │
   │ pidstat │        │  vmstat  │        │ iotop   │
   │ perf    │        │  slabtop │        │ blktrace│
   └─────────┘        └──────────┘        └─────────┘

System Overview Tools

top/htop

# Basic top
top

# Top hotkeys:
#   1     - Show individual CPUs
#   M     - Sort by memory
#   P     - Sort by CPU
#   c     - Show full command
#   k     - Kill process
#   r     - Renice process
#   H     - Show threads
#   f     - Configure fields

# Batch mode (for scripting)
top -b -n 1 > top_output.txt

# Monitor specific process
top -p 1234

# htop (interactive, better UI)
htop

# htop features:
#   F6    - Sort by column
#   F9    - Kill process
#   F5    - Tree view
#   \     - Filter
#   /     - Search

vmstat

# Basic vmstat (1 second interval, 5 samples)
vmstat 1 5

# Output columns:
# procs:
#   r  = Running/waiting processes
#   b  = Blocked processes (I/O)
# memory:
#   swpd  = Virtual memory used
#   free  = Idle memory
#   buff  = Buffer memory
#   cache = Cache memory
# swap:
#   si = Swapped in from disk
#   so = Swapped out to disk
# io:
#   bi = Blocks received from device
#   bo = Blocks sent to device
# system:
#   in = Interrupts per second
#   cs = Context switches per second
# cpu:
#   us = User time
#   sy = System time
#   id = Idle time
#   wa = I/O wait time
#   st = Stolen time (VM)

# With timestamps
vmstat -t 1 5

# Memory statistics
vmstat -s

# Disk statistics
vmstat -d

# Watch for high values:
# - r > CPU count = CPU saturation
# - b > 0 = I/O blocking
# - si/so > 0 = Active swapping
# - wa > 20% = I/O wait issues

sar (System Activity Reporter)

# Install sysstat package first

# CPU usage
sar -u 1 5

# Memory usage
sar -r 1 5

# Disk I/O
sar -d 1 5

# Network
sar -n DEV 1 5

# All statistics
sar -A 1 5

# Historical data (if collected)
sar -u -f /var/log/sa/sa01

# Enable data collection
systemctl enable --now sysstat

CPU Performance

Diagnosing CPU Issues

# CPU utilization per core
mpstat -P ALL 1

# Per-process CPU usage
pidstat -u 1

# Top CPU-consuming processes
ps aux --sort=-%cpu | head -10

# Load average
uptime
cat /proc/loadavg

# Load average interpretation:
# 1.00 = 100% on single CPU
# Compare to: nproc (CPU count)
# < CPU count = OK
# > CPU count = Overloaded

CPU Saturation

# Check run queue (vmstat r column)
vmstat 1

# If r > CPU count, system is saturated

# Check context switches
vmstat 1 | awk '{print $12, $13}'

# High context switches may indicate:
# - Too many processes
# - Lock contention
# - Interrupt issues

# Check scheduler latency
cat /proc/schedstat

Process CPU Analysis

# strace - system call tracing
strace -c -p <pid>             # Summary
strace -tt -p <pid>            # Detailed with timestamps
strace -e trace=open,read,write -p <pid>

# perf - performance profiling
perf top                       # Real-time profile
perf record -p <pid> sleep 30  # Record profile
perf report                    # Analyze recording

# Flame graphs (with perf)
perf record -F 99 -a -g sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > cpu.svg

# Check process state
cat /proc/<pid>/status | grep State
# R = Running, S = Sleeping, D = Disk sleep (uninterruptible)

Runaway Process Triage

When your system fans spin up unexpectedly, use this workflow:

# Step 1: Find top CPU consumers
ps aux --sort=-%cpu | head -15

# Step 2: Check system load
cat /proc/loadavg
# Output: 3.73 3.84 3.69 6/1780 2244908
#         ^1m  ^5m  ^15m ^running/total ^last_pid

Identifying Orphaned Processes

Orphaned processes (PPID=1) were adopted by init after their parent died. They often indicate a crashed session or stuck background job.

# Find orphaned processes consuming CPU
ps -eo pid,ppid,stat,%cpu,etime,cmd --sort=-%cpu | awk '$2==1 && $4>5'

# Detailed check on suspect process
ps -p <PID> -o pid,ppid,stat,etime,%cpu,%mem,cmd
Real Example: Stuck hyprlock
$ ps aux --sort=-%cpu | head -5
USER       PID %CPU %MEM  TIME COMMAND
user   3603830  103  0.3 2369:32 hyprlock    # <-- 39+ hours CPU time!

$ ps -p 3603830 -o pid,ppid,stat,etime,cmd
    PID    PPID STAT     ELAPSED CMD
3603830       1 Rl    1-14:19:52 hyprlock    # <-- PPID=1 (orphaned), Rl (running)

Process State Indicators

State Meaning Action

R / Rl

Running (l=low priority)

Check if legitimate workload

S / Sl

Sleeping (interruptible)

Normal for most processes

D

Uninterruptible sleep (I/O)

Check disk/NFS issues

Z

Zombie (defunct)

Parent needs to reap; kill parent

T

Stopped (signal/debugger)

Resume with kill -CONT or kill

Safe Kill Workflow

# Graceful termination first
kill <PID>

# If still running after 5 seconds
kill -9 <PID>

# For zombie processes, kill the parent
ps -p <ZOMBIE_PID> -o ppid=    # Get parent PID
kill <PARENT_PID>

Bulk Cleanup

# Kill all processes by name
pkill hyprlock

# Kill all orphaned processes by a specific user (CAREFUL!)
ps -eo pid,ppid,user,cmd | awk '$2==1 && $3=="username" {print $1}' | xargs kill

# Kill processes consuming >90% CPU for more than 1 hour
ps -eo pid,%cpu,etime,cmd --sort=-%cpu | awk '$2>90 && $3~/[0-9]+-/ {print $1}' | head -5
# ^-- etime format: days-HH:MM:SS, so [0-9]+- matches >1 day

CPU Tuning

# Check CPU frequency
cat /proc/cpuinfo | grep MHz
cpupower frequency-info

# Set CPU governor
cpupower frequency-set -g performance
# Governors: performance, powersave, ondemand, conservative

# Process priority (nice)
nice -n 10 command            # Lower priority
renice -n -5 -p <pid>         # Higher priority (root)

# CPU affinity
taskset -c 0,1 command        # Run on CPUs 0,1
taskset -p -c 0 <pid>         # Set affinity for running process

# Disable CPU cores (for testing)
echo 0 > /sys/devices/system/cpu/cpu3/online

Memory Performance

Diagnosing Memory Issues

# Memory overview
free -h
# total = Total physical RAM
# used = Used memory
# free = Completely unused
# buff/cache = Kernel buffers and page cache
# available = Estimated available for apps

# Detailed memory stats
cat /proc/meminfo
vmstat -s

# Per-process memory
ps aux --sort=-%mem | head -10
pidstat -r 1

# Memory by process
smem -tk
pmap -x <pid>

Memory Saturation

# Check for swapping
vmstat 1 | awk '{print $7, $8}'  # si, so columns
# si/so > 0 = Active swapping (performance impact)

# Check swap usage
free -h
swapon --show

# OOM killer activity
dmesg | grep -i "out of memory"
journalctl -k | grep -i oom

# Memory pressure
cat /proc/pressure/memory

# Check NUMA statistics
numastat
numactl --hardware

Memory Analysis

# Slab memory (kernel objects)
slabtop
cat /proc/slabinfo

# Page cache
cat /proc/meminfo | grep -E "Cached|Buffers|Active|Inactive"

# Per-process detailed
cat /proc/<pid>/status | grep -E "VmSize|VmRSS|VmSwap"
# VmSize = Virtual memory size
# VmRSS = Resident Set Size (physical memory)
# VmSwap = Swapped memory

# Memory maps
pmap -x <pid>
cat /proc/<pid>/smaps

# Find memory leaks
valgrind --leak-check=full ./program

Memory Tuning

# Clear page cache (for testing)
sync; echo 3 > /proc/sys/vm/drop_caches

# Swappiness (0-100, lower = prefer RAM)
sysctl vm.swappiness
sysctl -w vm.swappiness=10

# Dirty page settings
sysctl vm.dirty_ratio              # % RAM for dirty pages
sysctl vm.dirty_background_ratio   # % before background flush

# OOM score adjustment
echo -1000 > /proc/<pid>/oom_score_adj  # Never kill
echo 1000 > /proc/<pid>/oom_score_adj   # Kill first

# Huge pages
cat /proc/meminfo | grep Huge
sysctl vm.nr_hugepages=128

# NUMA tuning
numactl --membind=0 --cpunodebind=0 ./program

Disk I/O Performance

Diagnosing I/O Issues

# I/O statistics
iostat -xz 1

# Key columns:
#   r/s, w/s     = Reads/writes per second
#   rkB/s, wkB/s = KB read/written per second
#   await        = Average I/O wait time (ms)
#   avgqu-sz     = Average queue length
#   %util        = Device utilization

# Warning signs:
# - %util > 60% = Device saturated
# - await > 10ms = Slow I/O
# - avgqu-sz > 1 = I/O queuing

# Per-process I/O
iotop
pidstat -d 1

# Block device statistics
cat /proc/diskstats

I/O Saturation

# Check for blocked processes
vmstat 1 | awk '{print $2}'  # 'b' column
# b > 0 = Processes blocked on I/O

# I/O wait
vmstat 1 | awk '{print $16}'  # 'wa' column
# wa > 20% = I/O bottleneck

# Queue depth
iostat -xz 1 | awk '{print $1, $10}'  # avgqu-sz

# Check for I/O pressure
cat /proc/pressure/io

I/O Analysis

# Detailed block I/O tracing
blktrace -d /dev/sda -o trace
blkparse trace.* > trace.txt

# Simpler I/O tracing
sudo bpftrace -e 'tracepoint:block:block_rq_issue { @[args->comm] = count(); }'

# File-level I/O
fatrace                    # File access tracing
inotifywait -m /path       # Monitor file events

# Find I/O-heavy processes
iotop -o                   # Only show I/O processes
pidstat -d 1 | grep -v "^$"

# Check filesystem
df -h
df -i                      # Inode usage

I/O Tuning

# I/O scheduler
cat /sys/block/sda/queue/scheduler
# Options: none, mq-deadline, kyber, bfq

# Change scheduler (for NVMe, 'none' is often best)
echo none > /sys/block/nvme0n1/queue/scheduler

# Read-ahead
cat /sys/block/sda/queue/read_ahead_kb
echo 256 > /sys/block/sda/queue/read_ahead_kb

# Queue depth
cat /sys/block/sda/queue/nr_requests
echo 256 > /sys/block/sda/queue/nr_requests

# Dirty page writeback (for write-heavy workloads)
sysctl vm.dirty_expire_centisecs=500
sysctl vm.dirty_writeback_centisecs=100

# Filesystem mount options
# noatime    - Don't update access times
# nodiratime - Don't update directory access times
# barrier=0  - Disable write barriers (risky!)

Network Performance

Diagnosing Network Issues

# Network statistics
sar -n DEV 1 5

# Interface statistics
ip -s link
cat /proc/net/dev

# Socket statistics
ss -s
ss -tunap

# Per-process network
nethogs
iftop

# Packet analysis
tcpdump -i eth0 -c 100

Network Saturation

# Check for dropped packets
ip -s link show eth0 | grep -E "dropped|errors"
netstat -i
cat /proc/net/dev

# Check for buffer overflows
netstat -s | grep -i drop
netstat -s | grep -i overflow

# Socket buffer sizes
sysctl net.core.rmem_max
sysctl net.core.wmem_max

# TCP statistics
ss -s
cat /proc/net/netstat

Network Analysis

# Connection states
ss -tan state established | wc -l
ss -tan state time-wait | wc -l

# Port utilization
ss -tulnp

# Bandwidth test
iperf3 -s                  # Server
iperf3 -c server_ip        # Client

# Latency test
ping -c 100 host
mtr host

# DNS performance
dig +stats example.com

Network Tuning

# Increase socket buffer sizes
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216

# TCP tuning
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

# Enable TCP window scaling
sysctl -w net.ipv4.tcp_window_scaling=1

# Increase connection backlog
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_max_syn_backlog=65535

# TIME_WAIT tuning
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_fin_timeout=15

# Network queue (for high-speed NICs)
ethtool -g eth0            # Show ring buffer
ethtool -G eth0 rx 4096    # Increase RX buffer

# Interrupt coalescing
ethtool -c eth0
ethtool -C eth0 rx-usecs 50

Process Analysis

Finding Problem Processes

# CPU hogs
ps aux --sort=-%cpu | head -10
top -bn1 | head -20

# Memory hogs
ps aux --sort=-%mem | head -10
smem -tk | head -10

# I/O hogs
iotop -b -n 1 | head -10
pidstat -d 1 1

# Thread count
ps -eLf | awk '{print $1}' | sort | uniq -c | sort -rn | head

# Open files
lsof -p <pid> | wc -l
ls -la /proc/<pid>/fd | wc -l

# File descriptor limits
cat /proc/<pid>/limits | grep "open files"

Process Profiling

# strace - system calls
strace -c -p <pid>                    # Summary
strace -T -p <pid>                    # With timing
strace -e trace=file -p <pid>         # File operations
strace -e trace=network -p <pid>      # Network operations

# ltrace - library calls
ltrace -c -p <pid>

# perf - CPU profiling
perf record -p <pid> -g sleep 30
perf report

# perf - specific events
perf stat -p <pid> sleep 10

# BPF tools
execsnoop                             # New processes
opensnoop                             # File opens
biolatency                            # Block I/O latency

Process Resource Limits

# View limits
cat /proc/<pid>/limits
ulimit -a

# Modify limits (in shell)
ulimit -n 65535                       # Open files
ulimit -u 65535                       # Max processes

# Persistent limits in /etc/security/limits.conf
# user    soft    nofile    65535
# user    hard    nofile    65535

# systemd service limits
systemctl show <service> | grep Limit
# Add to service file:
# [Service]
# LimitNOFILE=65535
# LimitNPROC=65535

System-Wide Analysis

BPF Tools (bcc/bpftrace)

# Install bcc-tools
dnf install bcc-tools
apt install bpfcc-tools

# Common tools (in /usr/share/bcc/tools/)
execsnoop           # New process execution
opensnoop           # File opens
biosnoop            # Block I/O with latency
tcpconnect          # TCP connections
tcpretrans          # TCP retransmissions
runqlat             # CPU scheduler latency
ext4slower          # Slow ext4 operations

# bpftrace examples
bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s %s\n", comm, str(args->filename)); }'
bpftrace -e 'profile:hz:99 { @[kstack] = count(); }'

Performance Recording

# Record system state
sar -A 1 60 > sar_output.txt &
vmstat 1 60 > vmstat_output.txt &
iostat -xz 1 60 > iostat_output.txt &

# Continuous monitoring script
while true; do
    date >> /var/log/perf_monitor.log
    vmstat 1 5 >> /var/log/perf_monitor.log
    echo "---" >> /var/log/perf_monitor.log
    sleep 60
done

# Performance Co-Pilot (PCP)
systemctl enable --now pmcd
pmstat
pmrep -t 1 kernel.all.load

Baseline Comparison

# Create baseline
sar -A -o /var/log/baseline_$(date +%Y%m%d).sar 1 3600

# Compare to baseline
sar -A -f /var/log/baseline_20240315.sar

# Quick baseline check
cat << 'EOF' > check_baseline.sh
#!/bin/bash
echo "=== Load Average ==="
uptime
echo "=== Memory ==="
free -h
echo "=== Disk ==="
df -h
echo "=== CPU ==="
mpstat 1 5
echo "=== I/O ==="
iostat -xz 1 5
EOF
chmod +x check_baseline.sh

Quick Troubleshooting Checklist

□ Check load average: uptime
□ Check CPU: mpstat -P ALL 1, top
□ Check memory: free -h, vmstat 1
□ Check swapping: vmstat 1 (si/so columns)
□ Check disk I/O: iostat -xz 1, iotop
□ Check disk space: df -h, df -i
□ Check network: sar -n DEV 1, ss -s
□ Find CPU-heavy processes: ps aux --sort=-%cpu
□ Find memory-heavy processes: ps aux --sort=-%mem
□ Find I/O-heavy processes: iotop
□ Check system logs: journalctl -p err -b
□ Check for OOM kills: dmesg | grep -i oom

Quick Command Reference

# === System Overview ===
top / htop                     # Interactive process viewer
vmstat 1 5                     # Virtual memory stats
sar -A 1 5                     # All system stats

# === CPU ===
mpstat -P ALL 1                # Per-CPU stats
pidstat -u 1                   # Per-process CPU
perf top                       # CPU profiling
ps aux --sort=-%cpu | head     # Top CPU processes

# === Memory ===
free -h                        # Memory overview
vmstat -s                      # Memory statistics
pidstat -r 1                   # Per-process memory
ps aux --sort=-%mem | head     # Top memory processes

# === Disk I/O ===
iostat -xz 1                   # I/O statistics
iotop                          # Per-process I/O
pidstat -d 1                   # Per-process disk
df -h                          # Disk space

# === Network ===
sar -n DEV 1                   # Network stats
ss -s                          # Socket summary
nethogs                        # Per-process bandwidth
iftop                          # Interface bandwidth

# === Process ===
strace -c -p <pid>             # System call trace
perf record -p <pid>           # CPU profiling
lsof -p <pid>                  # Open files
pmap -x <pid>                  # Memory maps

# === Tuning ===
sysctl -a                      # All kernel parameters
ulimit -a                      # Shell resource limits

See Also