Statistics

Descriptive Statistics

Measures of center
Mean (average):     x̄ = (sum of all values) / n
Median:             middle value when sorted (average of two middle if even n)
Mode:               most frequently occurring value

When to use which:
  Mean:   symmetric data, no outliers (response times under normal load)
  Median: skewed data or outliers (salaries, latency with spikes)
  Mode:   categorical data (most common error code)

Example: latencies [2, 3, 3, 5, 5, 5, 8, 100] ms
  Mean = 131/8 = 16.4 ms    (distorted by outlier)
  Median = (5+5)/2 = 5 ms   (better representation)
  Mode = 5 ms
Measures of spread
Range:             max - min
Variance:          σ^2 = sum((xi - x̄)^2) / n         (population)
                   s^2 = sum((xi - x̄)^2) / (n-1)     (sample)
Standard deviation: σ = sqrt(variance)
IQR:               Q3 - Q1 (interquartile range, middle 50%)

Quartiles:
  Q1 = 25th percentile
  Q2 = 50th percentile (median)
  Q3 = 75th percentile

Outlier detection (Tukey's rule):
  Below Q1 - 1.5×IQR  or  Above Q3 + 1.5×IQR

Percentiles and SLA Metrics

Percentile definitions
P50 (median):   50% of values fall below this
P90:            90% below — "typical worst case"
P95:            95% below — common SLA metric
P99:            99% below — tail latency
P99.9:          99.9% below — extreme tail

Example: API response times (ms)
  P50 = 45 ms     (half of requests faster than this)
  P90 = 120 ms    (90% under 120 ms)
  P95 = 250 ms    (SLA target)
  P99 = 800 ms    (1 in 100 this slow)

Why percentiles matter more than mean:
  Mean = 60 ms sounds great
  But P99 = 3000 ms means 1% of users wait 3 seconds
  At 10K requests/sec, that's 100 users per second hitting 3s
Computing percentiles from command line
# Sort numerically, find P95 from a file of latency values
sort -n latencies.txt | awk '
    {a[NR]=$1}
    END {
        p50=a[int(NR*0.50)]
        p90=a[int(NR*0.90)]
        p95=a[int(NR*0.95)]
        p99=a[int(NR*0.99)]
        printf "P50=%d P90=%d P95=%d P99=%d\n", p50, p90, p95, p99
    }'

# Basic stats from a column of numbers
awk '{sum+=$1; sumsq+=$1*$1; n++}
     END {
         mean=sum/n
         var=sumsq/n - mean^2
         printf "n=%d mean=%.2f sd=%.2f\n", n, mean, sqrt(var)
     }' data.txt

Correlation and Regression

Correlation coefficient
Pearson's r: measures linear relationship between two variables

r = sum((xi - x̄)(yi - ȳ)) / sqrt(sum(xi - x̄)^2 × sum(yi - ȳ)^2)

Interpretation:
  r = 1       perfect positive linear relationship
  r = -1      perfect negative linear relationship
  r = 0       no linear relationship (may still be nonlinear!)
  |r| > 0.7   strong correlation
  |r| < 0.3   weak correlation

CAUTION: correlation ≠ causation
  Ice cream sales and drownings are correlated (summer)
  CPU usage and errors may correlate without one causing the other
Linear regression
Best fit line: ŷ = mx + b

Slope:     m = r × (sy / sx)
Intercept: b = ȳ - m × x̄

R^2 (coefficient of determination):
  R^2 = r^2
  Proportion of variance in y explained by x
  R^2 = 0.85 means "85% of variation explained by the model"

Hypothesis Testing (Concepts)

Framework
Null hypothesis H0:     "no effect" or "no difference"
Alternative hypothesis: "there IS an effect"

p-value: probability of seeing data this extreme if H0 is true
  p < 0.05: reject H0 (statistically significant)
  p >= 0.05: fail to reject H0 (not enough evidence)

Type I error (false positive):  reject H0 when it's true   (α)
Type II error (false negative): fail to reject when false   (β)

Example: "Did the new config reduce latency?"
  H0: mean latency before = mean latency after
  H1: mean latency after < mean latency before
  Collect samples, compute test statistic, find p-value
Practical application
A/B testing:     Compare two configurations
  Collect metrics from both, test if difference is significant

Anomaly detection:
  If observation > μ + 3σ, flag as anomaly
  P(normal value > 3σ) ≈ 0.13%

Change detection:
  Before: mean response = 50ms, sd = 10ms
  After: mean response = 55ms
  Is 5ms increase significant or just noise?
  Depends on sample size and variance