Statistics
Descriptive Statistics
Measures of center
Mean (average): x̄ = (sum of all values) / n
Median: middle value when sorted (average of two middle if even n)
Mode: most frequently occurring value
When to use which:
Mean: symmetric data, no outliers (response times under normal load)
Median: skewed data or outliers (salaries, latency with spikes)
Mode: categorical data (most common error code)
Example: latencies [2, 3, 3, 5, 5, 5, 8, 100] ms
Mean = 131/8 = 16.4 ms (distorted by outlier)
Median = (5+5)/2 = 5 ms (better representation)
Mode = 5 ms
Measures of spread
Range: max - min
Variance: σ^2 = sum((xi - x̄)^2) / n (population)
s^2 = sum((xi - x̄)^2) / (n-1) (sample)
Standard deviation: σ = sqrt(variance)
IQR: Q3 - Q1 (interquartile range, middle 50%)
Quartiles:
Q1 = 25th percentile
Q2 = 50th percentile (median)
Q3 = 75th percentile
Outlier detection (Tukey's rule):
Below Q1 - 1.5×IQR or Above Q3 + 1.5×IQR
Percentiles and SLA Metrics
Percentile definitions
P50 (median): 50% of values fall below this
P90: 90% below — "typical worst case"
P95: 95% below — common SLA metric
P99: 99% below — tail latency
P99.9: 99.9% below — extreme tail
Example: API response times (ms)
P50 = 45 ms (half of requests faster than this)
P90 = 120 ms (90% under 120 ms)
P95 = 250 ms (SLA target)
P99 = 800 ms (1 in 100 this slow)
Why percentiles matter more than mean:
Mean = 60 ms sounds great
But P99 = 3000 ms means 1% of users wait 3 seconds
At 10K requests/sec, that's 100 users per second hitting 3s
Computing percentiles from command line
# Sort numerically, find P95 from a file of latency values
sort -n latencies.txt | awk '
{a[NR]=$1}
END {
p50=a[int(NR*0.50)]
p90=a[int(NR*0.90)]
p95=a[int(NR*0.95)]
p99=a[int(NR*0.99)]
printf "P50=%d P90=%d P95=%d P99=%d\n", p50, p90, p95, p99
}'
# Basic stats from a column of numbers
awk '{sum+=$1; sumsq+=$1*$1; n++}
END {
mean=sum/n
var=sumsq/n - mean^2
printf "n=%d mean=%.2f sd=%.2f\n", n, mean, sqrt(var)
}' data.txt
Correlation and Regression
Correlation coefficient
Pearson's r: measures linear relationship between two variables
r = sum((xi - x̄)(yi - ȳ)) / sqrt(sum(xi - x̄)^2 × sum(yi - ȳ)^2)
Interpretation:
r = 1 perfect positive linear relationship
r = -1 perfect negative linear relationship
r = 0 no linear relationship (may still be nonlinear!)
|r| > 0.7 strong correlation
|r| < 0.3 weak correlation
CAUTION: correlation ≠ causation
Ice cream sales and drownings are correlated (summer)
CPU usage and errors may correlate without one causing the other
Linear regression
Best fit line: ŷ = mx + b
Slope: m = r × (sy / sx)
Intercept: b = ȳ - m × x̄
R^2 (coefficient of determination):
R^2 = r^2
Proportion of variance in y explained by x
R^2 = 0.85 means "85% of variation explained by the model"
Hypothesis Testing (Concepts)
Framework
Null hypothesis H0: "no effect" or "no difference"
Alternative hypothesis: "there IS an effect"
p-value: probability of seeing data this extreme if H0 is true
p < 0.05: reject H0 (statistically significant)
p >= 0.05: fail to reject H0 (not enough evidence)
Type I error (false positive): reject H0 when it's true (α)
Type II error (false negative): fail to reject when false (β)
Example: "Did the new config reduce latency?"
H0: mean latency before = mean latency after
H1: mean latency after < mean latency before
Collect samples, compute test statistic, find p-value
Practical application
A/B testing: Compare two configurations
Collect metrics from both, test if difference is significant
Anomaly detection:
If observation > μ + 3σ, flag as anomaly
P(normal value > 3σ) ≈ 0.13%
Change detection:
Before: mean response = 50ms, sd = 10ms
After: mean response = 55ms
Is 5ms increase significant or just noise?
Depends on sample size and variance