Information Theory

Shannon entropy, lossless and lossy compression, error detection and correction codes, and channel capacity theorem.

Information and Entropy

Shannon entropy
H(X) = -sum(p_i × log_2(p_i))    bits

Measures the average information content (surprise) per symbol.

Example: fair coin
  H = -(0.5 × log_2(0.5) + 0.5 × log_2(0.5))
  H = -(0.5 × -1 + 0.5 × -1) = 1 bit

Example: biased coin (90% heads)
  H = -(0.9 × log_2(0.9) + 0.1 × log_2(0.1))
  H ≈ -(0.9 × -0.152 + 0.1 × -3.322) ≈ 0.469 bits
  Less entropy = more predictable = more compressible

Maximum entropy: uniform distribution → H = log_2(n)
  English text: ~4.7 bits per character (26 letters)
  Uniform: log_2(26) ≈ 4.7 bits
  Actual English: ~1.0-1.5 bits per character (redundant)
Information content of single event
I(x) = -log_2(P(x))    bits

"How surprising is this event?"

P = 1/2:    I = 1 bit       (coin flip)
P = 1/8:    I = 3 bits      (specific card suit + color)
P = 1/256:  I = 8 bits      (specific byte value)
P = 1:      I = 0 bits      (certain event — no information)

Data Compression

Lossless compression
Shannon's source coding theorem:
  Cannot compress below H(X) bits per symbol on average
  Optimal compression approaches entropy

Huffman coding:
  Frequent symbols → short codes
  Rare symbols → long codes
  Prefix-free: no code is prefix of another
  Approaches entropy for large block sizes

LZ77/LZ78 (used in gzip, zstd):
  Replace repeated sequences with back-references
  (offset, length) pairs
  Exploits patterns in data, not just symbol frequency

Compression ratios:
  English text: 2-3× (high redundancy)
  Binary executables: 1.5-2×
  Already compressed (JPEG, MP3): ~1× (near entropy)
  Random data: ~1× (cannot compress — already maximum entropy)
Lossy compression
Discard information humans don't perceive:
  JPEG: spatial frequency, color subsampling
  MP3:  masking effect, inaudible frequencies
  H.264: temporal redundancy, motion prediction

Rate-distortion theory:
  Trade-off between bit rate and quality
  Minimum bits needed for given quality level

Error Detection and Correction

Error detection
Parity bit:     single bit, detects 1-bit errors
                Even parity: total 1s must be even
                Cannot correct, misses 2-bit errors

Checksum:       sum of data values mod some number
                TCP/UDP use 16-bit one's complement sum
                Detects common errors, not adversarial

CRC:            polynomial division over GF(2)
                CRC-32 in Ethernet, ZIP, PNG
                Detects all burst errors up to 32 bits
                Good for random errors, NOT cryptographic

Cryptographic hash: SHA-256, BLAKE3
                Detects any tampering (adversarial)
                Computationally infeasible to forge
Error correction
Hamming code:   can correct 1-bit errors, detect 2-bit
                Used in ECC RAM
                (7,4) Hamming: 4 data bits + 3 parity bits

Reed-Solomon:   corrects burst errors
                Used in CDs, DVDs, QR codes, deep space
                Can correct up to t errors with 2t parity symbols

LDPC:           near Shannon limit performance
                Used in WiFi 6 (802.11ax), 5G, SSD

Turbo codes:    near Shannon limit
                Used in 3G/4G cellular

Hamming distance:
  Number of bit positions where two codewords differ
  Minimum distance d:
    Detect up to d-1 errors
    Correct up to ⌊(d-1)/2⌋ errors

Channel Capacity

Shannon’s channel coding theorem
C = B × log_2(1 + SNR)

C = channel capacity (bits/second)
B = bandwidth (Hz)
SNR = signal-to-noise ratio (linear, not dB)

SNR_dB = 10 × log_10(SNR_linear)

Example: WiFi channel
  B = 20 MHz, SNR = 30 dB (1000 linear)
  C = 20×10^6 × log_2(1001) ≈ 200 Mbps theoretical max

Key insights:
  Doubling bandwidth → doubles capacity
  Doubling SNR → adds ~B bits/sec (diminishing returns)
  Error-free communication possible at any rate < C
  Above C: errors are unavoidable
Mutual information
I(X;Y) = H(X) - H(X|Y)

"How much does knowing Y tell us about X?"

If X and Y independent: I(X;Y) = 0 (knowing Y tells nothing)
If Y = X: I(X;Y) = H(X) (knowing Y tells everything)
Channel capacity: C = max I(X;Y) over all input distributions

See Also

  • Signals — signal processing applies information theory to physical channels

  • Cryptography — entropy measures cryptographic strength

  • Physics — Landauer’s principle connects information to thermodynamics