Information Theory
Shannon entropy, lossless and lossy compression, error detection and correction codes, and channel capacity theorem.
Information and Entropy
Shannon entropy
H(X) = -sum(p_i × log_2(p_i)) bits
Measures the average information content (surprise) per symbol.
Example: fair coin
H = -(0.5 × log_2(0.5) + 0.5 × log_2(0.5))
H = -(0.5 × -1 + 0.5 × -1) = 1 bit
Example: biased coin (90% heads)
H = -(0.9 × log_2(0.9) + 0.1 × log_2(0.1))
H ≈ -(0.9 × -0.152 + 0.1 × -3.322) ≈ 0.469 bits
Less entropy = more predictable = more compressible
Maximum entropy: uniform distribution → H = log_2(n)
English text: ~4.7 bits per character (26 letters)
Uniform: log_2(26) ≈ 4.7 bits
Actual English: ~1.0-1.5 bits per character (redundant)
Information content of single event
I(x) = -log_2(P(x)) bits
"How surprising is this event?"
P = 1/2: I = 1 bit (coin flip)
P = 1/8: I = 3 bits (specific card suit + color)
P = 1/256: I = 8 bits (specific byte value)
P = 1: I = 0 bits (certain event — no information)
Data Compression
Lossless compression
Shannon's source coding theorem:
Cannot compress below H(X) bits per symbol on average
Optimal compression approaches entropy
Huffman coding:
Frequent symbols → short codes
Rare symbols → long codes
Prefix-free: no code is prefix of another
Approaches entropy for large block sizes
LZ77/LZ78 (used in gzip, zstd):
Replace repeated sequences with back-references
(offset, length) pairs
Exploits patterns in data, not just symbol frequency
Compression ratios:
English text: 2-3× (high redundancy)
Binary executables: 1.5-2×
Already compressed (JPEG, MP3): ~1× (near entropy)
Random data: ~1× (cannot compress — already maximum entropy)
Lossy compression
Discard information humans don't perceive:
JPEG: spatial frequency, color subsampling
MP3: masking effect, inaudible frequencies
H.264: temporal redundancy, motion prediction
Rate-distortion theory:
Trade-off between bit rate and quality
Minimum bits needed for given quality level
Error Detection and Correction
Error detection
Parity bit: single bit, detects 1-bit errors
Even parity: total 1s must be even
Cannot correct, misses 2-bit errors
Checksum: sum of data values mod some number
TCP/UDP use 16-bit one's complement sum
Detects common errors, not adversarial
CRC: polynomial division over GF(2)
CRC-32 in Ethernet, ZIP, PNG
Detects all burst errors up to 32 bits
Good for random errors, NOT cryptographic
Cryptographic hash: SHA-256, BLAKE3
Detects any tampering (adversarial)
Computationally infeasible to forge
Error correction
Hamming code: can correct 1-bit errors, detect 2-bit
Used in ECC RAM
(7,4) Hamming: 4 data bits + 3 parity bits
Reed-Solomon: corrects burst errors
Used in CDs, DVDs, QR codes, deep space
Can correct up to t errors with 2t parity symbols
LDPC: near Shannon limit performance
Used in WiFi 6 (802.11ax), 5G, SSD
Turbo codes: near Shannon limit
Used in 3G/4G cellular
Hamming distance:
Number of bit positions where two codewords differ
Minimum distance d:
Detect up to d-1 errors
Correct up to ⌊(d-1)/2⌋ errors
Channel Capacity
Shannon’s channel coding theorem
C = B × log_2(1 + SNR)
C = channel capacity (bits/second)
B = bandwidth (Hz)
SNR = signal-to-noise ratio (linear, not dB)
SNR_dB = 10 × log_10(SNR_linear)
Example: WiFi channel
B = 20 MHz, SNR = 30 dB (1000 linear)
C = 20×10^6 × log_2(1001) ≈ 200 Mbps theoretical max
Key insights:
Doubling bandwidth → doubles capacity
Doubling SNR → adds ~B bits/sec (diminishing returns)
Error-free communication possible at any rate < C
Above C: errors are unavoidable
Mutual information
I(X;Y) = H(X) - H(X|Y)
"How much does knowing Y tell us about X?"
If X and Y independent: I(X;Y) = 0 (knowing Y tells nothing)
If Y = X: I(X;Y) = H(X) (knowing Y tells everything)
Channel capacity: C = max I(X;Y) over all input distributions
See Also
-
Signals — signal processing applies information theory to physical channels
-
Cryptography — entropy measures cryptographic strength
-
Physics — Landauer’s principle connects information to thermodynamics