cut, sort, uniq Mastery

Philosophy: The Extraction Pipeline

These three commands form the core text extraction pipeline in Unix:

           cut                  sort                uniq
┌─────────┐     ┌─────────┐     ┌─────────┐     ┌─────────┐
│  Input  │ --> │ Extract │ --> │  Order  │ --> │  Unique │ --> Output
│  Data   │     │ Columns │     │  Lines  │     │  Lines  │
└─────────┘     └─────────┘     └─────────┘     └─────────┘

uniq requires sorted input. This is the most common mistake. uniq only removes adjacent duplicates - it doesn’t search the entire stream.

cut: Column Extraction

The Mental Model

Think of cut as a vertical slicer. While grep selects rows, cut selects columns.

Input (rows):
┌───────────────────────────────┐
│ user:x:1000:1000:Name:/home:sh│  grep selects THIS (rows)
│ root:x:0:0:root:/root:/bin/sh │
└───────────────────────────────┘
           ↓ ↓ ↓
         cut selects THESE (columns)

Delimiter Mode (-d / -f)

# Extract field 1 (username) from /etc/passwd
cut -d: -f1 /etc/passwd

# Extract fields 1 and 7 (user and shell)
cut -d: -f1,7 /etc/passwd

# Extract fields 1 through 4
cut -d: -f1-4 /etc/passwd

# Extract field 3 onwards (UID and everything after)
cut -d: -f3- /etc/passwd

# Extract up to field 3
cut -d: -f-3 /etc/passwd

Character Mode (-c)

# First 10 characters of each line
cut -c1-10 file.txt

# Character 5 onwards
cut -c5- file.txt

# Characters 1-8 and 15-20
cut -c1-8,15-20 file.txt

Byte Mode (-b)

# First 100 bytes per line (useful for fixed-width binary)
cut -b1-100 data.bin

The --complement Flag

Extract everything EXCEPT the specified fields:

# Everything except field 2 (password placeholder)
cut -d: -f2 --complement /etc/passwd

# Remove sensitive columns from CSV export
cut -d, -f3,7 --complement sensitive.csv > sanitized.csv

Multi-Character Delimiters (cut Can’t Do This)

cut only supports single-character delimiters. For multi-character delimiters, use awk:

# cut -d"::" doesn't work!

# Use awk instead
echo "field1::field2::field3" | awk -F'::' '{print $2}'

The Output Delimiter (--output-delimiter)

# Change delimiter in output
cut -d: -f1,6,7 --output-delimiter='|' /etc/passwd

Output

root|/root|/bin/bash
evanusmodestus|/home/evanusmodestus|/bin/zsh

sort: Ordering Lines

Default Behavior

# Lexicographic (dictionary) sort
sort file.txt

# Reverse order
sort -r file.txt

Numeric Sort (-n)

Without -n, "100" comes before "20" because "1" < "2" lexicographically.

# Numeric sort (100 comes after 20)
sort -n numbers.txt

# Reverse numeric (largest first)
sort -rn numbers.txt

Human-Readable Sort (-h)

# Sort by human-readable sizes (1K, 2M, 3G)
du -h /var/log/* | sort -h

# Largest directories first
du -h /home/* | sort -rh

Version Sort (-V)

# Sort version numbers correctly
echo -e "v1.2\nv1.10\nv1.3" | sort -V

Output (correct)

v1.2
v1.3
v1.10

Field-Based Sort (-k)

The -k option specifies which field(s) to sort by.

# Sort by 2nd field (tab-separated by default)
sort -k2 file.txt

# Sort by 3rd field, numerically
sort -k3,3n file.txt

# Sort by 3rd field, then by 1st field (secondary sort)
sort -k3,3n -k1,1 file.txt

Field Specification Details

-k3       = Sort by field 3 to end of line
-k3,3     = Sort by field 3 only
-k3,3n    = Sort by field 3, numerically
-k3,3r    = Sort by field 3, reverse
-k3,3nr   = Sort by field 3, numeric reverse
-k3.2,3.5 = Sort by characters 2-5 within field 3

Custom Delimiter (-t)

# Sort /etc/passwd by UID (field 3)
sort -t: -k3,3n /etc/passwd

# Sort CSV by 2nd column
sort -t, -k2,2 data.csv

Unique During Sort (-u)

# Sort and deduplicate in one pass
sort -u file.txt

This is more efficient than sort | uniq for large files.

Stable Sort (-s)

Preserve original order when keys are equal:

sort -s -k2,2 file.txt

Check If Sorted (-c)

# Returns exit code 0 if sorted, 1 if not
sort -c file.txt && echo "Already sorted" || echo "Not sorted"

Random Sort (-R)

# Shuffle lines randomly
sort -R file.txt

# Select 5 random lines
sort -R file.txt | head -5

Memory and Performance

# Use more memory for faster sorting (large files)
sort -S 50% file.txt  # Use 50% of available RAM

# Use temporary directory for huge files
sort -T /mnt/fast-ssd file.txt

# Parallel sort (use multiple cores)
sort --parallel=8 huge_file.txt

uniq: Deduplication

The Critical Rule

uniq only removes ADJACENT duplicates!

# WRONG - won't work correctly
cat file.txt | uniq

# CORRECT - must sort first
sort file.txt | uniq

Basic Operations

# Remove adjacent duplicates
sort file.txt | uniq

# Equivalent: sort with -u flag
sort -u file.txt

Count Occurrences (-c)

# Show count of each unique line
sort file.txt | uniq -c

Output

      3 apple
      1 banana
      5 cherry

# Sort by frequency (most common first)
sort file.txt | uniq -c | sort -rn

Show Only Duplicates (-d)

# Show only lines that appear more than once
sort file.txt | uniq -d

Show Only Unique (-u)

# Show only lines that appear exactly once
sort file.txt | uniq -u

Ignore Case (-i)

# Case-insensitive deduplication
sort -f file.txt | uniq -i

Skip Fields (-f)

# Ignore first 2 fields when comparing
sort file.txt | uniq -f 2

Skip Characters (-s)

# Ignore first 5 characters when comparing
sort file.txt | uniq -s 5

Compare Only N Characters (-w)

# Compare only first 10 characters
sort file.txt | uniq -w 10

The Canonical Pipeline

The most common pattern:

<input> | cut -d'<delim>' -f<fields> | sort | uniq -c | sort -rn

Example: Top 10 IP addresses in access log

awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10

Example: Unique shells in use

cut -d: -f7 /etc/passwd | sort | uniq -c | sort -rn

Infrastructure Automation Patterns

SSH Authorized Keys Cleanup

# Extract unique key fingerprints
cat ~/.ssh/authorized_keys | \
  awk '{print $2}' | \              # Extract key blob
  sort -u | \
  while read key; do
    echo "$key" | ssh-keygen -lf - 2>/dev/null
  done

Find Duplicate MAC Addresses in Network

# Parse ISE endpoints for duplicate MACs
netapi ise --format json ers endpoint | \
  jq -r '.[].mac' | \
  sort | uniq -d

Log Analysis: Top Error Messages

# Top 10 error patterns from systemd journal
journalctl -p err --since today --no-pager | \
  cut -d: -f4- | \              # Remove timestamp/host/service prefix
  sort | uniq -c | sort -rn | head -10

DNS Query Analysis from pfSense

# Top queried domains
ssh pfsense "cat /var/log/resolver.log" 2>/dev/null | \
  grep "query:" | \
  awk '{print $8}' | \
  cut -d'(' -f1 | \
  sort | uniq -c | sort -rn | head -20

Certificate Expiry by Issuer

# Group certificates by issuer
netapi ise api-call openapi GET '/api/v1/certs/trusted-certificate?size=100' | \
  jq -r '.response[].issuedBy' | \
  sort | uniq -c | sort -rn

IP Range Discovery

# Find unique subnets from IP list
cat ip_list.txt | \
  cut -d. -f1-3 | \          # Extract first 3 octets (/24)
  sort -u | \
  while read subnet; do
    echo "${subnet}.0/24"
  done

Process Memory Usage Summary

# Memory by process name (RSS in KB)
ps aux --no-headers | \
  awk '{sum[$11]+=$6} END {for (proc in sum) print sum[proc], proc}' | \
  sort -rn | head -15

Systemd Service State Summary

# Count services by state
systemctl list-units --type=service --all --no-legend | \
  awk '{print $3}' | \
  sort | uniq -c | sort -rn

Git Author Statistics

# Commits per author
git log --format='%aN' | sort | uniq -c | sort -rn

# Lines changed per author
git log --format='%aN' --numstat | \
  awk '/^[0-9]/ {adds[$3]+=$1; dels[$3]+=$2}
       /^[^0-9]/ {author=$0}
       END {for (a in adds) print adds[a]+dels[a], adds[a], dels[a], a}' | \
  sort -rn

Network Connection Summary

# Connections per state
ss -tnap | tail -n+2 | \
  awk '{print $1}' | \
  sort | uniq -c | sort -rn

# Connections per remote IP
ss -tn | tail -n+2 | \
  awk '{print $5}' | \
  cut -d: -f1 | \
  sort | uniq -c | sort -rn | head -10

Package Installation Dates (RHEL)

# Packages installed per day
rpm -qa --queryformat '%{INSTALLTIME:day}\n' | \
  sort | uniq -c | sort -k2

Arch Package Installation Dates

# Packages installed per day
grep "installed" /var/log/pacman.log | \
  cut -d' ' -f1 | cut -d'[' -f2 | cut -dT -f1 | \
  sort | uniq -c | sort -k2

Advanced Patterns

Frequency Analysis with Percentages

# Add percentage to frequency count
<input> | sort | uniq -c | sort -rn | \
  awk 'NR==1 {total=$1; print $0, "100.00%"}
       NR>1 {printf "%7d %s %.2f%%\n", $1, $2, ($1/total)*100}'

Top N with "Other" Aggregation

# Top 5 with everything else as "Other"
<input> | sort | uniq -c | sort -rn | \
  awk 'NR<=5 {print; top+=$1}
       NR>5 {other+=$1}
       END {if (other>0) print other, "Other"}'

Field Comparison Between Files

# IPs in file1 but not in file2
cut -d, -f1 file1.csv | sort -u > /tmp/ips1.txt
cut -d, -f1 file2.csv | sort -u > /tmp/ips2.txt
comm -23 /tmp/ips1.txt /tmp/ips2.txt

Histogram Generation

# ASCII histogram of log times by hour
awk '{print $4}' access.log | cut -d: -f2 | sort | uniq -c | \
  awk '{printf "%2s %6d ", $2, $1; for (i=0; i<$1/100; i++) printf "#"; print ""}'

Output

00   1234 ############
01    567 #####
02    234 ##
...

Rolling Deduplication (Stream Processing)

# Deduplicate keeping first occurrence, no sorting needed
awk '!seen[$0]++' file.txt

This is much faster than sort | uniq for large files where you just need unique lines (don’t care about order).

Multi-Column Unique

# Unique combinations of columns 1 and 3
cut -d, -f1,3 data.csv | sort -u

# Using awk for complex key
awk -F, '{print $1 "|" $3}' data.csv | sort -u

Time-Based Aggregation

# Events per minute
awk '{print $1, $2}' syslog | \       # Extract date time
  cut -c1-16 | \                       # YYYY-MM-DD HH:MM (minute precision)
  sort | uniq -c | sort -k2,3

Set Operations

# Union (all unique lines from both files)
sort file1.txt file2.txt | uniq

# Intersection (lines in both files)
sort file1.txt file2.txt | uniq -d

# Difference (lines only in file1)
sort file1.txt file2.txt file2.txt | uniq -u

Common Mistakes and Fixes

Mistake: Using uniq Without sort

# WRONG - adjacent duplicates only
cat file.txt | uniq

# CORRECT
sort file.txt | uniq

Mistake: Numeric Sort Without -n

# WRONG - lexicographic (1, 10, 100, 2, 20)
sort numbers.txt

# CORRECT - numeric (1, 2, 10, 20, 100)
sort -n numbers.txt

Mistake: Wrong Field Delimiter

# WRONG - default is tab, not space
cut -f2 space_separated.txt

# CORRECT - specify delimiter
cut -d' ' -f2 space_separated.txt

Mistake: cut Field Numbers Start at 1

# WRONG - field 0 doesn't exist
cut -d: -f0 /etc/passwd

# CORRECT - first field is 1
cut -d: -f1 /etc/passwd

Mistake: Forgetting sort Is Case-Sensitive

# Case-sensitive (default)
sort file.txt  # "Apple" before "apple"

# Case-insensitive
sort -f file.txt

Performance Comparison

Operation Tool Notes

Operation	Tool	Notes
Extract columns	`cut`	Fastest for simple field extraction
Extract columns (complex)	`awk`	More flexible, slightly slower
Sort small files	`sort`	In-memory, very fast
Sort huge files	`sort -S 50% --parallel=N`	External merge sort
Unique (must preserve order)	`awk '!seen[$0]++'`	O(n) single pass, uses hash
Unique (order doesn’t matter)	`sort -u`	O(n log n) but optimized
Count occurrences	`sort \| uniq -c`	Standard approach
Count (huge files)	`awk '{count[$0]++} END {…}'`	Single pass, hash-based

Extract columns

cut

Fastest for simple field extraction

Extract columns (complex)

awk

More flexible, slightly slower

Sort small files

sort

In-memory, very fast

Sort huge files

sort -S 50% --parallel=N

External merge sort

Unique (must preserve order)

awk '!seen[$0]++'

O(n) single pass, uses hash

Unique (order doesn’t matter)

sort -u

O(n log n) but optimized

Count occurrences

sort | uniq -c

Standard approach

Count (huge files)

awk '{count[$0]++} END {…}'

Single pass, hash-based

Quick Reference

CUT
  -d<char>    Field delimiter
  -f<n>       Field number(s) (1-indexed)
  -f1,3       Fields 1 and 3
  -f1-3       Fields 1 through 3
  -f3-        Field 3 to end
  -c<n>       Character positions
  --complement  Invert selection
  --output-delimiter=<str>  Output separator

SORT
  -n          Numeric sort
  -r          Reverse
  -h          Human-readable (1K, 2M)
  -V          Version sort (1.10 > 1.9)
  -k<n>       Sort by field n
  -k<n>,<m>   Sort by fields n through m
  -t<char>    Field delimiter
  -u          Unique (remove duplicates)
  -f          Case-insensitive
  -s          Stable sort
  -R          Random shuffle
  -c          Check if sorted
  -S <size>   Buffer size (e.g., 50%)
  --parallel=N  Use N threads

UNIQ
  -c          Count occurrences
  -d          Only show duplicates
  -u          Only show unique (appear once)
  -i          Case-insensitive
  -f<n>       Skip first n fields
  -s<n>       Skip first n characters
  -w<n>       Compare only first n characters

cut, sort, uniq Mastery

Philosophy: The Extraction Pipeline

cut: Column Extraction

The Mental Model

Delimiter Mode (-d / -f)

Character Mode (-c)

Byte Mode (-b)

The --complement Flag

Multi-Character Delimiters (cut Can’t Do This)

The Output Delimiter (--output-delimiter)

sort: Ordering Lines

Default Behavior

Numeric Sort (-n)

Human-Readable Sort (-h)

Version Sort (-V)

Field-Based Sort (-k)

Field Specification Details

Custom Delimiter (-t)

Unique During Sort (-u)

Stable Sort (-s)

Check If Sorted (-c)

Random Sort (-R)

Memory and Performance

uniq: Deduplication

The Critical Rule

Basic Operations

Count Occurrences (-c)

Show Only Duplicates (-d)

Show Only Unique (-u)

Ignore Case (-i)

Skip Fields (-f)

Skip Characters (-s)

Compare Only N Characters (-w)

The Canonical Pipeline

Infrastructure Automation Patterns

SSH Authorized Keys Cleanup

Find Duplicate MAC Addresses in Network

Log Analysis: Top Error Messages

DNS Query Analysis from pfSense

Certificate Expiry by Issuer

IP Range Discovery

Process Memory Usage Summary

Systemd Service State Summary

Git Author Statistics

Network Connection Summary

Package Installation Dates (RHEL)

Arch Package Installation Dates

Advanced Patterns

Frequency Analysis with Percentages

Top N with "Other" Aggregation

Field Comparison Between Files

Histogram Generation

Rolling Deduplication (Stream Processing)

Multi-Column Unique

Time-Based Aggregation

Set Operations

Common Mistakes and Fixes

Mistake: Using uniq Without sort

Mistake: Numeric Sort Without -n

Mistake: Wrong Field Delimiter

Mistake: cut Field Numbers Start at 1

Mistake: Forgetting sort Is Case-Sensitive

Performance Comparison

Quick Reference

See Also