cut, sort, uniq Mastery

Philosophy: The Extraction Pipeline

These three commands form the core text extraction pipeline in Unix:

           cut                  sort                uniq
┌─────────┐     ┌─────────┐     ┌─────────┐     ┌─────────┐
│  Input  │ --> │ Extract │ --> │  Order  │ --> │  Unique │ --> Output
│  Data   │     │ Columns │     │  Lines  │     │  Lines  │
└─────────┘     └─────────┘     └─────────┘     └─────────┘

uniq requires sorted input. This is the most common mistake. uniq only removes adjacent duplicates - it doesn’t search the entire stream.

cut: Column Extraction

The Mental Model

Think of cut as a vertical slicer. While grep selects rows, cut selects columns.

Input (rows):
┌───────────────────────────────┐
│ user:x:1000:1000:Name:/home:sh│  grep selects THIS (rows)
│ root:x:0:0:root:/root:/bin/sh │
└───────────────────────────────┘
           ↓ ↓ ↓
         cut selects THESE (columns)

Delimiter Mode (-d / -f)

# Extract field 1 (username) from /etc/passwd
cut -d: -f1 /etc/passwd
# Extract fields 1 and 7 (user and shell)
cut -d: -f1,7 /etc/passwd
# Extract fields 1 through 4
cut -d: -f1-4 /etc/passwd
# Extract field 3 onwards (UID and everything after)
cut -d: -f3- /etc/passwd
# Extract up to field 3
cut -d: -f-3 /etc/passwd

Character Mode (-c)

# First 10 characters of each line
cut -c1-10 file.txt
# Character 5 onwards
cut -c5- file.txt
# Characters 1-8 and 15-20
cut -c1-8,15-20 file.txt

Byte Mode (-b)

# First 100 bytes per line (useful for fixed-width binary)
cut -b1-100 data.bin

The --complement Flag

Extract everything EXCEPT the specified fields:

# Everything except field 2 (password placeholder)
cut -d: -f2 --complement /etc/passwd
# Remove sensitive columns from CSV export
cut -d, -f3,7 --complement sensitive.csv > sanitized.csv

Multi-Character Delimiters (cut Can’t Do This)

cut only supports single-character delimiters. For multi-character delimiters, use awk:

# cut -d"::" doesn't work!

# Use awk instead
echo "field1::field2::field3" | awk -F'::' '{print $2}'

The Output Delimiter (--output-delimiter)

# Change delimiter in output
cut -d: -f1,6,7 --output-delimiter='|' /etc/passwd
Output
root|/root|/bin/bash
evanusmodestus|/home/evanusmodestus|/bin/zsh

sort: Ordering Lines

Default Behavior

# Lexicographic (dictionary) sort
sort file.txt
# Reverse order
sort -r file.txt

Numeric Sort (-n)

Without -n, "100" comes before "20" because "1" < "2" lexicographically.

# Numeric sort (100 comes after 20)
sort -n numbers.txt
# Reverse numeric (largest first)
sort -rn numbers.txt

Human-Readable Sort (-h)

# Sort by human-readable sizes (1K, 2M, 3G)
du -h /var/log/* | sort -h
# Largest directories first
du -h /home/* | sort -rh

Version Sort (-V)

# Sort version numbers correctly
echo -e "v1.2\nv1.10\nv1.3" | sort -V
Output (correct)
v1.2
v1.3
v1.10

Field-Based Sort (-k)

The -k option specifies which field(s) to sort by.

# Sort by 2nd field (tab-separated by default)
sort -k2 file.txt
# Sort by 3rd field, numerically
sort -k3,3n file.txt
# Sort by 3rd field, then by 1st field (secondary sort)
sort -k3,3n -k1,1 file.txt

Field Specification Details

-k3       = Sort by field 3 to end of line
-k3,3     = Sort by field 3 only
-k3,3n    = Sort by field 3, numerically
-k3,3r    = Sort by field 3, reverse
-k3,3nr   = Sort by field 3, numeric reverse
-k3.2,3.5 = Sort by characters 2-5 within field 3

Custom Delimiter (-t)

# Sort /etc/passwd by UID (field 3)
sort -t: -k3,3n /etc/passwd
# Sort CSV by 2nd column
sort -t, -k2,2 data.csv

Unique During Sort (-u)

# Sort and deduplicate in one pass
sort -u file.txt

This is more efficient than sort | uniq for large files.

Stable Sort (-s)

Preserve original order when keys are equal:

sort -s -k2,2 file.txt

Check If Sorted (-c)

# Returns exit code 0 if sorted, 1 if not
sort -c file.txt && echo "Already sorted" || echo "Not sorted"

Random Sort (-R)

# Shuffle lines randomly
sort -R file.txt
# Select 5 random lines
sort -R file.txt | head -5

Memory and Performance

# Use more memory for faster sorting (large files)
sort -S 50% file.txt  # Use 50% of available RAM
# Use temporary directory for huge files
sort -T /mnt/fast-ssd file.txt
# Parallel sort (use multiple cores)
sort --parallel=8 huge_file.txt

uniq: Deduplication

The Critical Rule

uniq only removes ADJACENT duplicates!

# WRONG - won't work correctly
cat file.txt | uniq

# CORRECT - must sort first
sort file.txt | uniq

Basic Operations

# Remove adjacent duplicates
sort file.txt | uniq
# Equivalent: sort with -u flag
sort -u file.txt

Count Occurrences (-c)

# Show count of each unique line
sort file.txt | uniq -c
Output
      3 apple
      1 banana
      5 cherry
# Sort by frequency (most common first)
sort file.txt | uniq -c | sort -rn

Show Only Duplicates (-d)

# Show only lines that appear more than once
sort file.txt | uniq -d

Show Only Unique (-u)

# Show only lines that appear exactly once
sort file.txt | uniq -u

Ignore Case (-i)

# Case-insensitive deduplication
sort -f file.txt | uniq -i

Skip Fields (-f)

# Ignore first 2 fields when comparing
sort file.txt | uniq -f 2

Skip Characters (-s)

# Ignore first 5 characters when comparing
sort file.txt | uniq -s 5

Compare Only N Characters (-w)

# Compare only first 10 characters
sort file.txt | uniq -w 10

The Canonical Pipeline

The most common pattern:

<input> | cut -d'<delim>' -f<fields> | sort | uniq -c | sort -rn
Example: Top 10 IP addresses in access log
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10
Example: Unique shells in use
cut -d: -f7 /etc/passwd | sort | uniq -c | sort -rn

Infrastructure Automation Patterns

SSH Authorized Keys Cleanup

# Extract unique key fingerprints
cat ~/.ssh/authorized_keys | \
  awk '{print $2}' | \              # Extract key blob
  sort -u | \
  while read key; do
    echo "$key" | ssh-keygen -lf - 2>/dev/null
  done

Find Duplicate MAC Addresses in Network

# Parse ISE endpoints for duplicate MACs
netapi ise --format json ers endpoint | \
  jq -r '.[].mac' | \
  sort | uniq -d

Log Analysis: Top Error Messages

# Top 10 error patterns from systemd journal
journalctl -p err --since today --no-pager | \
  cut -d: -f4- | \              # Remove timestamp/host/service prefix
  sort | uniq -c | sort -rn | head -10

DNS Query Analysis from pfSense

# Top queried domains
ssh pfsense "cat /var/log/resolver.log" 2>/dev/null | \
  grep "query:" | \
  awk '{print $8}' | \
  cut -d'(' -f1 | \
  sort | uniq -c | sort -rn | head -20

Certificate Expiry by Issuer

# Group certificates by issuer
netapi ise api-call openapi GET '/api/v1/certs/trusted-certificate?size=100' | \
  jq -r '.response[].issuedBy' | \
  sort | uniq -c | sort -rn

IP Range Discovery

# Find unique subnets from IP list
cat ip_list.txt | \
  cut -d. -f1-3 | \          # Extract first 3 octets (/24)
  sort -u | \
  while read subnet; do
    echo "${subnet}.0/24"
  done

Process Memory Usage Summary

# Memory by process name (RSS in KB)
ps aux --no-headers | \
  awk '{sum[$11]+=$6} END {for (proc in sum) print sum[proc], proc}' | \
  sort -rn | head -15

Systemd Service State Summary

# Count services by state
systemctl list-units --type=service --all --no-legend | \
  awk '{print $3}' | \
  sort | uniq -c | sort -rn

Git Author Statistics

# Commits per author
git log --format='%aN' | sort | uniq -c | sort -rn
# Lines changed per author
git log --format='%aN' --numstat | \
  awk '/^[0-9]/ {adds[$3]+=$1; dels[$3]+=$2}
       /^[^0-9]/ {author=$0}
       END {for (a in adds) print adds[a]+dels[a], adds[a], dels[a], a}' | \
  sort -rn

Network Connection Summary

# Connections per state
ss -tnap | tail -n+2 | \
  awk '{print $1}' | \
  sort | uniq -c | sort -rn
# Connections per remote IP
ss -tn | tail -n+2 | \
  awk '{print $5}' | \
  cut -d: -f1 | \
  sort | uniq -c | sort -rn | head -10

Package Installation Dates (RHEL)

# Packages installed per day
rpm -qa --queryformat '%{INSTALLTIME:day}\n' | \
  sort | uniq -c | sort -k2

Arch Package Installation Dates

# Packages installed per day
grep "installed" /var/log/pacman.log | \
  cut -d' ' -f1 | cut -d'[' -f2 | cut -dT -f1 | \
  sort | uniq -c | sort -k2

Advanced Patterns

Frequency Analysis with Percentages

# Add percentage to frequency count
<input> | sort | uniq -c | sort -rn | \
  awk 'NR==1 {total=$1; print $0, "100.00%"}
       NR>1 {printf "%7d %s %.2f%%\n", $1, $2, ($1/total)*100}'

Top N with "Other" Aggregation

# Top 5 with everything else as "Other"
<input> | sort | uniq -c | sort -rn | \
  awk 'NR<=5 {print; top+=$1}
       NR>5 {other+=$1}
       END {if (other>0) print other, "Other"}'

Field Comparison Between Files

# IPs in file1 but not in file2
cut -d, -f1 file1.csv | sort -u > /tmp/ips1.txt
cut -d, -f1 file2.csv | sort -u > /tmp/ips2.txt
comm -23 /tmp/ips1.txt /tmp/ips2.txt

Histogram Generation

# ASCII histogram of log times by hour
awk '{print $4}' access.log | cut -d: -f2 | sort | uniq -c | \
  awk '{printf "%2s %6d ", $2, $1; for (i=0; i<$1/100; i++) printf "#"; print ""}'
Output
00   1234 ############
01    567 #####
02    234 ##
...

Rolling Deduplication (Stream Processing)

# Deduplicate keeping first occurrence, no sorting needed
awk '!seen[$0]++' file.txt

This is much faster than sort | uniq for large files where you just need unique lines (don’t care about order).

Multi-Column Unique

# Unique combinations of columns 1 and 3
cut -d, -f1,3 data.csv | sort -u
# Using awk for complex key
awk -F, '{print $1 "|" $3}' data.csv | sort -u

Time-Based Aggregation

# Events per minute
awk '{print $1, $2}' syslog | \       # Extract date time
  cut -c1-16 | \                       # YYYY-MM-DD HH:MM (minute precision)
  sort | uniq -c | sort -k2,3

Set Operations

# Union (all unique lines from both files)
sort file1.txt file2.txt | uniq

# Intersection (lines in both files)
sort file1.txt file2.txt | uniq -d

# Difference (lines only in file1)
sort file1.txt file2.txt file2.txt | uniq -u

Common Mistakes and Fixes

Mistake: Using uniq Without sort

# WRONG - adjacent duplicates only
cat file.txt | uniq

# CORRECT
sort file.txt | uniq

Mistake: Numeric Sort Without -n

# WRONG - lexicographic (1, 10, 100, 2, 20)
sort numbers.txt

# CORRECT - numeric (1, 2, 10, 20, 100)
sort -n numbers.txt

Mistake: Wrong Field Delimiter

# WRONG - default is tab, not space
cut -f2 space_separated.txt

# CORRECT - specify delimiter
cut -d' ' -f2 space_separated.txt

Mistake: cut Field Numbers Start at 1

# WRONG - field 0 doesn't exist
cut -d: -f0 /etc/passwd

# CORRECT - first field is 1
cut -d: -f1 /etc/passwd

Mistake: Forgetting sort Is Case-Sensitive

# Case-sensitive (default)
sort file.txt  # "Apple" before "apple"

# Case-insensitive
sort -f file.txt

Performance Comparison

Operation Tool Notes

Extract columns

cut

Fastest for simple field extraction

Extract columns (complex)

awk

More flexible, slightly slower

Sort small files

sort

In-memory, very fast

Sort huge files

sort -S 50% --parallel=N

External merge sort

Unique (must preserve order)

awk '!seen[$0]++'

O(n) single pass, uses hash

Unique (order doesn’t matter)

sort -u

O(n log n) but optimized

Count occurrences

sort | uniq -c

Standard approach

Count (huge files)

awk '{count[$0]++} END {…​}'

Single pass, hash-based

Quick Reference

CUT
  -d<char>    Field delimiter
  -f<n>       Field number(s) (1-indexed)
  -f1,3       Fields 1 and 3
  -f1-3       Fields 1 through 3
  -f3-        Field 3 to end
  -c<n>       Character positions
  --complement  Invert selection
  --output-delimiter=<str>  Output separator

SORT
  -n          Numeric sort
  -r          Reverse
  -h          Human-readable (1K, 2M)
  -V          Version sort (1.10 > 1.9)
  -k<n>       Sort by field n
  -k<n>,<m>   Sort by fields n through m
  -t<char>    Field delimiter
  -u          Unique (remove duplicates)
  -f          Case-insensitive
  -s          Stable sort
  -R          Random shuffle
  -c          Check if sorted
  -S <size>   Buffer size (e.g., 50%)
  --parallel=N  Use N threads

UNIQ
  -c          Count occurrences
  -d          Only show duplicates
  -u          Only show unique (appear once)
  -i          Case-insensitive
  -f<n>       Skip first n fields
  -s<n>       Skip first n characters
  -w<n>       Compare only first n characters

See Also