cut, sort, uniq Mastery
Philosophy: The Extraction Pipeline
These three commands form the core text extraction pipeline in Unix:
cut sort uniq
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Input │ --> │ Extract │ --> │ Order │ --> │ Unique │ --> Output
│ Data │ │ Columns │ │ Lines │ │ Lines │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
|
|
cut: Column Extraction
The Mental Model
Think of cut as a vertical slicer. While grep selects rows, cut selects columns.
Input (rows):
┌───────────────────────────────┐
│ user:x:1000:1000:Name:/home:sh│ grep selects THIS (rows)
│ root:x:0:0:root:/root:/bin/sh │
└───────────────────────────────┘
↓ ↓ ↓
cut selects THESE (columns)
Delimiter Mode (-d / -f)
# Extract field 1 (username) from /etc/passwd
cut -d: -f1 /etc/passwd
# Extract fields 1 and 7 (user and shell)
cut -d: -f1,7 /etc/passwd
# Extract fields 1 through 4
cut -d: -f1-4 /etc/passwd
# Extract field 3 onwards (UID and everything after)
cut -d: -f3- /etc/passwd
# Extract up to field 3
cut -d: -f-3 /etc/passwd
Character Mode (-c)
# First 10 characters of each line
cut -c1-10 file.txt
# Character 5 onwards
cut -c5- file.txt
# Characters 1-8 and 15-20
cut -c1-8,15-20 file.txt
The --complement Flag
Extract everything EXCEPT the specified fields:
# Everything except field 2 (password placeholder)
cut -d: -f2 --complement /etc/passwd
# Remove sensitive columns from CSV export
cut -d, -f3,7 --complement sensitive.csv > sanitized.csv
sort: Ordering Lines
Numeric Sort (-n)
|
Without |
# Numeric sort (100 comes after 20)
sort -n numbers.txt
# Reverse numeric (largest first)
sort -rn numbers.txt
Human-Readable Sort (-h)
# Sort by human-readable sizes (1K, 2M, 3G)
du -h /var/log/* | sort -h
# Largest directories first
du -h /home/* | sort -rh
Version Sort (-V)
# Sort version numbers correctly
echo -e "v1.2\nv1.10\nv1.3" | sort -V
v1.2 v1.3 v1.10
Field-Based Sort (-k)
The -k option specifies which field(s) to sort by.
# Sort by 2nd field (tab-separated by default)
sort -k2 file.txt
# Sort by 3rd field, numerically
sort -k3,3n file.txt
# Sort by 3rd field, then by 1st field (secondary sort)
sort -k3,3n -k1,1 file.txt
Custom Delimiter (-t)
# Sort /etc/passwd by UID (field 3)
sort -t: -k3,3n /etc/passwd
# Sort CSV by 2nd column
sort -t, -k2,2 data.csv
Unique During Sort (-u)
# Sort and deduplicate in one pass
sort -u file.txt
This is more efficient than sort | uniq for large files.
Check If Sorted (-c)
# Returns exit code 0 if sorted, 1 if not
sort -c file.txt && echo "Already sorted" || echo "Not sorted"
uniq: Deduplication
The Critical Rule
|
|
Basic Operations
# Remove adjacent duplicates
sort file.txt | uniq
# Equivalent: sort with -u flag
sort -u file.txt
The Canonical Pipeline
The most common pattern:
<input> | cut -d'<delim>' -f<fields> | sort | uniq -c | sort -rn
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10
cut -d: -f7 /etc/passwd | sort | uniq -c | sort -rn
Infrastructure Automation Patterns
SSH Authorized Keys Cleanup
# Extract unique key fingerprints
cat ~/.ssh/authorized_keys | \
awk '{print $2}' | \ # Extract key blob
sort -u | \
while read key; do
echo "$key" | ssh-keygen -lf - 2>/dev/null
done
Find Duplicate MAC Addresses in Network
# Parse ISE endpoints for duplicate MACs
netapi ise --format json ers endpoint | \
jq -r '.[].mac' | \
sort | uniq -d
Log Analysis: Top Error Messages
# Top 10 error patterns from systemd journal
journalctl -p err --since today --no-pager | \
cut -d: -f4- | \ # Remove timestamp/host/service prefix
sort | uniq -c | sort -rn | head -10
DNS Query Analysis from pfSense
# Top queried domains
ssh pfsense "cat /var/log/resolver.log" 2>/dev/null | \
grep "query:" | \
awk '{print $8}' | \
cut -d'(' -f1 | \
sort | uniq -c | sort -rn | head -20
Certificate Expiry by Issuer
# Group certificates by issuer
netapi ise api-call openapi GET '/api/v1/certs/trusted-certificate?size=100' | \
jq -r '.response[].issuedBy' | \
sort | uniq -c | sort -rn
IP Range Discovery
# Find unique subnets from IP list
cat ip_list.txt | \
cut -d. -f1-3 | \ # Extract first 3 octets (/24)
sort -u | \
while read subnet; do
echo "${subnet}.0/24"
done
Process Memory Usage Summary
# Memory by process name (RSS in KB)
ps aux --no-headers | \
awk '{sum[$11]+=$6} END {for (proc in sum) print sum[proc], proc}' | \
sort -rn | head -15
Systemd Service State Summary
# Count services by state
systemctl list-units --type=service --all --no-legend | \
awk '{print $3}' | \
sort | uniq -c | sort -rn
Git Author Statistics
# Commits per author
git log --format='%aN' | sort | uniq -c | sort -rn
# Lines changed per author
git log --format='%aN' --numstat | \
awk '/^[0-9]/ {adds[$3]+=$1; dels[$3]+=$2}
/^[^0-9]/ {author=$0}
END {for (a in adds) print adds[a]+dels[a], adds[a], dels[a], a}' | \
sort -rn
Network Connection Summary
# Connections per state
ss -tnap | tail -n+2 | \
awk '{print $1}' | \
sort | uniq -c | sort -rn
# Connections per remote IP
ss -tn | tail -n+2 | \
awk '{print $5}' | \
cut -d: -f1 | \
sort | uniq -c | sort -rn | head -10
Advanced Patterns
Frequency Analysis with Percentages
# Add percentage to frequency count
<input> | sort | uniq -c | sort -rn | \
awk 'NR==1 {total=$1; print $0, "100.00%"}
NR>1 {printf "%7d %s %.2f%%\n", $1, $2, ($1/total)*100}'
Top N with "Other" Aggregation
# Top 5 with everything else as "Other"
<input> | sort | uniq -c | sort -rn | \
awk 'NR<=5 {print; top+=$1}
NR>5 {other+=$1}
END {if (other>0) print other, "Other"}'
Field Comparison Between Files
# IPs in file1 but not in file2
cut -d, -f1 file1.csv | sort -u > /tmp/ips1.txt
cut -d, -f1 file2.csv | sort -u > /tmp/ips2.txt
comm -23 /tmp/ips1.txt /tmp/ips2.txt
Histogram Generation
# ASCII histogram of log times by hour
awk '{print $4}' access.log | cut -d: -f2 | sort | uniq -c | \
awk '{printf "%2s %6d ", $2, $1; for (i=0; i<$1/100; i++) printf "#"; print ""}'
00 1234 ############ 01 567 ##### 02 234 ## ...
Rolling Deduplication (Stream Processing)
# Deduplicate keeping first occurrence, no sorting needed
awk '!seen[$0]++' file.txt
This is much faster than sort | uniq for large files where you just need unique lines (don’t care about order).
Multi-Column Unique
# Unique combinations of columns 1 and 3
cut -d, -f1,3 data.csv | sort -u
# Using awk for complex key
awk -F, '{print $1 "|" $3}' data.csv | sort -u
Common Mistakes and Fixes
Mistake: Using uniq Without sort
# WRONG - adjacent duplicates only
cat file.txt | uniq
# CORRECT
sort file.txt | uniq
Mistake: Numeric Sort Without -n
# WRONG - lexicographic (1, 10, 100, 2, 20)
sort numbers.txt
# CORRECT - numeric (1, 2, 10, 20, 100)
sort -n numbers.txt
Mistake: Wrong Field Delimiter
# WRONG - default is tab, not space
cut -f2 space_separated.txt
# CORRECT - specify delimiter
cut -d' ' -f2 space_separated.txt
Performance Comparison
| Operation | Tool | Notes |
|---|---|---|
Extract columns |
|
Fastest for simple field extraction |
Extract columns (complex) |
|
More flexible, slightly slower |
Sort small files |
|
In-memory, very fast |
Sort huge files |
|
External merge sort |
Unique (must preserve order) |
|
O(n) single pass, uses hash |
Unique (order doesn’t matter) |
|
O(n log n) but optimized |
Count occurrences |
|
Standard approach |
Count (huge files) |
|
Single pass, hash-based |
Quick Reference
CUT
-d<char> Field delimiter
-f<n> Field number(s) (1-indexed)
-f1,3 Fields 1 and 3
-f1-3 Fields 1 through 3
-f3- Field 3 to end
-c<n> Character positions
--complement Invert selection
--output-delimiter=<str> Output separator
SORT
-n Numeric sort
-r Reverse
-h Human-readable (1K, 2M)
-V Version sort (1.10 > 1.9)
-k<n> Sort by field n
-k<n>,<m> Sort by fields n through m
-t<char> Field delimiter
-u Unique (remove duplicates)
-f Case-insensitive
-s Stable sort
-R Random shuffle
-c Check if sorted
-S <size> Buffer size (e.g., 50%)
--parallel=N Use N threads
UNIQ
-c Count occurrences
-d Only show duplicates
-u Only show unique (appear once)
-i Case-insensitive
-f<n> Skip first n fields
-s<n> Skip first n characters
-w<n> Compare only first n characters
See Also
-
AWK Mastery - When cut isn’t enough
-
Grep Mastery - Row selection before cut
-
Stream Processing - Pipeline composition
-
xargs Mastery - Multiply pipeline output