Regex Session 06: awk Regex Power

awk combines the power of regex pattern matching with field-based data processing. When grep isn’t enough and sed is too limited, awk shines.

awk Basics

Syntax: awk 'pattern { action }' file

  • If pattern matches, action executes

  • Default action is { print }

  • Default pattern is "match all lines"

Field Extraction

awk automatically splits lines into fields:

  • $0 = entire line

  • $1 = first field

  • $2 = second field

  • NF = number of fields

  • $NF = last field

# Print second field (space-delimited by default)
echo "hello world today" | awk '{print $2}'
# Output: world

# Print last field
echo "one two three four" | awk '{print $NF}'
# Output: four

Test File Setup

cat << 'EOF' > /tmp/awk-practice.txt
# Access Log
192.168.1.100 GET /api/users 200 145ms
10.50.1.20 POST /api/login 200 89ms
172.16.0.5 GET /api/products 404 23ms
192.168.1.100 GET /api/orders 500 2341ms
10.50.1.50 DELETE /api/users/123 403 45ms

# User Data
admin:x:1000:1000:Administrator:/home/admin:/bin/bash
developer:x:1001:1001:Developer User:/home/dev:/bin/zsh
service:x:999:999:Service Account:/var/lib/service:/sbin/nologin

# Network Config
interface=eth0 ip=192.168.1.100 netmask=255.255.255.0 gateway=192.168.1.1
interface=eth1 ip=10.50.1.20 netmask=255.255.255.0 gateway=10.50.1.1
interface=lo ip=127.0.0.1 netmask=255.0.0.0

# CSV Data
Name,Email,Department,Salary
John Doe,john@example.com,Engineering,85000
Jane Smith,jane@example.com,Marketing,72000
Bob Wilson,bob@example.com,Engineering,92000
EOF

Lesson 1: Pattern Matching

# Print lines containing "GET"
awk '/GET/' /tmp/awk-practice.txt

# Print lines containing IP starting with 192
awk '/192\.168/' /tmp/awk-practice.txt

# Print lines NOT matching pattern
awk '!/^#/' /tmp/awk-practice.txt  # Exclude comments

Lesson 2: Field + Pattern Combinations

# Print specific fields from matching lines
awk '/GET/ {print $1, $4}' /tmp/awk-practice.txt
# Output: IP and status code

# Conditional on field value
awk '$4 == 500 {print $1, $3}' /tmp/awk-practice.txt
# Output: IPs with 500 errors

# Numeric comparison
awk '$5 > 100 {print $0}' /tmp/awk-practice.txt
# Note: "145ms" isn't numeric - need to extract

Lesson 3: Field Separators

# Colon-separated (like /etc/passwd)
awk -F: '{print $1, $NF}' /tmp/awk-practice.txt
# Prints: username and shell

# Comma-separated (CSV)
awk -F, '/Engineering/ {print $1, $4}' /tmp/awk-practice.txt

# Multiple separators
awk -F'[=:]' '{print $1, $2}' /tmp/awk-practice.txt

Lesson 4: Regex in Field Matching

# Match field against regex using ~
awk '$1 ~ /^192/' /tmp/awk-practice.txt

# Negated match using !~
awk '$1 !~ /^#/' /tmp/awk-practice.txt

# Match multiple fields
awk '$3 ~ /\/api\// && $4 ~ /[45][0-9]{2}/' /tmp/awk-practice.txt

Lesson 5: Built-in Regex Functions

match() - Find and extract

# Find position of match
awk '{
    if (match($0, /[0-9]+ms/)) {
        print substr($0, RSTART, RLENGTH)
    }
}' /tmp/awk-practice.txt

gsub() - Global substitution

# Replace all occurrences
awk '{gsub(/192\.168/, "10.0.0"); print}' /tmp/awk-practice.txt

# Replace in specific field
awk -F, '{gsub(/@.*/, "@COMPANY.COM", $2); print}' /tmp/awk-practice.txt

sub() - Single substitution

# Replace first occurrence only
awk '{sub(/GET/, "REQUEST"); print}' /tmp/awk-practice.txt

split() - Split by regex

# Split field by pattern
awk '{
    n = split($5, arr, /[^0-9]+/)
    print "Duration:", arr[1], "ms"
}' /tmp/awk-practice.txt

Lesson 6: Practical Patterns

Log Analysis

# Count requests per IP
awk '/^[0-9]/ {count[$1]++} END {for (ip in count) print ip, count[ip]}' /tmp/awk-practice.txt

# Find slow requests (>100ms)
awk '{
    if (match($5, /[0-9]+/)) {
        ms = substr($5, RSTART, RLENGTH)
        if (ms > 100) print $1, $3, $5
    }
}' /tmp/awk-practice.txt

# Count status codes
awk '/^[0-9]/ {status[$4]++} END {for (s in status) print s, status[s]}' /tmp/awk-practice.txt

Data Extraction

# Extract emails from lines
awk 'match($0, /[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+/) {
    print substr($0, RSTART, RLENGTH)
}' /tmp/awk-practice.txt

# Extract IP addresses
awk 'match($0, /[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/) {
    print substr($0, RSTART, RLENGTH)
}' /tmp/awk-practice.txt

Data Transformation

# Convert key=value to JSON
awk -F'[ =]' '/interface/ {
    printf "{ \"interface\": \"%s\", \"ip\": \"%s\" }\n", $2, $4
}' /tmp/awk-practice.txt

# CSV to JSON-like
awk -F, 'NR>1 {
    printf "{ \"name\": \"%s\", \"email\": \"%s\", \"salary\": %s }\n", $1, $2, $4
}' /tmp/awk-practice.txt

Lesson 7: awk vs grep vs sed

Task grep sed awk

Find lines

grep 'pattern' file

sed -n '/pattern/p' file

awk '/pattern/' file

Extract matches

grep -o 'pattern' file

Complex with hold space

match() + substr()

Replace

grep can’t replace

sed 's/old/new/g' file

gsub(/old/, "new")

Field extraction

grep can’t do this

sed can’t do this

awk '{print $2}'

Calculations

Not possible

Not possible

awk '{sum+=$1} END{print sum}'

Rule of thumb: - grep: Find lines (simple patterns) - sed: Replace/transform text - awk: Field extraction, calculations, complex logic

Complex Example: Log Analysis Report

awk '
BEGIN {
    print "=== API Request Analysis ==="
    print ""
}
/^[0-9]/ {
    # Extract response time (remove "ms")
    gsub(/ms/, "", $5)

    # Count by status
    status[$4]++

    # Track slow requests
    if ($5 > 100) slow++

    # Sum for average
    total += $5
    count++
}
END {
    print "Status Code Distribution:"
    for (s in status) printf "  %s: %d requests\n", s, status[s]
    print ""
    printf "Slow requests (>100ms): %d\n", slow
    printf "Average response time: %.1f ms\n", total/count
}
' /tmp/awk-practice.txt

Exercises to Complete

  1. [ ] Print usernames and shells from passwd-style lines

  2. [ ] Find all requests that returned 4xx or 5xx

  3. [ ] Calculate average salary from CSV

  4. [ ] Extract all IP addresses and count occurrences

  5. [ ] Convert CSV to tab-separated

Self-Check

Solutions
# 1. Usernames and shells
awk -F: '/^[a-z]/ {print $1, $NF}' /tmp/awk-practice.txt

# 2. 4xx/5xx requests
awk '$4 ~ /^[45][0-9]{2}$/ {print}' /tmp/awk-practice.txt

# 3. Average salary
awk -F, 'NR>1 && $4 ~ /[0-9]/ {sum+=$4; n++} END {print "Avg:", sum/n}' /tmp/awk-practice.txt

# 4. IP count
awk 'match($0, /[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+/) {
    ip = substr($0, RSTART, RLENGTH)
    count[ip]++
} END {
    for (ip in count) print ip, count[ip]
}' /tmp/awk-practice.txt

# 5. CSV to TSV
awk -F, 'BEGIN {OFS="\t"} {$1=$1; print}' /tmp/awk-practice.txt

Next Session

Session 07: vim Regex - Search and replace in your editor.