ETL Session 03: Log Processing

Log file analysis. This session covers filtering logs with grep, transforming with sed, and aggregating with awk.

Pre-Session State

  • Can convert JSON to CSV

  • Understand jq basics

  • Know awk printf formatting

Setup

cat > /tmp/app.log << 'EOF'
2026-03-18T10:00:01 INFO [kvm-01] Server started
2026-03-18T10:00:15 WARN [kvm-01] High memory usage: 85%
2026-03-18T10:01:00 ERROR [vault-01] Connection refused to 10.50.1.100
2026-03-18T10:01:30 INFO [kvm-02] Backup completed
2026-03-18T10:02:00 ERROR [kvm-01] Disk full on /var/log
2026-03-18T10:02:45 WARN [ise-01] Certificate expiring in 30 days
2026-03-18T10:03:00 INFO [vault-01] Unsealed successfully
2026-03-18T10:03:30 ERROR [kvm-02] OOM killer invoked
2026-03-18T10:04:00 INFO [kvm-01] Service nginx restarted
EOF

Lesson 1: grep Filtering

Concept: Filter lines matching patterns.

Exercise 1.1: Basic filtering

# Find all errors
grep ERROR /tmp/app.log

# Find errors OR warnings
grep -E 'ERROR|WARN' /tmp/app.log

# Case insensitive
grep -i error /tmp/app.log

Exercise 1.2: Inverse and context

# Lines NOT matching
grep -v INFO /tmp/app.log

# Show context around matches
grep -B 1 -A 1 ERROR /tmp/app.log  # 1 line before and after
grep -C 2 ERROR /tmp/app.log       # 2 lines context

Exercise 1.3: Extract matches only

# Extract just the IP addresses
grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' /tmp/app.log

# Extract hostnames from brackets
grep -oE '\[[a-z]+-[0-9]+\]' /tmp/app.log | tr -d '[]'

Lesson 2: sed Transformation

Concept: Stream editing for text transformation.

Exercise 2.1: Simple substitution

# Replace ERROR with [ERROR]
sed 's/ERROR/[ERROR]/' /tmp/app.log

# Global replacement (all occurrences)
sed 's/kvm/KVM/g' /tmp/app.log

Exercise 2.2: Delete lines

# Delete INFO lines
sed '/INFO/d' /tmp/app.log

# Keep only ERROR lines
sed '/ERROR/!d' /tmp/app.log

Exercise 2.3: Extract and transform

# Extract timestamp and level
sed -n 's/\(.*\) \(INFO\|WARN\|ERROR\) .*/\1 \2/p' /tmp/app.log

# Reformat timestamp
sed 's/T/ /' /tmp/app.log | sed 's/\([0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}\).*/\1/'

Lesson 3: awk Aggregation

Concept: Field processing and statistics.

Exercise 3.1: Count by field

# Count by log level
awk '{level[$2]++} END {for (l in level) print l, level[l]}' /tmp/app.log

Output:

INFO 4
WARN 2
ERROR 3

Exercise 3.2: Count by host

# Extract host from brackets, count
awk -F'[][]' '{hosts[$2]++} END {for (h in hosts) print h, hosts[h]}' /tmp/app.log

Exercise 3.3: Time-based analysis

# Errors per minute
grep ERROR /tmp/app.log | \
  awk -F'[T:]' '{minute=$1":"$2":"$3; errors[minute]++}
    END {for (m in errors) print m, errors[m]}' | sort

Exercise 3.4: Top errors

# Most common error messages
grep ERROR /tmp/app.log | \
  awk -F'] ' '{print $2}' | \
  sort | uniq -c | sort -rn | head -5

Lesson 4: Combined Pipeline

Concept: Chain grep/sed/awk for complex analysis.

Exercise 4.1: Error report

echo "=== ERROR REPORT ==="
echo "Generated: $(date)"
echo ""
echo "Errors by host:"
grep ERROR /tmp/app.log | \
  awk -F'[][]' '{hosts[$2]++} END {for (h in hosts) printf "  %s: %d\n", h, hosts[h]}'
echo ""
echo "Recent errors:"
grep ERROR /tmp/app.log | tail -3 | sed 's/^/  /'

Exercise 4.2: Summary statistics

awk '
  {
    total++
    level[$2]++
    # Extract host
    match($0, /\[([a-z]+-[0-9]+)\]/, arr)
    if (arr[1]) hosts[arr[1]]++
  }
  END {
    print "Total entries:", total
    print "\nBy level:"
    for (l in level) printf "  %s: %d\n", l, level[l]
    print "\nBy host:"
    for (h in hosts) printf "  %s: %d\n", h, hosts[h]
  }' /tmp/app.log

Exercise 4.3: Alert pipeline

# Real-time error alerting (simulated)
grep ERROR /tmp/app.log | while read line; do
  host=$(echo "$line" | grep -oE '\[[a-z]+-[0-9]+\]' | tr -d '[]')
  msg=$(echo "$line" | awk -F'] ' '{print $2}')
  echo "ALERT: $host - $msg"
done

Summary: What You Learned

Concept Syntax Example

grep pattern

grep PATTERN file

grep ERROR log

grep regex

grep -E 'a|b'

grep -E 'ERROR|WARN'

grep extract

grep -o PATTERN

Extract matches only

sed substitute

sed 's/old/new/'

Replace first

sed global

sed 's/old/new/g'

Replace all

sed delete

sed '/pattern/d'

Delete matching lines

awk count

{arr[$1]++}

Count by field

awk sum

{sum+=$1} END {print sum}

Sum field

awk printf

printf "%s: %d\n", k, v

Formatted output

Exercises to Complete

  1. [ ] Count log entries by hour

  2. [ ] Find all unique IP addresses in logs

  3. [ ] Create error summary report with percentages

  4. [ ] Build alerting pipeline for ERROR level

Next Session

Session 04: API to Report - curl/jq patterns, enrichment.

Session Log

Timestamp Notes

Start

<Record when you started>

End

<Record when you finished>