ETL Session 03: Log Processing
Log file analysis. This session covers filtering logs with grep, transforming with sed, and aggregating with awk.
Pre-Session State
-
Can convert JSON to CSV
-
Understand jq basics
-
Know awk printf formatting
Setup
cat > /tmp/app.log << 'EOF'
2026-03-18T10:00:01 INFO [kvm-01] Server started
2026-03-18T10:00:15 WARN [kvm-01] High memory usage: 85%
2026-03-18T10:01:00 ERROR [vault-01] Connection refused to 10.50.1.100
2026-03-18T10:01:30 INFO [kvm-02] Backup completed
2026-03-18T10:02:00 ERROR [kvm-01] Disk full on /var/log
2026-03-18T10:02:45 WARN [ise-01] Certificate expiring in 30 days
2026-03-18T10:03:00 INFO [vault-01] Unsealed successfully
2026-03-18T10:03:30 ERROR [kvm-02] OOM killer invoked
2026-03-18T10:04:00 INFO [kvm-01] Service nginx restarted
EOF
Lesson 1: grep Filtering
Concept: Filter lines matching patterns.
Exercise 1.1: Basic filtering
# Find all errors
grep ERROR /tmp/app.log
# Find errors OR warnings
grep -E 'ERROR|WARN' /tmp/app.log
# Case insensitive
grep -i error /tmp/app.log
Exercise 1.2: Inverse and context
# Lines NOT matching
grep -v INFO /tmp/app.log
# Show context around matches
grep -B 1 -A 1 ERROR /tmp/app.log # 1 line before and after
grep -C 2 ERROR /tmp/app.log # 2 lines context
Exercise 1.3: Extract matches only
# Extract just the IP addresses
grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' /tmp/app.log
# Extract hostnames from brackets
grep -oE '\[[a-z]+-[0-9]+\]' /tmp/app.log | tr -d '[]'
Lesson 2: sed Transformation
Concept: Stream editing for text transformation.
Exercise 2.1: Simple substitution
# Replace ERROR with [ERROR]
sed 's/ERROR/[ERROR]/' /tmp/app.log
# Global replacement (all occurrences)
sed 's/kvm/KVM/g' /tmp/app.log
Exercise 2.2: Delete lines
# Delete INFO lines
sed '/INFO/d' /tmp/app.log
# Keep only ERROR lines
sed '/ERROR/!d' /tmp/app.log
Exercise 2.3: Extract and transform
# Extract timestamp and level
sed -n 's/\(.*\) \(INFO\|WARN\|ERROR\) .*/\1 \2/p' /tmp/app.log
# Reformat timestamp
sed 's/T/ /' /tmp/app.log | sed 's/\([0-9]\{2\}:[0-9]\{2\}:[0-9]\{2\}\).*/\1/'
Lesson 3: awk Aggregation
Concept: Field processing and statistics.
Exercise 3.1: Count by field
# Count by log level
awk '{level[$2]++} END {for (l in level) print l, level[l]}' /tmp/app.log
Output:
INFO 4 WARN 2 ERROR 3
Exercise 3.2: Count by host
# Extract host from brackets, count
awk -F'[][]' '{hosts[$2]++} END {for (h in hosts) print h, hosts[h]}' /tmp/app.log
Exercise 3.3: Time-based analysis
# Errors per minute
grep ERROR /tmp/app.log | \
awk -F'[T:]' '{minute=$1":"$2":"$3; errors[minute]++}
END {for (m in errors) print m, errors[m]}' | sort
Exercise 3.4: Top errors
# Most common error messages
grep ERROR /tmp/app.log | \
awk -F'] ' '{print $2}' | \
sort | uniq -c | sort -rn | head -5
Lesson 4: Combined Pipeline
Concept: Chain grep/sed/awk for complex analysis.
Exercise 4.1: Error report
echo "=== ERROR REPORT ==="
echo "Generated: $(date)"
echo ""
echo "Errors by host:"
grep ERROR /tmp/app.log | \
awk -F'[][]' '{hosts[$2]++} END {for (h in hosts) printf " %s: %d\n", h, hosts[h]}'
echo ""
echo "Recent errors:"
grep ERROR /tmp/app.log | tail -3 | sed 's/^/ /'
Exercise 4.2: Summary statistics
awk '
{
total++
level[$2]++
# Extract host
match($0, /\[([a-z]+-[0-9]+)\]/, arr)
if (arr[1]) hosts[arr[1]]++
}
END {
print "Total entries:", total
print "\nBy level:"
for (l in level) printf " %s: %d\n", l, level[l]
print "\nBy host:"
for (h in hosts) printf " %s: %d\n", h, hosts[h]
}' /tmp/app.log
Exercise 4.3: Alert pipeline
# Real-time error alerting (simulated)
grep ERROR /tmp/app.log | while read line; do
host=$(echo "$line" | grep -oE '\[[a-z]+-[0-9]+\]' | tr -d '[]')
msg=$(echo "$line" | awk -F'] ' '{print $2}')
echo "ALERT: $host - $msg"
done
Summary: What You Learned
| Concept | Syntax | Example |
|---|---|---|
grep pattern |
|
|
grep regex |
|
|
grep extract |
|
Extract matches only |
sed substitute |
|
Replace first |
sed global |
|
Replace all |
sed delete |
|
Delete matching lines |
awk count |
|
Count by field |
awk sum |
|
Sum field |
awk printf |
|
Formatted output |
Exercises to Complete
-
[ ] Count log entries by hour
-
[ ] Find all unique IP addresses in logs
-
[ ] Create error summary report with percentages
-
[ ] Build alerting pipeline for ERROR level
Next Session
Session 04: API to Report - curl/jq patterns, enrichment.
Session Log
| Timestamp | Notes |
|---|---|
Start |
<Record when you started> |
End |
<Record when you finished> |