ETL Session 02: JSON to CSV

Data format conversion. This session covers extracting data from JSON using jq and converting to CSV, TSV, and formatted tables.

Pre-Session State

  • Understand pipes and tee

  • Know basic xargs usage

  • Can run jq identity filter

Setup

cat > /tmp/servers.json << 'EOF'
[
  {"name": "kvm-01", "ip": "10.50.1.110", "role": "hypervisor", "cpu": 85},
  {"name": "kvm-02", "ip": "10.50.1.111", "role": "hypervisor", "cpu": 45},
  {"name": "vault-01", "ip": "10.50.1.60", "role": "secrets", "cpu": 32},
  {"name": "ise-01", "ip": "10.50.1.20", "role": "nac", "cpu": 78}
]
EOF

Lesson 1: jq Field Extraction

Concept: Build arrays of fields for CSV conversion.

Exercise 1.1: Extract single field

cat /tmp/servers.json | jq -r '.[].name'

Output: Each name on its own line.

Exercise 1.2: Extract multiple fields as array

cat /tmp/servers.json | jq '.[] | [.name, .ip, .role]'

Output: Arrays like ["kvm-01", "10.50.1.110", "hypervisor"]

Exercise 1.3: Flatten to values

cat /tmp/servers.json | jq -r '.[] | [.name, .ip, .cpu] | @tsv'

Output: Tab-separated values (TSV).

Lesson 2: @csv Output

Concept: @csv properly escapes and quotes CSV fields.

Exercise 2.1: Basic CSV

cat /tmp/servers.json | jq -r '.[] | [.name, .ip, .role, .cpu] | @csv'

Output:

"kvm-01","10.50.1.110","hypervisor",85
"kvm-02","10.50.1.111","hypervisor",45

Exercise 2.2: CSV with header

cat /tmp/servers.json | jq -r '
  ["name","ip","role","cpu"],
  (.[] | [.name, .ip, .role, .cpu])
  | @csv'

Output: CSV with header row first.

Exercise 2.3: Selective fields

# Only hypervisors, only name and cpu
cat /tmp/servers.json | jq -r '
  ["host","cpu_percent"],
  (.[] | select(.role == "hypervisor") | [.name, .cpu])
  | @csv'

Lesson 3: awk Post-Processing

Concept: Use awk to format jq output into tables.

Exercise 3.1: TSV to table

cat /tmp/servers.json | jq -r '.[] | [.name, .ip, .cpu] | @tsv' |
  awk -F'\t' '{printf "%-12s %-15s %3d%%\n", $1, $2, $3}'

Output:

kvm-01       10.50.1.110      85%
kvm-02       10.50.1.111      45%
cat /tmp/servers.json | jq -r '.[] | [.name, .ip, .cpu] | @tsv' |
  awk -F'\t' '
    BEGIN {
      printf "%-12s %-15s %s\n", "HOST", "IP", "CPU"
      print "----------------------------------------"
    }
    { printf "%-12s %-15s %3d%%\n", $1, $2, $3; total+=$3; count++ }
    END {
      print "----------------------------------------"
      printf "Average CPU: %.1f%%\n", total/count
    }'

Exercise 3.3: Conditional formatting

cat /tmp/servers.json | jq -r '.[] | [.name, .cpu] | @tsv' |
  awk -F'\t' '{
    status = ($2 > 80) ? "HIGH" : ($2 > 50) ? "MED" : "OK"
    printf "%-12s %3d%% [%s]\n", $1, $2, status
  }'

Lesson 4: Complete ETL Pipeline

Concept: Chain Extract → Transform → Load steps.

Exercise 4.1: JSON to report file

cat /tmp/servers.json | \
  jq -r '.[] | select(.cpu > 50) | [.name, .ip, .cpu] | @tsv' | \
  awk -F'\t' 'BEGIN {print "HIGH CPU SERVERS"} {printf "%s (%s): %d%%\n", $1, $2, $3}' | \
  tee /tmp/high-cpu-report.txt

Exercise 4.2: Multiple output formats

# Generate both CSV and markdown table
cat /tmp/servers.json | jq -r '.[] | [.name, .ip, .role] | @csv' > /tmp/servers.csv

cat /tmp/servers.json | jq -r '
  "| Name | IP | Role |",
  "|------|-----|------|",
  (.[] | "| \(.name) | \(.ip) | \(.role) |")' > /tmp/servers.md

Exercise 4.3: Aggregation pipeline

# Count servers by role
cat /tmp/servers.json | \
  jq -r 'group_by(.role) | .[] | "\(.[0].role): \(length)"'

Output:

hypervisor: 2
nac: 1
secrets: 1

Summary: What You Learned

Concept Syntax Example

Extract field

.[] | .field

.[] | .name

Build array

[.a, .b, .c]

[.name, .ip, .cpu]

CSV output

@csv

[…​] | @csv

TSV output

@tsv

[…​] | @tsv

Header row

["h1","h2"], (.[] | …​)

Literal array first

awk format

printf "%-10s"

Left-pad to 10

awk aggregate

{sum+=$1} END {print sum}

Running totals

jq group

group_by(.field)

Aggregate by field

Exercises to Complete

  1. [ ] Convert servers.json to CSV with header

  2. [ ] Create a markdown table from JSON

  3. [ ] Build a report showing only high CPU hosts

  4. [ ] Count servers by role and format as table

Next Session

Session 03: Log Processing - grep, sed, awk for log analysis.

Session Log

Timestamp Notes

Start

<Record when you started>

End

<Record when you finished>