ETL Session 04: API to Report

API data extraction. This session covers fetching data from APIs, processing with jq, enriching data, and generating reports.

Pre-Session State

  • Can process logs with grep/sed/awk

  • Understand jq filtering and construction

  • Know CSV output patterns

Setup

# Simulate API responses
mkdir -p /tmp/api

# Hosts API
cat > /tmp/api/hosts.json << 'EOF'
{"data": [
  {"id": 1, "name": "kvm-01", "ip": "10.50.1.110", "status": "running"},
  {"id": 2, "name": "kvm-02", "ip": "10.50.1.111", "status": "running"},
  {"id": 3, "name": "vault-01", "ip": "10.50.1.60", "status": "warning"},
  {"id": 4, "name": "db-01", "ip": "10.50.1.70", "status": "stopped"}
]}
EOF

# Metrics API
cat > /tmp/api/metrics.json << 'EOF'
{"data": [
  {"host_id": 1, "cpu": 85, "memory": 72, "disk": 45},
  {"host_id": 2, "cpu": 32, "memory": 58, "disk": 67},
  {"host_id": 3, "cpu": 45, "memory": 89, "disk": 23},
  {"host_id": 4, "cpu": 0, "memory": 0, "disk": 95}
]}
EOF

Lesson 1: curl + jq Pattern

Concept: Extract data from API and process immediately.

Exercise 1.1: Basic API fetch

# Simulate: curl -s https://api.example.com/hosts | jq ...
cat /tmp/api/hosts.json | jq '.data[] | .name'

Exercise 1.2: Filter and transform

# Get running hosts only
cat /tmp/api/hosts.json | jq -r '.data[] | select(.status == "running") | .name'

Exercise 1.3: Build new structure

# Restructure API response
cat /tmp/api/hosts.json | jq '.data | map({hostname: .name, address: .ip, state: .status})'

Lesson 2: Data Enrichment

Concept: Combine data from multiple sources.

Exercise 2.1: Manual join (lookup)

# Join hosts and metrics by host_id
cat /tmp/api/hosts.json | jq -r '.data[] | "\(.id) \(.name) \(.ip)"' | \
while read id name ip; do
  cpu=$(cat /tmp/api/metrics.json | jq -r ".data[] | select(.host_id == $id) | .cpu")
  echo "$name ($ip): CPU $cpu%"
done

Exercise 2.2: jq native join (slurp)

# Use jq to join two JSON files
jq -s '
  .[0].data as $hosts |
  .[1].data as $metrics |
  $hosts | map(. as $h |
    ($metrics[] | select(.host_id == $h.id)) as $m |
    {name: $h.name, ip: $h.ip, status: $h.status, cpu: $m.cpu, memory: $m.memory}
  )
' /tmp/api/hosts.json /tmp/api/metrics.json

Exercise 2.3: Add computed fields

# Add severity based on metrics
jq -s '
  .[0].data as $hosts |
  .[1].data as $metrics |
  $hosts | map(. as $h |
    ($metrics[] | select(.host_id == $h.id)) as $m |
    {
      name: $h.name,
      cpu: $m.cpu,
      severity: (if $m.cpu > 80 then "critical"
                 elif $m.cpu > 50 then "warning"
                 else "ok" end)
    }
  )
' /tmp/api/hosts.json /tmp/api/metrics.json

Lesson 3: Report Generation

Concept: Transform data into human-readable reports.

Exercise 3.1: Markdown report

echo "# Infrastructure Report"
echo ""
echo "Generated: $(date)"
echo ""
echo "## Host Status"
echo ""
echo "| Host | IP | Status |"
echo "|------|-----|--------|"
cat /tmp/api/hosts.json | jq -r '.data[] | "| \(.name) | \(.ip) | \(.status) |"'

Exercise 3.2: Summary section

echo "## Summary"
echo ""
total=$(cat /tmp/api/hosts.json | jq '.data | length')
running=$(cat /tmp/api/hosts.json | jq '[.data[] | select(.status == "running")] | length')
warning=$(cat /tmp/api/hosts.json | jq '[.data[] | select(.status == "warning")] | length')
stopped=$(cat /tmp/api/hosts.json | jq '[.data[] | select(.status == "stopped")] | length')

echo "- Total hosts: $total"
echo "- Running: $running"
echo "- Warning: $warning"
echo "- Stopped: $stopped"

Exercise 3.3: Alert section

echo "## Alerts"
echo ""

# High CPU
echo "### High CPU (>80%)"
jq -s '
  .[0].data as $hosts |
  .[1].data as $metrics |
  $hosts | map(. as $h |
    ($metrics[] | select(.host_id == $h.id)) as $m |
    select($m.cpu > 80) |
    "- \($h.name): \($m.cpu)%"
  ) | .[]
' /tmp/api/hosts.json /tmp/api/metrics.json

echo ""
echo "### Stopped Hosts"
cat /tmp/api/hosts.json | jq -r '.data[] | select(.status == "stopped") | "- \(.name) (\(.ip))"'

Lesson 4: Complete Pipeline

Concept: End-to-end ETL pipeline.

Exercise 4.1: Full report generator

#!/bin/bash
# infra-report.sh

HOSTS="/tmp/api/hosts.json"
METRICS="/tmp/api/metrics.json"
OUTPUT="/tmp/infra-report.md"

{
  echo "# Infrastructure Status Report"
  echo "Generated: $(date '+%Y-%m-%d %H:%M:%S')"
  echo ""

  echo "## Overview"
  echo ""
  echo "| Metric | Value |"
  echo "|--------|-------|"
  echo "| Total Hosts | $(jq '.data | length' $HOSTS) |"
  echo "| Running | $(jq '[.data[] | select(.status == "running")] | length' $HOSTS) |"
  echo "| Avg CPU | $(jq '[.data[].cpu] | add / length | floor' $METRICS)% |"
  echo ""

  echo "## Detailed Status"
  echo ""
  echo "| Host | IP | Status | CPU | Memory | Disk |"
  echo "|------|-----|--------|-----|--------|------|"

  jq -s -r '
    .[0].data as $hosts |
    .[1].data as $metrics |
    $hosts[] | . as $h |
    ($metrics[] | select(.host_id == $h.id)) as $m |
    "| \($h.name) | \($h.ip) | \($h.status) | \($m.cpu)% | \($m.memory)% | \($m.disk)% |"
  ' $HOSTS $METRICS

  echo ""
  echo "## Alerts"
  echo ""

  jq -s -r '
    .[0].data as $hosts |
    .[1].data as $metrics |
    $hosts[] | . as $h |
    ($metrics[] | select(.host_id == $h.id)) as $m |
    select($m.cpu > 80 or $m.memory > 85 or $m.disk > 90 or $h.status != "running") |
    "- **\($h.name)**: " +
    (if $h.status != "running" then "Status: \($h.status) " else "" end) +
    (if $m.cpu > 80 then "CPU: \($m.cpu)% " else "" end) +
    (if $m.memory > 85 then "Memory: \($m.memory)% " else "" end) +
    (if $m.disk > 90 then "Disk: \($m.disk)% " else "" end)
  ' $HOSTS $METRICS

} > $OUTPUT

echo "Report generated: $OUTPUT"
cat $OUTPUT

Summary: What You Learned

Concept Syntax Example

API fetch

curl -s URL | jq

curl api/hosts | jq '.data'

jq slurp

jq -s

Combine multiple files

Join data

$a[] | select(.id == $b.id)

Match by key

Computed field

if…​then…​else…​end

Add derived values

Markdown table

"| \(.x) | \(.y) |"

jq string template

Shell variables

$(jq 'query' file)

Capture jq output

Report script

{ …​ } > file

Redirect block

Exercises to Complete

  1. [ ] Create a script that joins hosts and metrics

  2. [ ] Generate CSV report with severity column

  3. [ ] Build alerting report (only problem hosts)

  4. [ ] Add timestamp and run as cron job

Next Session

Session 05: Infrastructure ETL - ISE, k8s, network patterns.

Session Log

Timestamp Notes

Start

<Record when you started>

End

<Record when you finished>