ETL Session 01: Pipeline Basics

The Unix philosophy in action. This session covers the pipe operator, tee for splitting output, process substitution, and xargs for command building.

Pre-Session State

  • Can run basic shell commands

  • Understand stdin/stdout concept

  • Know basic grep/cat usage

Setup

# Create test data
cat > /tmp/hosts.txt << 'EOF'
kvm-01 10.50.1.110 hypervisor
kvm-02 10.50.1.111 hypervisor
vault-01 10.50.1.60 secrets
vault-02 10.50.1.61 secrets
ise-01 10.50.1.20 nac
EOF

Lesson 1: Pipe Operator

Concept: | connects stdout of one command to stdin of next.

Exercise 1.1: Simple pipeline

# Extract → Filter → Count
cat /tmp/hosts.txt | grep hypervisor | wc -l

Output: 2

Exercise 1.2: Multi-stage pipeline

# Extract IPs from hypervisors, sort unique
cat /tmp/hosts.txt | grep hypervisor | awk '{print $2}' | sort -u

Exercise 1.3: Pipeline with formatting

# Format as "hostname: ip"
cat /tmp/hosts.txt | awk '{print $1 ": " $2}'

Lesson 2: tee - Split the Stream

Concept: tee writes to file AND stdout simultaneously.

Exercise 2.1: Save and continue

# Save intermediate result while continuing pipeline
cat /tmp/hosts.txt | grep hypervisor | tee /tmp/hypervisors.txt | wc -l
cat /tmp/hypervisors.txt  # Verify saved

Exercise 2.2: Multiple outputs

# Save to multiple files
cat /tmp/hosts.txt | tee /tmp/copy1.txt /tmp/copy2.txt | wc -l

Exercise 2.3: Append mode

# Append instead of overwrite
echo "new-host 10.50.1.200 test" | tee -a /tmp/hosts.txt

Lesson 3: Process Substitution

Concept: <(cmd) and >(cmd) create virtual files from commands.

Exercise 3.1: Compare two commands

# Compare output of two commands
diff <(cat /tmp/hosts.txt | awk '{print $1}' | sort) \
     <(echo -e "ise-01\nkvm-01\nkvm-02\nvault-01\nvault-02")

Exercise 3.2: Multiple inputs

# Paste combines columns from multiple sources
paste <(cat /tmp/hosts.txt | awk '{print $1}') \
      <(cat /tmp/hosts.txt | awk '{print $2}')

Exercise 3.3: Output substitution

# Log to file while displaying on screen
cat /tmp/hosts.txt | grep hypervisor > >(tee /tmp/log.txt)

Lesson 4: xargs - Build Commands

Concept: xargs converts stdin to command arguments.

Exercise 4.1: Basic xargs

# Echo each hostname
cat /tmp/hosts.txt | awk '{print $1}' | xargs echo "Hosts:"
# Output: Hosts: kvm-01 kvm-02 vault-01 vault-02 ise-01

Exercise 4.2: One at a time (-n 1)

# Run command for each line
cat /tmp/hosts.txt | awk '{print $2}' | xargs -n 1 echo "IP:"
# Output: IP: 10.50.1.110
#         IP: 10.50.1.111
#         ...

Exercise 4.3: Placeholder (-I)

# Use placeholder for positioning
cat /tmp/hosts.txt | awk '{print $1}' | xargs -I {} echo "Pinging {}..."
# Output: Pinging kvm-01...
#         Pinging kvm-02...

Exercise 4.4: Parallel execution (-P)

# Run 4 pings in parallel
cat /tmp/hosts.txt | awk '{print $2}' | xargs -n 1 -P 4 ping -c 1

Lesson 5: Command Grouping

Concept: Group commands for combined output.

Exercise 5.1: Subshell grouping

# Combine multiple outputs
(echo "=== Hypervisors ===" && grep hypervisor /tmp/hosts.txt) | cat

Exercise 5.2: Brace grouping

# Same result, no subshell
{ echo "=== Report ==="; cat /tmp/hosts.txt; echo "=== End ==="; } | cat

Summary: What You Learned

Concept Syntax Example

Pipe

cmd1 | cmd2

cat f | grep x

Tee

tee file

Save and continue

Tee append

tee -a file

Append mode

Process sub (in)

<(cmd)

diff <(cmd1) <(cmd2)

Process sub (out)

>(cmd)

cmd > >(tee log)

xargs basic

xargs cmd

Build command line

xargs one

xargs -n 1

One arg per execution

xargs placeholder

xargs -I {}

Position arg in command

xargs parallel

xargs -P N

Run N in parallel

Subshell

(cmd1; cmd2)

Grouped in subshell

Brace group

{ cmd1; cmd2; }

Grouped, no subshell

Exercises to Complete

  1. [ ] Extract all IPs, save to file, count total

  2. [ ] Compare hostnames from two files using process substitution

  3. [ ] Ping all hosts in parallel using xargs

  4. [ ] Create a report with header, data, footer using grouping

Next Session

Session 02: JSON to CSV - jq transforms, @csv output.

Session Log

Timestamp Notes

Start

<Record when you started>

End

<Record when you finished>