Regex Session 03: Groups & Capturing

Parentheses are the most powerful regex feature. They let you group patterns, capture data, and extract exactly what you need.

Why Groups Matter

Without groups, you can only match or not match. With groups, you can:

  1. Extract specific parts of a match

  2. Apply quantifiers to multi-character patterns

  3. Create alternatives with OR logic

  4. Backreference earlier matches

Test File Setup

cat << 'EOF' > /tmp/groups-practice.txt
# Log entries
2026-03-15T10:30:45 [INFO] User admin logged in from 192.168.1.100
2026-03-15T10:31:02 [WARN] Disk space low: 15% remaining
2026-03-15T10:32:00 [ERROR] Connection failed to server-db-01.example.com:5432
2026-03-15T10:33:00 [DEBUG] Query took 145ms for endpoint /api/v1/users

# Network data
IP: 192.168.1.100 MAC: AA:BB:CC:DD:EE:FF VLAN: 100
IP: 10.50.1.20 MAC: 14:F6:D8:7B:31:80 VLAN: 10
IP: 172.16.0.1 MAC: 98:BB:1E:1F:A7:13 VLAN: 999

# Config entries
server_url=https://api.example.com:8443/v2
database_host=db-prod-01.internal:5432
cache_server=redis://cache.local:6379

# Email addresses
contact: evan.rosado@domusdigitalis.dev
support: admin+help@example.com
bounce: no-reply@service.domain.co.uk
EOF

Lesson 1: Grouping for Quantifiers

Problem: Match repeated patterns like MAC address octets.

Without grouping:

# This only makes : optional, not the whole octet
grep -E '[A-F0-9]{2}:?' /tmp/groups-practice.txt

With grouping:

# Group the octet+colon, repeat 5 times, then final octet
grep -oE '([A-F0-9]{2}:){5}[A-F0-9]{2}' /tmp/groups-practice.txt

Output:

AA:BB:CC:DD:EE:FF
14:F6:D8:7B:31:80
98:BB:1E:1F:A7:13

Key insight: ([A-F0-9]{2}:){5} treats the entire group as a unit and repeats it 5 times.

Lesson 2: Capturing with Groups

Concept: Parentheses create "capture groups" - numbered containers for matched text.

# sed uses \1, \2, etc. to reference captured groups
# Extract the log level from [LEVEL]
echo "[ERROR] Connection failed" | sed -E 's/.*\[([A-Z]+)\].*/\1/'

Output: ERROR

Breakdown: - \[([A-Z]+)\] - Capture uppercase letters inside brackets - \1 - Reference the first capture group

Exercise 2.1: Extract timestamp and level

# Capture: group 1 = timestamp, group 2 = level
echo "2026-03-15T10:30:45 [INFO] Server started" | \
  sed -E 's/^([0-9T:-]+) \[([A-Z]+)\].*/Timestamp: \1, Level: \2/'

Output: Timestamp: 2026-03-15T10:30:45, Level: INFO

Exercise 2.2: Reformat date

# Convert YYYY-MM-DD to MM/DD/YYYY
echo "2026-03-15" | sed -E 's/([0-9]{4})-([0-9]{2})-([0-9]{2})/\2\/\3\/\1/'

Output: 03/15/2026

Lesson 3: Non-Capturing Groups

Problem: You want to group but NOT capture (saves memory, cleaner output).

Syntax: (?:…​) - group without capturing (PCRE only)

# Capturing group (creates \1)
echo "foobar foobaz" | grep -oP '(foo)(bar|baz)'

# Non-capturing group (no \1 created for first group)
echo "foobar foobaz" | grep -oP '(?:foo)(bar|baz)'

Use cases for non-capturing: - Performance (many matches) - Cleaner group numbering - When you only need some groups

Lesson 4: Alternation (OR Logic)

Syntax: (option1|option2|option3)

# Match INFO, WARN, or ERROR
grep -E '\[(INFO|WARN|ERROR)\]' /tmp/groups-practice.txt

Output:

2026-03-15T10:30:45 [INFO] User admin logged in from 192.168.1.100
2026-03-15T10:31:02 [WARN] Disk space low: 15% remaining
2026-03-15T10:32:00 [ERROR] Connection failed to server-db-01.example.com:5432

Exercise 4.1: Match protocol prefixes

grep -oE '(https?|redis)://' /tmp/groups-practice.txt

Output:

https://
redis://

Exercise 4.2: Match domains with common TLDs

grep -oE '[a-z0-9.-]+\.(com|dev|uk)' /tmp/groups-practice.txt

Lesson 5: Backreferences

Concept: Reference earlier capture groups within the SAME pattern.

Syntax: \1, \2, etc.

Exercise 5.1: Find repeated words

echo "the the quick brown fox fox" | grep -oE '\b(\w+)\s+\1\b'

Output:

the the
fox fox

Breakdown: - (\w+) - Capture a word - \s+ - One or more spaces - \1 - Match the SAME word again

Exercise 5.2: Find matching HTML tags

echo "<div>content</div> <span>more</span>" | grep -oP '<(\w+)>.*?</\1>'

Output:

<div>content</div>
<span>more</span>

Lesson 6: Named Groups (PCRE)

Concept: Name your capture groups for readability.

Python syntax: (?P<name>…​)

PCRE syntax: (?<name>…​) or (?'name'…​)

# Named groups in grep -P
echo "IP: 192.168.1.100 VLAN: 100" | \
  grep -oP 'IP: (?<ip>[0-9.]+) VLAN: (?<vlan>[0-9]+)'

Python example:

import re

text = "IP: 192.168.1.100 VLAN: 100"
pattern = r'IP: (?P<ip>[0-9.]+) VLAN: (?P<vlan>[0-9]+)'
match = re.search(pattern, text)

print(match.group('ip'))    # 192.168.1.100
print(match.group('vlan'))  # 100
print(match.groupdict())    # {'ip': '192.168.1.100', 'vlan': '100'}

Practical Applications

Extract Email Parts

# Capture: group 1 = local, group 2 = domain
grep -oP '([A-Za-z0-9._%+-]+)@([A-Za-z0-9.-]+\.[A-Za-z]{2,})' /tmp/groups-practice.txt

To extract JUST the domain:

grep -oP '(?<=[A-Za-z0-9._%+-]+@)[A-Za-z0-9.-]+\.[A-Za-z]{2,}' /tmp/groups-practice.txt

Parse URL Components

# Extract: protocol, host, port, path
echo "https://api.example.com:8443/v2/users" | \
  grep -oP '(?P<proto>https?)://(?P<host>[^:/]+):?(?P<port>[0-9]*)(?P<path>/.*)?'

With sed:

echo "https://api.example.com:8443/v2/users" | \
  sed -E 's|(https?)://([^:/]+):?([0-9]*)(/.*)?|Protocol: \1\nHost: \2\nPort: \3\nPath: \4|'

Reformat Log Lines

# Transform log format
grep -E '^\[' /tmp/groups-practice.txt | \
  sed -E 's/^([0-9T:-]+) \[([A-Z]+)\] (.*)/\2 | \1 | \3/'

Output:

INFO | 2026-03-15T10:30:45 | User admin logged in from 192.168.1.100
WARN | 2026-03-15T10:31:02 | Disk space low: 15% remaining
ERROR | 2026-03-15T10:32:00 | Connection failed to server-db-01.example.com:5432

Summary: Group Syntax Reference

Syntax Name Use Case

(…​)

Capturing group

Extract data, apply quantifiers

(?:…​)

Non-capturing group

Group without capturing (PCRE)

(a|b)

Alternation

Match one of multiple options

\1, \2

Backreference

Reference earlier capture

(?P<name>…​)

Named group (Python)

Readable extraction

(?<name>…​)

Named group (PCRE)

Readable extraction

Exercises to Complete

  1. [ ] Extract just the port numbers from the config entries

  2. [ ] Capture and swap date format from ISO to US format

  3. [ ] Find and extract the username from "User X logged in"

  4. [ ] Match server names that end with -01 or -02

  5. [ ] Parse email into local part and domain

Self-Check

Solutions
# 1. Extract ports
grep -oP ':\K[0-9]+(?=/)' /tmp/groups-practice.txt

# 2. Swap date format
sed -E 's/([0-9]{4})-([0-9]{2})-([0-9]{2})/\2\/\3\/\1/g' /tmp/groups-practice.txt

# 3. Extract username after "User "
grep -oP '(?<=User )\w+' /tmp/groups-practice.txt

# 4. Match server-XX-01 or -02
grep -oE '[a-z]+-[a-z]+-0[12]' /tmp/groups-practice.txt

# 5. Parse email parts
grep -oP '([A-Za-z0-9._%+-]+)@([A-Za-z0-9.-]+)' /tmp/groups-practice.txt

Next Session

Session 04: Lookahead & Lookbehind - Match based on context without consuming characters.