Regex Session 03: Groups & Capturing
Parentheses are the most powerful regex feature. They let you group patterns, capture data, and extract exactly what you need.
Why Groups Matter
Without groups, you can only match or not match. With groups, you can:
-
Extract specific parts of a match
-
Apply quantifiers to multi-character patterns
-
Create alternatives with OR logic
-
Backreference earlier matches
Test File Setup
cat << 'EOF' > /tmp/groups-practice.txt
# Log entries
2026-03-15T10:30:45 [INFO] User admin logged in from 192.168.1.100
2026-03-15T10:31:02 [WARN] Disk space low: 15% remaining
2026-03-15T10:32:00 [ERROR] Connection failed to server-db-01.example.com:5432
2026-03-15T10:33:00 [DEBUG] Query took 145ms for endpoint /api/v1/users
# Network data
IP: 192.168.1.100 MAC: AA:BB:CC:DD:EE:FF VLAN: 100
IP: 10.50.1.20 MAC: 14:F6:D8:7B:31:80 VLAN: 10
IP: 172.16.0.1 MAC: 98:BB:1E:1F:A7:13 VLAN: 999
# Config entries
server_url=https://api.example.com:8443/v2
database_host=db-prod-01.internal:5432
cache_server=redis://cache.local:6379
# Email addresses
contact: evan.rosado@domusdigitalis.dev
support: admin+help@example.com
bounce: no-reply@service.domain.co.uk
EOF
Lesson 1: Grouping for Quantifiers
Problem: Match repeated patterns like MAC address octets.
Without grouping:
# This only makes : optional, not the whole octet
grep -E '[A-F0-9]{2}:?' /tmp/groups-practice.txt
With grouping:
# Group the octet+colon, repeat 5 times, then final octet
grep -oE '([A-F0-9]{2}:){5}[A-F0-9]{2}' /tmp/groups-practice.txt
Output:
AA:BB:CC:DD:EE:FF 14:F6:D8:7B:31:80 98:BB:1E:1F:A7:13
Key insight: ([A-F0-9]{2}:){5} treats the entire group as a unit and repeats it 5 times.
Lesson 2: Capturing with Groups
Concept: Parentheses create "capture groups" - numbered containers for matched text.
# sed uses \1, \2, etc. to reference captured groups
# Extract the log level from [LEVEL]
echo "[ERROR] Connection failed" | sed -E 's/.*\[([A-Z]+)\].*/\1/'
Output: ERROR
Breakdown:
- \[([A-Z]+)\] - Capture uppercase letters inside brackets
- \1 - Reference the first capture group
Exercise 2.1: Extract timestamp and level
# Capture: group 1 = timestamp, group 2 = level
echo "2026-03-15T10:30:45 [INFO] Server started" | \
sed -E 's/^([0-9T:-]+) \[([A-Z]+)\].*/Timestamp: \1, Level: \2/'
Output: Timestamp: 2026-03-15T10:30:45, Level: INFO
Exercise 2.2: Reformat date
# Convert YYYY-MM-DD to MM/DD/YYYY
echo "2026-03-15" | sed -E 's/([0-9]{4})-([0-9]{2})-([0-9]{2})/\2\/\3\/\1/'
Output: 03/15/2026
Lesson 3: Non-Capturing Groups
Problem: You want to group but NOT capture (saves memory, cleaner output).
Syntax: (?:…) - group without capturing (PCRE only)
# Capturing group (creates \1)
echo "foobar foobaz" | grep -oP '(foo)(bar|baz)'
# Non-capturing group (no \1 created for first group)
echo "foobar foobaz" | grep -oP '(?:foo)(bar|baz)'
Use cases for non-capturing: - Performance (many matches) - Cleaner group numbering - When you only need some groups
Lesson 4: Alternation (OR Logic)
Syntax: (option1|option2|option3)
# Match INFO, WARN, or ERROR
grep -E '\[(INFO|WARN|ERROR)\]' /tmp/groups-practice.txt
Output:
2026-03-15T10:30:45 [INFO] User admin logged in from 192.168.1.100 2026-03-15T10:31:02 [WARN] Disk space low: 15% remaining 2026-03-15T10:32:00 [ERROR] Connection failed to server-db-01.example.com:5432
Exercise 4.1: Match protocol prefixes
grep -oE '(https?|redis)://' /tmp/groups-practice.txt
Output:
https:// redis://
Exercise 4.2: Match domains with common TLDs
grep -oE '[a-z0-9.-]+\.(com|dev|uk)' /tmp/groups-practice.txt
Lesson 5: Backreferences
Concept: Reference earlier capture groups within the SAME pattern.
Syntax: \1, \2, etc.
Exercise 5.1: Find repeated words
echo "the the quick brown fox fox" | grep -oE '\b(\w+)\s+\1\b'
Output:
the the fox fox
Breakdown:
- (\w+) - Capture a word
- \s+ - One or more spaces
- \1 - Match the SAME word again
Exercise 5.2: Find matching HTML tags
echo "<div>content</div> <span>more</span>" | grep -oP '<(\w+)>.*?</\1>'
Output:
<div>content</div> <span>more</span>
Lesson 6: Named Groups (PCRE)
Concept: Name your capture groups for readability.
Python syntax: (?P<name>…)
PCRE syntax: (?<name>…) or (?'name'…)
# Named groups in grep -P
echo "IP: 192.168.1.100 VLAN: 100" | \
grep -oP 'IP: (?<ip>[0-9.]+) VLAN: (?<vlan>[0-9]+)'
Python example:
import re
text = "IP: 192.168.1.100 VLAN: 100"
pattern = r'IP: (?P<ip>[0-9.]+) VLAN: (?P<vlan>[0-9]+)'
match = re.search(pattern, text)
print(match.group('ip')) # 192.168.1.100
print(match.group('vlan')) # 100
print(match.groupdict()) # {'ip': '192.168.1.100', 'vlan': '100'}
Practical Applications
Extract Email Parts
# Capture: group 1 = local, group 2 = domain
grep -oP '([A-Za-z0-9._%+-]+)@([A-Za-z0-9.-]+\.[A-Za-z]{2,})' /tmp/groups-practice.txt
To extract JUST the domain:
grep -oP '(?<=[A-Za-z0-9._%+-]+@)[A-Za-z0-9.-]+\.[A-Za-z]{2,}' /tmp/groups-practice.txt
Parse URL Components
# Extract: protocol, host, port, path
echo "https://api.example.com:8443/v2/users" | \
grep -oP '(?P<proto>https?)://(?P<host>[^:/]+):?(?P<port>[0-9]*)(?P<path>/.*)?'
With sed:
echo "https://api.example.com:8443/v2/users" | \
sed -E 's|(https?)://([^:/]+):?([0-9]*)(/.*)?|Protocol: \1\nHost: \2\nPort: \3\nPath: \4|'
Reformat Log Lines
# Transform log format
grep -E '^\[' /tmp/groups-practice.txt | \
sed -E 's/^([0-9T:-]+) \[([A-Z]+)\] (.*)/\2 | \1 | \3/'
Output:
INFO | 2026-03-15T10:30:45 | User admin logged in from 192.168.1.100 WARN | 2026-03-15T10:31:02 | Disk space low: 15% remaining ERROR | 2026-03-15T10:32:00 | Connection failed to server-db-01.example.com:5432
Summary: Group Syntax Reference
| Syntax | Name | Use Case |
|---|---|---|
|
Capturing group |
Extract data, apply quantifiers |
|
Non-capturing group |
Group without capturing (PCRE) |
|
Alternation |
Match one of multiple options |
|
Backreference |
Reference earlier capture |
|
Named group (Python) |
Readable extraction |
|
Named group (PCRE) |
Readable extraction |
Exercises to Complete
-
[ ] Extract just the port numbers from the config entries
-
[ ] Capture and swap date format from ISO to US format
-
[ ] Find and extract the username from "User X logged in"
-
[ ] Match server names that end with -01 or -02
-
[ ] Parse email into local part and domain
Self-Check
Solutions
# 1. Extract ports
grep -oP ':\K[0-9]+(?=/)' /tmp/groups-practice.txt
# 2. Swap date format
sed -E 's/([0-9]{4})-([0-9]{2})-([0-9]{2})/\2\/\3\/\1/g' /tmp/groups-practice.txt
# 3. Extract username after "User "
grep -oP '(?<=User )\w+' /tmp/groups-practice.txt
# 4. Match server-XX-01 or -02
grep -oE '[a-z]+-[a-z]+-0[12]' /tmp/groups-practice.txt
# 5. Parse email parts
grep -oP '([A-Za-z0-9._%+-]+)@([A-Za-z0-9.-]+)' /tmp/groups-practice.txt
Next Session
Session 04: Lookahead & Lookbehind - Match based on context without consuming characters.