Drill 05: Groups & Backreferences
Capturing groups extract parts of matches and enable powerful backreference substitutions. This is where regex becomes truly useful for data transformation.
Core Concepts
| Syntax | Meaning | Example |
|---|---|---|
|
Capturing group |
|
|
Backreference to group |
|
|
Non-capturing group |
|
|
Named group (Python) |
|
|
Named group (PCRE) |
|
|
Reset match start |
|
Why Groups Matter
-
Extraction - Pull specific parts from complex patterns
-
Backreferences - Match repeated text
-
Replacement - Rearrange captured text
-
Alternation - Group alternatives together
Interactive CLI Drill
bash ~/atelier/_bibliotheca/domus-captures/docs/modules/ROOT/examples/regex-drills/05-groups.sh
Exercise Set 1: Basic Capturing
cat << 'EOF' > /tmp/ex-groups.txt
192.168.1.100
10.50.1.20
host: server-01
host: db-prod-02
user@example.com
admin@company.org
key="value123"
setting="production"
EOF
Ex 1.1: Extract first two octets of IP
Solution
grep -oP '(\d+\.\d+)\.\d+\.\d+' /tmp/ex-groups.txt
# Shows whole match
# To show only group 1:
grep -oP '(\d+\.\d+)(?=\.\d+\.\d+)' /tmp/ex-groups.txt
# Or use sed:
sed -nE 's/^([0-9]+\.[0-9]+)\..*/\1/p' /tmp/ex-groups.txt
Output: 192.168, 10.50
Ex 1.2: Extract hostname from "host: NAME"
Solution
grep -oP 'host: \K[\w-]+' /tmp/ex-groups.txt
\K resets match, so only hostname is returned.
Output: server-01, db-prod-02
Ex 1.3: Extract username from email
Solution
grep -oP '^[^@]+(?=@)' /tmp/ex-groups.txt
# Or with \K:
grep -oP '^([^@]+)@' /tmp/ex-groups.txt | sed 's/@$//'
Output: user, admin
Ex 1.4: Extract value from key="value"
Solution
grep -oP '(?<=")[^"]+(?=")' /tmp/ex-groups.txt
# Or with \K:
grep -oP 'key="\K[^"]+' /tmp/ex-groups.txt
Output: value123, production
Exercise Set 2: Backreferences
cat << 'EOF' > /tmp/ex-backref.txt
the the quick fox
hello hello world
no no no more
yes yes
test TEST
abab
abba
12-34-12
aa:bb:cc:aa
EOF
Ex 2.1: Find repeated words
Solution
grep -P '\b(\w+)\s+\1\b' /tmp/ex-backref.txt
\1 references what group 1 matched.
Output: "the the", "hello hello", "no no", "yes yes"
Ex 2.2: Find repeated characters (aabb pattern)
Solution
grep -P '(.)\1' /tmp/ex-backref.txt
Matches: "hello" (ll), "aa" (aa), "bb", "cc"
Ex 2.3: Find palindrome-like patterns (abba)
Solution
grep -P '(.)(.)\2\1' /tmp/ex-backref.txt
Matches "abba" pattern: capture a, capture b, b again, a again.
Ex 2.4: Find matching octets in sequence
Solution
grep -P '(\d+)-\d+-\1' /tmp/ex-backref.txt
Output: 12-34-12 (first and third match)
Exercise Set 3: Replacement with Groups
Ex 3.1: Swap first and last name
Solution
echo "John Smith" | sed -E 's/(\w+) (\w+)/\2, \1/'
Output: Smith, John
Ex 3.2: Reformat date from MM/DD/YYYY to YYYY-MM-DD
Solution
echo "03/15/2026" | sed -E 's/([0-9]{2})\/([0-9]{2})\/([0-9]{4})/\3-\1-\2/'
Output: 2026-03-15
Ex 3.3: Add quotes around values
Solution
echo "key=value" | sed -E 's/(.+)=(.+)/\1="\2"/'
Output: key="value"
Ex 3.4: Extract domain from URL
Solution
echo "https://www.example.com/path" | sed -E 's|https?://([^/]+).*|\1|'
Output: www.example.com
Exercise Set 4: Non-Capturing Groups
Ex 4.1: Match phone formats without capturing area code
Solution
# Capturing: Each () creates a group
echo "555-123-4567" | grep -oP '(\d{3})-(\d{3})-(\d{4})'
# Groups: 1=555, 2=123, 3=4567
# Non-capturing for area code format:
echo "555-123-4567" | grep -oP '(?:\d{3}-)(\d{3}-\d{4})'
# Only captures: 123-4567
Ex 4.2: Optional prefix without capturing
Solution
# Match "Mr." or "Mrs." optionally, capture name
echo -e "Mr. Smith\nJones\nMrs. Davis" | grep -oP '(?:Mrs?\. )?(\w+)'
The (?:Mrs?\. )? groups without capturing.
Ex 4.3: Alternation with non-capturing group
Solution
# Match file extensions without capturing them
echo -e "file.txt\nscript.sh\nimage.png" | grep -oP '\w+\.(?:txt|sh|png)'
Exercise Set 5: Named Groups (Python)
import re
# Named groups with (?P<name>...)
pattern = re.compile(
r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
)
text = "Date: 2026-03-15"
match = pattern.search(text)
if match:
# Access by name
print(f"Year: {match.group('year')}")
print(f"Month: {match.group('month')}")
print(f"Day: {match.group('day')}")
# Or as dict
print(match.groupdict())
# {'year': '2026', 'month': '03', 'day': '15'}
Ex 5.1: Parse log entry with named groups
Solution
import re
log = "2026-03-15T10:30:45 [ERROR] Connection failed"
pattern = re.compile(
r'(?P<timestamp>\d{4}-\d{2}-\d{2}T[\d:]+)\s+'
r'\[(?P<level>\w+)\]\s+'
r'(?P<message>.+)'
)
match = pattern.match(log)
if match:
data = match.groupdict()
print(f"Level: {data['level']}")
print(f"Message: {data['message']}")
Ex 5.2: Parse MAC address into octets
Solution
import re
mac = "AA:BB:CC:DD:EE:FF"
pattern = re.compile(
r'(?P<o1>[0-9A-Fa-f]{2}):'
r'(?P<o2>[0-9A-Fa-f]{2}):'
r'(?P<o3>[0-9A-Fa-f]{2}):'
r'(?P<o4>[0-9A-Fa-f]{2}):'
r'(?P<o5>[0-9A-Fa-f]{2}):'
r'(?P<o6>[0-9A-Fa-f]{2})'
)
match = pattern.match(mac)
if match:
octets = [match.group(f'o{i}') for i in range(1, 7)]
print(f"OUI: {':'.join(octets[:3])}")
Real-World Applications
Professional: Parse ISE Logs
# Extract MAC and result from ISE logs
grep -oP 'Calling-Station-ID=\K[0-9A-F:-]+' /var/log/ise-psc.log
# Parse authentication result
grep -oP '(?<=AuthenticationResult=)\w+' /var/log/ise-psc.log
Professional: Reformat Config Files
# Convert "interface Gi0/1" to "Gi0/1"
sed -E 's/interface (.*)/\1/' config.txt
# Add VLAN prefix: "100" → "vlan 100"
sed -E 's/^([0-9]+)$/vlan \1/' vlans.txt
Professional: Extract IP:Port
# Parse "10.50.1.20:8080" into parts
echo "10.50.1.20:8080" | grep -oP '(?P<ip>[\d.]+):(?P<port>\d+)'
# Using sed for extraction
echo "10.50.1.20:8080" | sed -E 's/(.+):(.+)/IP=\1 PORT=\2/'
Personal: Reformat Dates in Notes
# Convert "March 15, 2026" to "2026-03-15"
# (simplified - real month conversion needs more logic)
sed -E 's/(\w+) ([0-9]+), ([0-9]{4})/\3-\1-\2/' notes.txt
Personal: Clean Phone Numbers
# Standardize "(555) 123-4567" to "555-123-4567"
sed -E 's/\(([0-9]{3})\) ([0-9]{3})-([0-9]{4})/\1-\2-\3/' contacts.txt
Personal: Extract Tags from Notes
# Find #hashtags
grep -oP '#\w+' ~/notes/*.md
# Find @mentions
grep -oP '@\w+' ~/notes/*.md
Tool Variants
grep: Group Extraction
# Use \K to extract after pattern
grep -oP 'user=\K\w+' file.txt
# Multiple groups with sed pipe
echo "a=1 b=2" | grep -oP '\w+=\d+' | sed 's/=/ → /'
sed: Group Replacement
# Basic group reference
sed -E 's/(foo)(bar)/\2\1/' file.txt # foobar → barfoo
# Multiple groups
sed -E 's/([A-Z]+):([0-9]+)/ID:\1 COUNT:\2/' file.txt
# Entire match reference with &
sed 's/[0-9]*/[&]/' file.txt # 123 → [123]
awk: Using match() with groups
# GNU awk with PCRE (gawk -P)
echo "key=value" | gawk 'match($0, /(\w+)=(\w+)/, a) {print "Key:", a[1], "Val:", a[2]}'
# Standard awk extraction
echo "user@domain" | awk -F'@' '{print "User:", $1, "Domain:", $2}'
vim: Group Replacement
" Swap words :%s/\(\w\+\) \(\w\+\)/\2 \1/g " Add brackets around numbers :%s/\([0-9]\+\)/[\1]/g " Case conversion with groups :%s/\(\w\)/\u\1/g " First char uppercase
Python: Groups and Named Groups
import re
text = "Server: web-01 IP: 10.50.1.100"
# Numbered groups
pattern = re.compile(r'Server: ([\w-]+) IP: ([\d.]+)')
match = pattern.search(text)
if match:
print(f"Server: {match.group(1)}") # web-01
print(f"IP: {match.group(2)}") # 10.50.1.100
# Named groups
pattern = re.compile(r'Server: (?P<name>[\w-]+) IP: (?P<ip>[\d.]+)')
match = pattern.search(text)
if match:
data = match.groupdict()
print(data) # {'name': 'web-01', 'ip': '10.50.1.100'}
The \K Trick (PCRE)
\K resets the match start - everything before it is required but not included in the match.
# Extract value after "key="
echo "key=secret123" | grep -oP 'key=\K\w+'
# Output: secret123 (not key=secret123)
# Extract text between markers
echo "START:content:END" | grep -oP 'START:\K[^:]+(?=:END)'
# Output: content
# Equivalent to positive lookbehind but more flexible
# \K works with variable-length patterns; lookbehind doesn't
Gotchas
Group Numbering
# Groups are numbered left to right by opening parenthesis
# Pattern: ((a)(b))(c)
# Group 1: ((a)(b)) = "ab"
# Group 2: (a) = "a"
# Group 3: (b) = "b"
# Group 4: (c) = "c"
echo "abc" | sed -E 's/((a)(b))(c)/\1-\2-\3-\4/'
# Output: ab-a-b-c
sed Group Syntax
# BRE (default): Escape parentheses
sed 's/\(foo\)/[\1]/' file.txt
# ERE (-E flag): No escape needed
sed -E 's/(foo)/[\1]/' file.txt
Empty Groups
# Optional group might be empty
echo "hello" | grep -oP '(\d+)?hello'
# Group 1 is empty/unset if no digits
# Handle in Python:
match = re.search(r'(\d+)?hello', 'hello')
print(match.group(1)) # None
Greedy Groups
# Greedy captures too much
echo "<tag>a</tag><tag>b</tag>" | grep -oP '<tag>(.+)</tag>'
# Group 1: a</tag><tag>b
# Use lazy or negated class
echo "<tag>a</tag><tag>b</tag>" | grep -oP '<tag>([^<]+)</tag>'
# Group 1: a (first match), b (second match)
Key Takeaways
| Concept | Usage |
|---|---|
|
Capture for extraction or backreference |
|
Reference captured groups in pattern/replacement |
|
Group without capturing (efficiency) |
|
Named group (Python/PCRE) |
|
Reset match start (PCRE) - like lookbehind |
|
Access group n in Python |
|
Get all named groups as dict |
Self-Test
-
What does
(\w+) \1match? -
How do you reference group 2 in sed replacement?
-
What’s the difference between
(a)and(?:a)? -
How do you access a named group "year" in Python?
-
What does
\Kdo in PCRE?
Answers
-
A word followed by space and the same word (repeated word)
-
\2 -
(a)captures,(?:a)only groups (no capture) -
match.group('year')ormatch.groupdict()['year'] -
Resets match start - content before
\Kis required but not in result
Next Drill
Drill 06: Alternation - Master | OR operator and proper grouping.