Drill 05: Groups & Backreferences

Capturing groups extract parts of matches and enable powerful backreference substitutions. This is where regex becomes truly useful for data transformation.

Core Concepts

Syntax Meaning Example

Syntax	Meaning	Example
`(pattern)`	Capturing group	`(abc)` captures "abc"
`\1`, `\2`…	Backreference to group	`(\w+) \1` matches "the the"
`(?:pattern)`	Non-capturing group	`(?:abc)+` groups without capturing
`(?P<name>pattern)`	Named group (Python)	`(?P<year>\d{4})` names it "year"
`(?<name>pattern)`	Named group (PCRE)	`(?<year>\d{4})` names it "year"
`\K`	Reset match start	`foo\Kbar` matches only "bar"

(pattern)

Capturing group

(abc) captures "abc"

\1, \2…

Backreference to group

(\w+) \1 matches "the the"

(?:pattern)

Non-capturing group

(?:abc)+ groups without capturing

(?P<name>pattern)

Named group (Python)

(?P<year>\d{4}) names it "year"

(?<name>pattern)

Named group (PCRE)

(?<year>\d{4}) names it "year"

\K

Reset match start

foo\Kbar matches only "bar"

Why Groups Matter

Extraction - Pull specific parts from complex patterns
Backreferences - Match repeated text
Replacement - Rearrange captured text
Alternation - Group alternatives together

Interactive CLI Drill

bash ~/atelier/_bibliotheca/domus-captures/docs/modules/ROOT/examples/regex-drills/05-groups.sh

Exercise Set 1: Basic Capturing

cat << 'EOF' > /tmp/ex-groups.txt
192.168.1.100
10.50.1.20
host: server-01
host: db-prod-02
user@example.com
admin@company.org
key="value123"
setting="production"
EOF

Ex 1.1: Extract first two octets of IP

Solution

grep -oP '(\d+\.\d+)\.\d+\.\d+' /tmp/ex-groups.txt
# Shows whole match

# To show only group 1:
grep -oP '(\d+\.\d+)(?=\.\d+\.\d+)' /tmp/ex-groups.txt
# Or use sed:
sed -nE 's/^([0-9]+\.[0-9]+)\..*/\1/p' /tmp/ex-groups.txt

Output: 192.168, 10.50

Ex 1.2: Extract hostname from "host: NAME"

Solution

grep -oP 'host: \K[\w-]+' /tmp/ex-groups.txt

\K resets match, so only hostname is returned. Output: server-01, db-prod-02

Ex 1.3: Extract username from email

Solution

grep -oP '^[^@]+(?=@)' /tmp/ex-groups.txt
# Or with \K:
grep -oP '^([^@]+)@' /tmp/ex-groups.txt | sed 's/@$//'

Output: user, admin

Ex 1.4: Extract value from key="value"

Solution

grep -oP '(?<=")[^"]+(?=")' /tmp/ex-groups.txt
# Or with \K:
grep -oP 'key="\K[^"]+' /tmp/ex-groups.txt

Output: value123, production

Exercise Set 2: Backreferences

cat << 'EOF' > /tmp/ex-backref.txt
the the quick fox
hello hello world
no no no more
yes yes
test TEST
abab
abba
12-34-12
aa:bb:cc:aa
EOF

Ex 2.1: Find repeated words

Solution

grep -P '\b(\w+)\s+\1\b' /tmp/ex-backref.txt

\1 references what group 1 matched. Output: "the the", "hello hello", "no no", "yes yes"

Ex 2.2: Find repeated characters (aabb pattern)

Solution

grep -P '(.)\1' /tmp/ex-backref.txt

Matches: "hello" (ll), "aa" (aa), "bb", "cc"

Ex 2.3: Find palindrome-like patterns (abba)

Solution

grep -P '(.)(.)\2\1' /tmp/ex-backref.txt

Matches "abba" pattern: capture a, capture b, b again, a again.

Ex 2.4: Find matching octets in sequence

Solution

grep -P '(\d+)-\d+-\1' /tmp/ex-backref.txt

Output: 12-34-12 (first and third match)

Exercise Set 3: Replacement with Groups

Ex 3.1: Swap first and last name

Solution

echo "John Smith" | sed -E 's/(\w+) (\w+)/\2, \1/'

Output: Smith, John

Ex 3.2: Reformat date from MM/DD/YYYY to YYYY-MM-DD

Solution

echo "03/15/2026" | sed -E 's/([0-9]{2})\/([0-9]{2})\/([0-9]{4})/\3-\1-\2/'

Output: 2026-03-15

Ex 3.3: Add quotes around values

Solution

echo "key=value" | sed -E 's/(.+)=(.+)/\1="\2"/'

Output: key="value"

Ex 3.4: Extract domain from URL

Solution

echo "https://www.example.com/path" | sed -E 's|https?://([^/]+).*|\1|'

Output: www.example.com

Exercise Set 4: Non-Capturing Groups

Ex 4.1: Match phone formats without capturing area code

Solution

# Capturing: Each () creates a group
echo "555-123-4567" | grep -oP '(\d{3})-(\d{3})-(\d{4})'
# Groups: 1=555, 2=123, 3=4567

# Non-capturing for area code format:
echo "555-123-4567" | grep -oP '(?:\d{3}-)(\d{3}-\d{4})'
# Only captures: 123-4567

Ex 4.2: Optional prefix without capturing

Solution

# Match "Mr." or "Mrs." optionally, capture name
echo -e "Mr. Smith\nJones\nMrs. Davis" | grep -oP '(?:Mrs?\. )?(\w+)'

The (?:Mrs?\. )? groups without capturing.

Ex 4.3: Alternation with non-capturing group

Solution

# Match file extensions without capturing them
echo -e "file.txt\nscript.sh\nimage.png" | grep -oP '\w+\.(?:txt|sh|png)'

Exercise Set 5: Named Groups (Python)

import re

# Named groups with (?P<name>...)
pattern = re.compile(
    r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
)

text = "Date: 2026-03-15"
match = pattern.search(text)

if match:
    # Access by name
    print(f"Year: {match.group('year')}")
    print(f"Month: {match.group('month')}")
    print(f"Day: {match.group('day')}")

    # Or as dict
    print(match.groupdict())
    # {'year': '2026', 'month': '03', 'day': '15'}

Ex 5.1: Parse log entry with named groups

Solution

import re

log = "2026-03-15T10:30:45 [ERROR] Connection failed"

pattern = re.compile(
    r'(?P<timestamp>\d{4}-\d{2}-\d{2}T[\d:]+)\s+'
    r'\[(?P<level>\w+)\]\s+'
    r'(?P<message>.+)'
)

match = pattern.match(log)
if match:
    data = match.groupdict()
    print(f"Level: {data['level']}")
    print(f"Message: {data['message']}")

Ex 5.2: Parse MAC address into octets

Solution

import re

mac = "AA:BB:CC:DD:EE:FF"

pattern = re.compile(
    r'(?P<o1>[0-9A-Fa-f]{2}):'
    r'(?P<o2>[0-9A-Fa-f]{2}):'
    r'(?P<o3>[0-9A-Fa-f]{2}):'
    r'(?P<o4>[0-9A-Fa-f]{2}):'
    r'(?P<o5>[0-9A-Fa-f]{2}):'
    r'(?P<o6>[0-9A-Fa-f]{2})'
)

match = pattern.match(mac)
if match:
    octets = [match.group(f'o{i}') for i in range(1, 7)]
    print(f"OUI: {':'.join(octets[:3])}")

Real-World Applications

Professional: Parse ISE Logs

# Extract MAC and result from ISE logs
grep -oP 'Calling-Station-ID=\K[0-9A-F:-]+' /var/log/ise-psc.log

# Parse authentication result
grep -oP '(?<=AuthenticationResult=)\w+' /var/log/ise-psc.log

Professional: Reformat Config Files

# Convert "interface Gi0/1" to "Gi0/1"
sed -E 's/interface (.*)/\1/' config.txt

# Add VLAN prefix: "100" → "vlan 100"
sed -E 's/^([0-9]+)$/vlan \1/' vlans.txt

Professional: Extract IP:Port

# Parse "10.50.1.20:8080" into parts
echo "10.50.1.20:8080" | grep -oP '(?P<ip>[\d.]+):(?P<port>\d+)'

# Using sed for extraction
echo "10.50.1.20:8080" | sed -E 's/(.+):(.+)/IP=\1 PORT=\2/'

Personal: Reformat Dates in Notes

# Convert "March 15, 2026" to "2026-03-15"
# (simplified - real month conversion needs more logic)
sed -E 's/(\w+) ([0-9]+), ([0-9]{4})/\3-\1-\2/' notes.txt

Personal: Clean Phone Numbers

# Standardize "(555) 123-4567" to "555-123-4567"
sed -E 's/\(([0-9]{3})\) ([0-9]{3})-([0-9]{4})/\1-\2-\3/' contacts.txt

Personal: Extract Tags from Notes

# Find #hashtags
grep -oP '#\w+' ~/notes/*.md

# Find @mentions
grep -oP '@\w+' ~/notes/*.md

Tool Variants

grep: Group Extraction

# Use \K to extract after pattern
grep -oP 'user=\K\w+' file.txt

# Multiple groups with sed pipe
echo "a=1 b=2" | grep -oP '\w+=\d+' | sed 's/=/ → /'

sed: Group Replacement

# Basic group reference
sed -E 's/(foo)(bar)/\2\1/' file.txt  # foobar → barfoo

# Multiple groups
sed -E 's/([A-Z]+):([0-9]+)/ID:\1 COUNT:\2/' file.txt

# Entire match reference with &
sed 's/[0-9]*/[&]/' file.txt  # 123 → [123]

awk: Using match() with groups

# GNU awk with PCRE (gawk -P)
echo "key=value" | gawk 'match($0, /(\w+)=(\w+)/, a) {print "Key:", a[1], "Val:", a[2]}'

# Standard awk extraction
echo "user@domain" | awk -F'@' '{print "User:", $1, "Domain:", $2}'

vim: Group Replacement

" Swap words
:%s/\(\w\+\) \(\w\+\)/\2 \1/g

" Add brackets around numbers
:%s/\([0-9]\+\)/[\1]/g

" Case conversion with groups
:%s/\(\w\)/\u\1/g  " First char uppercase

Python: Groups and Named Groups

import re

text = "Server: web-01 IP: 10.50.1.100"

# Numbered groups
pattern = re.compile(r'Server: ([\w-]+) IP: ([\d.]+)')
match = pattern.search(text)
if match:
    print(f"Server: {match.group(1)}")  # web-01
    print(f"IP: {match.group(2)}")      # 10.50.1.100

# Named groups
pattern = re.compile(r'Server: (?P<name>[\w-]+) IP: (?P<ip>[\d.]+)')
match = pattern.search(text)
if match:
    data = match.groupdict()
    print(data)  # {'name': 'web-01', 'ip': '10.50.1.100'}

The \K Trick (PCRE)

\K resets the match start - everything before it is required but not included in the match.

# Extract value after "key="
echo "key=secret123" | grep -oP 'key=\K\w+'
# Output: secret123 (not key=secret123)

# Extract text between markers
echo "START:content:END" | grep -oP 'START:\K[^:]+(?=:END)'
# Output: content

# Equivalent to positive lookbehind but more flexible
# \K works with variable-length patterns; lookbehind doesn't

Gotchas

Group Numbering

# Groups are numbered left to right by opening parenthesis
# Pattern: ((a)(b))(c)
# Group 1: ((a)(b)) = "ab"
# Group 2: (a) = "a"
# Group 3: (b) = "b"
# Group 4: (c) = "c"

echo "abc" | sed -E 's/((a)(b))(c)/\1-\2-\3-\4/'
# Output: ab-a-b-c

sed Group Syntax

# BRE (default): Escape parentheses
sed 's/\(foo\)/[\1]/' file.txt

# ERE (-E flag): No escape needed
sed -E 's/(foo)/[\1]/' file.txt

Empty Groups

# Optional group might be empty
echo "hello" | grep -oP '(\d+)?hello'
# Group 1 is empty/unset if no digits

# Handle in Python:
match = re.search(r'(\d+)?hello', 'hello')
print(match.group(1))  # None

Greedy Groups

# Greedy captures too much
echo "<tag>a</tag><tag>b</tag>" | grep -oP '<tag>(.+)</tag>'
# Group 1: a</tag><tag>b

# Use lazy or negated class
echo "<tag>a</tag><tag>b</tag>" | grep -oP '<tag>([^<]+)</tag>'
# Group 1: a (first match), b (second match)

Key Takeaways

Concept Usage

Concept	Usage
`(pattern)`	Capture for extraction or backreference
`\1`, `\2`	Reference captured groups in pattern/replacement
`(?:pattern)`	Group without capturing (efficiency)
`(?P<name>…)`	Named group (Python/PCRE)
`\K`	Reset match start (PCRE) - like lookbehind
`match.group(n)`	Access group n in Python
`match.groupdict()`	Get all named groups as dict

(pattern)

Capture for extraction or backreference

\1, \2

Reference captured groups in pattern/replacement

(?:pattern)

Group without capturing (efficiency)

(?P<name>…)

Named group (Python/PCRE)

\K

Reset match start (PCRE) - like lookbehind

match.group(n)

Access group n in Python

match.groupdict()

Get all named groups as dict

Self-Test

What does (\w+) \1 match?
How do you reference group 2 in sed replacement?
What’s the difference between (a) and (?:a)?
How do you access a named group "year" in Python?
What does \K do in PCRE?

Answers

A word followed by space and the same word (repeated word)
\2
(a) captures, (?:a) only groups (no capture)
match.group('year') or match.groupdict()['year']
Resets match start - content before \K is required but not in result

Next Drill

Drill 06: Alternation - Master | OR operator and proper grouping.