Drill 05: Groups & Backreferences

Capturing groups extract parts of matches and enable powerful backreference substitutions. This is where regex becomes truly useful for data transformation.

Core Concepts

Syntax Meaning Example

(pattern)

Capturing group

(abc) captures "abc"

\1, \2…​

Backreference to group

(\w+) \1 matches "the the"

(?:pattern)

Non-capturing group

(?:abc)+ groups without capturing

(?P<name>pattern)

Named group (Python)

(?P<year>\d{4}) names it "year"

(?<name>pattern)

Named group (PCRE)

(?<year>\d{4}) names it "year"

\K

Reset match start

foo\Kbar matches only "bar"

Why Groups Matter

  1. Extraction - Pull specific parts from complex patterns

  2. Backreferences - Match repeated text

  3. Replacement - Rearrange captured text

  4. Alternation - Group alternatives together

Interactive CLI Drill

bash ~/atelier/_bibliotheca/domus-captures/docs/modules/ROOT/examples/regex-drills/05-groups.sh

Exercise Set 1: Basic Capturing

cat << 'EOF' > /tmp/ex-groups.txt
192.168.1.100
10.50.1.20
host: server-01
host: db-prod-02
user@example.com
admin@company.org
key="value123"
setting="production"
EOF

Ex 1.1: Extract first two octets of IP

Solution
grep -oP '(\d+\.\d+)\.\d+\.\d+' /tmp/ex-groups.txt
# Shows whole match

# To show only group 1:
grep -oP '(\d+\.\d+)(?=\.\d+\.\d+)' /tmp/ex-groups.txt
# Or use sed:
sed -nE 's/^([0-9]+\.[0-9]+)\..*/\1/p' /tmp/ex-groups.txt

Output: 192.168, 10.50

Ex 1.2: Extract hostname from "host: NAME"

Solution
grep -oP 'host: \K[\w-]+' /tmp/ex-groups.txt

\K resets match, so only hostname is returned. Output: server-01, db-prod-02

Ex 1.3: Extract username from email

Solution
grep -oP '^[^@]+(?=@)' /tmp/ex-groups.txt
# Or with \K:
grep -oP '^([^@]+)@' /tmp/ex-groups.txt | sed 's/@$//'

Output: user, admin

Ex 1.4: Extract value from key="value"

Solution
grep -oP '(?<=")[^"]+(?=")' /tmp/ex-groups.txt
# Or with \K:
grep -oP 'key="\K[^"]+' /tmp/ex-groups.txt

Output: value123, production

Exercise Set 2: Backreferences

cat << 'EOF' > /tmp/ex-backref.txt
the the quick fox
hello hello world
no no no more
yes yes
test TEST
abab
abba
12-34-12
aa:bb:cc:aa
EOF

Ex 2.1: Find repeated words

Solution
grep -P '\b(\w+)\s+\1\b' /tmp/ex-backref.txt

\1 references what group 1 matched. Output: "the the", "hello hello", "no no", "yes yes"

Ex 2.2: Find repeated characters (aabb pattern)

Solution
grep -P '(.)\1' /tmp/ex-backref.txt

Matches: "hello" (ll), "aa" (aa), "bb", "cc"

Ex 2.3: Find palindrome-like patterns (abba)

Solution
grep -P '(.)(.)\2\1' /tmp/ex-backref.txt

Matches "abba" pattern: capture a, capture b, b again, a again.

Ex 2.4: Find matching octets in sequence

Solution
grep -P '(\d+)-\d+-\1' /tmp/ex-backref.txt

Output: 12-34-12 (first and third match)

Exercise Set 3: Replacement with Groups

Ex 3.1: Swap first and last name

Solution
echo "John Smith" | sed -E 's/(\w+) (\w+)/\2, \1/'

Output: Smith, John

Ex 3.2: Reformat date from MM/DD/YYYY to YYYY-MM-DD

Solution
echo "03/15/2026" | sed -E 's/([0-9]{2})\/([0-9]{2})\/([0-9]{4})/\3-\1-\2/'

Output: 2026-03-15

Ex 3.3: Add quotes around values

Solution
echo "key=value" | sed -E 's/(.+)=(.+)/\1="\2"/'

Output: key="value"

Ex 3.4: Extract domain from URL

Solution
echo "https://www.example.com/path" | sed -E 's|https?://([^/]+).*|\1|'

Output: www.example.com

Exercise Set 4: Non-Capturing Groups

Ex 4.1: Match phone formats without capturing area code

Solution
# Capturing: Each () creates a group
echo "555-123-4567" | grep -oP '(\d{3})-(\d{3})-(\d{4})'
# Groups: 1=555, 2=123, 3=4567

# Non-capturing for area code format:
echo "555-123-4567" | grep -oP '(?:\d{3}-)(\d{3}-\d{4})'
# Only captures: 123-4567

Ex 4.2: Optional prefix without capturing

Solution
# Match "Mr." or "Mrs." optionally, capture name
echo -e "Mr. Smith\nJones\nMrs. Davis" | grep -oP '(?:Mrs?\. )?(\w+)'

The (?:Mrs?\. )? groups without capturing.

Ex 4.3: Alternation with non-capturing group

Solution
# Match file extensions without capturing them
echo -e "file.txt\nscript.sh\nimage.png" | grep -oP '\w+\.(?:txt|sh|png)'

Exercise Set 5: Named Groups (Python)

import re

# Named groups with (?P<name>...)
pattern = re.compile(
    r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
)

text = "Date: 2026-03-15"
match = pattern.search(text)

if match:
    # Access by name
    print(f"Year: {match.group('year')}")
    print(f"Month: {match.group('month')}")
    print(f"Day: {match.group('day')}")

    # Or as dict
    print(match.groupdict())
    # {'year': '2026', 'month': '03', 'day': '15'}

Ex 5.1: Parse log entry with named groups

Solution
import re

log = "2026-03-15T10:30:45 [ERROR] Connection failed"

pattern = re.compile(
    r'(?P<timestamp>\d{4}-\d{2}-\d{2}T[\d:]+)\s+'
    r'\[(?P<level>\w+)\]\s+'
    r'(?P<message>.+)'
)

match = pattern.match(log)
if match:
    data = match.groupdict()
    print(f"Level: {data['level']}")
    print(f"Message: {data['message']}")

Ex 5.2: Parse MAC address into octets

Solution
import re

mac = "AA:BB:CC:DD:EE:FF"

pattern = re.compile(
    r'(?P<o1>[0-9A-Fa-f]{2}):'
    r'(?P<o2>[0-9A-Fa-f]{2}):'
    r'(?P<o3>[0-9A-Fa-f]{2}):'
    r'(?P<o4>[0-9A-Fa-f]{2}):'
    r'(?P<o5>[0-9A-Fa-f]{2}):'
    r'(?P<o6>[0-9A-Fa-f]{2})'
)

match = pattern.match(mac)
if match:
    octets = [match.group(f'o{i}') for i in range(1, 7)]
    print(f"OUI: {':'.join(octets[:3])}")

Real-World Applications

Professional: Parse ISE Logs

# Extract MAC and result from ISE logs
grep -oP 'Calling-Station-ID=\K[0-9A-F:-]+' /var/log/ise-psc.log

# Parse authentication result
grep -oP '(?<=AuthenticationResult=)\w+' /var/log/ise-psc.log

Professional: Reformat Config Files

# Convert "interface Gi0/1" to "Gi0/1"
sed -E 's/interface (.*)/\1/' config.txt

# Add VLAN prefix: "100" → "vlan 100"
sed -E 's/^([0-9]+)$/vlan \1/' vlans.txt

Professional: Extract IP:Port

# Parse "10.50.1.20:8080" into parts
echo "10.50.1.20:8080" | grep -oP '(?P<ip>[\d.]+):(?P<port>\d+)'

# Using sed for extraction
echo "10.50.1.20:8080" | sed -E 's/(.+):(.+)/IP=\1 PORT=\2/'

Personal: Reformat Dates in Notes

# Convert "March 15, 2026" to "2026-03-15"
# (simplified - real month conversion needs more logic)
sed -E 's/(\w+) ([0-9]+), ([0-9]{4})/\3-\1-\2/' notes.txt

Personal: Clean Phone Numbers

# Standardize "(555) 123-4567" to "555-123-4567"
sed -E 's/\(([0-9]{3})\) ([0-9]{3})-([0-9]{4})/\1-\2-\3/' contacts.txt

Personal: Extract Tags from Notes

# Find #hashtags
grep -oP '#\w+' ~/notes/*.md

# Find @mentions
grep -oP '@\w+' ~/notes/*.md

Tool Variants

grep: Group Extraction

# Use \K to extract after pattern
grep -oP 'user=\K\w+' file.txt

# Multiple groups with sed pipe
echo "a=1 b=2" | grep -oP '\w+=\d+' | sed 's/=/ → /'

sed: Group Replacement

# Basic group reference
sed -E 's/(foo)(bar)/\2\1/' file.txt  # foobar → barfoo

# Multiple groups
sed -E 's/([A-Z]+):([0-9]+)/ID:\1 COUNT:\2/' file.txt

# Entire match reference with &
sed 's/[0-9]*/[&]/' file.txt  # 123 → [123]

awk: Using match() with groups

# GNU awk with PCRE (gawk -P)
echo "key=value" | gawk 'match($0, /(\w+)=(\w+)/, a) {print "Key:", a[1], "Val:", a[2]}'

# Standard awk extraction
echo "user@domain" | awk -F'@' '{print "User:", $1, "Domain:", $2}'

vim: Group Replacement

" Swap words
:%s/\(\w\+\) \(\w\+\)/\2 \1/g

" Add brackets around numbers
:%s/\([0-9]\+\)/[\1]/g

" Case conversion with groups
:%s/\(\w\)/\u\1/g  " First char uppercase

Python: Groups and Named Groups

import re

text = "Server: web-01 IP: 10.50.1.100"

# Numbered groups
pattern = re.compile(r'Server: ([\w-]+) IP: ([\d.]+)')
match = pattern.search(text)
if match:
    print(f"Server: {match.group(1)}")  # web-01
    print(f"IP: {match.group(2)}")      # 10.50.1.100

# Named groups
pattern = re.compile(r'Server: (?P<name>[\w-]+) IP: (?P<ip>[\d.]+)')
match = pattern.search(text)
if match:
    data = match.groupdict()
    print(data)  # {'name': 'web-01', 'ip': '10.50.1.100'}

The \K Trick (PCRE)

\K resets the match start - everything before it is required but not included in the match.

# Extract value after "key="
echo "key=secret123" | grep -oP 'key=\K\w+'
# Output: secret123 (not key=secret123)

# Extract text between markers
echo "START:content:END" | grep -oP 'START:\K[^:]+(?=:END)'
# Output: content

# Equivalent to positive lookbehind but more flexible
# \K works with variable-length patterns; lookbehind doesn't

Gotchas

Group Numbering

# Groups are numbered left to right by opening parenthesis
# Pattern: ((a)(b))(c)
# Group 1: ((a)(b)) = "ab"
# Group 2: (a) = "a"
# Group 3: (b) = "b"
# Group 4: (c) = "c"

echo "abc" | sed -E 's/((a)(b))(c)/\1-\2-\3-\4/'
# Output: ab-a-b-c

sed Group Syntax

# BRE (default): Escape parentheses
sed 's/\(foo\)/[\1]/' file.txt

# ERE (-E flag): No escape needed
sed -E 's/(foo)/[\1]/' file.txt

Empty Groups

# Optional group might be empty
echo "hello" | grep -oP '(\d+)?hello'
# Group 1 is empty/unset if no digits

# Handle in Python:
match = re.search(r'(\d+)?hello', 'hello')
print(match.group(1))  # None

Greedy Groups

# Greedy captures too much
echo "<tag>a</tag><tag>b</tag>" | grep -oP '<tag>(.+)</tag>'
# Group 1: a</tag><tag>b

# Use lazy or negated class
echo "<tag>a</tag><tag>b</tag>" | grep -oP '<tag>([^<]+)</tag>'
# Group 1: a (first match), b (second match)

Key Takeaways

Concept Usage

(pattern)

Capture for extraction or backreference

\1, \2

Reference captured groups in pattern/replacement

(?:pattern)

Group without capturing (efficiency)

(?P<name>…​)

Named group (Python/PCRE)

\K

Reset match start (PCRE) - like lookbehind

match.group(n)

Access group n in Python

match.groupdict()

Get all named groups as dict

Self-Test

  1. What does (\w+) \1 match?

  2. How do you reference group 2 in sed replacement?

  3. What’s the difference between (a) and (?:a)?

  4. How do you access a named group "year" in Python?

  5. What does \K do in PCRE?

Answers
  1. A word followed by space and the same word (repeated word)

  2. \2

  3. (a) captures, (?:a) only groups (no capture)

  4. match.group('year') or match.groupdict()['year']

  5. Resets match start - content before \K is required but not in result

Next Drill

Drill 06: Alternation - Master | OR operator and proper grouping.