Groups & Capturing

Groups serve two purposes: they bundle elements for quantification and alternation, and they capture matched text for extraction or backreferencing. Understanding groups unlocks regex’s power for data extraction.

Basic Groups `()`

Parentheses create a group that:

Bundles elements as a single unit
Captures the matched text

Pattern: (ab)+
Text:    abababab
Matches: ^^^^^^^^
         (group matched 4 times)

Grouping for Quantifiers

Without grouping, quantifiers apply only to the preceding element:

Pattern: ab+
Matches: ab, abb, abbb (b repeated)

Pattern: (ab)+
Matches: ab, abab, ababab (ab repeated)

Infrastructure Example:

Pattern: ([0-9]{1,3}\.){3}[0-9]{1,3}
Matches: 192.168.1.100
         (octets with dots grouped, repeated 3 times)

Capturing Groups

Captured groups are numbered left-to-right by opening parenthesis.

Pattern: (\d{4})-(\d{2})-(\d{2})
Text:    2026-03-15

Group 0: 2026-03-15  (entire match)
Group 1: 2026
Group 2: 03
Group 3: 15

Using Captures in grep

# Extract date parts
echo "2026-03-15" | grep -oP '(\d{4})-(\d{2})-(\d{2})'
# Output: 2026-03-15

# Extract just the year (using lookahead - covered later)
echo "Date: 2026-03-15" | grep -oP '\d{4}(?=-\d{2}-\d{2})'
# Output: 2026

Using Captures in Python

import re

text = "Error on 2026-03-15 at 14:30:00"
match = re.search(r'(\d{4})-(\d{2})-(\d{2})', text)

if match:
    print(match.group(0))  # 2026-03-15 (full match)
    print(match.group(1))  # 2026 (year)
    print(match.group(2))  # 03 (month)
    print(match.group(3))  # 15 (day)
    print(match.groups())  # ('2026', '03', '15')

Using Captures in JavaScript

const text = "Error on 2026-03-15 at 14:30:00";
const match = text.match(/(\d{4})-(\d{2})-(\d{2})/);

if (match) {
    console.log(match[0]);  // 2026-03-15 (full match)
    console.log(match[1]);  // 2026 (year)
    console.log(match[2]);  // 03 (month)
    console.log(match[3]);  // 15 (day)
}

Backreferences

Backreferences match the same text that was captured earlier.

Syntax

Syntax Context Example

Syntax	Context	Example
`\1`, `\2`, etc.	Pattern matching	`(\w+)\s+\1` matches "the the"
`$1`, `$2`, etc.	Replacement (most tools)	Replace with captured text
`\g{1}`, `\g{name}`	PCRE named/numbered	Avoids ambiguity

\1, \2, etc.

Pattern matching

(\w+)\s+\1 matches "the the"

$1, $2, etc.

Replacement (most tools)

Replace with captured text

\g{1}, \g{name}

PCRE named/numbered

Avoids ambiguity

Finding Duplicates

Pattern: \b(\w+)\s+\1\b
Text:    The the quick brown fox fox
Matches:     ^^^^^^^           ^^^^^^^

# Find duplicate words
grep -P '\b(\w+)\s+\1\b' document.txt

# Find duplicate lines (adjacent)
uniq -d file.txt

# Find duplicate configuration values
grep -P '^(\S+)=.*\n\1=' config.ini

Backreferences in sed

# Swap first and last name
echo "Rosado, Evan" | sed -E 's/(\w+), (\w+)/\2 \1/'
# Output: Evan Rosado

# Double a number
echo "Count: 42" | sed -E 's/([0-9]+)/\1\1/'
# Output: Count: 4242

# Quote values
echo "name=value" | sed -E 's/=(.+)/="\1"/'
# Output: name="value"

Infrastructure Examples

# Swap IP octets (reverse)
echo "192.168.1.100" | sed -E 's/([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)/\4.\3.\2.\1/'
# Output: 100.1.168.192

# Reformat MAC address (: to -)
echo "AA:BB:CC:DD:EE:FF" | sed -E 's/([A-F0-9]{2}):([A-F0-9]{2}):([A-F0-9]{2}):([A-F0-9]{2}):([A-F0-9]{2}):([A-F0-9]{2})/\1-\2-\3-\4-\5-\6/'
# Output: AA-BB-CC-DD-EE-FF

# Extract hostname from FQDN
echo "server.inside.domusdigitalis.dev" | sed -E 's/^([^.]+)\..+/\1/'
# Output: server

Non-Capturing Groups `(?:)`

When you need grouping but don’t need to capture:

Pattern: (?:ab)+
Matches: abab (but doesn't capture)

# Saves memory, improves performance
# Keeps group numbers cleaner

When to Use

Use Capturing () Use Non-Capturing (?:)

Use Capturing `()`	Use Non-Capturing `(?:)`
Need to extract the matched text	Just need grouping for quantifiers
Need backreference	Don’t need the value
Few groups	Many groups (performance)

Need to extract the matched text

Just need grouping for quantifiers

Need backreference

Don’t need the value

Few groups

Many groups (performance)

Example: Protocol Matching

# Capturing (unnecessary capture)
Pattern: (https?)://(\w+)
Group 1: https
Group 2: domain

# Non-capturing protocol, capturing domain
Pattern: (?:https?)://(\w+)
Group 1: domain (the one we care about)

# Extract domain, ignore protocol
echo "https://api.example.com/v1" | grep -oP '(?:https?)://\K[^/]+'
# Output: api.example.com

Named Groups `(?P<name>)` or `(?<name>)`

Named groups improve readability and maintenance.

Syntax by Flavor

Flavor Define Reference

Flavor	Define	Reference
Python	`(?P<name>…)`	`\g<name>` or `(?P=name)`
PCRE, JavaScript	`(?<name>…)`	`\k<name>`
.NET	`(?<name>…)` or `(?'name'…)`	`\k<name>`

Python

(?P<name>…)

\g<name> or (?P=name)

PCRE, JavaScript

(?<name>…)

\k<name>

.NET

(?<name>…) or (?'name'…)

\k<name>

Python Example

import re

log = "2026-03-15 ERROR Connection refused from 192.168.1.100"

pattern = r'(?P<date>\d{4}-\d{2}-\d{2}) (?P<level>\w+) (?P<message>.+)'
match = re.search(pattern, log)

if match:
    print(match.group('date'))     # 2026-03-15
    print(match.group('level'))    # ERROR
    print(match.group('message'))  # Connection refused from 192.168.1.100
    print(match.groupdict())       # {'date': '2026-03-15', 'level': 'ERROR', ...}

JavaScript Example (ES2018+)

const log = "2026-03-15 ERROR Connection refused";

const pattern = /(?<date>\d{4}-\d{2}-\d{2}) (?<level>\w+) (?<message>.+)/;
const match = log.match(pattern);

if (match) {
    console.log(match.groups.date);     // 2026-03-15
    console.log(match.groups.level);    // ERROR
    console.log(match.groups.message);  // Connection refused
}

PCRE (grep -P) Example

# Named groups in grep (limited support)
echo "192.168.1.100" | grep -oP '(?<oct1>\d+)\.(?<oct2>\d+)\.(?<oct3>\d+)\.(?<oct4>\d+)'
# Still outputs: 192.168.1.100
# (grep doesn't have substitution, use with other tools)

Nested Groups

Groups can be nested. Numbering follows opening parenthesis order.

Pattern: (((\d{2}):(\d{2})):(\d{2}))
Text:    14:30:45

Group 1: 14:30:45  (outermost)
Group 2: 14:30     (hour:minute)
Group 3: 14        (hour)
Group 4: 30        (minute)
Group 5: 45        (second)

Counting Rule: Count ( from left to right.

((a)(b(c)))
12  3 4

Group 1: abc
Group 2: a
Group 3: bc
Group 4: c

Practical Patterns

Extract Key-Value Pairs

import re

config = """
server=192.168.1.1
port=8080
timeout=30
"""

pattern = r'^(\w+)=(.+)$'
for match in re.finditer(pattern, config, re.MULTILINE):
    key, value = match.groups()
    print(f"{key}: {value}")

# Output:
# server: 192.168.1.1
# port: 8080
# timeout: 30

Parse Log Entries

import re

log = "Mar 15 14:30:45 server sshd[12345]: Failed password for root from 10.0.0.1"

pattern = r'(\w+\s+\d+\s+[\d:]+)\s+(\w+)\s+(\w+)\[(\d+)\]:\s+(.+)'
match = re.search(pattern, log)

if match:
    timestamp, host, service, pid, message = match.groups()
    # timestamp: Mar 15 14:30:45
    # host: server
    # service: sshd
    # pid: 12345
    # message: Failed password for root from 10.0.0.1

Validate and Extract URL Parts

import re

url = "https://api.example.com:8443/v1/users?id=123"

pattern = r'^(?P<proto>https?)://(?P<host>[^:/]+)(?::(?P<port>\d+))?(?P<path>/[^?]*)?(?:\?(?P<query>.+))?$'
match = re.match(pattern, url)

if match:
    print(match.groupdict())
    # {'proto': 'https', 'host': 'api.example.com', 'port': '8443',
    #  'path': '/v1/users', 'query': 'id=123'}

Extract IP and Port

# From netstat/ss output
ss -tlnp | grep -oP '(\d+\.\d+\.\d+\.\d+):(\d+)'

# Separate IP and port with sed
ss -tlnp | sed -nE 's/.*\s([0-9.]+):([0-9]+).*/IP: \1, Port: \2/p'

Self-Test Exercises

Try each challenge FIRST. Only expand the answer after you’ve attempted it.

Setup Test Data

cat << 'EOF' > /tmp/groups.txt
Date: 2026-03-15
Time: 14:30:45
Server: web-01.inside.domusdigitalis.dev
IP: 192.168.1.100
MAC: AA:BB:CC:DD:EE:FF
Log: 2026-03-15 ERROR Connection refused
URL: https://api.example.com:8443/v1/users
Duplicate: the the quick brown
Duplicate: word word here
Config: server=192.168.1.1
Name: Rosado, Evan
Pair: key=value
EOF

Challenge 1: Extract Date Components

Goal: Extract just the date (2026-03-15) from lines containing dates

Answer

grep -oE '[0-9]{4}-[0-9]{2}-[0-9]{2}' /tmp/groups.txt

This extracts the whole date. For individual parts (year, month, day), you’d use capturing groups in Python or sed.

Challenge 2: Extract Hostname from FQDN

Goal: Extract just "web-01" from "web-01.inside.domusdigitalis.dev"

Answer

# Using \K (PCRE)
grep -oP 'Server: \K[^.]+' /tmp/groups.txt

# Using sed with capture group
grep 'Server:' /tmp/groups.txt | sed -E 's/Server: ([^.]+)\..+/\1/'

[^.]+ matches one or more non-dot characters (the hostname).

Challenge 3: Find Duplicate Words

Goal: Find lines with duplicate adjacent words (like "the the" or "word word")

Answer

grep -P '\b(\w+)\s+\1\b' /tmp/groups.txt

(\w+) captures a word, \1 matches the SAME word again. PCRE required.

Challenge 4: Swap Name Format

Goal: Change "Rosado, Evan" to "Evan Rosado"

Answer

grep "Name:" /tmp/groups.txt | sed -E 's/Name: (\w+), (\w+)/Name: \2 \1/'

\1 is first capture (Rosado), \2 is second (Evan). Swap them.

Challenge 5: Extract Key-Value Pairs

Goal: Extract lines that match "key=value" format

Answer

grep -E '\w+=.+' /tmp/groups.txt

Or extract just the key and value separately:

grep -oE '\w+=[^[:space:]]+' /tmp/groups.txt

Challenge 6: Group for Repetition (IP Pattern)

Goal: Match the IP address using grouping for the repeated octets

Answer

grep -oE '([0-9]{1,3}\.){3}[0-9]{1,3}' /tmp/groups.txt

([0-9]\{1,3}\.){3} groups "digits + dot" and repeats 3 times.

Challenge 7: Extract Time Components

Goal: Extract just the time (14:30:45)

Answer

grep -oE '[0-9]{2}:[0-9]{2}:[0-9]{2}' /tmp/groups.txt

Or using groups: ([0-9]{2}:){2}[0-9]{2}

Challenge 8: Reverse IP Octets with sed

Goal: Transform 192.168.1.100 → 100.1.168.192

Answer

echo "192.168.1.100" | sed -E 's/([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)/\4.\3.\2.\1/'

Four capture groups, referenced in reverse order.

Challenge 9: Extract Domain from URL

Goal: Extract "api.example.com" from the URL

Answer

# Using \K
grep -oP 'https?://\K[^:/]+' /tmp/groups.txt

# Using lookbehind
grep -oP '(?<=://)[^:/]+' /tmp/groups.txt

[^:/]+ matches until we hit colon or slash.

Challenge 10: Non-Capturing Group

Goal: Match http:// or https:// URLs but don’t capture the protocol

Answer

grep -oP '(?:https?)://\S+' /tmp/groups.txt

(?:…) groups without capturing. Useful when you just need grouping for quantifiers or alternation.

Challenge 11: Extract Port from URL

Goal: Extract just "8443" from the URL

Answer

grep -oP ':\K\d+(?=/)' /tmp/groups.txt

\K resets match start, \d+ captures port, (?=/) ensures it’s followed by slash.

Challenge 12: MAC Address with Grouped Octets

Goal: Match the MAC address using grouping for the repeated pattern

Answer

grep -oE '([A-F0-9]{2}:){5}[A-F0-9]{2}' /tmp/groups.txt

([A-F0-9]{2}:){5} groups "two hex chars + colon" repeated 5 times.

Common Mistakes

Mistake 1: Wrong Group Number

Pattern: ((a)(b))
         12  3

# Wrong assumption: Group 2 is "ab"
# Correct: Group 2 is "a"

Fix: Count opening parentheses from left.

Mistake 2: Capturing When Not Needed

# Wasteful - creates unnecessary captures
Pattern: (https?)://(www\.)?(.+)

# Better - use non-capturing for protocol/www
Pattern: (?:https?)://(?:www\.)?(.+)

Mistake 3: Backreference to Wrong Group

# Looking for duplicate words
# Wrong - \2 doesn't exist or is wrong group
grep -P '(\w+)\s+\2' file.txt

# Correct
grep -P '(\w+)\s+\1' file.txt

Key Takeaways

() captures text - for extraction and backreferences
(?:) groups without capturing - for quantifiers/alternation only
Groups numbered left-to-right - by opening parenthesis
\1, \2 for backreferences - match same text again
Named groups (?P<name>) - improve readability
Use non-capturing when possible - better performance

Next Module

Alternation & Conditionals - OR logic and conditional patterns.