Groups & Capturing

Groups serve two purposes: they bundle elements for quantification and alternation, and they capture matched text for extraction or backreferencing. Understanding groups unlocks regex’s power for data extraction.

Basic Groups ()

Parentheses create a group that:

  1. Bundles elements as a single unit

  2. Captures the matched text

Pattern: (ab)+
Text:    abababab
Matches: ^^^^^^^^
         (group matched 4 times)

Grouping for Quantifiers

Without grouping, quantifiers apply only to the preceding element:

Pattern: ab+
Matches: ab, abb, abbb (b repeated)

Pattern: (ab)+
Matches: ab, abab, ababab (ab repeated)

Infrastructure Example:

Pattern: ([0-9]{1,3}\.){3}[0-9]{1,3}
Matches: 192.168.1.100
         (octets with dots grouped, repeated 3 times)

Capturing Groups

Captured groups are numbered left-to-right by opening parenthesis.

Pattern: (\d{4})-(\d{2})-(\d{2})
Text:    2026-03-15

Group 0: 2026-03-15  (entire match)
Group 1: 2026
Group 2: 03
Group 3: 15

Using Captures in grep

# Extract date parts
echo "2026-03-15" | grep -oP '(\d{4})-(\d{2})-(\d{2})'
# Output: 2026-03-15

# Extract just the year (using lookahead - covered later)
echo "Date: 2026-03-15" | grep -oP '\d{4}(?=-\d{2}-\d{2})'
# Output: 2026

Using Captures in Python

import re

text = "Error on 2026-03-15 at 14:30:00"
match = re.search(r'(\d{4})-(\d{2})-(\d{2})', text)

if match:
    print(match.group(0))  # 2026-03-15 (full match)
    print(match.group(1))  # 2026 (year)
    print(match.group(2))  # 03 (month)
    print(match.group(3))  # 15 (day)
    print(match.groups())  # ('2026', '03', '15')

Using Captures in JavaScript

const text = "Error on 2026-03-15 at 14:30:00";
const match = text.match(/(\d{4})-(\d{2})-(\d{2})/);

if (match) {
    console.log(match[0]);  // 2026-03-15 (full match)
    console.log(match[1]);  // 2026 (year)
    console.log(match[2]);  // 03 (month)
    console.log(match[3]);  // 15 (day)
}

Backreferences

Backreferences match the same text that was captured earlier.

Syntax

Syntax Context Example

\1, \2, etc.

Pattern matching

(\w+)\s+\1 matches "the the"

$1, $2, etc.

Replacement (most tools)

Replace with captured text

\g{1}, \g{name}

PCRE named/numbered

Avoids ambiguity

Finding Duplicates

Pattern: \b(\w+)\s+\1\b
Text:    The the quick brown fox fox
Matches:     ^^^^^^^           ^^^^^^^
# Find duplicate words
grep -P '\b(\w+)\s+\1\b' document.txt

# Find duplicate lines (adjacent)
uniq -d file.txt

# Find duplicate configuration values
grep -P '^(\S+)=.*\n\1=' config.ini

Backreferences in sed

# Swap first and last name
echo "Rosado, Evan" | sed -E 's/(\w+), (\w+)/\2 \1/'
# Output: Evan Rosado

# Double a number
echo "Count: 42" | sed -E 's/([0-9]+)/\1\1/'
# Output: Count: 4242

# Quote values
echo "name=value" | sed -E 's/=(.+)/="\1"/'
# Output: name="value"

Infrastructure Examples

# Swap IP octets (reverse)
echo "192.168.1.100" | sed -E 's/([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)/\4.\3.\2.\1/'
# Output: 100.1.168.192

# Reformat MAC address (: to -)
echo "AA:BB:CC:DD:EE:FF" | sed -E 's/([A-F0-9]{2}):([A-F0-9]{2}):([A-F0-9]{2}):([A-F0-9]{2}):([A-F0-9]{2}):([A-F0-9]{2})/\1-\2-\3-\4-\5-\6/'
# Output: AA-BB-CC-DD-EE-FF

# Extract hostname from FQDN
echo "server.inside.domusdigitalis.dev" | sed -E 's/^([^.]+)\..+/\1/'
# Output: server

Non-Capturing Groups (?:)

When you need grouping but don’t need to capture:

Pattern: (?:ab)+
Matches: abab (but doesn't capture)

# Saves memory, improves performance
# Keeps group numbers cleaner

When to Use

Use Capturing () Use Non-Capturing (?:)

Need to extract the matched text

Just need grouping for quantifiers

Need backreference

Don’t need the value

Few groups

Many groups (performance)

Example: Protocol Matching

# Capturing (unnecessary capture)
Pattern: (https?)://(\w+)
Group 1: https
Group 2: domain

# Non-capturing protocol, capturing domain
Pattern: (?:https?)://(\w+)
Group 1: domain (the one we care about)
# Extract domain, ignore protocol
echo "https://api.example.com/v1" | grep -oP '(?:https?)://\K[^/]+'
# Output: api.example.com

Named Groups (?P<name>) or (?<name>)

Named groups improve readability and maintenance.

Syntax by Flavor

Flavor Define Reference

Python

(?P<name>…​)

\g<name> or (?P=name)

PCRE, JavaScript

(?<name>…​)

\k<name>

.NET

(?<name>…​) or (?'name'…​)

\k<name>

Python Example

import re

log = "2026-03-15 ERROR Connection refused from 192.168.1.100"

pattern = r'(?P<date>\d{4}-\d{2}-\d{2}) (?P<level>\w+) (?P<message>.+)'
match = re.search(pattern, log)

if match:
    print(match.group('date'))     # 2026-03-15
    print(match.group('level'))    # ERROR
    print(match.group('message'))  # Connection refused from 192.168.1.100
    print(match.groupdict())       # {'date': '2026-03-15', 'level': 'ERROR', ...}

JavaScript Example (ES2018+)

const log = "2026-03-15 ERROR Connection refused";

const pattern = /(?<date>\d{4}-\d{2}-\d{2}) (?<level>\w+) (?<message>.+)/;
const match = log.match(pattern);

if (match) {
    console.log(match.groups.date);     // 2026-03-15
    console.log(match.groups.level);    // ERROR
    console.log(match.groups.message);  // Connection refused
}

PCRE (grep -P) Example

# Named groups in grep (limited support)
echo "192.168.1.100" | grep -oP '(?<oct1>\d+)\.(?<oct2>\d+)\.(?<oct3>\d+)\.(?<oct4>\d+)'
# Still outputs: 192.168.1.100
# (grep doesn't have substitution, use with other tools)

Nested Groups

Groups can be nested. Numbering follows opening parenthesis order.

Pattern: (((\d{2}):(\d{2})):(\d{2}))
Text:    14:30:45

Group 1: 14:30:45  (outermost)
Group 2: 14:30     (hour:minute)
Group 3: 14        (hour)
Group 4: 30        (minute)
Group 5: 45        (second)

Counting Rule: Count ( from left to right.

((a)(b(c)))
12  3 4

Group 1: abc
Group 2: a
Group 3: bc
Group 4: c

Practical Patterns

Extract Key-Value Pairs

import re

config = """
server=192.168.1.1
port=8080
timeout=30
"""

pattern = r'^(\w+)=(.+)$'
for match in re.finditer(pattern, config, re.MULTILINE):
    key, value = match.groups()
    print(f"{key}: {value}")

# Output:
# server: 192.168.1.1
# port: 8080
# timeout: 30

Parse Log Entries

import re

log = "Mar 15 14:30:45 server sshd[12345]: Failed password for root from 10.0.0.1"

pattern = r'(\w+\s+\d+\s+[\d:]+)\s+(\w+)\s+(\w+)\[(\d+)\]:\s+(.+)'
match = re.search(pattern, log)

if match:
    timestamp, host, service, pid, message = match.groups()
    # timestamp: Mar 15 14:30:45
    # host: server
    # service: sshd
    # pid: 12345
    # message: Failed password for root from 10.0.0.1

Validate and Extract URL Parts

import re

url = "https://api.example.com:8443/v1/users?id=123"

pattern = r'^(?P<proto>https?)://(?P<host>[^:/]+)(?::(?P<port>\d+))?(?P<path>/[^?]*)?(?:\?(?P<query>.+))?$'
match = re.match(pattern, url)

if match:
    print(match.groupdict())
    # {'proto': 'https', 'host': 'api.example.com', 'port': '8443',
    #  'path': '/v1/users', 'query': 'id=123'}

Extract IP and Port

# From netstat/ss output
ss -tlnp | grep -oP '(\d+\.\d+\.\d+\.\d+):(\d+)'

# Separate IP and port with sed
ss -tlnp | sed -nE 's/.*\s([0-9.]+):([0-9]+).*/IP: \1, Port: \2/p'

Self-Test Exercises

Try each challenge FIRST. Only expand the answer after you’ve attempted it.

Setup Test Data

cat << 'EOF' > /tmp/groups.txt
Date: 2026-03-15
Time: 14:30:45
Server: web-01.inside.domusdigitalis.dev
IP: 192.168.1.100
MAC: AA:BB:CC:DD:EE:FF
Log: 2026-03-15 ERROR Connection refused
URL: https://api.example.com:8443/v1/users
Duplicate: the the quick brown
Duplicate: word word here
Config: server=192.168.1.1
Name: Rosado, Evan
Pair: key=value
EOF

Challenge 1: Extract Date Components

Goal: Extract just the date (2026-03-15) from lines containing dates

Answer
grep -oE '[0-9]{4}-[0-9]{2}-[0-9]{2}' /tmp/groups.txt

This extracts the whole date. For individual parts (year, month, day), you’d use capturing groups in Python or sed.


Challenge 2: Extract Hostname from FQDN

Goal: Extract just "web-01" from "web-01.inside.domusdigitalis.dev"

Answer
# Using \K (PCRE)
grep -oP 'Server: \K[^.]+' /tmp/groups.txt

# Using sed with capture group
grep 'Server:' /tmp/groups.txt | sed -E 's/Server: ([^.]+)\..+/\1/'

[^.]+ matches one or more non-dot characters (the hostname).


Challenge 3: Find Duplicate Words

Goal: Find lines with duplicate adjacent words (like "the the" or "word word")

Answer
grep -P '\b(\w+)\s+\1\b' /tmp/groups.txt

(\w+) captures a word, \1 matches the SAME word again. PCRE required.


Challenge 4: Swap Name Format

Goal: Change "Rosado, Evan" to "Evan Rosado"

Answer
grep "Name:" /tmp/groups.txt | sed -E 's/Name: (\w+), (\w+)/Name: \2 \1/'

\1 is first capture (Rosado), \2 is second (Evan). Swap them.


Challenge 5: Extract Key-Value Pairs

Goal: Extract lines that match "key=value" format

Answer
grep -E '\w+=.+' /tmp/groups.txt

Or extract just the key and value separately:

grep -oE '\w+=[^[:space:]]+' /tmp/groups.txt

Challenge 6: Group for Repetition (IP Pattern)

Goal: Match the IP address using grouping for the repeated octets

Answer
grep -oE '([0-9]{1,3}\.){3}[0-9]{1,3}' /tmp/groups.txt

([0-9]\{1,3}\.){3} groups "digits + dot" and repeats 3 times.


Challenge 7: Extract Time Components

Goal: Extract just the time (14:30:45)

Answer
grep -oE '[0-9]{2}:[0-9]{2}:[0-9]{2}' /tmp/groups.txt

Or using groups: ([0-9]{2}:){2}[0-9]{2}


Challenge 8: Reverse IP Octets with sed

Goal: Transform 192.168.1.100 → 100.1.168.192

Answer
echo "192.168.1.100" | sed -E 's/([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)/\4.\3.\2.\1/'

Four capture groups, referenced in reverse order.


Challenge 9: Extract Domain from URL

Goal: Extract "api.example.com" from the URL

Answer
# Using \K
grep -oP 'https?://\K[^:/]+' /tmp/groups.txt

# Using lookbehind
grep -oP '(?<=://)[^:/]+' /tmp/groups.txt

[^:/]+ matches until we hit colon or slash.


Challenge 10: Non-Capturing Group

Goal: Match http:// or https:// URLs but don’t capture the protocol

Answer
grep -oP '(?:https?)://\S+' /tmp/groups.txt

(?:…​) groups without capturing. Useful when you just need grouping for quantifiers or alternation.


Challenge 11: Extract Port from URL

Goal: Extract just "8443" from the URL

Answer
grep -oP ':\K\d+(?=/)' /tmp/groups.txt

\K resets match start, \d+ captures port, (?=/) ensures it’s followed by slash.


Challenge 12: MAC Address with Grouped Octets

Goal: Match the MAC address using grouping for the repeated pattern

Answer
grep -oE '([A-F0-9]{2}:){5}[A-F0-9]{2}' /tmp/groups.txt

([A-F0-9]{2}:){5} groups "two hex chars + colon" repeated 5 times.

Common Mistakes

Mistake 1: Wrong Group Number

Pattern: ((a)(b))
         12  3

# Wrong assumption: Group 2 is "ab"
# Correct: Group 2 is "a"

Fix: Count opening parentheses from left.

Mistake 2: Capturing When Not Needed

# Wasteful - creates unnecessary captures
Pattern: (https?)://(www\.)?(.+)

# Better - use non-capturing for protocol/www
Pattern: (?:https?)://(?:www\.)?(.+)

Mistake 3: Backreference to Wrong Group

# Looking for duplicate words
# Wrong - \2 doesn't exist or is wrong group
grep -P '(\w+)\s+\2' file.txt

# Correct
grep -P '(\w+)\s+\1' file.txt

Key Takeaways

  1. () captures text - for extraction and backreferences

  2. (?:) groups without capturing - for quantifiers/alternation only

  3. Groups numbered left-to-right - by opening parenthesis

  4. \1, \2 for backreferences - match same text again

  5. Named groups (?P<name>) - improve readability

  6. Use non-capturing when possible - better performance

Next Module

Alternation & Conditionals - OR logic and conditional patterns.