Groups & Capturing
Groups serve two purposes: they bundle elements for quantification and alternation, and they capture matched text for extraction or backreferencing. Understanding groups unlocks regex’s power for data extraction.
Basic Groups ()
Parentheses create a group that:
-
Bundles elements as a single unit
-
Captures the matched text
Pattern: (ab)+
Text: abababab
Matches: ^^^^^^^^
(group matched 4 times)
Grouping for Quantifiers
Without grouping, quantifiers apply only to the preceding element:
Pattern: ab+
Matches: ab, abb, abbb (b repeated)
Pattern: (ab)+
Matches: ab, abab, ababab (ab repeated)
Infrastructure Example:
Pattern: ([0-9]{1,3}\.){3}[0-9]{1,3}
Matches: 192.168.1.100
(octets with dots grouped, repeated 3 times)
Capturing Groups
Captured groups are numbered left-to-right by opening parenthesis.
Pattern: (\d{4})-(\d{2})-(\d{2})
Text: 2026-03-15
Group 0: 2026-03-15 (entire match)
Group 1: 2026
Group 2: 03
Group 3: 15
Using Captures in grep
# Extract date parts
echo "2026-03-15" | grep -oP '(\d{4})-(\d{2})-(\d{2})'
# Output: 2026-03-15
# Extract just the year (using lookahead - covered later)
echo "Date: 2026-03-15" | grep -oP '\d{4}(?=-\d{2}-\d{2})'
# Output: 2026
Using Captures in Python
import re
text = "Error on 2026-03-15 at 14:30:00"
match = re.search(r'(\d{4})-(\d{2})-(\d{2})', text)
if match:
print(match.group(0)) # 2026-03-15 (full match)
print(match.group(1)) # 2026 (year)
print(match.group(2)) # 03 (month)
print(match.group(3)) # 15 (day)
print(match.groups()) # ('2026', '03', '15')
Using Captures in JavaScript
const text = "Error on 2026-03-15 at 14:30:00";
const match = text.match(/(\d{4})-(\d{2})-(\d{2})/);
if (match) {
console.log(match[0]); // 2026-03-15 (full match)
console.log(match[1]); // 2026 (year)
console.log(match[2]); // 03 (month)
console.log(match[3]); // 15 (day)
}
Backreferences
Backreferences match the same text that was captured earlier.
Syntax
| Syntax | Context | Example |
|---|---|---|
|
Pattern matching |
|
|
Replacement (most tools) |
Replace with captured text |
|
PCRE named/numbered |
Avoids ambiguity |
Finding Duplicates
Pattern: \b(\w+)\s+\1\b
Text: The the quick brown fox fox
Matches: ^^^^^^^ ^^^^^^^
# Find duplicate words
grep -P '\b(\w+)\s+\1\b' document.txt
# Find duplicate lines (adjacent)
uniq -d file.txt
# Find duplicate configuration values
grep -P '^(\S+)=.*\n\1=' config.ini
Backreferences in sed
# Swap first and last name
echo "Rosado, Evan" | sed -E 's/(\w+), (\w+)/\2 \1/'
# Output: Evan Rosado
# Double a number
echo "Count: 42" | sed -E 's/([0-9]+)/\1\1/'
# Output: Count: 4242
# Quote values
echo "name=value" | sed -E 's/=(.+)/="\1"/'
# Output: name="value"
Infrastructure Examples
# Swap IP octets (reverse)
echo "192.168.1.100" | sed -E 's/([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)/\4.\3.\2.\1/'
# Output: 100.1.168.192
# Reformat MAC address (: to -)
echo "AA:BB:CC:DD:EE:FF" | sed -E 's/([A-F0-9]{2}):([A-F0-9]{2}):([A-F0-9]{2}):([A-F0-9]{2}):([A-F0-9]{2}):([A-F0-9]{2})/\1-\2-\3-\4-\5-\6/'
# Output: AA-BB-CC-DD-EE-FF
# Extract hostname from FQDN
echo "server.inside.domusdigitalis.dev" | sed -E 's/^([^.]+)\..+/\1/'
# Output: server
Non-Capturing Groups (?:)
When you need grouping but don’t need to capture:
Pattern: (?:ab)+
Matches: abab (but doesn't capture)
# Saves memory, improves performance
# Keeps group numbers cleaner
When to Use
Use Capturing () |
Use Non-Capturing (?:) |
|---|---|
Need to extract the matched text |
Just need grouping for quantifiers |
Need backreference |
Don’t need the value |
Few groups |
Many groups (performance) |
Example: Protocol Matching
# Capturing (unnecessary capture)
Pattern: (https?)://(\w+)
Group 1: https
Group 2: domain
# Non-capturing protocol, capturing domain
Pattern: (?:https?)://(\w+)
Group 1: domain (the one we care about)
# Extract domain, ignore protocol
echo "https://api.example.com/v1" | grep -oP '(?:https?)://\K[^/]+'
# Output: api.example.com
Named Groups (?P<name>) or (?<name>)
Named groups improve readability and maintenance.
Syntax by Flavor
| Flavor | Define | Reference |
|---|---|---|
Python |
|
|
PCRE, JavaScript |
|
|
.NET |
|
|
Python Example
import re
log = "2026-03-15 ERROR Connection refused from 192.168.1.100"
pattern = r'(?P<date>\d{4}-\d{2}-\d{2}) (?P<level>\w+) (?P<message>.+)'
match = re.search(pattern, log)
if match:
print(match.group('date')) # 2026-03-15
print(match.group('level')) # ERROR
print(match.group('message')) # Connection refused from 192.168.1.100
print(match.groupdict()) # {'date': '2026-03-15', 'level': 'ERROR', ...}
JavaScript Example (ES2018+)
const log = "2026-03-15 ERROR Connection refused";
const pattern = /(?<date>\d{4}-\d{2}-\d{2}) (?<level>\w+) (?<message>.+)/;
const match = log.match(pattern);
if (match) {
console.log(match.groups.date); // 2026-03-15
console.log(match.groups.level); // ERROR
console.log(match.groups.message); // Connection refused
}
PCRE (grep -P) Example
# Named groups in grep (limited support)
echo "192.168.1.100" | grep -oP '(?<oct1>\d+)\.(?<oct2>\d+)\.(?<oct3>\d+)\.(?<oct4>\d+)'
# Still outputs: 192.168.1.100
# (grep doesn't have substitution, use with other tools)
Nested Groups
Groups can be nested. Numbering follows opening parenthesis order.
Pattern: (((\d{2}):(\d{2})):(\d{2}))
Text: 14:30:45
Group 1: 14:30:45 (outermost)
Group 2: 14:30 (hour:minute)
Group 3: 14 (hour)
Group 4: 30 (minute)
Group 5: 45 (second)
Counting Rule: Count ( from left to right.
((a)(b(c)))
12 3 4
Group 1: abc
Group 2: a
Group 3: bc
Group 4: c
Practical Patterns
Extract Key-Value Pairs
import re
config = """
server=192.168.1.1
port=8080
timeout=30
"""
pattern = r'^(\w+)=(.+)$'
for match in re.finditer(pattern, config, re.MULTILINE):
key, value = match.groups()
print(f"{key}: {value}")
# Output:
# server: 192.168.1.1
# port: 8080
# timeout: 30
Parse Log Entries
import re
log = "Mar 15 14:30:45 server sshd[12345]: Failed password for root from 10.0.0.1"
pattern = r'(\w+\s+\d+\s+[\d:]+)\s+(\w+)\s+(\w+)\[(\d+)\]:\s+(.+)'
match = re.search(pattern, log)
if match:
timestamp, host, service, pid, message = match.groups()
# timestamp: Mar 15 14:30:45
# host: server
# service: sshd
# pid: 12345
# message: Failed password for root from 10.0.0.1
Validate and Extract URL Parts
import re
url = "https://api.example.com:8443/v1/users?id=123"
pattern = r'^(?P<proto>https?)://(?P<host>[^:/]+)(?::(?P<port>\d+))?(?P<path>/[^?]*)?(?:\?(?P<query>.+))?$'
match = re.match(pattern, url)
if match:
print(match.groupdict())
# {'proto': 'https', 'host': 'api.example.com', 'port': '8443',
# 'path': '/v1/users', 'query': 'id=123'}
Extract IP and Port
# From netstat/ss output
ss -tlnp | grep -oP '(\d+\.\d+\.\d+\.\d+):(\d+)'
# Separate IP and port with sed
ss -tlnp | sed -nE 's/.*\s([0-9.]+):([0-9]+).*/IP: \1, Port: \2/p'
Self-Test Exercises
| Try each challenge FIRST. Only expand the answer after you’ve attempted it. |
Setup Test Data
cat << 'EOF' > /tmp/groups.txt
Date: 2026-03-15
Time: 14:30:45
Server: web-01.inside.domusdigitalis.dev
IP: 192.168.1.100
MAC: AA:BB:CC:DD:EE:FF
Log: 2026-03-15 ERROR Connection refused
URL: https://api.example.com:8443/v1/users
Duplicate: the the quick brown
Duplicate: word word here
Config: server=192.168.1.1
Name: Rosado, Evan
Pair: key=value
EOF
Challenge 1: Extract Date Components
Goal: Extract just the date (2026-03-15) from lines containing dates
Answer
grep -oE '[0-9]{4}-[0-9]{2}-[0-9]{2}' /tmp/groups.txt
This extracts the whole date. For individual parts (year, month, day), you’d use capturing groups in Python or sed.
Challenge 2: Extract Hostname from FQDN
Goal: Extract just "web-01" from "web-01.inside.domusdigitalis.dev"
Answer
# Using \K (PCRE)
grep -oP 'Server: \K[^.]+' /tmp/groups.txt
# Using sed with capture group
grep 'Server:' /tmp/groups.txt | sed -E 's/Server: ([^.]+)\..+/\1/'
[^.]+ matches one or more non-dot characters (the hostname).
Challenge 3: Find Duplicate Words
Goal: Find lines with duplicate adjacent words (like "the the" or "word word")
Answer
grep -P '\b(\w+)\s+\1\b' /tmp/groups.txt
(\w+) captures a word, \1 matches the SAME word again. PCRE required.
Challenge 4: Swap Name Format
Goal: Change "Rosado, Evan" to "Evan Rosado"
Answer
grep "Name:" /tmp/groups.txt | sed -E 's/Name: (\w+), (\w+)/Name: \2 \1/'
\1 is first capture (Rosado), \2 is second (Evan). Swap them.
Challenge 5: Extract Key-Value Pairs
Goal: Extract lines that match "key=value" format
Answer
grep -E '\w+=.+' /tmp/groups.txt
Or extract just the key and value separately:
grep -oE '\w+=[^[:space:]]+' /tmp/groups.txt
Challenge 6: Group for Repetition (IP Pattern)
Goal: Match the IP address using grouping for the repeated octets
Answer
grep -oE '([0-9]{1,3}\.){3}[0-9]{1,3}' /tmp/groups.txt
([0-9]\{1,3}\.){3} groups "digits + dot" and repeats 3 times.
Challenge 7: Extract Time Components
Goal: Extract just the time (14:30:45)
Answer
grep -oE '[0-9]{2}:[0-9]{2}:[0-9]{2}' /tmp/groups.txt
Or using groups: ([0-9]{2}:){2}[0-9]{2}
Challenge 8: Reverse IP Octets with sed
Goal: Transform 192.168.1.100 → 100.1.168.192
Answer
echo "192.168.1.100" | sed -E 's/([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)/\4.\3.\2.\1/'
Four capture groups, referenced in reverse order.
Challenge 9: Extract Domain from URL
Goal: Extract "api.example.com" from the URL
Answer
# Using \K
grep -oP 'https?://\K[^:/]+' /tmp/groups.txt
# Using lookbehind
grep -oP '(?<=://)[^:/]+' /tmp/groups.txt
[^:/]+ matches until we hit colon or slash.
Challenge 10: Non-Capturing Group
Goal: Match http:// or https:// URLs but don’t capture the protocol
Answer
grep -oP '(?:https?)://\S+' /tmp/groups.txt
(?:…) groups without capturing. Useful when you just need grouping for quantifiers or alternation.
Challenge 11: Extract Port from URL
Goal: Extract just "8443" from the URL
Answer
grep -oP ':\K\d+(?=/)' /tmp/groups.txt
\K resets match start, \d+ captures port, (?=/) ensures it’s followed by slash.
Challenge 12: MAC Address with Grouped Octets
Goal: Match the MAC address using grouping for the repeated pattern
Answer
grep -oE '([A-F0-9]{2}:){5}[A-F0-9]{2}' /tmp/groups.txt
([A-F0-9]{2}:){5} groups "two hex chars + colon" repeated 5 times.
Common Mistakes
Mistake 1: Wrong Group Number
Pattern: ((a)(b))
12 3
# Wrong assumption: Group 2 is "ab"
# Correct: Group 2 is "a"
Fix: Count opening parentheses from left.
Mistake 2: Capturing When Not Needed
# Wasteful - creates unnecessary captures
Pattern: (https?)://(www\.)?(.+)
# Better - use non-capturing for protocol/www
Pattern: (?:https?)://(?:www\.)?(.+)
Mistake 3: Backreference to Wrong Group
# Looking for duplicate words
# Wrong - \2 doesn't exist or is wrong group
grep -P '(\w+)\s+\2' file.txt
# Correct
grep -P '(\w+)\s+\1' file.txt
Key Takeaways
-
()captures text - for extraction and backreferences -
(?:)groups without capturing - for quantifiers/alternation only -
Groups numbered left-to-right - by opening parenthesis
-
\1,\2for backreferences - match same text again -
Named groups
(?P<name>)- improve readability -
Use non-capturing when possible - better performance
Next Module
Alternation & Conditionals - OR logic and conditional patterns.