Drill 10: Advanced Techniques
The final drill. These advanced techniques separate regex masters from regex users. Atomic groups prevent catastrophic backtracking, conditionals enable context-aware matching, and recursive patterns match nested structures.
Pattern Complexity Levels
| Level | Techniques | Use Case |
|---|---|---|
Basic |
|
Simple matching |
Intermediate |
Groups, anchors, lookahead |
Extraction and validation |
Advanced |
Lookbehind, backreferences |
Context-aware matching |
Expert |
Atomic groups, conditionals, recursion |
Complex parsing, optimization |
Interactive CLI Drill
bash ~/atelier/_bibliotheca/domus-captures/docs/modules/ROOT/examples/regex-drills/10-advanced.sh
Section 1: Atomic Groups and Possessive Quantifiers
The Backtracking Problem
Normal regex engines use backtracking - when a match fails, they try alternative paths.
# Catastrophic backtracking example
# Pattern: (a+)+b
# Input: aaaaaaaaaaaaaaaaaaaaaaaac
# The engine tries:
# (aaaaaaaaaaaaaaaaaaaaaaaaa) - fails (no b)
# (aaaaaaaaaaaaaaaaaaaaaaaa)(a) - fails
# (aaaaaaaaaaaaaaaaaaaaaaa)(aa) - fails
# ... exponential combinations!
This can hang your system or cause ReDoS (Regular Expression Denial of Service) attacks.
Atomic Groups: (?>…)
Atomic groups prevent backtracking into the group once matched.
# PCRE atomic group syntax
echo "aaaaaac" | grep -P '(?>a+)b'
# No match, but returns instantly (no backtracking)
# Compare to normal group (would backtrack)
echo "aaaaaab" | grep -P '(?>a+)b'
# Output: aaaaaab
Possessive Quantifiers: +`, `*, ?+
Possessive quantifiers are shorthand for atomic groups.
# Possessive: a++ (same as (?>a+))
echo "aaaaaab" | grep -P 'a++b'
# Output: aaaaaab
# Possessive prevents backtracking
echo "aaaaaac" | grep -P 'a++c'
# No match (a++ takes all a's, won't give any back)
# All possessive forms:
# *+ = zero or more, possessive
# ++ = one or more, possessive
# ?+ = zero or one, possessive
# {n,m}+ = bounded, possessive
When to Use Possessive/Atomic
# Use when you KNOW the matched portion should never be given back
# Email local part (before @)
grep -oP '[a-zA-Z0-9._%+-]++(?=@)' emails.txt
# Path components (won't backtrack on slashes)
grep -oP '/[^/]++'
# Quoted strings (once you see closing quote, done)
grep -oP '"[^"]*+"'
# Number sequences (all digits belong together)
grep -oP '\d++'
Performance Comparison
# Create test file with potential ReDoS input
echo "aaaaaaaaaaaaaaaaaaaaaaaaaaaaab" > /tmp/redos.txt
echo "aaaaaaaaaaaaaaaaaaaaaaaaaaaac" >> /tmp/redos.txt
# Slow (exponential backtracking on non-match)
time grep -P '(a+)+b' /tmp/redos.txt 2>/dev/null
# Fast (atomic group prevents backtracking)
time grep -P '(?>a+)+b' /tmp/redos.txt
# Also fast (possessive quantifier)
time grep -P 'a++b' /tmp/redos.txt
Section 2: Conditional Patterns
Basic Conditional: (?(condition)yes|no)
Match different patterns based on conditions.
# Syntax: (?(condition)yes-pattern|no-pattern)
# The no-pattern is optional
# Condition types:
# (1), (2)... - Did group N match?
# (<name>) - Did named group match?
# (R) - Are we in recursion?
# (?=...) - Lookahead condition
Conditional on Group Match
# Match phone with or without area code parentheses
# If opening paren exists, require closing paren
echo "(555) 123-4567" | grep -oP '(\()?\d{3}(?(1)\)|-)\d{3}-\d{4}'
# Output: (555) 123-4567
echo "555-123-4567" | grep -oP '(\()?\d{3}(?(1)\)|-)\d{3}-\d{4}'
# Output: 555-123-4567
# Breakdown:
# (\()? - Group 1: optional opening paren
# \d{3} - Three digits
# (?(1)\)|-) - If group 1 matched, expect ), else expect -
# \d{3}-\d{4} - Rest of number
Conditional on Named Group
# Match balanced quotes (single or double)
echo '"hello"' | grep -oP "(?<q>['\"]).*?(?(q)\\k<q>)"
# Wait - this doesn't work as expected
# Better: match quoted string with matching quotes
echo '"hello"' | grep -oP '(["\x27])(?:(?!\1).)*\1'
# Output: "hello"
echo "'world'" | grep -oP '(["\x27])(?:(?!\1).)*\1'
# Output: 'world'
Conditional with Lookahead
# Match number with optional decimal, require decimal point before decimals
echo "123.45" | grep -oP '\d+(?(?=\.)\.\d+)'
# Output: 123.45
echo "123" | grep -oP '\d+(?(?=\.)\.\d+)'
# Output: 123
# Breakdown:
# \d+ - One or more digits
# (?(?=\.) - IF lookahead sees a dot...
# \.\d+ - ...THEN match dot and decimals
# ) - (no ELSE clause - just skip)
Practical Conditional Examples
# Match URL with optional protocol
# If protocol exists, require ://
echo "https://example.com" | grep -oP '(https?)?(?1:\/\/)?[\w.-]+\.\w+'
# Note: PCRE2 syntax may vary
# Alternative approach using alternation
echo -e "https://example.com\nexample.com" | grep -oP '(?:https?://)?[\w.-]+\.\w+'
# Match version with optional v prefix
# If v exists, must be lowercase
echo "v1.2.3" | grep -oP '(v)?\d+\.\d+\.\d+'
echo "1.2.3" | grep -oP '(v)?\d+\.\d+\.\d+'
Section 3: Recursive Patterns
What is Recursion in Regex?
Recursive patterns match nested structures like parentheses, HTML tags, or JSON.
# Standard regex CANNOT match arbitrary nesting:
# ((a)(b(c))) - how many levels? Unknown.
# PCRE recursive patterns CAN:
# (?R) or (?0) - recurse entire pattern
# (?1), (?2) - recurse specific group
Basic Recursion: (?R)
# Match nested parentheses
echo "(a(b(c)d)e)" | grep -oP '\((?:[^()]+|(?R))*\)'
# Output: (a(b(c)d)e)
# Breakdown:
# \( - Opening paren
# (?: - Non-capturing group:
# [^()]+ - Non-paren characters OR
# |(?R) - Recurse the ENTIRE pattern
# )* - Zero or more times
# \) - Closing paren
Recursion into Specific Group: (?1)
# More controlled recursion - recurse group 1 only
echo "(a(b(c)d)e)" | grep -oP '(\((?:[^()]+|(?1))*\))'
# Output: (a(b(c)d)e)
# Match nested braces
echo "{a{b{c}d}e}" | grep -oP '(\{(?:[^{}]+|(?1))*\})'
# Output: {a{b{c}d}e}
Matching Nested HTML Tags
# Simple nested div matching
echo "<div><div>inner</div></div>" | grep -oP '<div>(?:[^<]+|<div>(?:[^<]+|(?R))*</div>)*</div>'
# More general (but still limited):
# Match <tag>...</tag> with same tag nesting
pattern='<(\w+)>(?:[^<]+|<\1>(?R)*</\1>)*</\1>'
| For real HTML/XML, use a proper parser. Regex for nested markup is fragile. |
Recursion with Named Groups
# Named group recursion
echo "(a(b)c)" | grep -oP '(?<paren>\((?:[^()]+|(?&paren))*\))'
# Output: (a(b)c)
# (?&name) recurses into named group
Practical Recursion Examples
# Match JSON arrays (simplified)
json='[1,[2,3],[4,[5,6]]]'
echo "$json" | grep -oP '\[(?:[^\[\]]+|(?R))*\]'
# Output: [1,[2,3],[4,[5,6]]]
# Match S-expressions (Lisp-style)
sexp='(define (square x) (* x x))'
echo "$sexp" | grep -oP '\((?:[^()]+|(?R))*\)'
# Output: (define (square x) (* x x))
# Match function calls with nested calls
code='func(a, other(b, c), d)'
echo "$code" | grep -oP '\w+\((?:[^()]+|(?R))*\)'
# Output: func(a, other(b, c), d), other(b, c)
Section 4: Subroutine Calls
Define Once, Use Multiple Times
# Define IP octet pattern once, use 4 times
# (?(DEFINE)...) defines patterns without matching
grep -oP '(?x)
(?(DEFINE)
(?<octet>25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
)
(?&octet)\.(?&octet)\.(?&octet)\.(?&octet)
' <<< "192.168.1.100"
# Output: 192.168.1.100
Complex Pattern Definition
# Define reusable sub-patterns
grep -oP '(?x)
(?(DEFINE)
(?<year>\d{4})
(?<month>0[1-9]|1[0-2])
(?<day>0[1-9]|[12][0-9]|3[01])
(?<hour>[01][0-9]|2[0-3])
(?<minute>[0-5][0-9])
(?<second>[0-5][0-9])
)
(?&year)-(?&month)-(?&day)T(?&hour):(?&minute):(?&second)
' <<< "2026-03-26T14:30:45"
# Output: 2026-03-26T14:30:45
Benefits of DEFINE
-
Readability - Name your sub-patterns
-
DRY - Don’t Repeat Yourself
-
Maintainability - Change once, update everywhere
-
Testing - Test sub-patterns in isolation
Section 5: Branch Reset Groups
Problem: Group Numbers Shift with Alternation
# Normal alternation - groups numbered sequentially
echo "hello world" | perl -pe 's/(hello)|(world)/[$1$2]/g'
# Group 1 or Group 2 matches, never both
# This makes backreferences awkward
Solution: Branch Reset (?|…)
# Branch reset: alternatives share group numbers
echo "hello world" | perl -pe 's/(?|(hello)|(world))/[$1]/g'
# Output: [hello] [world]
# Both "hello" and "world" are Group 1
Practical Branch Reset
# Match different date formats, normalize to group 1-3
pattern='(?|(\d{4})-(\d{2})-(\d{2})|(\d{2})/(\d{2})/(\d{4}))'
# Both formats fill groups 1, 2, 3 (differently ordered though)
# This is powerful for parsing multiple input formats
Section 6: Script Runs (Unicode)
Match Within Same Script
# Perl/PCRE2 feature: ensure text is all same Unicode script
# Detects mixed-script attacks (homoglyph attacks)
# Match only ASCII text
grep -P '^[\p{ASCII}]+$' file.txt
# Match only Cyrillic
grep -P '^\p{Cyrillic}+$' file.txt
# Match only Latin script
grep -P '^\p{Latin}+$' file.txt
Security Application
# Detect potential homoglyph attacks in domains
# Cyrillic 'Π°' looks like Latin 'a'
# Find domains with mixed scripts (suspicious)
grep -P '[^\p{ASCII}]' domains.txt | while read domain; do
echo "Potential homoglyph: $domain"
done
Section 7: Optimization Techniques
Anchoring
# Anchored patterns are faster - engine knows where to start
# Slow (checks every position)
grep -P 'ERROR:.*failed' huge.log
# Faster (starts from line beginning only)
grep -P '^ERROR:.*failed' huge.log
# Even faster with fixed prefix
grep -F 'ERROR:' huge.log | grep -P 'failed$'
Avoid Catastrophic Patterns
# DANGEROUS patterns (exponential backtracking):
# (a+)+
# (a*)*
# (a|aa)+
# (.*a){10}
# SAFE alternatives:
# a+
# a*
# a+
# (?>[^a]*a){10} # Atomic groups prevent backtracking
Use Possessive When Possible
# Convert greedy to possessive when backtracking won't help
# Instead of:
grep -oP '"[^"]*"' file.txt
# Use (when you know quote won't be escaped):
grep -oP '"[^"]*+"' file.txt
Specific Character Classes
# More specific = faster
# Slow (matches anything)
grep -P '.*@.*\..*' emails.txt
# Faster (specific character class)
grep -P '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' emails.txt
# Even better - use negated classes
grep -P '[^@]+@[^.]+\.[^.]+'
Unroll Loops
# Instead of:
grep -P '(?:abc|def)+' file.txt
# Unroll for common patterns:
grep -P '(?:abc|def)(?:abc|def)*' file.txt
# Or use possessive:
grep -P '(?:abc|def)++' file.txt
Real-World Applications
Professional: ReDoS Prevention Audit
#!/bin/bash
# Scan code for potentially dangerous regex patterns
find . -name "*.py" -o -name "*.js" -o -name "*.rb" | while read f; do
# Find nested quantifiers
if grep -qP '\([^)]*[*+]\)[*+]' "$f"; then
echo "DANGER: Nested quantifiers in $f"
grep -nP '\([^)]*[*+]\)[*+]' "$f"
fi
# Find patterns with multiple .*
if grep -qP '\.\*.*\.\*' "$f"; then
echo "WARNING: Multiple .* in $f"
grep -nP '\.\*.*\.\*' "$f"
fi
done
Professional: Complex Log Parser
#!/bin/bash
# Parse logs with optional fields using conditionals
# Log format: [LEVEL] [TIMESTAMP] (optional PID) Message
log_pattern='(?x)
^\[(?<level>\w+)\] # Required: level
\s+\[(?<timestamp>[\d:-]+)\] # Required: timestamp
(?:\s+\((?<pid>\d+)\))? # Optional: (pid)
\s+(?<message>.+)$ # Required: message
'
grep -oP "$log_pattern" /var/log/app.log | head -10
Professional: Config Validator with Recursion
#!/bin/bash
# Validate nested JSON-like structures
validate_nesting() {
local content=$1
# Check balanced braces
if ! echo "$content" | grep -qP '^\{(?:[^{}]+|(?R))*\}$'; then
echo "ERROR: Unbalanced braces"
return 1
fi
# Check balanced brackets
if ! echo "$content" | grep -qP '^\[(?:[^\[\]]+|(?R))*\]$'; then
echo "ERROR: Unbalanced brackets"
return 1
fi
echo "OK: Nesting is valid"
return 0
}
Personal: Extract Nested Markdown Lists
#!/bin/bash
# Parse nested bullet lists from markdown
# Match list items with their nesting level
grep -P '^\s*[-*]\s+' ~/notes/*.md | while read line; do
# Count leading spaces to determine nesting
spaces=$(echo "$line" | grep -oP '^\s*' | wc -c)
level=$((spaces / 2))
content=$(echo "$line" | grep -oP '[-*]\s+\K.*')
printf "%s%s\n" "$(printf ' %.0s' $(seq 1 $level))" "$content"
done
Personal: Smart Date Parser
#!/bin/bash
# Parse multiple date formats into ISO format
parse_date() {
local input=$1
# ISO format: 2026-03-26
if echo "$input" | grep -qP '^\d{4}-\d{2}-\d{2}$'; then
echo "$input"
return
fi
# US format: 03/26/2026
if echo "$input" | grep -qP '^\d{2}/\d{2}/\d{4}$'; then
echo "$input" | sed -E 's|(\d{2})/(\d{2})/(\d{4})|\3-\1-\2|'
return
fi
# Written: March 26, 2026
if match=$(echo "$input" | grep -oP '(?<month>\w+)\s+(?<day>\d+),\s+(?<year>\d{4})'); then
# Would need month name lookup...
echo "Complex: $input"
return
fi
echo "Unknown format: $input"
}
Professional: ISE RADIUS Session Analyzer
#!/bin/bash
# Parse ISE RADIUS sessions with nested authentication details
# Uses recursion for nested JSON responses
# ISE MNT API returns nested session data
# Extract nested attribute groups with recursion
parse_ise_session() {
local session_json=$1
# Extract nested attribute-value pairs using recursion
# Pattern matches {...} including nested braces
echo "$session_json" | grep -oP '\{(?:[^{}]+|(?R))*\}'
}
# Conditional pattern for auth status
# If passed=true, expect posture; else expect failure reason
ise_auth_pattern='(?x)
"authentication_status":\s*"(?<status>PASSED|FAILED)"
.*?
(?(1)
"posture_status":\s*"(?<posture>[^"]+)"
|
"failure_reason":\s*"(?<reason>[^"]+)"
)
'
# Extract sessions with optional VLAN assignment
grep -oP '(?x)
"mac_address":\s*"(?<mac>[0-9A-Fa-f:]{17})"
.*?
"nas_ip":\s*"(?<nas>\d+\.\d+\.\d+\.\d+)"
(?:.*?"assigned_vlan":\s*"(?<vlan>\d+)")?
' /tmp/ise-sessions.json
Professional: Cisco ACL Validator
#!/bin/bash
# Validate Cisco ACL syntax including nested object-groups
# Uses recursion for nested group references
# ACL line with optional nested groups
acl_pattern='(?x)
^(?<action>permit|deny)\s+
(?<protocol>ip|tcp|udp|icmp)\s+
(?:
(?<src_type>host|any|\d+\.\d+\.\d+\.\d+)
|
object-group\s+(?<src_group>\S+)
)\s+
(?:
(?<dst_type>host|any|\d+\.\d+\.\d+\.\d+)
|
object-group\s+(?<dst_group>\S+)
)
(?:\s+(?<ports>eq|range|gt|lt)\s+[\d\s-]+)?
'
validate_acl() {
local acl_file=$1
local errors=0
while IFS= read -r line; do
# Skip comments and empty lines
[[ "$line" =~ ^[[:space:]]*[!#] ]] && continue
[[ -z "$line" ]] && continue
if ! echo "$line" | grep -qP "$acl_pattern"; then
echo "INVALID: $line"
((errors++))
fi
done < "$acl_file"
echo "Validation complete: $errors errors found"
return $errors
}
Professional: Network Syslog Parser with Conditionals
#!/bin/bash
# Parse Cisco syslog with conditional field extraction
# Different formats based on facility code
# If %SEC, expect security fields; else expect interface fields
syslog_pattern='(?x)
^(?<timestamp>\w+\s+\d+\s+[\d:]+)\s+
(?<host>\S+)\s+
%(?<facility>\w+)-(?<severity>\d)-(?<mnemonic>\w+):\s+
(?(5) # Check if mnemonic captured
(?:
# SEC facility: expect IP addresses
(?<=SEC-).*?(?<src_ip>\d+\.\d+\.\d+\.\d+).*?(?<dst_ip>\d+\.\d+\.\d+\.\d+)
|
# LINK facility: expect interface
(?<=LINK-).*?(?<interface>(?:Gi|Fa|Te)\S+)
|
# Default: capture rest as message
(?<message>.+)
)
)
'
# Parse with possessive quantifiers for efficiency (no backtracking on long messages)
grep -oP '(?x)
^\w+\s++\d+\s++[\d:]++\s++ # Timestamp (possessive)
\S++\s++ # Hostname
%\w++-\d+-\w++:\s++ # Facility-Severity-Mnemonic
(?:
(?:denied|permitted)\s++ # ACL action
(?<proto>tcp|udp|icmp)\s++ # Protocol
(?<src>\d++\.\d++\.\d++\.\d++) # Source IP (possessive)
)?
.*$
' /var/log/syslog
Personal: Finance Transaction Categorizer
#!/bin/bash
# Categorize bank transactions using DEFINE for reusable patterns
# Demonstrates DRY regex with named subroutines
# Define reusable patterns for financial parsing
finance_pattern='(?x)
(?(DEFINE)
(?<amount>\$?\d{1,3}(?:,\d{3})*(?:\.\d{2})?)
(?<date>\d{2}/\d{2}/\d{4}|\d{4}-\d{2}-\d{2})
(?<merchant>[A-Za-z0-9\s&\x27-]+?)
(?<category>grocery|dining|gas|utility|transfer|income)
)
^(?&date)\s+
(?<desc>(?&merchant))\s+
(?<amt>-?(?&amount))
'
categorize_transaction() {
local desc=$1
local amount=$2
# Use conditionals: if amount negative, check expense categories
# If positive, it's income
if [[ "$amount" =~ ^- ]]; then
# Expense categorization using alternation with possessive
category=$(echo "$desc" | grep -oP '(?i)
(?:walmart|target|costco)(*SKIP)(*F)|grocery
| (?:starbucks|mcdonald|restaurant)(*SKIP)(*F)|dining
| (?:shell|chevron|exxon)(*SKIP)(*F)|gas
| (?:electric|water|internet)(*SKIP)(*F)|utility
' | head -1)
else
category="income"
fi
echo "$category"
}
# Parse CSV with recursive handling of quoted fields containing commas
# Matches: "field, with comma",normal field,"another, field"
csv_pattern='(?:^|,)(?:"([^"]*+)"|([^,]++))'
# Extract transactions with balanced description parentheses
# e.g., "AMAZON (Prime Membership (Annual))"
grep -oP '\d{2}/\d{2}/\d{4}\s+[A-Z]+\s+(?:[^()]+|\((?:[^()]+|(?R))*\))+\s+-?\$[\d,.]+' \
~/finances/transactions.csv
Personal: Gopass Entry Validator
#!/bin/bash
# Validate gopass entry structure with nested YAML
# Uses recursion for nested key-value structures
validate_gopass_entry() {
local entry=$1
# First line must be password (non-empty)
if ! echo "$entry" | head -1 | grep -qP '^.+$'; then
echo "ERROR: First line must be password"
return 1
fi
# Remaining lines: YAML key: value with optional nesting
# Uses possessive quantifiers for efficient parsing
yaml_pattern='(?x)
^(?<indent>\s*+) # Possessive indent capture
(?<key>[a-z_]++): # Possessive key
(?:
\s++(?<value>.++) # Inline value (possessive)
|
\n(?:\1\s++.++\n)*+ # Nested block (possessive)
)
'
if echo "$entry" | tail -n +2 | grep -qvP '^[a-z_]+:\s*.+$|^\s+[a-z_]+:'; then
echo "WARNING: Non-standard YAML detected"
fi
}
# Atomic group for matching gopass paths (no backtracking on store names)
gopass_path_pattern='(?>(v3|personal|work))/(?>[a-z_]+/)*+[a-z_]+$'
gopass list | grep -P "$gopass_path_pattern"
Tool Variants
PCRE (grep -P)
# Full PCRE support
grep -oP '(?>a+)b' # Atomic groups
grep -oP 'a++b' # Possessive quantifiers
grep -oP '(?R)' # Recursion
grep -oP '(?(1)yes|no)' # Conditionals
grep -oP '(?|a|b)' # Branch reset (limited)
Python: regex module
# Standard re module has LIMITED advanced features
import re
# For full PCRE-like support, use regex module
import regex
# Atomic groups
pattern = regex.compile(r'(?>a+)b')
# Possessive quantifiers
pattern = regex.compile(r'a++b')
# Recursion
pattern = regex.compile(r'\((?:[^()]+|(?R))*\)')
# Branch reset
pattern = regex.compile(r'(?|(\w+)|(\d+))')
# DEFINE
pattern = regex.compile(r'''(?x)
(?(DEFINE)(?<octet>\d{1,3}))
(?&octet)\.(?&octet)\.(?&octet)\.(?&octet)
''')
Perl
# Perl has native PCRE support
my $text = "(a(b(c)d)e)";
# Recursion
if ($text =~ /(\((?:[^()]+|(?1))*\))/) {
print "Matched: $1\n";
}
# Conditionals
$text =~ s/(\()?\d{3}(?(1)\)|-)\d{3}-\d{4}/PHONE/g;
# Branch reset
$text =~ s/(?|(foo)|(bar))/$1/g;
# DEFINE
my $ip_pattern = qr{
(?(DEFINE)
(?<octet>25[0-5]|2[0-4]\d|[01]?\d\d?)
)
(?&octet)\.(?&octet)\.(?&octet)\.(?&octet)
}x;
JavaScript (Limited)
// JavaScript lacks most advanced features
// No atomic groups, recursion, conditionals, or branch reset
// Alternative: use multiple simpler patterns
function matchNestedParens(str) {
let depth = 0;
let start = -1;
let results = [];
for (let i = 0; i < str.length; i++) {
if (str[i] === '(') {
if (depth === 0) start = i;
depth++;
} else if (str[i] === ')') {
depth--;
if (depth === 0 && start >= 0) {
results.push(str.substring(start, i + 1));
}
}
}
return results;
}
vim (Limited)
" Vim regex engine doesn't support PCRE advanced features " Use external commands for complex patterns: " Filter through perl for recursion :%!perl -pe 's/\((?:[^()]+|(?R))*\)/<MATCHED>/g' " Or grep -P for possessive :%!grep -oP 'a++'
Gotchas
Possessive vs Atomic
# They're equivalent, but syntax differs
# Possessive: modifier AFTER quantifier
a++ # One or more a, possessive
a*+ # Zero or more a, possessive
a?+ # Zero or one a, possessive
# Atomic: wrapper AROUND expression
(?>a+) # Same as a++
(?>a*) # Same as a*+
(?>a?) # Same as a?+
# Atomic can wrap complex expressions
(?>foo|bar|baz)+ # Can't do this with possessive
Recursion Depth Limits
# Very deep nesting can hit recursion limits
# Default PCRE limit is around 250
# Very deeply nested structures may fail
# For safety, add depth limits in code:
validate_nesting() {
local str=$1
local max_depth=${2:-50}
local depth=0
for ((i=0; i<${#str}; i++)); do
char="${str:$i:1}"
if [[ "$char" == "(" ]]; then
((depth++))
if ((depth > max_depth)); then
echo "ERROR: Max depth exceeded"
return 1
fi
elif [[ "$char" == ")" ]]; then
((depth--))
fi
done
}
DEFINE Block Scope
# DEFINE patterns only exist within the regex
# They don't match any text themselves
grep -oP '(?(DEFINE)(?<foo>bar))(?&foo)' <<< "bar"
# Output: bar (matched by (?&foo), not by DEFINE)
# The DEFINE part matches zero characters
Branch Reset Numbering
# Groups WITHIN branch reset share numbers
# Groups OUTSIDE continue from highest
# (?|(a)(b)|(c)(d))(e)
# In first branch: $1=a, $2=b
# In second branch: $1=c, $2=d
# After branch reset: $3=e
echo "abe" | perl -pe 's/(?|(a)(b)|(c)(d))(e)/$1-$2-$3/'
# Output: a-b-e
Conditional Group Numbering
# (?(N)...) checks if group N MATCHED, not if it EXISTS
# This is WRONG thinking:
# "If group 1 exists in the pattern, use yes-pattern"
# This is CORRECT:
# "If group 1 successfully matched something, use yes-pattern"
# Example: Optional prefix
echo "FOO:bar" | grep -oP '(FOO:)?(?<=(?(1)FOO:|))(\w+)'
# Group 1 = "FOO:" (matched)
# Lookbehind expects "FOO:" because group 1 matched
echo "bar" | grep -oP '(FOO:)?(?<=(?(1)FOO:|))(\w+)'
# Group 1 = empty (did not match)
# Lookbehind expects empty because group 1 didn't match
Recursion vs Subroutine Calls
# (?R) - recurses ENTIRE pattern (anchors included!)
# (?N) - recurses JUST group N (more flexible)
# DANGEROUS: (?R) with anchors
echo "nested" | grep -oP '^(\w+(?R)?\w+)$' # May not work as expected
# SAFER: Use group recursion
echo "(a(b)c)" | grep -oP '\((?:[^()]+|(?1))*\)' # Recurses just group 1
# But wait - group 1 IS the whole pattern here
# For clarity, use named groups:
echo "(a(b)c)" | grep -oP '(?<paren>\((?:[^()]+|(?&paren))*\))'
Possessive vs Atomic Performance
# Both PREVENT backtracking, but timing differs
# Possessive: checks DURING matching
a++b # Never stores backtrack positions for a's
# Atomic: checks AFTER matching
(?>a+)b # Matches all a's, THEN locks them in
# Practical difference: none for simple cases
# But atomic can wrap complex alternations:
(?>pattern1|pattern2|pattern3)+ # Lock after each alternation match
# No possessive equivalent for this
Recursion Stack Overflow
# Deep recursion can crash!
# Generate deep nesting
python -c "print('(' * 1000 + ')' * 1000)" > /tmp/deep.txt
# This may SEGFAULT or error:
grep -oP '\((?:[^()]+|(?R))*\)' /tmp/deep.txt
# pcre2grep: match limit exceeded
# Solutions:
# 1. Set higher limits (if available)
PCRE2_MATCH_LIMIT=10000000 grep -oP '...'
# 2. Use iterative approach instead of recursion
# 3. Pre-validate nesting depth before regex
Tool Feature Matrix
| Feature | grep -P | perl | Python re | Python regex |
|---|---|---|---|---|
Atomic |
β |
β |
β |
β |
Possessive |
β |
β |
β |
β |
Recursion |
β |
β |
β |
β |
Conditional |
β |
β |
β |
β |
DEFINE |
β |
β |
β |
β |
Branch Reset |
β |
β |
β |
β |
NOTE: Python’s built-in re module lacks all advanced features. Install regex module for PCRE-like support.
Key Takeaways
| Technique | Purpose |
|---|---|
|
Atomic group - prevent backtracking into group |
|
Possessive quantifiers - never backtrack |
|
Conditional - match based on condition |
|
Recursion - match nested structures |
|
Define reusable sub-patterns |
|
Branch reset - share group numbers |
Performance |
Anchor, be specific, use possessive |
Decision Tree: When to Use What
Need to prevent backtracking?
βββ Yes β Use atomic groups or possessive quantifiers
β βββ Simple expression? β Possessive (a++)
β βββ Complex expression? β Atomic ((?>...))
βββ No β Continue
Need conditional matching?
βββ Yes β Use (?(cond)yes|no)
β βββ Condition on group match? β (?(1)...)
β βββ Condition on lookahead? β (?(?=...)...)
βββ No β Continue
Need to match nesting?
βββ Yes β Use recursion
β βββ Entire pattern? β (?R)
β βββ Specific group? β (?1) or (?&name)
βββ No β Continue
Need same pattern multiple times?
βββ Yes β Use DEFINE
β βββ Define once, call with (?&name)
βββ No β Continue
Need same group number in alternatives?
βββ Yes β Use branch reset (?|...)
βββ No β Standard alternation
Self-Test: Master Level
-
What’s the difference between
(a+)band(?>a+)bwhen matching "aaac"? -
Write a pattern to match balanced parentheses to arbitrary depth.
-
What does
(?(1)yes|no)do? -
When would you use
(?(DEFINE)…)? -
Why are possessive quantifiers faster?
Answers
-
(a+)bbacktracks through combinations (slow, still fails).(?>a+)bfails immediately (fast, no backtracking). -
((?:[^()]+|(?R))*\)- recurses on nested parentheses -
If group 1 matched, use "yes" pattern; otherwise use "no" pattern
-
To define reusable sub-patterns without matching text (DRY principle)
-
They never backtrack - once matched, the engine never tries alternatives
Boss Level Challenge
Create a regex that: 1. Matches valid JSON arrays with nested arrays 2. Captures the depth of deepest nesting 3. Fails fast on unbalanced brackets
Solution
# Matching nested JSON arrays with depth tracking
# Note: This is a simplified validator
json_array='(?x)
(?<array>
\[ # Opening bracket
(?:
\s*
(?:
"(?:[^"\\]|\\.)*" # String
| -?\d+(?:\.\d+)? # Number
| true|false|null # Literals
| (?&array) # Nested array (recursion)
)
\s*
(?:,\s*)? # Optional comma
)*
\] # Closing bracket
)
'
# Test
echo '[1, [2, [3, 4]], 5]' | grep -oP "$json_array"
# For depth counting, need programmatic approach:
count_depth() {
local str=$1 max=0 current=0
for ((i=0; i<${#str}; i++)); do
case "${str:$i:1}" in
'[') ((current++)); ((current > max)) && max=$current ;;
']') ((current--)) ;;
esac
done
echo "Max depth: $max"
}
Congratulations!
You’ve completed the Regex Mastery curriculum. You now understand:
-
Character classes, quantifiers, anchors
-
Groups, backreferences, alternation
-
Lookahead and lookbehind assertions
-
Infrastructure-specific patterns
-
Atomic groups and possessive quantifiers
-
Conditionals and recursion
-
Performance optimization
You are now a regex expert.
Where to Go From Here
-
Practice daily - Use regex for log analysis, data extraction, validation
-
Study ReDoS - Understand and prevent regex denial of service
-
Learn your tools - Master grep -P, sed, awk, Python regex module
-
Build a pattern library - Document your most useful patterns
-
Teach others - Best way to solidify knowledge
Return to Index
Regex Mastery Index - Review all drills and exercises.