Drill 10: Advanced Techniques

The final drill. These advanced techniques separate regex masters from regex users. Atomic groups prevent catastrophic backtracking, conditionals enable context-aware matching, and recursive patterns match nested structures.

Pattern Complexity Levels

Level Techniques Use Case

Basic

., *, +, ?, []

Simple matching

Intermediate

Groups, anchors, lookahead

Extraction and validation

Advanced

Lookbehind, backreferences

Context-aware matching

Expert

Atomic groups, conditionals, recursion

Complex parsing, optimization

Interactive CLI Drill

bash ~/atelier/_bibliotheca/domus-captures/docs/modules/ROOT/examples/regex-drills/10-advanced.sh

Section 1: Atomic Groups and Possessive Quantifiers

The Backtracking Problem

Normal regex engines use backtracking - when a match fails, they try alternative paths.

# Catastrophic backtracking example
# Pattern: (a+)+b
# Input: aaaaaaaaaaaaaaaaaaaaaaaac

# The engine tries:
# (aaaaaaaaaaaaaaaaaaaaaaaaa) - fails (no b)
# (aaaaaaaaaaaaaaaaaaaaaaaa)(a) - fails
# (aaaaaaaaaaaaaaaaaaaaaaa)(aa) - fails
# ... exponential combinations!

This can hang your system or cause ReDoS (Regular Expression Denial of Service) attacks.

Atomic Groups: (?>…​)

Atomic groups prevent backtracking into the group once matched.

# PCRE atomic group syntax
echo "aaaaaac" | grep -P '(?>a+)b'
# No match, but returns instantly (no backtracking)

# Compare to normal group (would backtrack)
echo "aaaaaab" | grep -P '(?>a+)b'
# Output: aaaaaab

Possessive Quantifiers: +`, `*, ?+

Possessive quantifiers are shorthand for atomic groups.

# Possessive: a++ (same as (?>a+))
echo "aaaaaab" | grep -P 'a++b'
# Output: aaaaaab

# Possessive prevents backtracking
echo "aaaaaac" | grep -P 'a++c'
# No match (a++ takes all a's, won't give any back)

# All possessive forms:
# *+  = zero or more, possessive
# ++  = one or more, possessive
# ?+  = zero or one, possessive
# {n,m}+ = bounded, possessive

When to Use Possessive/Atomic

# Use when you KNOW the matched portion should never be given back

# Email local part (before @)
grep -oP '[a-zA-Z0-9._%+-]++(?=@)' emails.txt

# Path components (won't backtrack on slashes)
grep -oP '/[^/]++'

# Quoted strings (once you see closing quote, done)
grep -oP '"[^"]*+"'

# Number sequences (all digits belong together)
grep -oP '\d++'

Performance Comparison

# Create test file with potential ReDoS input
echo "aaaaaaaaaaaaaaaaaaaaaaaaaaaaab" > /tmp/redos.txt
echo "aaaaaaaaaaaaaaaaaaaaaaaaaaaac" >> /tmp/redos.txt

# Slow (exponential backtracking on non-match)
time grep -P '(a+)+b' /tmp/redos.txt 2>/dev/null

# Fast (atomic group prevents backtracking)
time grep -P '(?>a+)+b' /tmp/redos.txt

# Also fast (possessive quantifier)
time grep -P 'a++b' /tmp/redos.txt

Section 2: Conditional Patterns

Basic Conditional: (?(condition)yes|no)

Match different patterns based on conditions.

# Syntax: (?(condition)yes-pattern|no-pattern)
# The no-pattern is optional

# Condition types:
# (1), (2)...    - Did group N match?
# (<name>)      - Did named group match?
# (R)           - Are we in recursion?
# (?=...)       - Lookahead condition

Conditional on Group Match

# Match phone with or without area code parentheses
# If opening paren exists, require closing paren

echo "(555) 123-4567" | grep -oP '(\()?\d{3}(?(1)\)|-)\d{3}-\d{4}'
# Output: (555) 123-4567

echo "555-123-4567" | grep -oP '(\()?\d{3}(?(1)\)|-)\d{3}-\d{4}'
# Output: 555-123-4567

# Breakdown:
# (\()?        - Group 1: optional opening paren
# \d{3}        - Three digits
# (?(1)\)|-)   - If group 1 matched, expect ), else expect -
# \d{3}-\d{4}  - Rest of number

Conditional on Named Group

# Match balanced quotes (single or double)
echo '"hello"' | grep -oP "(?<q>['\"]).*?(?(q)\\k<q>)"
# Wait - this doesn't work as expected

# Better: match quoted string with matching quotes
echo '"hello"' | grep -oP '(["\x27])(?:(?!\1).)*\1'
# Output: "hello"

echo "'world'" | grep -oP '(["\x27])(?:(?!\1).)*\1'
# Output: 'world'

Conditional with Lookahead

# Match number with optional decimal, require decimal point before decimals
echo "123.45" | grep -oP '\d+(?(?=\.)\.\d+)'
# Output: 123.45

echo "123" | grep -oP '\d+(?(?=\.)\.\d+)'
# Output: 123

# Breakdown:
# \d+           - One or more digits
# (?(?=\.)      - IF lookahead sees a dot...
#   \.\d+       - ...THEN match dot and decimals
# )             - (no ELSE clause - just skip)

Practical Conditional Examples

# Match URL with optional protocol
# If protocol exists, require ://
echo "https://example.com" | grep -oP '(https?)?(?1:\/\/)?[\w.-]+\.\w+'
# Note: PCRE2 syntax may vary

# Alternative approach using alternation
echo -e "https://example.com\nexample.com" | grep -oP '(?:https?://)?[\w.-]+\.\w+'

# Match version with optional v prefix
# If v exists, must be lowercase
echo "v1.2.3" | grep -oP '(v)?\d+\.\d+\.\d+'
echo "1.2.3" | grep -oP '(v)?\d+\.\d+\.\d+'

Section 3: Recursive Patterns

What is Recursion in Regex?

Recursive patterns match nested structures like parentheses, HTML tags, or JSON.

# Standard regex CANNOT match arbitrary nesting:
# ((a)(b(c)))  - how many levels? Unknown.

# PCRE recursive patterns CAN:
# (?R) or (?0) - recurse entire pattern
# (?1), (?2)   - recurse specific group

Basic Recursion: (?R)

# Match nested parentheses
echo "(a(b(c)d)e)" | grep -oP '\((?:[^()]+|(?R))*\)'
# Output: (a(b(c)d)e)

# Breakdown:
# \(           - Opening paren
# (?:          - Non-capturing group:
#   [^()]+     -   Non-paren characters OR
#   |(?R)      -   Recurse the ENTIRE pattern
# )*           - Zero or more times
# \)           - Closing paren

Recursion into Specific Group: (?1)

# More controlled recursion - recurse group 1 only
echo "(a(b(c)d)e)" | grep -oP '(\((?:[^()]+|(?1))*\))'
# Output: (a(b(c)d)e)

# Match nested braces
echo "{a{b{c}d}e}" | grep -oP '(\{(?:[^{}]+|(?1))*\})'
# Output: {a{b{c}d}e}

Matching Nested HTML Tags

# Simple nested div matching
echo "<div><div>inner</div></div>" | grep -oP '<div>(?:[^<]+|<div>(?:[^<]+|(?R))*</div>)*</div>'

# More general (but still limited):
# Match <tag>...</tag> with same tag nesting
pattern='<(\w+)>(?:[^<]+|<\1>(?R)*</\1>)*</\1>'
For real HTML/XML, use a proper parser. Regex for nested markup is fragile.

Recursion with Named Groups

# Named group recursion
echo "(a(b)c)" | grep -oP '(?<paren>\((?:[^()]+|(?&paren))*\))'
# Output: (a(b)c)

# (?&name) recurses into named group

Practical Recursion Examples

# Match JSON arrays (simplified)
json='[1,[2,3],[4,[5,6]]]'
echo "$json" | grep -oP '\[(?:[^\[\]]+|(?R))*\]'
# Output: [1,[2,3],[4,[5,6]]]

# Match S-expressions (Lisp-style)
sexp='(define (square x) (* x x))'
echo "$sexp" | grep -oP '\((?:[^()]+|(?R))*\)'
# Output: (define (square x) (* x x))

# Match function calls with nested calls
code='func(a, other(b, c), d)'
echo "$code" | grep -oP '\w+\((?:[^()]+|(?R))*\)'
# Output: func(a, other(b, c), d), other(b, c)

Section 4: Subroutine Calls

Define Once, Use Multiple Times

# Define IP octet pattern once, use 4 times
# (?(DEFINE)...) defines patterns without matching

grep -oP '(?x)
  (?(DEFINE)
    (?<octet>25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
  )
  (?&octet)\.(?&octet)\.(?&octet)\.(?&octet)
' <<< "192.168.1.100"
# Output: 192.168.1.100

Complex Pattern Definition

# Define reusable sub-patterns
grep -oP '(?x)
  (?(DEFINE)
    (?<year>\d{4})
    (?<month>0[1-9]|1[0-2])
    (?<day>0[1-9]|[12][0-9]|3[01])
    (?<hour>[01][0-9]|2[0-3])
    (?<minute>[0-5][0-9])
    (?<second>[0-5][0-9])
  )
  (?&year)-(?&month)-(?&day)T(?&hour):(?&minute):(?&second)
' <<< "2026-03-26T14:30:45"
# Output: 2026-03-26T14:30:45

Benefits of DEFINE

  1. Readability - Name your sub-patterns

  2. DRY - Don’t Repeat Yourself

  3. Maintainability - Change once, update everywhere

  4. Testing - Test sub-patterns in isolation

Section 5: Branch Reset Groups

Problem: Group Numbers Shift with Alternation

# Normal alternation - groups numbered sequentially
echo "hello world" | perl -pe 's/(hello)|(world)/[$1$2]/g'
# Group 1 or Group 2 matches, never both

# This makes backreferences awkward

Solution: Branch Reset (?|…​)

# Branch reset: alternatives share group numbers
echo "hello world" | perl -pe 's/(?|(hello)|(world))/[$1]/g'
# Output: [hello] [world]

# Both "hello" and "world" are Group 1

Practical Branch Reset

# Match different date formats, normalize to group 1-3
pattern='(?|(\d{4})-(\d{2})-(\d{2})|(\d{2})/(\d{2})/(\d{4}))'

# Both formats fill groups 1, 2, 3 (differently ordered though)
# This is powerful for parsing multiple input formats

Section 6: Script Runs (Unicode)

Match Within Same Script

# Perl/PCRE2 feature: ensure text is all same Unicode script
# Detects mixed-script attacks (homoglyph attacks)

# Match only ASCII text
grep -P '^[\p{ASCII}]+$' file.txt

# Match only Cyrillic
grep -P '^\p{Cyrillic}+$' file.txt

# Match only Latin script
grep -P '^\p{Latin}+$' file.txt

Security Application

# Detect potential homoglyph attacks in domains
# Cyrillic 'Π°' looks like Latin 'a'

# Find domains with mixed scripts (suspicious)
grep -P '[^\p{ASCII}]' domains.txt | while read domain; do
    echo "Potential homoglyph: $domain"
done

Section 7: Optimization Techniques

Anchoring

# Anchored patterns are faster - engine knows where to start

# Slow (checks every position)
grep -P 'ERROR:.*failed' huge.log

# Faster (starts from line beginning only)
grep -P '^ERROR:.*failed' huge.log

# Even faster with fixed prefix
grep -F 'ERROR:' huge.log | grep -P 'failed$'

Avoid Catastrophic Patterns

# DANGEROUS patterns (exponential backtracking):
# (a+)+
# (a*)*
# (a|aa)+
# (.*a){10}

# SAFE alternatives:
# a+
# a*
# a+
# (?>[^a]*a){10}  # Atomic groups prevent backtracking

Use Possessive When Possible

# Convert greedy to possessive when backtracking won't help

# Instead of:
grep -oP '"[^"]*"' file.txt

# Use (when you know quote won't be escaped):
grep -oP '"[^"]*+"' file.txt

Specific Character Classes

# More specific = faster

# Slow (matches anything)
grep -P '.*@.*\..*' emails.txt

# Faster (specific character class)
grep -P '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' emails.txt

# Even better - use negated classes
grep -P '[^@]+@[^.]+\.[^.]+'

Unroll Loops

# Instead of:
grep -P '(?:abc|def)+' file.txt

# Unroll for common patterns:
grep -P '(?:abc|def)(?:abc|def)*' file.txt

# Or use possessive:
grep -P '(?:abc|def)++' file.txt

Real-World Applications

Professional: ReDoS Prevention Audit

#!/bin/bash
# Scan code for potentially dangerous regex patterns

find . -name "*.py" -o -name "*.js" -o -name "*.rb" | while read f; do
    # Find nested quantifiers
    if grep -qP '\([^)]*[*+]\)[*+]' "$f"; then
        echo "DANGER: Nested quantifiers in $f"
        grep -nP '\([^)]*[*+]\)[*+]' "$f"
    fi

    # Find patterns with multiple .*
    if grep -qP '\.\*.*\.\*' "$f"; then
        echo "WARNING: Multiple .* in $f"
        grep -nP '\.\*.*\.\*' "$f"
    fi
done

Professional: Complex Log Parser

#!/bin/bash
# Parse logs with optional fields using conditionals

# Log format: [LEVEL] [TIMESTAMP] (optional PID) Message
log_pattern='(?x)
    ^\[(?<level>\w+)\]                    # Required: level
    \s+\[(?<timestamp>[\d:-]+)\]          # Required: timestamp
    (?:\s+\((?<pid>\d+)\))?               # Optional: (pid)
    \s+(?<message>.+)$                    # Required: message
'

grep -oP "$log_pattern" /var/log/app.log | head -10

Professional: Config Validator with Recursion

#!/bin/bash
# Validate nested JSON-like structures

validate_nesting() {
    local content=$1

    # Check balanced braces
    if ! echo "$content" | grep -qP '^\{(?:[^{}]+|(?R))*\}$'; then
        echo "ERROR: Unbalanced braces"
        return 1
    fi

    # Check balanced brackets
    if ! echo "$content" | grep -qP '^\[(?:[^\[\]]+|(?R))*\]$'; then
        echo "ERROR: Unbalanced brackets"
        return 1
    fi

    echo "OK: Nesting is valid"
    return 0
}

Personal: Extract Nested Markdown Lists

#!/bin/bash
# Parse nested bullet lists from markdown

# Match list items with their nesting level
grep -P '^\s*[-*]\s+' ~/notes/*.md | while read line; do
    # Count leading spaces to determine nesting
    spaces=$(echo "$line" | grep -oP '^\s*' | wc -c)
    level=$((spaces / 2))
    content=$(echo "$line" | grep -oP '[-*]\s+\K.*')
    printf "%s%s\n" "$(printf '  %.0s' $(seq 1 $level))" "$content"
done

Personal: Smart Date Parser

#!/bin/bash
# Parse multiple date formats into ISO format

parse_date() {
    local input=$1

    # ISO format: 2026-03-26
    if echo "$input" | grep -qP '^\d{4}-\d{2}-\d{2}$'; then
        echo "$input"
        return
    fi

    # US format: 03/26/2026
    if echo "$input" | grep -qP '^\d{2}/\d{2}/\d{4}$'; then
        echo "$input" | sed -E 's|(\d{2})/(\d{2})/(\d{4})|\3-\1-\2|'
        return
    fi

    # Written: March 26, 2026
    if match=$(echo "$input" | grep -oP '(?<month>\w+)\s+(?<day>\d+),\s+(?<year>\d{4})'); then
        # Would need month name lookup...
        echo "Complex: $input"
        return
    fi

    echo "Unknown format: $input"
}

Professional: ISE RADIUS Session Analyzer

#!/bin/bash
# Parse ISE RADIUS sessions with nested authentication details
# Uses recursion for nested JSON responses

# ISE MNT API returns nested session data
# Extract nested attribute groups with recursion
parse_ise_session() {
    local session_json=$1

    # Extract nested attribute-value pairs using recursion
    # Pattern matches {...} including nested braces
    echo "$session_json" | grep -oP '\{(?:[^{}]+|(?R))*\}'
}

# Conditional pattern for auth status
# If passed=true, expect posture; else expect failure reason
ise_auth_pattern='(?x)
    "authentication_status":\s*"(?<status>PASSED|FAILED)"
    .*?
    (?(1)
        "posture_status":\s*"(?<posture>[^"]+)"
      |
        "failure_reason":\s*"(?<reason>[^"]+)"
    )
'

# Extract sessions with optional VLAN assignment
grep -oP '(?x)
    "mac_address":\s*"(?<mac>[0-9A-Fa-f:]{17})"
    .*?
    "nas_ip":\s*"(?<nas>\d+\.\d+\.\d+\.\d+)"
    (?:.*?"assigned_vlan":\s*"(?<vlan>\d+)")?
' /tmp/ise-sessions.json

Professional: Cisco ACL Validator

#!/bin/bash
# Validate Cisco ACL syntax including nested object-groups
# Uses recursion for nested group references

# ACL line with optional nested groups
acl_pattern='(?x)
    ^(?<action>permit|deny)\s+
    (?<protocol>ip|tcp|udp|icmp)\s+
    (?:
        (?<src_type>host|any|\d+\.\d+\.\d+\.\d+)
        |
        object-group\s+(?<src_group>\S+)
    )\s+
    (?:
        (?<dst_type>host|any|\d+\.\d+\.\d+\.\d+)
        |
        object-group\s+(?<dst_group>\S+)
    )
    (?:\s+(?<ports>eq|range|gt|lt)\s+[\d\s-]+)?
'

validate_acl() {
    local acl_file=$1
    local errors=0

    while IFS= read -r line; do
        # Skip comments and empty lines
        [[ "$line" =~ ^[[:space:]]*[!#] ]] && continue
        [[ -z "$line" ]] && continue

        if ! echo "$line" | grep -qP "$acl_pattern"; then
            echo "INVALID: $line"
            ((errors++))
        fi
    done < "$acl_file"

    echo "Validation complete: $errors errors found"
    return $errors
}

Professional: Network Syslog Parser with Conditionals

#!/bin/bash
# Parse Cisco syslog with conditional field extraction
# Different formats based on facility code

# If %SEC, expect security fields; else expect interface fields
syslog_pattern='(?x)
    ^(?<timestamp>\w+\s+\d+\s+[\d:]+)\s+
    (?<host>\S+)\s+
    %(?<facility>\w+)-(?<severity>\d)-(?<mnemonic>\w+):\s+
    (?(5)  # Check if mnemonic captured
        (?:
            # SEC facility: expect IP addresses
            (?<=SEC-).*?(?<src_ip>\d+\.\d+\.\d+\.\d+).*?(?<dst_ip>\d+\.\d+\.\d+\.\d+)
          |
            # LINK facility: expect interface
            (?<=LINK-).*?(?<interface>(?:Gi|Fa|Te)\S+)
          |
            # Default: capture rest as message
            (?<message>.+)
        )
    )
'

# Parse with possessive quantifiers for efficiency (no backtracking on long messages)
grep -oP '(?x)
    ^\w+\s++\d+\s++[\d:]++\s++      # Timestamp (possessive)
    \S++\s++                         # Hostname
    %\w++-\d+-\w++:\s++              # Facility-Severity-Mnemonic
    (?:
        (?:denied|permitted)\s++    # ACL action
        (?<proto>tcp|udp|icmp)\s++  # Protocol
        (?<src>\d++\.\d++\.\d++\.\d++)  # Source IP (possessive)
    )?
    .*$
' /var/log/syslog

Personal: Finance Transaction Categorizer

#!/bin/bash
# Categorize bank transactions using DEFINE for reusable patterns
# Demonstrates DRY regex with named subroutines

# Define reusable patterns for financial parsing
finance_pattern='(?x)
    (?(DEFINE)
        (?<amount>\$?\d{1,3}(?:,\d{3})*(?:\.\d{2})?)
        (?<date>\d{2}/\d{2}/\d{4}|\d{4}-\d{2}-\d{2})
        (?<merchant>[A-Za-z0-9\s&\x27-]+?)
        (?<category>grocery|dining|gas|utility|transfer|income)
    )
    ^(?&date)\s+
    (?<desc>(?&merchant))\s+
    (?<amt>-?(?&amount))
'

categorize_transaction() {
    local desc=$1
    local amount=$2

    # Use conditionals: if amount negative, check expense categories
    # If positive, it's income
    if [[ "$amount" =~ ^- ]]; then
        # Expense categorization using alternation with possessive
        category=$(echo "$desc" | grep -oP '(?i)
            (?:walmart|target|costco)(*SKIP)(*F)|grocery
          | (?:starbucks|mcdonald|restaurant)(*SKIP)(*F)|dining
          | (?:shell|chevron|exxon)(*SKIP)(*F)|gas
          | (?:electric|water|internet)(*SKIP)(*F)|utility
        ' | head -1)
    else
        category="income"
    fi

    echo "$category"
}

# Parse CSV with recursive handling of quoted fields containing commas
# Matches: "field, with comma",normal field,"another, field"
csv_pattern='(?:^|,)(?:"([^"]*+)"|([^,]++))'

# Extract transactions with balanced description parentheses
# e.g., "AMAZON (Prime Membership (Annual))"
grep -oP '\d{2}/\d{2}/\d{4}\s+[A-Z]+\s+(?:[^()]+|\((?:[^()]+|(?R))*\))+\s+-?\$[\d,.]+' \
    ~/finances/transactions.csv

Personal: Gopass Entry Validator

#!/bin/bash
# Validate gopass entry structure with nested YAML
# Uses recursion for nested key-value structures

validate_gopass_entry() {
    local entry=$1

    # First line must be password (non-empty)
    if ! echo "$entry" | head -1 | grep -qP '^.+$'; then
        echo "ERROR: First line must be password"
        return 1
    fi

    # Remaining lines: YAML key: value with optional nesting
    # Uses possessive quantifiers for efficient parsing
    yaml_pattern='(?x)
        ^(?<indent>\s*+)           # Possessive indent capture
        (?<key>[a-z_]++):          # Possessive key
        (?:
            \s++(?<value>.++)      # Inline value (possessive)
          |
            \n(?:\1\s++.++\n)*+    # Nested block (possessive)
        )
    '

    if echo "$entry" | tail -n +2 | grep -qvP '^[a-z_]+:\s*.+$|^\s+[a-z_]+:'; then
        echo "WARNING: Non-standard YAML detected"
    fi
}

# Atomic group for matching gopass paths (no backtracking on store names)
gopass_path_pattern='(?>(v3|personal|work))/(?>[a-z_]+/)*+[a-z_]+$'

gopass list | grep -P "$gopass_path_pattern"

Tool Variants

PCRE (grep -P)

# Full PCRE support
grep -oP '(?>a+)b'              # Atomic groups
grep -oP 'a++b'                 # Possessive quantifiers
grep -oP '(?R)'                 # Recursion
grep -oP '(?(1)yes|no)'         # Conditionals
grep -oP '(?|a|b)'              # Branch reset (limited)

Python: regex module

# Standard re module has LIMITED advanced features
import re

# For full PCRE-like support, use regex module
import regex

# Atomic groups
pattern = regex.compile(r'(?>a+)b')

# Possessive quantifiers
pattern = regex.compile(r'a++b')

# Recursion
pattern = regex.compile(r'\((?:[^()]+|(?R))*\)')

# Branch reset
pattern = regex.compile(r'(?|(\w+)|(\d+))')

# DEFINE
pattern = regex.compile(r'''(?x)
    (?(DEFINE)(?<octet>\d{1,3}))
    (?&octet)\.(?&octet)\.(?&octet)\.(?&octet)
''')

Perl

# Perl has native PCRE support
my $text = "(a(b(c)d)e)";

# Recursion
if ($text =~ /(\((?:[^()]+|(?1))*\))/) {
    print "Matched: $1\n";
}

# Conditionals
$text =~ s/(\()?\d{3}(?(1)\)|-)\d{3}-\d{4}/PHONE/g;

# Branch reset
$text =~ s/(?|(foo)|(bar))/$1/g;

# DEFINE
my $ip_pattern = qr{
    (?(DEFINE)
        (?<octet>25[0-5]|2[0-4]\d|[01]?\d\d?)
    )
    (?&octet)\.(?&octet)\.(?&octet)\.(?&octet)
}x;

JavaScript (Limited)

// JavaScript lacks most advanced features
// No atomic groups, recursion, conditionals, or branch reset

// Alternative: use multiple simpler patterns
function matchNestedParens(str) {
    let depth = 0;
    let start = -1;
    let results = [];

    for (let i = 0; i < str.length; i++) {
        if (str[i] === '(') {
            if (depth === 0) start = i;
            depth++;
        } else if (str[i] === ')') {
            depth--;
            if (depth === 0 && start >= 0) {
                results.push(str.substring(start, i + 1));
            }
        }
    }
    return results;
}

vim (Limited)

" Vim regex engine doesn't support PCRE advanced features
" Use external commands for complex patterns:

" Filter through perl for recursion
:%!perl -pe 's/\((?:[^()]+|(?R))*\)/<MATCHED>/g'

" Or grep -P for possessive
:%!grep -oP 'a++'

Gotchas

Possessive vs Atomic

# They're equivalent, but syntax differs

# Possessive: modifier AFTER quantifier
a++    # One or more a, possessive
a*+    # Zero or more a, possessive
a?+    # Zero or one a, possessive

# Atomic: wrapper AROUND expression
(?>a+)  # Same as a++
(?>a*)  # Same as a*+
(?>a?)  # Same as a?+

# Atomic can wrap complex expressions
(?>foo|bar|baz)+   # Can't do this with possessive

Recursion Depth Limits

# Very deep nesting can hit recursion limits

# Default PCRE limit is around 250
# Very deeply nested structures may fail

# For safety, add depth limits in code:
validate_nesting() {
    local str=$1
    local max_depth=${2:-50}
    local depth=0

    for ((i=0; i<${#str}; i++)); do
        char="${str:$i:1}"
        if [[ "$char" == "(" ]]; then
            ((depth++))
            if ((depth > max_depth)); then
                echo "ERROR: Max depth exceeded"
                return 1
            fi
        elif [[ "$char" == ")" ]]; then
            ((depth--))
        fi
    done
}

DEFINE Block Scope

# DEFINE patterns only exist within the regex
# They don't match any text themselves

grep -oP '(?(DEFINE)(?<foo>bar))(?&foo)' <<< "bar"
# Output: bar (matched by (?&foo), not by DEFINE)

# The DEFINE part matches zero characters

Branch Reset Numbering

# Groups WITHIN branch reset share numbers
# Groups OUTSIDE continue from highest

# (?|(a)(b)|(c)(d))(e)
# In first branch: $1=a, $2=b
# In second branch: $1=c, $2=d
# After branch reset: $3=e

echo "abe" | perl -pe 's/(?|(a)(b)|(c)(d))(e)/$1-$2-$3/'
# Output: a-b-e

Conditional Group Numbering

# (?(N)...) checks if group N MATCHED, not if it EXISTS

# This is WRONG thinking:
# "If group 1 exists in the pattern, use yes-pattern"

# This is CORRECT:
# "If group 1 successfully matched something, use yes-pattern"

# Example: Optional prefix
echo "FOO:bar" | grep -oP '(FOO:)?(?<=(?(1)FOO:|))(\w+)'
# Group 1 = "FOO:" (matched)
# Lookbehind expects "FOO:" because group 1 matched

echo "bar" | grep -oP '(FOO:)?(?<=(?(1)FOO:|))(\w+)'
# Group 1 = empty (did not match)
# Lookbehind expects empty because group 1 didn't match

Recursion vs Subroutine Calls

# (?R) - recurses ENTIRE pattern (anchors included!)
# (?N) - recurses JUST group N (more flexible)

# DANGEROUS: (?R) with anchors
echo "nested" | grep -oP '^(\w+(?R)?\w+)$'  # May not work as expected

# SAFER: Use group recursion
echo "(a(b)c)" | grep -oP '\((?:[^()]+|(?1))*\)'  # Recurses just group 1
# But wait - group 1 IS the whole pattern here

# For clarity, use named groups:
echo "(a(b)c)" | grep -oP '(?<paren>\((?:[^()]+|(?&paren))*\))'

Possessive vs Atomic Performance

# Both PREVENT backtracking, but timing differs

# Possessive: checks DURING matching
a++b    # Never stores backtrack positions for a's

# Atomic: checks AFTER matching
(?>a+)b # Matches all a's, THEN locks them in

# Practical difference: none for simple cases
# But atomic can wrap complex alternations:

(?>pattern1|pattern2|pattern3)+   # Lock after each alternation match
# No possessive equivalent for this

Recursion Stack Overflow

# Deep recursion can crash!

# Generate deep nesting
python -c "print('(' * 1000 + ')' * 1000)" > /tmp/deep.txt

# This may SEGFAULT or error:
grep -oP '\((?:[^()]+|(?R))*\)' /tmp/deep.txt
# pcre2grep: match limit exceeded

# Solutions:
# 1. Set higher limits (if available)
PCRE2_MATCH_LIMIT=10000000 grep -oP '...'

# 2. Use iterative approach instead of recursion
# 3. Pre-validate nesting depth before regex

Tool Feature Matrix

Feature grep -P perl Python re Python regex

Atomic (?>…​)

βœ“

βœ“

βœ—

βœ“

Possessive ++

βœ“

βœ“

βœ—

βœ“

Recursion (?R)

βœ“

βœ“

βœ—

βœ“

Conditional (?(N)…​)

βœ“

βœ“

βœ—

βœ“

DEFINE

βœ“

βœ“

βœ—

βœ“

Branch Reset (?|…​)

βœ“

βœ“

βœ—

βœ“

NOTE: Python’s built-in re module lacks all advanced features. Install regex module for PCRE-like support.

Key Takeaways

Technique Purpose

(?>…​)

Atomic group - prevent backtracking into group

+`, `*, ?+

Possessive quantifiers - never backtrack

(?(cond)yes|no)

Conditional - match based on condition

(?R), (?1)

Recursion - match nested structures

(?(DEFINE)…​)

Define reusable sub-patterns

(?|…​)

Branch reset - share group numbers

Performance

Anchor, be specific, use possessive

Decision Tree: When to Use What

Need to prevent backtracking?
β”œβ”€β”€ Yes β†’ Use atomic groups or possessive quantifiers
β”‚   └── Simple expression? β†’ Possessive (a++)
β”‚   └── Complex expression? β†’ Atomic ((?>...))
└── No β†’ Continue

Need conditional matching?
β”œβ”€β”€ Yes β†’ Use (?(cond)yes|no)
β”‚   └── Condition on group match? β†’ (?(1)...)
β”‚   └── Condition on lookahead? β†’ (?(?=...)...)
└── No β†’ Continue

Need to match nesting?
β”œβ”€β”€ Yes β†’ Use recursion
β”‚   └── Entire pattern? β†’ (?R)
β”‚   └── Specific group? β†’ (?1) or (?&name)
└── No β†’ Continue

Need same pattern multiple times?
β”œβ”€β”€ Yes β†’ Use DEFINE
β”‚   └── Define once, call with (?&name)
└── No β†’ Continue

Need same group number in alternatives?
β”œβ”€β”€ Yes β†’ Use branch reset (?|...)
└── No β†’ Standard alternation

Self-Test: Master Level

  1. What’s the difference between (a+)b and (?>a+)b when matching "aaac"?

  2. Write a pattern to match balanced parentheses to arbitrary depth.

  3. What does (?(1)yes|no) do?

  4. When would you use (?(DEFINE)…​)?

  5. Why are possessive quantifiers faster?

Answers
  1. (a+)b backtracks through combinations (slow, still fails). (?>a+)b fails immediately (fast, no backtracking).

  2. ((?:[^()]+|(?R))*\) - recurses on nested parentheses

  3. If group 1 matched, use "yes" pattern; otherwise use "no" pattern

  4. To define reusable sub-patterns without matching text (DRY principle)

  5. They never backtrack - once matched, the engine never tries alternatives

Boss Level Challenge

Create a regex that: 1. Matches valid JSON arrays with nested arrays 2. Captures the depth of deepest nesting 3. Fails fast on unbalanced brackets

Solution
# Matching nested JSON arrays with depth tracking
# Note: This is a simplified validator

json_array='(?x)
    (?<array>
        \[                          # Opening bracket
        (?:
            \s*
            (?:
                "(?:[^"\\]|\\.)*"   # String
                | -?\d+(?:\.\d+)?   # Number
                | true|false|null   # Literals
                | (?&array)         # Nested array (recursion)
            )
            \s*
            (?:,\s*)?               # Optional comma
        )*
        \]                          # Closing bracket
    )
'

# Test
echo '[1, [2, [3, 4]], 5]' | grep -oP "$json_array"

# For depth counting, need programmatic approach:
count_depth() {
    local str=$1 max=0 current=0
    for ((i=0; i<${#str}; i++)); do
        case "${str:$i:1}" in
            '[') ((current++)); ((current > max)) && max=$current ;;
            ']') ((current--)) ;;
        esac
    done
    echo "Max depth: $max"
}

Congratulations!

You’ve completed the Regex Mastery curriculum. You now understand:

  • Character classes, quantifiers, anchors

  • Groups, backreferences, alternation

  • Lookahead and lookbehind assertions

  • Infrastructure-specific patterns

  • Atomic groups and possessive quantifiers

  • Conditionals and recursion

  • Performance optimization

You are now a regex expert.

Where to Go From Here

  1. Practice daily - Use regex for log analysis, data extraction, validation

  2. Study ReDoS - Understand and prevent regex denial of service

  3. Learn your tools - Master grep -P, sed, awk, Python regex module

  4. Build a pattern library - Document your most useful patterns

  5. Teach others - Best way to solidify knowledge

Return to Index

Regex Mastery Index - Review all drills and exercises.