Regex Session 09: Advanced Patterns

Move beyond basic patterns to handle real-world complexity: multi-line data, nested structures, ambiguous formats, and edge cases that break naive patterns.

The Advanced Mindset

Basic regex: "Match this pattern" Advanced regex: "Match this pattern BUT NOT that, AND only in this context, AND handle edge cases"

Test Data Setup

cat << 'EOF' > /tmp/advanced-practice.txt
# Multi-line log entry
2026-03-15T10:30:45 [ERROR] Database connection failed
  Stack trace:
    at connect() in db.py:45
    at main() in app.py:12
  Cause: Timeout after 30s

# JSON-like structure
{
  "user": "admin",
  "roles": ["admin", "user", "viewer"],
  "settings": {
    "theme": "dark",
    "notifications": true
  }
}

# Config with sections
[database]
host = db-prod-01.internal
port = 5432
user = app_user

[cache]
host = redis.internal
port = 6379

# Mixed format data
Server: web-01 (192.168.1.100) [ACTIVE]
Server: web-02 (192.168.1.101) [STANDBY]
Server: db-01 (10.50.1.50) [ACTIVE] (Primary)

# Tricky email variations
simple@example.com
user.name+tag@sub.domain.co.uk
"quoted string"@example.com
admin@192.168.1.100

# Nested parentheses
func(arg1, func2(nested), arg3)
call(a, b(c, d(e)), f)
EOF

Lesson 1: Multi-line Patterns

The Problem

# Standard grep doesn't handle multi-line patterns
grep -E "ERROR.*Stack trace" /tmp/advanced-practice.txt
# No match - they're on different lines

Solution: PCRE with (?s) or grep -Pzo

# -z treats input as NUL-separated (effectively one "line")
# (?s) makes . match newlines
grep -Pzo '(?s)\[ERROR\].*?Stack trace:.*?(?=\n\n|\Z)' /tmp/advanced-practice.txt

Python Multi-line

import re

text = open('/tmp/advanced-practice.txt').read()

# DOTALL flag: . matches newline
pattern = re.compile(
    r'\[ERROR\].*?Stack trace:.*?Cause:.*?\n',
    re.DOTALL
)

match = pattern.search(text)
if match:
    print(match.group())

Lesson 2: Lazy vs Greedy Matching

Greedy (default): Match as MUCH as possible Lazy (with ?): Match as LITTLE as possible

echo '<div>first</div><div>second</div>' | grep -oP '<div>.*</div>'
# Greedy: <div>first</div><div>second</div>

echo '<div>first</div><div>second</div>' | grep -oP '<div>.*?</div>'
# Lazy: <div>first</div>

When to Use Each

Scenario Use Pattern

Scenario	Use	Pattern
Extract first occurrence	Lazy `*?`	`<tag>.*?</tag>`
Match entire span	Greedy `*`	`".*"`
Specific delimiter	Negated class	`<tag>[^<]*</tag>`

Extract first occurrence

Lazy *?

<tag>.*?</tag>

Match entire span

Greedy *

".*"

Specific delimiter

Negated class

<tag>[^<]*</tag>

Pro tip: Negated character class [^X]* is often clearer and faster than lazy matching.

Lesson 3: Atomic Groups and Possessive Quantifiers

Problem: Backtracking can cause catastrophic performance.

import re
import time

# This pattern is vulnerable to catastrophic backtracking
evil_pattern = r'(a+)+b'
evil_string = 'a' * 25  # No 'b' at end

start = time.time()
re.search(evil_pattern, evil_string)  # Takes forever!
print(f"Time: {time.time() - start}")

Solution: Possessive Quantifiers (regex module)

# Install: pip install regex
import regex

# Possessive ++ doesn't backtrack
safe_pattern = r'(a++)+'
regex.search(safe_pattern, evil_string)  # Fast!

Atomic Groups

# Atomic group (?>...) prevents backtracking into the group
pattern = r'(?>a+)b'
# Once a+ matches, it won't give back characters

Lesson 4: Conditional Patterns

PCRE supports: (?(condition)yes|no)

# Match "Mr" or "Mrs" followed by appropriate title
# If group 1 (s) exists, expect "Ms", else "Mr"
echo -e "Mr Smith\nMrs Jones" | \
  grep -P '^Mr(s)?\.?\s+\w+$'

Practical: Validate Paired Delimiters

import re

def has_balanced_parens(text: str) -> bool:
    """Check if parentheses are balanced (simple check)."""
    # Count open and close
    opens = len(re.findall(r'\(', text))
    closes = len(re.findall(r'\)', text))
    return opens == closes

# For proper nesting validation, you need recursion (regex alone can't do it)

Lesson 5: Recursive Patterns (PCRE)

Match nested structures:

# Match balanced parentheses with content
echo "func(arg1, inner(nested), arg3)" | \
  grep -oP '\((?:[^()]+|(?R))*\)'

Explanation: - \( - Opening paren - (?:…) - Non-capturing group - [^()]+ - Non-paren characters - |(?R) - OR recurse entire pattern - \) - Closing paren

# Python's re module doesn't support recursion
# Use regex module instead
import regex

pattern = regex.compile(r'\((?:[^()]+|(?R))*\)')
text = "func(a, b(c, d(e)), f)"
matches = pattern.findall(text)
print(matches)
# Output: ['(a, b(c, d(e)), f)', '(c, d(e))', '(e)']

Lesson 6: Branch Reset Groups

Problem: Different alternatives have different group numbers.

# Without branch reset
echo "Jan 15" | grep -oP '(Jan|Feb|Mar)\s+(\d+)'
# Group 1 = month, Group 2 = day

echo "15 Jan" | grep -oP '(\d+)\s+(Jan|Feb|Mar)'
# Now Group 1 = day, Group 2 = month!

Solution: Branch reset (?|…)

# With branch reset - groups renumber in each branch
echo -e "Jan 15\n15 Jan" | \
  grep -oP '(?|(?P<month>Jan|Feb|Mar)\s+(?P<day>\d+)|(?P<day>\d+)\s+(?P<month>Jan|Feb|Mar))'

Lesson 7: Complex Extraction Patterns

Extract Server Info with Multiple Formats

import re

text = """
Server: web-01 (192.168.1.100) [ACTIVE]
Server: web-02 (192.168.1.101) [STANDBY]
Server: db-01 (10.50.1.50) [ACTIVE] (Primary)
"""

# Handle optional (Primary) tag
pattern = re.compile(
    r'Server:\s+(?P<name>[\w-]+)\s+'
    r'\((?P<ip>[\d.]+)\)\s+'
    r'\[(?P<status>\w+)\]'
    r'(?:\s+\((?P<role>\w+)\))?'  # Optional role
)

for match in pattern.finditer(text):
    info = match.groupdict()
    role = info['role'] or 'Standard'
    print(f"{info['name']}: {info['ip']} ({info['status']}, {role})")

Output:

web-01: 192.168.1.100 (ACTIVE, Standard)
web-02: 192.168.1.101 (STANDBY, Standard)
db-01: 10.50.1.50 (ACTIVE, Primary)

Parse INI-style Config Sections

import re
from collections import defaultdict

def parse_ini(text: str) -> dict:
    """Parse INI-style config into nested dict."""
    config = defaultdict(dict)
    current_section = 'default'

    section_pattern = re.compile(r'^\[(\w+)\]$')
    value_pattern = re.compile(r'^(\w+)\s*=\s*(.+)$')

    for line in text.split('\n'):
        line = line.strip()
        if not line or line.startswith('#'):
            continue

        section_match = section_pattern.match(line)
        if section_match:
            current_section = section_match.group(1)
            continue

        value_match = value_pattern.match(line)
        if value_match:
            key, value = value_match.groups()
            config[current_section][key] = value

    return dict(config)

# Usage
config = parse_ini(open('/tmp/advanced-practice.txt').read())
print(config.get('database', {}))
# Output: {'host': 'db-prod-01.internal', 'port': '5432', 'user': 'app_user'}

Lesson 8: Context-Aware Matching

Match Only in Specific Context

# Match "port" only in [database] section
awk '/^\[database\]/,/^\[/' /tmp/advanced-practice.txt | grep -oP 'port\s*=\s*\K\d+'

import re

def find_in_section(text: str, section: str, key: str) -> str | None:
    """Find key value within specific INI section."""
    # Match section and extract its content
    section_pattern = re.compile(
        rf'\[{section}\].*?(?=\n\[|\Z)',
        re.DOTALL
    )
    section_match = section_pattern.search(text)
    if not section_match:
        return None

    # Find key within section
    key_pattern = re.compile(rf'^{key}\s*=\s*(.+)$', re.MULTILINE)
    key_match = key_pattern.search(section_match.group())
    return key_match.group(1) if key_match else None

# Usage
text = open('/tmp/advanced-practice.txt').read()
print(find_in_section(text, 'database', 'port'))  # 5432
print(find_in_section(text, 'cache', 'port'))     # 6379

Negative Context (Match NOT in Context)

# Match IPs that are NOT in the database section
grep -P '^\d' /tmp/advanced-practice.txt | \
  grep -v '^\[database\]' | \
  grep -oP '\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'

Lesson 9: Edge Case Handling

Email Variations

import re

# Comprehensive email pattern (simplified RFC 5322)
EMAIL_PATTERN = re.compile(r'''
    (?P<local>
        [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+  # Normal characters
        |
        "(?:[^"\\]|\\.)*"                   # Quoted string
    )
    @
    (?P<domain>
        (?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}   # Domain name
        |
        \d{1,3}(?:\.\d{1,3}){3}            # IP address
    )
''', re.VERBOSE)

test_emails = [
    "simple@example.com",
    "user.name+tag@sub.domain.co.uk",
    '"quoted string"@example.com',
    "admin@192.168.1.100",
]

for email in test_emails:
    match = EMAIL_PATTERN.match(email)
    if match:
        print(f"✓ {email}: {match.groupdict()}")
    else:
        print(f"✗ {email}: No match")

Handling Escaped Characters

import re

# Match quoted strings, handling escaped quotes
text = r'Say "Hello \"World\"" and "Goodbye"'

# Pattern that handles escaped quotes inside
pattern = r'"(?:[^"\\]|\\.)*"'

matches = re.findall(pattern, text)
print(matches)
# Output: ['"Hello \\"World\\""', '"Goodbye"']

Lesson 10: Real-World Challenges

Challenge 1: Parse Mixed Log Formats

import re
from datetime import datetime

# Multiple timestamp formats
LOG_PATTERNS = [
    # ISO format: 2026-03-15T10:30:45
    re.compile(r'(?P<ts>\d{4}-\d{2}-\d{2}T[\d:]+)\s+\[(?P<level>\w+)\]\s+(?P<msg>.+)'),
    # Syslog format: Mar 15 10:30:45
    re.compile(r'(?P<ts>\w{3}\s+\d+\s+[\d:]+)\s+(?P<host>\S+)\s+(?P<msg>.+)'),
    # Apache format: [15/Mar/2026:10:30:45 +0000]
    re.compile(r'\[(?P<ts>[^\]]+)\]\s+(?P<msg>.+)'),
]

def parse_log(line: str) -> dict | None:
    """Try multiple patterns to parse a log line."""
    for pattern in LOG_PATTERNS:
        match = pattern.match(line)
        if match:
            return match.groupdict()
    return None

Challenge 2: Validate Complex Identifiers

import re

# Kubernetes resource name: lowercase alphanumeric, hyphens, max 63 chars
K8S_NAME = re.compile(r'^[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?$')

# Docker image reference
DOCKER_IMAGE = re.compile(r'''
    ^
    (?:(?P<registry>[a-z0-9.-]+(?::\d+)?)/)?  # Optional registry
    (?P<name>[a-z0-9]+(?:[._-][a-z0-9]+)*)     # Image name
    (?::(?P<tag>[a-zA-Z0-9._-]+))?             # Optional tag
    (?:@(?P<digest>sha256:[a-f0-9]{64}))?      # Optional digest
    $
''', re.VERBOSE)

# Test
images = [
    "nginx",
    "nginx:latest",
    "registry.example.com/nginx:v1.2.3",
    "gcr.io/project/image@sha256:" + "a" * 64,
]

for img in images:
    match = DOCKER_IMAGE.match(img)
    if match:
        print(f"✓ {img}: {match.groupdict()}")

Performance Tips

Technique Why

Technique	Why
Anchor patterns	`^pattern` is faster than `pattern` anywhere
Use character classes	is faster than `.``?`
Avoid nested quantifiers	`(a+)+` can be catastrophic
Compile patterns	Reuse compiled patterns in loops
Fail fast	Put most likely to fail conditions first
Use possessive when safe	`a++` prevents backtracking

Anchor patterns

^pattern is faster than pattern anywhere

Use character classes

is faster than .?

Avoid nested quantifiers

(a+)+ can be catastrophic

Compile patterns

Reuse compiled patterns in loops

Fail fast

Put most likely to fail conditions first

Use possessive when safe

a++ prevents backtracking

Exercises to Complete

[ ] Write a pattern to match JSON strings (handling escapes)
[ ] Parse multi-line stack traces into structured data
[ ] Match nested function calls to any depth
[ ] Create a pattern that validates URLs with query parameters
[ ] Build a log parser that handles 3 different timestamp formats

Self-Check

Solutions

import re

# 1. JSON strings with escapes
JSON_STRING = re.compile(r'"(?:[^"\\]|\\.)*"')

# 2. Multi-line stack traces
STACK_TRACE = re.compile(
    r'Stack trace:\n(?P<frames>(?:\s+at .+\n)+)',
    re.MULTILINE
)

# 3. Nested function calls (requires regex module)
import regex
NESTED_CALLS = regex.compile(r'\w+\((?:[^()]+|(?R))*\)')

# 4. URL with query params
URL_PATTERN = re.compile(r'''
    ^
    (?P<scheme>https?)://
    (?P<host>[a-zA-Z0-9.-]+)
    (?::(?P<port>\d+))?
    (?P<path>/[^\s?#]*)?
    (?:\?(?P<query>[^\s#]*))?
    (?:\#(?P<fragment>[^\s]*))?
    $
''', re.VERBOSE)

# 5. Multi-format timestamp parser
TIMESTAMP_PATTERNS = {
    'iso': re.compile(r'\d{4}-\d{2}-\d{2}T[\d:]+'),
    'syslog': re.compile(r'\w{3}\s+\d+\s+[\d:]+'),
    'apache': re.compile(r'\[\d+/\w+/\d+:[\d:]+\s+[+-]\d+\]'),
}

Next Session

Session 10: Performance & Optimization - Write efficient patterns that scale.