Regex Session 09: Advanced Patterns
Move beyond basic patterns to handle real-world complexity: multi-line data, nested structures, ambiguous formats, and edge cases that break naive patterns.
The Advanced Mindset
Basic regex: "Match this pattern" Advanced regex: "Match this pattern BUT NOT that, AND only in this context, AND handle edge cases"
Test Data Setup
cat << 'EOF' > /tmp/advanced-practice.txt
# Multi-line log entry
2026-03-15T10:30:45 [ERROR] Database connection failed
Stack trace:
at connect() in db.py:45
at main() in app.py:12
Cause: Timeout after 30s
# JSON-like structure
{
"user": "admin",
"roles": ["admin", "user", "viewer"],
"settings": {
"theme": "dark",
"notifications": true
}
}
# Config with sections
[database]
host = db-prod-01.internal
port = 5432
user = app_user
[cache]
host = redis.internal
port = 6379
# Mixed format data
Server: web-01 (192.168.1.100) [ACTIVE]
Server: web-02 (192.168.1.101) [STANDBY]
Server: db-01 (10.50.1.50) [ACTIVE] (Primary)
# Tricky email variations
simple@example.com
user.name+tag@sub.domain.co.uk
"quoted string"@example.com
admin@192.168.1.100
# Nested parentheses
func(arg1, func2(nested), arg3)
call(a, b(c, d(e)), f)
EOF
Lesson 1: Multi-line Patterns
The Problem
# Standard grep doesn't handle multi-line patterns
grep -E "ERROR.*Stack trace" /tmp/advanced-practice.txt
# No match - they're on different lines
Solution: PCRE with (?s) or grep -Pzo
# -z treats input as NUL-separated (effectively one "line")
# (?s) makes . match newlines
grep -Pzo '(?s)\[ERROR\].*?Stack trace:.*?(?=\n\n|\Z)' /tmp/advanced-practice.txt
Python Multi-line
import re
text = open('/tmp/advanced-practice.txt').read()
# DOTALL flag: . matches newline
pattern = re.compile(
r'\[ERROR\].*?Stack trace:.*?Cause:.*?\n',
re.DOTALL
)
match = pattern.search(text)
if match:
print(match.group())
Lesson 2: Lazy vs Greedy Matching
Greedy (default): Match as MUCH as possible Lazy (with ?): Match as LITTLE as possible
echo '<div>first</div><div>second</div>' | grep -oP '<div>.*</div>'
# Greedy: <div>first</div><div>second</div>
echo '<div>first</div><div>second</div>' | grep -oP '<div>.*?</div>'
# Lazy: <div>first</div>
When to Use Each
| Scenario | Use | Pattern |
|---|---|---|
Extract first occurrence |
Lazy |
|
Match entire span |
Greedy |
|
Specific delimiter |
Negated class |
|
Pro tip: Negated character class [^X]* is often clearer and faster than lazy matching.
Lesson 3: Atomic Groups and Possessive Quantifiers
Problem: Backtracking can cause catastrophic performance.
import re
import time
# This pattern is vulnerable to catastrophic backtracking
evil_pattern = r'(a+)+b'
evil_string = 'a' * 25 # No 'b' at end
start = time.time()
re.search(evil_pattern, evil_string) # Takes forever!
print(f"Time: {time.time() - start}")
Solution: Possessive Quantifiers (regex module)
# Install: pip install regex
import regex
# Possessive ++ doesn't backtrack
safe_pattern = r'(a++)+'
regex.search(safe_pattern, evil_string) # Fast!
Atomic Groups
# Atomic group (?>...) prevents backtracking into the group
pattern = r'(?>a+)b'
# Once a+ matches, it won't give back characters
Lesson 4: Conditional Patterns
PCRE supports: (?(condition)yes|no)
# Match "Mr" or "Mrs" followed by appropriate title
# If group 1 (s) exists, expect "Ms", else "Mr"
echo -e "Mr Smith\nMrs Jones" | \
grep -P '^Mr(s)?\.?\s+\w+$'
Practical: Validate Paired Delimiters
import re
def has_balanced_parens(text: str) -> bool:
"""Check if parentheses are balanced (simple check)."""
# Count open and close
opens = len(re.findall(r'\(', text))
closes = len(re.findall(r'\)', text))
return opens == closes
# For proper nesting validation, you need recursion (regex alone can't do it)
Lesson 5: Recursive Patterns (PCRE)
Match nested structures:
# Match balanced parentheses with content
echo "func(arg1, inner(nested), arg3)" | \
grep -oP '\((?:[^()]+|(?R))*\)'
Explanation:
- \( - Opening paren
- (?:…) - Non-capturing group
- [^()]+ - Non-paren characters
- |(?R) - OR recurse entire pattern
- \) - Closing paren
# Python's re module doesn't support recursion
# Use regex module instead
import regex
pattern = regex.compile(r'\((?:[^()]+|(?R))*\)')
text = "func(a, b(c, d(e)), f)"
matches = pattern.findall(text)
print(matches)
# Output: ['(a, b(c, d(e)), f)', '(c, d(e))', '(e)']
Lesson 6: Branch Reset Groups
Problem: Different alternatives have different group numbers.
# Without branch reset
echo "Jan 15" | grep -oP '(Jan|Feb|Mar)\s+(\d+)'
# Group 1 = month, Group 2 = day
echo "15 Jan" | grep -oP '(\d+)\s+(Jan|Feb|Mar)'
# Now Group 1 = day, Group 2 = month!
Solution: Branch reset (?|…)
# With branch reset - groups renumber in each branch
echo -e "Jan 15\n15 Jan" | \
grep -oP '(?|(?P<month>Jan|Feb|Mar)\s+(?P<day>\d+)|(?P<day>\d+)\s+(?P<month>Jan|Feb|Mar))'
Lesson 7: Complex Extraction Patterns
Extract Server Info with Multiple Formats
import re
text = """
Server: web-01 (192.168.1.100) [ACTIVE]
Server: web-02 (192.168.1.101) [STANDBY]
Server: db-01 (10.50.1.50) [ACTIVE] (Primary)
"""
# Handle optional (Primary) tag
pattern = re.compile(
r'Server:\s+(?P<name>[\w-]+)\s+'
r'\((?P<ip>[\d.]+)\)\s+'
r'\[(?P<status>\w+)\]'
r'(?:\s+\((?P<role>\w+)\))?' # Optional role
)
for match in pattern.finditer(text):
info = match.groupdict()
role = info['role'] or 'Standard'
print(f"{info['name']}: {info['ip']} ({info['status']}, {role})")
Output:
web-01: 192.168.1.100 (ACTIVE, Standard) web-02: 192.168.1.101 (STANDBY, Standard) db-01: 10.50.1.50 (ACTIVE, Primary)
Parse INI-style Config Sections
import re
from collections import defaultdict
def parse_ini(text: str) -> dict:
"""Parse INI-style config into nested dict."""
config = defaultdict(dict)
current_section = 'default'
section_pattern = re.compile(r'^\[(\w+)\]$')
value_pattern = re.compile(r'^(\w+)\s*=\s*(.+)$')
for line in text.split('\n'):
line = line.strip()
if not line or line.startswith('#'):
continue
section_match = section_pattern.match(line)
if section_match:
current_section = section_match.group(1)
continue
value_match = value_pattern.match(line)
if value_match:
key, value = value_match.groups()
config[current_section][key] = value
return dict(config)
# Usage
config = parse_ini(open('/tmp/advanced-practice.txt').read())
print(config.get('database', {}))
# Output: {'host': 'db-prod-01.internal', 'port': '5432', 'user': 'app_user'}
Lesson 8: Context-Aware Matching
Match Only in Specific Context
# Match "port" only in [database] section
awk '/^\[database\]/,/^\[/' /tmp/advanced-practice.txt | grep -oP 'port\s*=\s*\K\d+'
import re
def find_in_section(text: str, section: str, key: str) -> str | None:
"""Find key value within specific INI section."""
# Match section and extract its content
section_pattern = re.compile(
rf'\[{section}\].*?(?=\n\[|\Z)',
re.DOTALL
)
section_match = section_pattern.search(text)
if not section_match:
return None
# Find key within section
key_pattern = re.compile(rf'^{key}\s*=\s*(.+)$', re.MULTILINE)
key_match = key_pattern.search(section_match.group())
return key_match.group(1) if key_match else None
# Usage
text = open('/tmp/advanced-practice.txt').read()
print(find_in_section(text, 'database', 'port')) # 5432
print(find_in_section(text, 'cache', 'port')) # 6379
Negative Context (Match NOT in Context)
# Match IPs that are NOT in the database section
grep -P '^\d' /tmp/advanced-practice.txt | \
grep -v '^\[database\]' | \
grep -oP '\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
Lesson 9: Edge Case Handling
Email Variations
import re
# Comprehensive email pattern (simplified RFC 5322)
EMAIL_PATTERN = re.compile(r'''
(?P<local>
[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+ # Normal characters
|
"(?:[^"\\]|\\.)*" # Quoted string
)
@
(?P<domain>
(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,} # Domain name
|
\d{1,3}(?:\.\d{1,3}){3} # IP address
)
''', re.VERBOSE)
test_emails = [
"simple@example.com",
"user.name+tag@sub.domain.co.uk",
'"quoted string"@example.com',
"admin@192.168.1.100",
]
for email in test_emails:
match = EMAIL_PATTERN.match(email)
if match:
print(f"✓ {email}: {match.groupdict()}")
else:
print(f"✗ {email}: No match")
Handling Escaped Characters
import re
# Match quoted strings, handling escaped quotes
text = r'Say "Hello \"World\"" and "Goodbye"'
# Pattern that handles escaped quotes inside
pattern = r'"(?:[^"\\]|\\.)*"'
matches = re.findall(pattern, text)
print(matches)
# Output: ['"Hello \\"World\\""', '"Goodbye"']
Lesson 10: Real-World Challenges
Challenge 1: Parse Mixed Log Formats
import re
from datetime import datetime
# Multiple timestamp formats
LOG_PATTERNS = [
# ISO format: 2026-03-15T10:30:45
re.compile(r'(?P<ts>\d{4}-\d{2}-\d{2}T[\d:]+)\s+\[(?P<level>\w+)\]\s+(?P<msg>.+)'),
# Syslog format: Mar 15 10:30:45
re.compile(r'(?P<ts>\w{3}\s+\d+\s+[\d:]+)\s+(?P<host>\S+)\s+(?P<msg>.+)'),
# Apache format: [15/Mar/2026:10:30:45 +0000]
re.compile(r'\[(?P<ts>[^\]]+)\]\s+(?P<msg>.+)'),
]
def parse_log(line: str) -> dict | None:
"""Try multiple patterns to parse a log line."""
for pattern in LOG_PATTERNS:
match = pattern.match(line)
if match:
return match.groupdict()
return None
Challenge 2: Validate Complex Identifiers
import re
# Kubernetes resource name: lowercase alphanumeric, hyphens, max 63 chars
K8S_NAME = re.compile(r'^[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?$')
# Docker image reference
DOCKER_IMAGE = re.compile(r'''
^
(?:(?P<registry>[a-z0-9.-]+(?::\d+)?)/)? # Optional registry
(?P<name>[a-z0-9]+(?:[._-][a-z0-9]+)*) # Image name
(?::(?P<tag>[a-zA-Z0-9._-]+))? # Optional tag
(?:@(?P<digest>sha256:[a-f0-9]{64}))? # Optional digest
$
''', re.VERBOSE)
# Test
images = [
"nginx",
"nginx:latest",
"registry.example.com/nginx:v1.2.3",
"gcr.io/project/image@sha256:" + "a" * 64,
]
for img in images:
match = DOCKER_IMAGE.match(img)
if match:
print(f"✓ {img}: {match.groupdict()}")
Performance Tips
| Technique | Why |
|---|---|
Anchor patterns |
|
Use character classes |
|
Avoid nested quantifiers |
|
Compile patterns |
Reuse compiled patterns in loops |
Fail fast |
Put most likely to fail conditions first |
Use possessive when safe |
|
Exercises to Complete
-
[ ] Write a pattern to match JSON strings (handling escapes)
-
[ ] Parse multi-line stack traces into structured data
-
[ ] Match nested function calls to any depth
-
[ ] Create a pattern that validates URLs with query parameters
-
[ ] Build a log parser that handles 3 different timestamp formats
Self-Check
Solutions
import re
# 1. JSON strings with escapes
JSON_STRING = re.compile(r'"(?:[^"\\]|\\.)*"')
# 2. Multi-line stack traces
STACK_TRACE = re.compile(
r'Stack trace:\n(?P<frames>(?:\s+at .+\n)+)',
re.MULTILINE
)
# 3. Nested function calls (requires regex module)
import regex
NESTED_CALLS = regex.compile(r'\w+\((?:[^()]+|(?R))*\)')
# 4. URL with query params
URL_PATTERN = re.compile(r'''
^
(?P<scheme>https?)://
(?P<host>[a-zA-Z0-9.-]+)
(?::(?P<port>\d+))?
(?P<path>/[^\s?#]*)?
(?:\?(?P<query>[^\s#]*))?
(?:\#(?P<fragment>[^\s]*))?
$
''', re.VERBOSE)
# 5. Multi-format timestamp parser
TIMESTAMP_PATTERNS = {
'iso': re.compile(r'\d{4}-\d{2}-\d{2}T[\d:]+'),
'syslog': re.compile(r'\w{3}\s+\d+\s+[\d:]+'),
'apache': re.compile(r'\[\d+/\w+/\d+:[\d:]+\s+[+-]\d+\]'),
}
Next Session
Session 10: Performance & Optimization - Write efficient patterns that scale.