Regex Patterns

Regular expression syntax across BRE, ERE, and PCRE flavors.

The Three Flavors

Flavor Flag Tools Key Differences

Flavor	Flag	Tools	Key Differences
BRE	(default)	grep, sed, ed	`\+`, `\?`, `\|`, `\(\)` need escaping
ERE	-E	grep -E, sed -E, awk	`+`, `?`, `\|`, `()` work without escaping
PCRE	-P	grep -P, perl, python	`\d`, `\w`, `\b`, lookahead/behind

BRE

(default)

grep, sed, ed

\+, \?, |,  need escaping

ERE

-E

grep -E, sed -E, awk

+, ?, |, () work without escaping

PCRE

-P

grep -P, perl, python

\d, \w, \b, lookahead/behind

Quick test which you have:

echo "test123" | grep -P '\d+'    # PCRE - should print test123
echo "test123" | grep -E '[0-9]+' # ERE - should print test123
echo "test123" | grep '[0-9]\+'   # BRE - note escaped +

Metacharacters

Char Meaning Example

Char	Meaning	Example
`.`	Any single character (except newline)	`c.t` matches cat, cut, c9t
`*`	Zero or more of previous	`go*d` matches gd, god, good, goood
`+`	One or more of previous (ERE/PCRE)	`go+d` matches god, good (not gd)
`?`	Zero or one of previous (ERE/PCRE)	`colou?r` matches color, colour
`\|`	Alternation (OR)	`cat\|dog` matches cat or dog
`^`	Start of line	`^#` matches lines starting with #
`$`	End of line	`\.conf$` matches lines ending in .conf
`\`	Escape metacharacter	`\.` matches literal dot
`[]`	Character class	`[aeiou]` matches any vowel
`()`	Grouping / capture	`(ab)+` matches ab, abab, ababab

.

Any single character (except newline)

c.t matches cat, cut, c9t

*

Zero or more of previous

go*d matches gd, god, good, goood

+

One or more of previous (ERE/PCRE)

go+d matches god, good (not gd)

?

Zero or one of previous (ERE/PCRE)

colou?r matches color, colour

|

Alternation (OR)

cat|dog matches cat or dog

^

Start of line

^# matches lines starting with #

$

End of line

\.conf$ matches lines ending in .conf

\

Escape metacharacter

\. matches literal dot

[]

Character class

[aeiou] matches any vowel

()

Grouping / capture

(ab)+ matches ab, abab, ababab

Character Classes

Custom Classes

[abc]       # a, b, or c
[a-z]       # lowercase letter
[A-Z]       # uppercase letter
[0-9]       # digit
[a-zA-Z]    # any letter
[a-zA-Z0-9] # alphanumeric
[^abc]      # NOT a, b, or c (negation)
[^0-9]      # NOT a digit
[-abc]      # literal hyphen (first position)
[abc-]      # literal hyphen (last position)
[]abc]      # literal ] (first position)
[.?*]       # metacharacters literal inside []

POSIX Classes (BRE/ERE)

[[:alnum:]]  # alphanumeric [a-zA-Z0-9]
[[:alpha:]]  # alphabetic [a-zA-Z]
[[:digit:]]  # digit [0-9]
[[:lower:]]  # lowercase [a-z]
[[:upper:]]  # uppercase [A-Z]
[[:space:]]  # whitespace (space, tab, newline)
[[:blank:]]  # space or tab only
[[:punct:]]  # punctuation
[[:xdigit:]] # hex digit [0-9A-Fa-f]

PCRE Shortcuts (grep -P)

\d    # digit [0-9]
\D    # NOT digit [^0-9]
\w    # word char [a-zA-Z0-9_]
\W    # NOT word char
\s    # whitespace
\S    # NOT whitespace
\b    # word boundary
\B    # NOT word boundary

Quantifiers

Greedy Lazy (PCRE) Meaning Example

Greedy	Lazy (PCRE)	Meaning	Example
`*`	`*?`	0 or more	`a*` matches "", a, aa, aaa
`+`	`+?`	1 or more	`a+` matches a, aa, aaa (not "")
`?`	`??`	0 or 1	`a?` matches "", a
`{n}`	`{n}?`	exactly n	`a{3}` matches aaa only
`{n,}`	`{n,}?`	n or more	`a{2,}` matches aa, aaa, aaaa…
`{n,m}`	`{n,m}?`	n to m	`a{2,4}` matches aa, aaa, aaaa

*

*?

0 or more

a* matches "", a, aa, aaa

+

+?

1 or more

a+ matches a, aa, aaa (not "")

?

??

0 or 1

a? matches "", a

{n}

{n}?

exactly n

a{3} matches aaa only

{n,}

{n,}?

n or more

a{2,} matches aa, aaa, aaaa…

{n,m}

{n,m}?

n to m

a{2,4} matches aa, aaa, aaaa

Greedy vs Lazy

# Greedy (default): match as MUCH as possible
echo "<div>content</div>" | grep -oP '<.*>'
# Output: <div>content</div>

# Lazy: match as LITTLE as possible
echo "<div>content</div>" | grep -oP '<.*?>'
# Output: <div>
#         </div>

Anchors and Boundaries

^pattern    # Start of line
pattern$    # End of line
^$          # Empty line
^.+$        # Non-empty line
\bword\b    # Whole word only (PCRE)
\<word\>    # Whole word only (GNU grep BRE/ERE)

# Examples
grep '^#' file           # Comment lines
grep '\.conf$' file      # Lines ending in .conf
grep -v '^$' file        # Non-empty lines
grep -P '\berror\b' file # "error" as whole word

Groups and Capture

Basic Grouping

(pattern)      # Capture group - saves match
(?:pattern)    # Non-capture group (PCRE) - just grouping

# Alternation within group
grep -E '(cat|dog)' file      # cat or dog
grep -E 'gr(a|e)y' file       # gray or grey
grep -E '(ab)+' file          # ab, abab, ababab

# BRE requires escaping
grep '\(cat\|dog\)' file      # BRE version

Backreferences

\1    # First capture group
\2    # Second capture group

# Match repeated characters
grep -E '(.)\1' file          # aa, bb, cc...
grep -E '(.)\1\1' file        # aaa, bbb, ccc...

# Match repeated words
grep -E '\b(\w+)\s+\1\b' file # "the the", "is is"

# sed replacement with groups
echo "hello world" | sed -E 's/(\w+) (\w+)/\2 \1/'
# Output: world hello

Named Groups (PCRE)

(?<name>pattern)   # Named capture
\k<name>           # Named backreference

# Example: Match IP and reference by name
grep -P '(?<ip>\d+\.\d+\.\d+\.\d+).*\k<ip>' file

Lookahead and Lookbehind (PCRE)

These match a position, not characters. They don’t consume input.

Syntax Name Matches

Syntax	Name	Matches
`(?=pattern)`	Positive lookahead	Position followed by pattern
`(?!pattern)`	Negative lookahead	Position NOT followed by pattern
`(?⇐pattern)`	Positive lookbehind	Position preceded by pattern
`(?<!pattern)`	Negative lookbehind	Position NOT preceded by pattern

(?=pattern)

Positive lookahead

Position followed by pattern

(?!pattern)

Negative lookahead

Position NOT followed by pattern

(?⇐pattern)

Positive lookbehind

Position preceded by pattern

(?<!pattern)

Negative lookbehind

Position NOT preceded by pattern

Examples

# Find "error" followed by a number
grep -P 'error(?=\s*\d)' file

# Find "error" NOT followed by "404"
grep -P 'error(?!.*404)' file

# Extract price (digits after $)
echo "Price: $199" | grep -oP '(?<=\$)\d+'
# Output: 199

# Find port numbers NOT preceded by "127.0.0.1:"
grep -P '(?<!127\.0\.0\.1:)\d{2,5}' file

Password Validation Pattern

# At least: 8 chars, 1 upper, 1 lower, 1 digit
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}$

Flavor Comparison Table

Feature BRE ERE PCRE

Feature	BRE	ERE	PCRE
One or more `+`	`\+`	`+`	`+`
Zero or one `?`	`\?`	`?`	`?`
Alternation `\|`	`\|`	`\|`	`\|`
Grouping `()`	`\(\)`	`()`	`()`
`\d` for digits	No	No	Yes
`\w` for word	No	No	Yes
`\b` word boundary	No	No	Yes
Lookahead	No	No	Yes
Lookbehind	No	No	Yes
Lazy quantifiers	No	No	Yes
Non-capture groups	No	No	Yes

One or more +

\+

+

Zero or one ?

\?

?

Alternation |

|

Grouping ()

()

\d for digits

Yes

\w for word

Yes

\b word boundary

Yes

Lookahead

Yes

Lookbehind

Yes

Lazy quantifiers

Yes

Non-capture groups

Yes

Drills: Basic Patterns

Practice these in regex101.com, then replicate in terminal.

Test Data

10.50.1.1 - - [12/Mar/2026:10:23:45 +0000] "GET /api/v1/users HTTP/1.1" 200 1234
192.168.1.100 - admin [12/Mar/2026:10:23:46 +0000] "POST /login HTTP/1.1" 401 89
10.50.1.20 - - [12/Mar/2026:10:23:47 +0000] "GET /health HTTP/1.1" 200 15
172.16.0.50 - evan [12/Mar/2026:10:23:48 +0000] "DELETE /api/v1/users/5 HTTP/1.1" 403 201
fe80::1 - - [12/Mar/2026:10:23:49 +0000] "GET /metrics HTTP/1.1" 200 8492
MAC: 14:F6:D8:7B:31:80 assigned to VLAN 10
MAC: 98:BB:1E:1F:A7:13 assigned to VLAN 20
error: connection refused to 10.50.1.50:389
warning: certificate expires in 30 days
ERROR: authentication failed for user 'admin'

Drill 1: IP Addresses

# Match any IPv4 address
\b\d{1,3}(?:\.\d{1,3}){3}\b

# Terminal:
grep -oP '\b\d{1,3}(?:\.\d{1,3}){3}\b' file

Drill 2: 10.x.x.x Network Only

# Only 10.x.x.x addresses
\b10(?:\.\d{1,3}){3}\b

# Terminal:
grep -oP '\b10(?:\.\d{1,3}){3}\b' file

Drill 3: MAC Addresses

# Standard MAC format (colon-separated)
\b([0-9A-Fa-f]{2}:){5}[0-9A-Fa-f]{2}\b

# Terminal:
grep -oP '\b([0-9A-Fa-f]{2}:){5}[0-9A-Fa-f]{2}\b' file

Drill 4: HTTP Status Codes

# 4xx errors
" (4\d{2}) "

# Terminal:
grep -oP '" \K4\d{2}(?= )' file

# All status codes
grep -oP '" \K[0-9]{3}(?= )' file

Drill 5: Usernames (not -)

# Capture username field (third field, not -)
 - (\w+) \[

# Terminal:
grep -oP ' - \K\w+(?= \[)' file | grep -v '^-$'

Drills: Intermediate

Drill 6: HTTP Methods

# Extract HTTP method
"(GET|POST|PUT|DELETE|PATCH)"

# Terminal:
grep -oP '"\K(GET|POST|PUT|DELETE|PATCH)' file

Drill 7: Request Paths

# Path after method
"(?:GET|POST|PUT|DELETE|PATCH) ([^ ]+)

# Terminal:
grep -oP '"(?:GET|POST|PUT|DELETE|PATCH) \K[^ ]+' file

Drill 8: Log Levels (Case-Insensitive)

# Match error/warning/ERROR/WARNING etc.
(?i)\b(error|warn(?:ing)?|fatal|critical)\b

# Terminal:
grep -iP '\b(error|warn(ing)?|fatal|critical)\b' file

Drill 9: Port Numbers

# Extract port after IP:port
(?<=:)\d{2,5}\b

# Terminal:
grep -oP '(?<=:)\d{2,5}\b' file

Drill 10: VLAN IDs

# VLAN followed by number
VLAN\s+(\d+)

# Terminal:
grep -oP 'VLAN\s+\K\d+' file

Drills: Advanced

Drill 11: Date Extraction

# Extract date from log brackets
\[(\d{2}/\w{3}/\d{4})

# Terminal:
grep -oP '\[\K\d{2}/\w{3}/\d{4}' file

Drill 12: Failed Auth Lines

# Lines with 401 or 403 status
" (40[13]) "

# Full line extraction:
grep -P '" 40[13] ' file

Drill 13: Extract Quoted Strings

# Content inside double quotes
"([^"]+)"

# Terminal:
grep -oP '"[^"]+"' file

Drill 14: IPv6 Detection

# Simple IPv6 pattern (link-local example)
fe80::[0-9a-fA-F:]+

# Terminal:
grep -oP 'fe80::[0-9a-fA-F:]+' file

Drill 15: Certificate Expiry Days

# Extract number from "expires in X days"
expires in (\d+) days

# Terminal:
grep -oP 'expires in \K\d+(?= days)' file

Common Infrastructure Patterns

Network

# IPv4
\b\d{1,3}(?:\.\d{1,3}){3}\b

# IPv4 with CIDR
\b\d{1,3}(?:\.\d{1,3}){3}/\d{1,2}\b

# MAC (colon)
\b([0-9A-Fa-f]{2}:){5}[0-9A-Fa-f]{2}\b

# MAC (hyphen - Windows style)
\b([0-9A-Fa-f]{2}-){5}[0-9A-Fa-f]{2}\b

# Port number
\b([0-9]{1,5})\b

# FQDN
\b[a-zA-Z0-9]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(\.[a-zA-Z]{2,})+\b

# URL
https?://[^\s<>"]+

Security

# JWT token (3 base64 parts)
eyJ[A-Za-z0-9_-]+\.eyJ[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+

# API key patterns (generic)
[A-Za-z0-9]{32,}

# AWS Access Key
AKIA[0-9A-Z]{16}

# Private key header
-----BEGIN [A-Z ]+ PRIVATE KEY-----

# Password in URL (security audit)
://[^:]+:([^@]+)@

Logs

# Syslog timestamp
[A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2}

# ISO 8601
\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}

# Apache/Nginx combined log (full line)
^(\S+) (\S+) (\S+) \[([^\]]+)\] "([^"]*)" (\d{3}) (\d+)

# Error levels
\b(EMERG|ALERT|CRIT|ERR|WARN|NOTICE|INFO|DEBUG)\b

Config Files

# Key=Value
^(\w+)\s*=\s*(.+)$

# YAML key: value
^(\s*)(\w+):\s*(.+)?$

# Comment lines (# or //)
^\s*(#|//).*$

# Empty or whitespace-only lines
^\s*$

# INI section headers
^\[([^\]]+)\]$

Regex in sed and awk

sed Examples

# Basic substitution
sed 's/old/new/' file

# Global (all occurrences on line)
sed 's/old/new/g' file

# Using capture groups
sed -E 's/([0-9]+)/[\1]/' file           # 123 -> [123]

# Swap two words
sed -E 's/(\w+) (\w+)/\2 \1/' file

# Delete lines matching pattern
sed '/pattern/d' file

# Print only matching lines (like grep)
sed -n '/pattern/p' file

# In-place edit
sed -i 's/old/new/g' file
sed -i.bak 's/old/new/g' file    # With backup

awk Examples

# Match pattern
awk '/pattern/ {print}' file

# Match and extract field
awk '/error/ {print $1, $NF}' file

# Regex on specific field
awk '$1 ~ /^10\./ {print}' file           # First field starts with 10.

# Negative match
awk '$1 !~ /^192\./ {print}' file         # First field NOT starting with 192.

# gsub (global substitution)
awk '{gsub(/old/, "new"); print}' file

# sub (first occurrence only)
awk '{sub(/old/, "new"); print}' file

# Match and extract with gensub (GNU awk)
awk '{print gensub(/.*:([0-9]+).*/, "\\1", "g")}' file

Common Gotchas

Escaping in Different Contexts

# Shell escaping - use single quotes for regex
grep 'pattern with $special' file         # $ is literal
grep "pattern with $special" file         # $ is shell variable!

# BRE vs ERE escaping
grep 'a\+b' file      # BRE: one or more a
grep -E 'a+b' file    # ERE: same thing

# Escaping in character class
grep '[.?*]' file     # Metacharacters are literal inside []
grep '[^abc]' file    # ^ means NOT only at start of []

Greedy Matching Trap

# Problem: Greedy .* matches too much
echo '<div>one</div><div>two</div>' | grep -oP '<div>.*</div>'
# Output: <div>one</div><div>two</div>  (one match, not two!)

# Solution: Lazy .*?
echo '<div>one</div><div>two</div>' | grep -oP '<div>.*?</div>'
# Output: <div>one</div>
#         <div>two</div>

# Alternative: Negated class (works in BRE/ERE too)
echo '<div>one</div><div>two</div>' | grep -oE '<div>[^<]*</div>'

Word Boundary Differences

# PCRE word boundary
grep -P '\bword\b' file

# GNU BRE/ERE word boundary
grep '\<word\>' file

# ERE alternative (may not work everywhere)
grep -E '(^|[^a-zA-Z])word($|[^a-zA-Z])' file

Newline Handling

# . does NOT match newline by default
# Use -z for null-delimited (multiline) in grep

# Match across lines (GNU grep)
grep -Pzo 'start.*?end' file

# In sed, use N to read next line
sed 'N;s/line1\nline2/replaced/' file

Atomic Groups and Possessive Quantifiers (PCRE)

These prevent backtracking - once matched, the engine won’t give up characters. Critical for performance and avoiding catastrophic backtracking.

Possessive Quantifiers

Add + after any quantifier to make it possessive:

Greedy Possessive Behavior

Greedy	Possessive	Behavior
`*`	`*+`	0+ chars, no backtrack
`+`	`++`	1+ chars, no backtrack
`?`	`?+`	0 or 1, no backtrack
`{n,m}`	`{n,m}+`	n-m chars, no backtrack

*

*+

0+ chars, no backtrack

+

++

1+ chars, no backtrack

?

?+

0 or 1, no backtrack

{n,m}

{n,m}+

n-m chars, no backtrack

# Compare greedy vs possessive
echo "aaaaaaaaab" | grep -P 'a+b'     # Matches - greedy backtracks
echo "aaaaaaaaab" | grep -P 'a++b'    # Matches - no backtrack needed

echo "aaaaaaaaaa" | grep -P 'a+b'     # No match (backtracks, tries, fails)
echo "aaaaaaaaaa" | grep -P 'a++b'    # No match (fails fast, no backtrack)

Atomic Groups

(?>pattern) - Once matched, contents cannot be backtracked into.

# Atomic group syntax
(?>pattern)

# Example: Match integer OR float, prefer integer
echo "3.14" | grep -oP '(?>\d+)\.?\d*'
# Output: 3.14 (atomic group captures "3", then matches ".14")

# Without atomic group - ambiguity
echo "3.14" | grep -oP '(\d+)\.?\d*'
# Also matches, but engine may backtrack unnecessarily

When to Use

# Pattern that causes catastrophic backtracking
# DON'T: (a+)+ against "aaaaaaaaaaaaaaaaaaaaX"

# DO: Use possessive or atomic
(?>a+)+     # Atomic group prevents inner backtrack
a++         # Possessive on inner quantifier

# Real example: Matching quoted strings efficiently
# DON'T: ".*"  (backtracks excessively on non-matches)
# DO:    "[^"]*"  (no backtracking possible)
# OR:    "(?>[^"]*)"  (atomic for complex inner patterns)

Recursive Patterns (PCRE)

Match nested structures like parentheses, HTML tags, JSON.

Basic Recursion

(?R)        # Recurse entire pattern
(?0)        # Same as (?R)
(?1)        # Recurse first capture group
(?2)        # Recurse second capture group
(?&name)    # Recurse named group
(?P>name)   # Python-style named recursion

Matching Balanced Parentheses

# Match balanced parens: (), (()), ((())), etc.
\((?:[^()]*|(?R))*\)

# Breakdown:
# \(        opening paren
# (?:       non-capture group for contents:
#   [^()]*    any non-paren chars
#   |         OR
#   (?R)      recurse entire pattern (nested parens)
# )*        zero or more content items
# \)        closing paren

# Test:
echo "(a(b(c)d)e)" | grep -oP '\((?:[^()]*|(?R))*\)'
# Output: (a(b(c)d)e)

echo "((nested))" | grep -oP '\((?:[^()]*|(?R))*\)'
# Output: ((nested))

Matching Nested Structures

# Match balanced braces (JSON-like)
\{(?:[^{}]*|(?R))*\}

# Match balanced brackets
\[(?:[^\[\]]*|(?R))*\]

# Match balanced angle brackets (XML-like)
<(?:[^<>]*|(?R))*>

Group Recursion

# Recurse specific group instead of whole pattern
# (?1) recurses group 1

# Pattern: word = (nested stuff)
(\w+)\s*=\s*(\((?:[^()]*|(?2))*\))

# (?2) recurses only the paren-matching group
echo "config = ((a)(b))" | grep -oP '(\w+)\s*=\s*(\((?:[^()]*|(?2))*\))'
# Output: config = ((a)(b))

Named Group Recursion

# Define pattern once, recurse by name
(?<parens>\((?:[^()]*|(?&parens))*\))

# Match: expression (with (nested) parens)
\w+\s*(?<parens>\((?:[^()]*|(?&parens))*\))

echo "func((a,b),(c,d))" | grep -oP '\w+(?<parens>\((?:[^()]*|(?&parens))*\))'
# Output: func((a,b),(c,d))

Conditional Patterns (PCRE)

Match different patterns based on conditions.

Syntax

(?(condition)yes-pattern|no-pattern)
(?(condition)yes-pattern)    # No-pattern is empty match

Conditional on Capture Group

# (?(1)...|...) - true if group 1 matched
# Match optional opening paren, require closing if present

(\()?[a-z]+(?(1)\))

# Breakdown:
# (\()?     optional capture of opening paren (group 1)
# [a-z]+    one or more letters
# (?(1)\))  IF group 1 matched, require closing paren

echo "abc" | grep -oP '(\()?[a-z]+(?(1)\))'
# Output: abc (no parens needed)

echo "(abc)" | grep -oP '(\()?[a-z]+(?(1)\))'
# Output: (abc) (parens balanced)

echo "(abc" | grep -oP '(\()?[a-z]+(?(1)\))'
# No output (opening but no closing)

Conditional on Lookahead

# (?(?=condition)yes|no)
# Match based on what follows

# If followed by digits, match word; else match number
(?(?=\d)\d+|\w+)

echo "abc 123" | grep -oP '(?(?=\d)\d+|\w+)'
# Output: abc
#         123

Practical Examples

# Match phone with optional country code
# If +, require country code pattern; else local pattern
(?(\+)\+\d{1,3}[-\s]?)?\d{3}[-\s]?\d{3}[-\s]?\d{4}

# Match quoted or unquoted value
# (")? captures optional quote, (?(1)") requires closing if present
(")?[^",]+(?(1)")

echo 'field,"quoted value",plain' | grep -oP '(")?[^",]+(?(1)")'
# Output: field
#         "quoted value"
#         plain

Branch Reset Groups (PCRE)

(?|…) - Alternatives share the same group numbers. Useful when different patterns should populate the same capture group.

Without Branch Reset

# Normal alternation - different group numbers
(cat)|(dog)|(bird)

# "cat"  → $1 = "cat",  $2 = undef, $3 = undef
# "dog"  → $1 = undef,  $2 = "dog", $3 = undef
# "bird" → $1 = undef,  $2 = undef, $3 = "bird"

With Branch Reset

# Branch reset - same group number for all alternatives
(?|(cat)|(dog)|(bird))

# "cat"  → $1 = "cat"
# "dog"  → $1 = "dog"
# "bird" → $1 = "bird"

# All alternatives populate group 1!
echo -e "cat\ndog\nbird" | grep -oP '(?|(cat)|(dog)|(bird))'
# Output: cat
#         dog
#         bird

Practical Use Case

# Extract date in multiple formats, normalize to one group
# Format 1: 2026-03-14  (ISO)
# Format 2: 03/14/2026  (US)
# Format 3: 14.03.2026  (EU)

(?|(\d{4})-(\d{2})-(\d{2})|(\d{2})/(\d{2})/(\d{4})|(\d{2})\.(\d{2})\.(\d{4}))

# This is messy - the captures don't align semantically
# Better to extract and post-process, but branch reset helps

# Extract IP:port or just IP (port optional)
(?|(\d+\.\d+\.\d+\.\d+):(\d+)|(\d+\.\d+\.\d+\.\d+)())

# Group 1 = IP, Group 2 = port (or empty)

Mode Modifiers (PCRE)

Change regex behavior inline. Apply to whole pattern or sections.

Global Modifiers

Modifier Name Effect

Modifier	Name	Effect
`(?i)`	Case insensitive	`a` matches `a` or `A`
`(?m)`	Multiline	`^` and `$` match line boundaries
`(?s)`	Single-line (dotall)	`.` matches newline too
`(?x)`	Extended (free-spacing)	Whitespace ignored, `#` comments allowed
`(?U)`	Ungreedy	Quantifiers lazy by default

(?i)

Case insensitive

a matches a or A

(?m)

Multiline

^ and $ match line boundaries

(?s)

Single-line (dotall)

. matches newline too

(?x)

Extended (free-spacing)

Whitespace ignored, # comments allowed

(?U)

Ungreedy

Quantifiers lazy by default

Inline Modifier Syntax

# Apply to entire pattern (at start)
(?i)pattern           # Case insensitive

# Apply to section only
normal(?i)insensitive(?-i)normal_again

# Multiple modifiers
(?im)pattern          # Case insensitive + multiline

# Negative (turn off)
(?-i)                 # Turn OFF case insensitivity

Case Insensitive `(?i)`

# Match ERROR, error, Error, etc.
(?i)error

echo -e "ERROR\nerror\nError" | grep -P '(?i)error'
# Matches all three

# Same as grep -i:
grep -iP 'error' file

# Apply to portion only:
(?i:error)\s+(\d+)    # "ERROR 404", "error 500" - but $1 is case-sensitive

Multiline `(?m)`

# Without (?m): ^ matches start of STRING only
# With (?m): ^ matches start of each LINE

# Match lines starting with #
(?m)^#.*$

# Equivalent to grep's default behavior (line-oriented)
# Useful in contexts where input is one big string

# Example in Perl one-liner:
echo -e "line1\n#comment\nline2" | perl -ne 'print if /(?m)^#/'
# Output: #comment

Single-line/Dotall `(?s)`

# Without (?s): . matches any char EXCEPT newline
# With (?s): . matches newline too

# Match everything between START and END, including newlines
(?s)START.*?END

# Example:
echo -e "START\nmulti\nline\nEND" | grep -Pzo '(?s)START.*?END'
# Output: START
#         multi
#         line
#         END

# grep -z treats input as null-delimited (whole file as one record)

Extended `(?x)` (Free-Spacing)

# Whitespace ignored, # starts comments
# Allows readable complex patterns

(?x)
  ^                    # Start of line
  (\d{3})              # Area code
  [-.\s]?              # Optional separator
  (\d{3})              # Exchange
  [-.\s]?              # Optional separator
  (\d{4})              # Subscriber
  $                    # End of line

# Equivalent to:
^(\d{3})[-.\s]?(\d{3})[-.\s]?(\d{4})$

# In grep (must escape or use single line):
grep -P '(?x) ^ (\d{3}) [-.\s]? (\d{3}) [-.\s]? (\d{4}) $' file

Ungreedy `(?U)`

# Without (?U): quantifiers greedy by default
# With (?U): quantifiers lazy by default

# Match minimal between tags
(?U)<tag>.*</tag>

# Same as:
<tag>.*?</tag>

# With (?U), .* is lazy, .*? becomes greedy (inverted!)

Unicode Properties (PCRE)

Match characters by Unicode category, script, or property. Requires PCRE2 or Perl with Unicode support.

General Categories

\p{L}    # Any letter (any script)
\p{Ll}   # Lowercase letter
\p{Lu}   # Uppercase letter
\p{N}    # Any number
\p{Nd}   # Decimal digit
\p{P}    # Punctuation
\p{S}    # Symbol
\p{Z}    # Separator (space, line, paragraph)
\p{C}    # Control/format/private use

\P{L}    # NOT a letter (uppercase P negates)

Script Matching

\p{Latin}      # Latin script
\p{Greek}      # Greek script
\p{Cyrillic}   # Cyrillic script
\p{Han}        # Chinese characters
\p{Hiragana}   # Japanese hiragana
\p{Katakana}   # Japanese katakana
\p{Arabic}     # Arabic script
\p{Hebrew}     # Hebrew script

# Long form:
\p{Script=Latin}
\p{Script=Greek}

# Match word in any script:
[\p{L}\p{M}]+  # Letters + combining marks

Practical Examples

# Match any letter (international)
echo "Héllo Wörld 你好" | grep -oP '\p{L}+'
# Output: Héllo
#         Wörld
#         你好

# Match email with international characters
[\p{L}\p{N}._%+-]+@[\p{L}\p{N}.-]+\.\p{L}{2,}

# Match only ASCII letters (not international)
[A-Za-z]+
# vs any letter:
\p{L}+

# Detect non-ASCII characters (security audit)
[^\x00-\x7F]
# Or:
\P{ASCII}

# Match emoji (basic)
[\x{1F300}-\x{1F9FF}]
# Or category:
\p{Emoji}

Unicode Character Classes

# Combining marks (accents, etc.)
\p{M}      # Any mark
\p{Mn}     # Non-spacing mark
\p{Mc}     # Spacing combining mark

# Letter + marks (proper word matching)
[\p{L}\p{M}]+

# Example: Match accented words
echo "café naïve résumé" | grep -oP '[\p{L}\p{M}]+'
# Output: café
#         naïve
#         résumé

Common Unicode Gotchas

# \w does NOT match international letters by default!
echo "Müller" | grep -oP '\w+'
# Output: M  ller (ü not matched!)

# Fix: Use \p{L} or enable Unicode mode
echo "Müller" | grep -oP '[\p{L}\p{N}_]+'
# Output: Müller

# Perl/PCRE2 - enable Unicode word characters:
# (?u) or (*UCP) at pattern start
echo "Müller" | grep -oP '(*UCP)\w+'
# Output: Müller

# Byte vs character length
echo "café" | wc -c   # 6 bytes (é = 2 bytes in UTF-8)
echo "café" | wc -m   # 5 characters

Regex Engine Internals

Understanding NFA vs DFA and catastrophic backtracking.

NFA vs DFA Engines

Feature	NFA (grep -P, Perl)	DFA (grep -E, awk)
Backreferences	Yes	No
Lookaround	Yes	No
Lazy quantifiers	Yes	No
Atomic groups	Yes	No
Speed guarantee	No (can backtrack)	Yes (linear time)
Tools	grep -P, Perl, Python	grep -E, egrep, awk

Feature

NFA (grep -P, Perl)

DFA (grep -E, awk)

Backreferences

Yes

Lookaround

Yes

Lazy quantifiers

Yes

Atomic groups

Yes

Speed guarantee

No (can backtrack)

Yes (linear time)

Tools

grep -P, Perl, Python

grep -E, egrep, awk

Catastrophic Backtracking

# The evil pattern: nested quantifiers with overlap
(a+)+

# Against "aaaaaaaaaaaaaaaaaaaaX":
# Engine tries every possible way to divide the a's
# 2^n combinations = exponential time = hang/crash

# Worse: (a*)*  or  (a+)*  or  (a|aa)+

# Demonstration (DON'T run on long strings):
# echo "aaaaaaaaaaaaaaaaaaaaaaaaX" | grep -P '(a+)+b'
# This will hang!

# Real-world examples of vulnerable patterns:
(.*a)+           # "aaaa...X" causes backtracking
(x+x+)+y         # Overlapping x's
(\w+)*           # Any word char, nested quantifiers

Fixing Catastrophic Backtracking

# Method 1: Possessive quantifiers
(a+)+     # BAD - catastrophic
(a++)     # GOOD - possessive inner prevents backtrack

# Method 2: Atomic groups
(a+)+     # BAD
(?>a+)    # GOOD - atomic group

# Method 3: Eliminate alternation overlap
(a|aa)+   # BAD - overlapping alternatives
a+        # GOOD - just match all a's

# Method 4: Use negated character class instead of .*
".*"          # BAD with nested quotes
"[^"]*"       # GOOD - can't backtrack

# Method 5: Anchor patterns
.*foo         # Potentially slow on long non-matching lines
^.*foo        # Better - fails fast at start of line

ReDoS (Regular Expression Denial of Service)

# Vulnerable patterns to NEVER use with untrusted input:
(a+)+
([a-zA-Z]+)*
(a|aa)+
(.*a){n}     # Where n is significant

# Security audit: Find vulnerable patterns in code
grep -rP '\([^)]*[+*]\)[+*]' --include="*.py"
grep -rP '\.\*[^?]' --include="*.py"

# Safe alternatives:
# 1. Use atomic groups/possessive quantifiers
# 2. Set regex timeout in your language
# 3. Limit input length before regex
# 4. Use DFA engine for untrusted input

Performance Tips

# 1. Anchor when possible
^pattern      # Much faster than searching entire line
pattern$      # Fails fast if line end doesn't match

# 2. Most specific first
(cat|catch)   # "catch" never matches (cat matches first)
(catch|cat)   # Better - longer/specific first

# 3. Avoid .* at pattern start
.*foo         # Scans entire string
[^f]*foo      # Better - stops at first 'f'

# 4. Use non-capturing groups when capture not needed
(?:abc)+      # Faster than (abc)+

# 5. Character class vs alternation
[aeiou]       # Faster
(a|e|i|o|u)   # Slower (creates capture group + alternation overhead)

# 6. Pre-filter with fast grep before complex regex
grep 'error' file | grep -P 'complex(?=pattern)'

Drills: Expert Level

These patterns use advanced PCRE features.

Drill 16: Balanced Parentheses Validation

# Match only strings with balanced parens
^\((?:[^()]*|(?R))*\)$

# Test data
echo -e "(valid)\n((nested))\n((broken)\n()()" | while read line; do
  echo -n "$line -> "
  echo "$line" | grep -qP '^\((?:[^()]*|(?R))*\)$' && echo "VALID" || echo "INVALID"
done

Drill 17: Atomic IP Validation

# Efficient IP validation with atomic groups
^(?>(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.){3}(?>25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)$

# Atomic groups prevent backtracking on invalid IPs

# Test:
echo -e "192.168.1.1\n256.1.1.1\n10.0.0.1\nabc" | grep -P '^(?>(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.){3}(?>25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)$'

Drill 18: Password Strength (Multiple Lookaheads)

# At least: 12 chars, 1 upper, 1 lower, 1 digit, 1 special
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%^&*]).{12,}$

# Extended with Unicode (international passwords):
^(?=.*\p{Ll})(?=.*\p{Lu})(?=.*\d)(?=.*[!@#$%^&*]).{12,}$

# Test:
echo -e "weak\nStrongPass1!\nValidPass123!" | grep -P '^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%^&*]).{12,}$'

Drill 19: Semver Parsing

# Semantic versioning: MAJOR.MINOR.PATCH(-prerelease)?(+build)?
^(?<major>0|[1-9]\d*)\.(?<minor>0|[1-9]\d*)\.(?<patch>0|[1-9]\d*)(?:-(?<prerelease>[0-9A-Za-z-]+(?:\.[0-9A-Za-z-]+)*))?(?:\+(?<build>[0-9A-Za-z-]+(?:\.[0-9A-Za-z-]+)*))?$

# Simpler version:
^\d+\.\d+\.\d+(-[a-zA-Z0-9.]+)?(\+[a-zA-Z0-9.]+)?$

# Test:
echo -e "1.2.3\n1.0.0-alpha\n2.1.0+build.123\n1.0.0-beta+exp.sha.5114f85" | \
  grep -P '^\d+\.\d+\.\d+(-[a-zA-Z0-9.]+)?(\+[a-zA-Z0-9.]+)?$'

Drill 20: RFC 5322 Email (Simplified)

# Practical email regex (not full RFC 5322, but handles 99%)
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

# With Unicode support:
^[\p{L}\p{N}._%+-]+@[\p{L}\p{N}.-]+\.\p{L}{2,}$

# More complete (handles quoted local parts):
^(?:[a-zA-Z0-9._%+-]+|"[^"]+")@[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?)*\.[a-zA-Z]{2,}$

Drill 21: JSON Key-Value Extraction

# Extract "key": "value" pairs from JSON
"([^"]+)":\s*"([^"]+)"

# With possible escapes in values:
"([^"]+)":\s*"((?:[^"\\]|\\.)*)\"

# Test:
echo '{"name": "John", "city": "New York"}' | grep -oP '"([^"]+)":\s*"\K[^"]+'
# Output: John
#         New York

Drill 22: Nested HTML Tag Matching

# Match balanced <div>...</div> with nesting
<div\b[^>]*>(?:[^<]*|<(?!/?div\b)|(?R))*</div>

# Explanation:
# <div\b[^>]*>   opening <div> tag with attributes
# (?:            non-capture alternation:
#   [^<]*          text (no tags)
#   |<(?!/?div\b)  tag that's not div (using negative lookahead)
#   |(?R)          recurse for nested div
# )*             zero or more
# </div>         closing tag

# Note: Real HTML parsing should use proper parser, not regex

Drill 23: Conditionals - Optional Sections

# Match log entries: IP (optional user) timestamp message
# If user present (not -), capture it

^(\d+\.\d+\.\d+\.\d+)\s+(-|\w+)\s+\[([^\]]+)\]\s+(.+)$

# With conditional for user:
^(\d+\.\d+\.\d+\.\d+)\s+(?:(?!-)\K(\w+)|-)?\s+\[([^\]]+)\]\s+(.+)$

# Test data:
# 10.0.0.1 evan [2026-03-14] message
# 10.0.0.1 - [2026-03-14] message

Drill 24: Mode Modifier Scoping

# Case-insensitive match, but capture preserves case
(?i)error:\s*(?-i)([A-Z0-9_]+)

# This matches "ERROR: FILE_NOT_FOUND" or "error: FILE_NOT_FOUND"
# But $1 only captures uppercase codes

echo -e "ERROR: TEST_123\nerror: REAL_CODE" | grep -oP '(?i)error:\s*(?-i)([A-Z0-9_]+)' | grep -oP '[A-Z0-9_]+$'
# Output: TEST_123
#         REAL_CODE

Drill 25: Branch Reset for Format Normalization

# Match phone in multiple formats, normalize groups
# (?| branch reset - all alternatives use same group numbers

(?|(\d{3})-(\d{3})-(\d{4})|(\d{3})\.(\d{3})\.(\d{4})|\((\d{3})\)\s*(\d{3})-(\d{4}))

# Group 1,2,3 = area, exchange, subscriber (regardless of format)

echo -e "555-123-4567\n555.123.4567\n(555) 123-4567" | grep -oP '(?|(\d{3})-(\d{3})-(\d{4})|(\d{3})\.(\d{3})\.(\d{4})|\((\d{3})\)\s*(\d{3})-(\d{4}))'

Real-World Complex Patterns

Production-ready patterns with explanations.

URL Parsing

# Full URL with capture groups
^(?<scheme>https?|ftp)://(?<host>[^:/\s]+)(?::(?<port>\d+))?(?<path>/[^\s?#]*)?(?:\?(?<query>[^\s#]*))?(?:#(?<fragment>\S*))?$

# Simplified:
^(https?|ftp)://([^:/\s]+)(:\d+)?(/[^\s?#]*)?(\?[^\s#]*)?(#\S*)?$

# Example extraction:
URL="https://api.example.com:8443/v1/users?page=1#section"
echo "$URL" | grep -oP '(?<=://)[^:/]+'  # Host: api.example.com
echo "$URL" | grep -oP '(?<=:)\d+'        # Port: 8443
echo "$URL" | grep -oP '/[^?#]+'          # Path: /v1/users

Log Parsing (Apache Combined)

# Apache combined log format
^(?<ip>\S+)\s+\S+\s+(?<user>\S+)\s+\[(?<time>[^\]]+)\]\s+"(?<method>\S+)\s+(?<path>\S+)\s+(?<proto>[^"]+)"\s+(?<status>\d+)\s+(?<bytes>\d+)\s+"(?<referrer>[^"]*)"\s+"(?<agent>[^"]*)"$

# Extract 4xx/5xx errors with response time > 1000ms
# (Assuming log has response time at end)
^(\S+).*" [45]\d{2} .* (\d{4,})$

Credit Card Masking

# Match credit card numbers (various formats)
\b(?:\d{4}[-\s]?){3}\d{4}\b

# Mask all but last 4 digits
echo "4111-1111-1111-1234" | sed -E 's/\d(?=(\d{4}[-\s]?){1,3}\d{4})/X/g'
# Output: XXXX-XXXX-XXXX-1234

# Validate Luhn checksum requires code, not just regex

Security: Secret Detection

# AWS Access Key ID
\b(AKIA[0-9A-Z]{16})\b

# AWS Secret Key (40 char base64)
\b([A-Za-z0-9+/]{40})\b

# GitHub Personal Access Token
\bghp_[A-Za-z0-9]{36}\b

# Generic API Key (32+ hex or alphanumeric)
\b[A-Za-z0-9]{32,}\b

# JWT Token
\beyJ[A-Za-z0-9_-]+\.eyJ[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+\b

# Private Key
-----BEGIN\s+(RSA|EC|OPENSSH|DSA|ENCRYPTED)?\s*PRIVATE\s+KEY-----

# Combined secrets scanner:
grep -rP '(AKIA[0-9A-Z]{16}|ghp_[A-Za-z0-9]{36}|-----BEGIN.*PRIVATE KEY-----)' .

Network: ACL/Firewall Rules

# Cisco ACL parsing
^(?<action>permit|deny)\s+(?<proto>ip|tcp|udp|icmp)\s+(?<src>\S+)\s+(?<srcwc>\S+)\s+(?<dst>\S+)\s+(?<dstwc>\S+)(?:\s+eq\s+(?<port>\d+))?

# iptables rule extraction
-A\s+(?<chain>\w+)\s+(?:-[sp]\s+(?<src>\S+)\s+)?(?:-d\s+(?<dst>\S+)\s+)?.*-j\s+(?<action>ACCEPT|DROP|REJECT)

# Extract source IPs from deny rules
grep -oP '(?<=deny\s+ip\s+)\S+(?=\s)' acl.txt

Config File Validation

# YAML key-value (basic)
^(\s*)([a-zA-Z_][a-zA-Z0-9_-]*):\s*(.*)$

# INI file section and key
^\s*\[([^\]]+)\]|^\s*([^=\s]+)\s*=\s*(.*)$

# systemd unit file
^\[(?<section>\w+)\]|^(?<key>[A-Z][a-zA-Z]+)=(?<value>.*)$

# Check for hardcoded IPs in config (audit)
grep -rP '\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b' /etc/

Quick Reference Card

Metacharacters

.       any char          *       0 or more
+       1 or more         ?       0 or 1
^       line start        $       line end
\b      word boundary     |       alternation
[]      char class        ()      grouping

PCRE Shortcuts

\d  digit [0-9]           \D  not digit
\w  word [a-zA-Z0-9_]     \W  not word
\s  whitespace            \S  not whitespace

Quantifiers

{n}     exactly n         {n,}    n or more
{n,m}   n to m            *?      lazy (PCRE)

Lookaround (PCRE)

(?=x)   followed by x     (?!x)   not followed by x
(?<=x)  preceded by x     (?<!x)  not preceded by x

Possessive & Atomic (PCRE)

*+      possessive 0+     (?>x)   atomic group
++      possessive 1+     (?|...) branch reset
?+      possessive 0/1    (?R)    recurse pattern
{n,m}+  possessive n-m    (?1)    recurse group 1

Mode Modifiers (PCRE)

(?i)  case insensitive    (?m)  multiline (^$ per line)
(?s)  dotall (. = \n)     (?x)  extended/free-spacing
(?U)  ungreedy default    (?-i) turn off modifier

Conditionals (PCRE)

(?(1)y|n)   if group 1, match y else n
(?(?=x)y|n) if followed by x, match y else n

Unicode (PCRE)

\p{L}   any letter        \p{N}   any number
\p{Ll}  lowercase         \p{Lu}  uppercase
\p{P}   punctuation       \p{S}   symbol
\p{Han} Chinese           \p{Latin} Latin script
\P{L}   NOT letter        (*UCP)  Unicode mode

grep Flags

-E  ERE (extended)        -P  PCRE (perl)
-o  only matching         -i  case insensitive
-v  invert match          -c  count
-n  line numbers          -l  filenames only
-z  null-delimited        -A/-B/-C context