Regex Patterns
Regular expression syntax across BRE, ERE, and PCRE flavors.
The Three Flavors
| Flavor | Flag | Tools | Key Differences |
|---|---|---|---|
BRE |
(default) |
grep, sed, ed |
|
ERE |
-E |
grep -E, sed -E, awk |
|
PCRE |
-P |
grep -P, perl, python |
|
Quick test which you have:
echo "test123" | grep -P '\d+' # PCRE - should print test123
echo "test123" | grep -E '[0-9]+' # ERE - should print test123
echo "test123" | grep '[0-9]\+' # BRE - note escaped +
Metacharacters
| Char | Meaning | Example |
|---|---|---|
|
Any single character (except newline) |
|
|
Zero or more of previous |
|
|
One or more of previous (ERE/PCRE) |
|
|
Zero or one of previous (ERE/PCRE) |
|
|
Alternation (OR) |
|
|
Start of line |
|
|
End of line |
|
|
Escape metacharacter |
|
|
Character class |
|
|
Grouping / capture |
|
Custom Classes
[abc] # a, b, or c
[a-z] # lowercase letter
[A-Z] # uppercase letter
[0-9] # digit
[a-zA-Z] # any letter
[a-zA-Z0-9] # alphanumeric
[^abc] # NOT a, b, or c (negation)
[^0-9] # NOT a digit
[-abc] # literal hyphen (first position)
[abc-] # literal hyphen (last position)
[]abc] # literal ] (first position)
[.?*] # metacharacters literal inside []
POSIX Classes (BRE/ERE)
[[:alnum:]] # alphanumeric [a-zA-Z0-9]
[[:alpha:]] # alphabetic [a-zA-Z]
[[:digit:]] # digit [0-9]
[[:lower:]] # lowercase [a-z]
[[:upper:]] # uppercase [A-Z]
[[:space:]] # whitespace (space, tab, newline)
[[:blank:]] # space or tab only
[[:punct:]] # punctuation
[[:xdigit:]] # hex digit [0-9A-Fa-f]
PCRE Shortcuts (grep -P)
\d # digit [0-9]
\D # NOT digit [^0-9]
\w # word char [a-zA-Z0-9_]
\W # NOT word char
\s # whitespace
\S # NOT whitespace
\b # word boundary
\B # NOT word boundary
Quantifiers
| Greedy | Lazy (PCRE) | Meaning | Example |
|---|---|---|---|
|
|
0 or more |
|
|
|
1 or more |
|
|
|
0 or 1 |
|
|
|
exactly n |
|
|
|
n or more |
|
|
|
n to m |
|
# Greedy (default): match as MUCH as possible
echo "<div>content</div>" | grep -oP '<.*>'
# Output: <div>content</div>
# Lazy: match as LITTLE as possible
echo "<div>content</div>" | grep -oP '<.*?>'
# Output: <div>
# </div>
Anchors and Boundaries
^pattern # Start of line
pattern$ # End of line
^$ # Empty line
^.+$ # Non-empty line
\bword\b # Whole word only (PCRE)
\<word\> # Whole word only (GNU grep BRE/ERE)
# Examples
grep '^#' file # Comment lines
grep '\.conf$' file # Lines ending in .conf
grep -v '^$' file # Non-empty lines
grep -P '\berror\b' file # "error" as whole word
Basic Grouping
(pattern) # Capture group - saves match
(?:pattern) # Non-capture group (PCRE) - just grouping
# Alternation within group
grep -E '(cat|dog)' file # cat or dog
grep -E 'gr(a|e)y' file # gray or grey
grep -E '(ab)+' file # ab, abab, ababab
# BRE requires escaping
grep '\(cat\|dog\)' file # BRE version
Backreferences
\1 # First capture group
\2 # Second capture group
# Match repeated characters
grep -E '(.)\1' file # aa, bb, cc...
grep -E '(.)\1\1' file # aaa, bbb, ccc...
# Match repeated words
grep -E '\b(\w+)\s+\1\b' file # "the the", "is is"
# sed replacement with groups
echo "hello world" | sed -E 's/(\w+) (\w+)/\2 \1/'
# Output: world hello
Named Groups (PCRE)
(?<name>pattern) # Named capture
\k<name> # Named backreference
# Example: Match IP and reference by name
grep -P '(?<ip>\d+\.\d+\.\d+\.\d+).*\k<ip>' file
Lookahead and Lookbehind (PCRE)
These match a position, not characters. They don’t consume input.
| Syntax | Name | Matches |
|---|---|---|
|
Positive lookahead |
Position followed by pattern |
|
Negative lookahead |
Position NOT followed by pattern |
|
Positive lookbehind |
Position preceded by pattern |
|
Negative lookbehind |
Position NOT preceded by pattern |
# Find "error" followed by a number
grep -P 'error(?=\s*\d)' file
# Find "error" NOT followed by "404"
grep -P 'error(?!.*404)' file
# Extract price (digits after $)
echo "Price: $199" | grep -oP '(?<=\$)\d+'
# Output: 199
# Find port numbers NOT preceded by "127.0.0.1:"
grep -P '(?<!127\.0\.0\.1:)\d{2,5}' file
# At least: 8 chars, 1 upper, 1 lower, 1 digit
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}$
Flavor Comparison Table
| Feature | BRE | ERE | PCRE |
|---|---|---|---|
One or more |
|
|
|
Zero or one |
|
|
|
Alternation |
|
|
|
Grouping |
|
|
|
|
No |
No |
Yes |
|
No |
No |
Yes |
|
No |
No |
Yes |
Lookahead |
No |
No |
Yes |
Lookbehind |
No |
No |
Yes |
Lazy quantifiers |
No |
No |
Yes |
Non-capture groups |
No |
No |
Yes |
Drills: Basic Patterns
Practice these in regex101.com, then replicate in terminal.
Test Data
10.50.1.1 - - [12/Mar/2026:10:23:45 +0000] "GET /api/v1/users HTTP/1.1" 200 1234
192.168.1.100 - admin [12/Mar/2026:10:23:46 +0000] "POST /login HTTP/1.1" 401 89
10.50.1.20 - - [12/Mar/2026:10:23:47 +0000] "GET /health HTTP/1.1" 200 15
172.16.0.50 - evan [12/Mar/2026:10:23:48 +0000] "DELETE /api/v1/users/5 HTTP/1.1" 403 201
fe80::1 - - [12/Mar/2026:10:23:49 +0000] "GET /metrics HTTP/1.1" 200 8492
MAC: 14:F6:D8:7B:31:80 assigned to VLAN 10
MAC: 98:BB:1E:1F:A7:13 assigned to VLAN 20
error: connection refused to 10.50.1.50:389
warning: certificate expires in 30 days
ERROR: authentication failed for user 'admin'
Drill 1: IP Addresses
# Match any IPv4 address
\b\d{1,3}(?:\.\d{1,3}){3}\b
# Terminal:
grep -oP '\b\d{1,3}(?:\.\d{1,3}){3}\b' file
Drill 2: 10.x.x.x Network Only
# Only 10.x.x.x addresses
\b10(?:\.\d{1,3}){3}\b
# Terminal:
grep -oP '\b10(?:\.\d{1,3}){3}\b' file
Drill 3: MAC Addresses
# Standard MAC format (colon-separated)
\b([0-9A-Fa-f]{2}:){5}[0-9A-Fa-f]{2}\b
# Terminal:
grep -oP '\b([0-9A-Fa-f]{2}:){5}[0-9A-Fa-f]{2}\b' file
Drill 4: HTTP Status Codes
# 4xx errors
" (4\d{2}) "
# Terminal:
grep -oP '" \K4\d{2}(?= )' file
# All status codes
grep -oP '" \K[0-9]{3}(?= )' file
Drill 5: Usernames (not -)
# Capture username field (third field, not -)
- (\w+) \[
# Terminal:
grep -oP ' - \K\w+(?= \[)' file | grep -v '^-$'
Drill 6: HTTP Methods
# Extract HTTP method
"(GET|POST|PUT|DELETE|PATCH)"
# Terminal:
grep -oP '"\K(GET|POST|PUT|DELETE|PATCH)' file
Drill 7: Request Paths
# Path after method
"(?:GET|POST|PUT|DELETE|PATCH) ([^ ]+)
# Terminal:
grep -oP '"(?:GET|POST|PUT|DELETE|PATCH) \K[^ ]+' file
Drill 8: Log Levels (Case-Insensitive)
# Match error/warning/ERROR/WARNING etc.
(?i)\b(error|warn(?:ing)?|fatal|critical)\b
# Terminal:
grep -iP '\b(error|warn(ing)?|fatal|critical)\b' file
Drill 9: Port Numbers
# Extract port after IP:port
(?<=:)\d{2,5}\b
# Terminal:
grep -oP '(?<=:)\d{2,5}\b' file
Drill 10: VLAN IDs
# VLAN followed by number
VLAN\s+(\d+)
# Terminal:
grep -oP 'VLAN\s+\K\d+' file
Drill 11: Date Extraction
# Extract date from log brackets
\[(\d{2}/\w{3}/\d{4})
# Terminal:
grep -oP '\[\K\d{2}/\w{3}/\d{4}' file
Drill 12: Failed Auth Lines
# Lines with 401 or 403 status
" (40[13]) "
# Full line extraction:
grep -P '" 40[13] ' file
Drill 13: Extract Quoted Strings
# Content inside double quotes
"([^"]+)"
# Terminal:
grep -oP '"[^"]+"' file
Drill 14: IPv6 Detection
# Simple IPv6 pattern (link-local example)
fe80::[0-9a-fA-F:]+
# Terminal:
grep -oP 'fe80::[0-9a-fA-F:]+' file
Drill 15: Certificate Expiry Days
# Extract number from "expires in X days"
expires in (\d+) days
# Terminal:
grep -oP 'expires in \K\d+(?= days)' file
Network
# IPv4
\b\d{1,3}(?:\.\d{1,3}){3}\b
# IPv4 with CIDR
\b\d{1,3}(?:\.\d{1,3}){3}/\d{1,2}\b
# MAC (colon)
\b([0-9A-Fa-f]{2}:){5}[0-9A-Fa-f]{2}\b
# MAC (hyphen - Windows style)
\b([0-9A-Fa-f]{2}-){5}[0-9A-Fa-f]{2}\b
# Port number
\b([0-9]{1,5})\b
# FQDN
\b[a-zA-Z0-9]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(\.[a-zA-Z]{2,})+\b
# URL
https?://[^\s<>"]+
Security
# JWT token (3 base64 parts)
eyJ[A-Za-z0-9_-]+\.eyJ[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+
# API key patterns (generic)
[A-Za-z0-9]{32,}
# AWS Access Key
AKIA[0-9A-Z]{16}
# Private key header
-----BEGIN [A-Z ]+ PRIVATE KEY-----
# Password in URL (security audit)
://[^:]+:([^@]+)@
Logs
# Syslog timestamp
[A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2}
# ISO 8601
\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}
# Apache/Nginx combined log (full line)
^(\S+) (\S+) (\S+) \[([^\]]+)\] "([^"]*)" (\d{3}) (\d+)
# Error levels
\b(EMERG|ALERT|CRIT|ERR|WARN|NOTICE|INFO|DEBUG)\b
Config Files
# Key=Value
^(\w+)\s*=\s*(.+)$
# YAML key: value
^(\s*)(\w+):\s*(.+)?$
# Comment lines (# or //)
^\s*(#|//).*$
# Empty or whitespace-only lines
^\s*$
# INI section headers
^\[([^\]]+)\]$
sed Examples
# Basic substitution
sed 's/old/new/' file
# Global (all occurrences on line)
sed 's/old/new/g' file
# Using capture groups
sed -E 's/([0-9]+)/[\1]/' file # 123 -> [123]
# Swap two words
sed -E 's/(\w+) (\w+)/\2 \1/' file
# Delete lines matching pattern
sed '/pattern/d' file
# Print only matching lines (like grep)
sed -n '/pattern/p' file
# In-place edit
sed -i 's/old/new/g' file
sed -i.bak 's/old/new/g' file # With backup
awk Examples
# Match pattern
awk '/pattern/ {print}' file
# Match and extract field
awk '/error/ {print $1, $NF}' file
# Regex on specific field
awk '$1 ~ /^10\./ {print}' file # First field starts with 10.
# Negative match
awk '$1 !~ /^192\./ {print}' file # First field NOT starting with 192.
# gsub (global substitution)
awk '{gsub(/old/, "new"); print}' file
# sub (first occurrence only)
awk '{sub(/old/, "new"); print}' file
# Match and extract with gensub (GNU awk)
awk '{print gensub(/.*:([0-9]+).*/, "\\1", "g")}' file
Escaping in Different Contexts
# Shell escaping - use single quotes for regex
grep 'pattern with $special' file # $ is literal
grep "pattern with $special" file # $ is shell variable!
# BRE vs ERE escaping
grep 'a\+b' file # BRE: one or more a
grep -E 'a+b' file # ERE: same thing
# Escaping in character class
grep '[.?*]' file # Metacharacters are literal inside []
grep '[^abc]' file # ^ means NOT only at start of []
Greedy Matching Trap
# Problem: Greedy .* matches too much
echo '<div>one</div><div>two</div>' | grep -oP '<div>.*</div>'
# Output: <div>one</div><div>two</div> (one match, not two!)
# Solution: Lazy .*?
echo '<div>one</div><div>two</div>' | grep -oP '<div>.*?</div>'
# Output: <div>one</div>
# <div>two</div>
# Alternative: Negated class (works in BRE/ERE too)
echo '<div>one</div><div>two</div>' | grep -oE '<div>[^<]*</div>'
Word Boundary Differences
# PCRE word boundary
grep -P '\bword\b' file
# GNU BRE/ERE word boundary
grep '\<word\>' file
# ERE alternative (may not work everywhere)
grep -E '(^|[^a-zA-Z])word($|[^a-zA-Z])' file
Newline Handling
# . does NOT match newline by default
# Use -z for null-delimited (multiline) in grep
# Match across lines (GNU grep)
grep -Pzo 'start.*?end' file
# In sed, use N to read next line
sed 'N;s/line1\nline2/replaced/' file
Atomic Groups and Possessive Quantifiers (PCRE)
These prevent backtracking - once matched, the engine won’t give up characters. Critical for performance and avoiding catastrophic backtracking.
Possessive Quantifiers
Add + after any quantifier to make it possessive:
| Greedy | Possessive | Behavior |
|---|---|---|
|
|
0+ chars, no backtrack |
|
|
1+ chars, no backtrack |
|
|
0 or 1, no backtrack |
|
|
n-m chars, no backtrack |
# Compare greedy vs possessive
echo "aaaaaaaaab" | grep -P 'a+b' # Matches - greedy backtracks
echo "aaaaaaaaab" | grep -P 'a++b' # Matches - no backtrack needed
echo "aaaaaaaaaa" | grep -P 'a+b' # No match (backtracks, tries, fails)
echo "aaaaaaaaaa" | grep -P 'a++b' # No match (fails fast, no backtrack)
Atomic Groups
(?>pattern) - Once matched, contents cannot be backtracked into.
# Atomic group syntax
(?>pattern)
# Example: Match integer OR float, prefer integer
echo "3.14" | grep -oP '(?>\d+)\.?\d*'
# Output: 3.14 (atomic group captures "3", then matches ".14")
# Without atomic group - ambiguity
echo "3.14" | grep -oP '(\d+)\.?\d*'
# Also matches, but engine may backtrack unnecessarily
When to Use
# Pattern that causes catastrophic backtracking
# DON'T: (a+)+ against "aaaaaaaaaaaaaaaaaaaaX"
# DO: Use possessive or atomic
(?>a+)+ # Atomic group prevents inner backtrack
a++ # Possessive on inner quantifier
# Real example: Matching quoted strings efficiently
# DON'T: ".*" (backtracks excessively on non-matches)
# DO: "[^"]*" (no backtracking possible)
# OR: "(?>[^"]*)" (atomic for complex inner patterns)
Recursive Patterns (PCRE)
Match nested structures like parentheses, HTML tags, JSON.
Basic Recursion
(?R) # Recurse entire pattern
(?0) # Same as (?R)
(?1) # Recurse first capture group
(?2) # Recurse second capture group
(?&name) # Recurse named group
(?P>name) # Python-style named recursion
Matching Balanced Parentheses
# Match balanced parens: (), (()), ((())), etc.
\((?:[^()]*|(?R))*\)
# Breakdown:
# \( opening paren
# (?: non-capture group for contents:
# [^()]* any non-paren chars
# | OR
# (?R) recurse entire pattern (nested parens)
# )* zero or more content items
# \) closing paren
# Test:
echo "(a(b(c)d)e)" | grep -oP '\((?:[^()]*|(?R))*\)'
# Output: (a(b(c)d)e)
echo "((nested))" | grep -oP '\((?:[^()]*|(?R))*\)'
# Output: ((nested))
Matching Nested Structures
# Match balanced braces (JSON-like)
\{(?:[^{}]*|(?R))*\}
# Match balanced brackets
\[(?:[^\[\]]*|(?R))*\]
# Match balanced angle brackets (XML-like)
<(?:[^<>]*|(?R))*>
Group Recursion
# Recurse specific group instead of whole pattern
# (?1) recurses group 1
# Pattern: word = (nested stuff)
(\w+)\s*=\s*(\((?:[^()]*|(?2))*\))
# (?2) recurses only the paren-matching group
echo "config = ((a)(b))" | grep -oP '(\w+)\s*=\s*(\((?:[^()]*|(?2))*\))'
# Output: config = ((a)(b))
Named Group Recursion
# Define pattern once, recurse by name
(?<parens>\((?:[^()]*|(?&parens))*\))
# Match: expression (with (nested) parens)
\w+\s*(?<parens>\((?:[^()]*|(?&parens))*\))
echo "func((a,b),(c,d))" | grep -oP '\w+(?<parens>\((?:[^()]*|(?&parens))*\))'
# Output: func((a,b),(c,d))
Conditional Patterns (PCRE)
Match different patterns based on conditions.
Syntax
(?(condition)yes-pattern|no-pattern)
(?(condition)yes-pattern) # No-pattern is empty match
Conditional on Capture Group
# (?(1)...|...) - true if group 1 matched
# Match optional opening paren, require closing if present
(\()?[a-z]+(?(1)\))
# Breakdown:
# (\()? optional capture of opening paren (group 1)
# [a-z]+ one or more letters
# (?(1)\)) IF group 1 matched, require closing paren
echo "abc" | grep -oP '(\()?[a-z]+(?(1)\))'
# Output: abc (no parens needed)
echo "(abc)" | grep -oP '(\()?[a-z]+(?(1)\))'
# Output: (abc) (parens balanced)
echo "(abc" | grep -oP '(\()?[a-z]+(?(1)\))'
# No output (opening but no closing)
Conditional on Lookahead
# (?(?=condition)yes|no)
# Match based on what follows
# If followed by digits, match word; else match number
(?(?=\d)\d+|\w+)
echo "abc 123" | grep -oP '(?(?=\d)\d+|\w+)'
# Output: abc
# 123
Practical Examples
# Match phone with optional country code
# If +, require country code pattern; else local pattern
(?(\+)\+\d{1,3}[-\s]?)?\d{3}[-\s]?\d{3}[-\s]?\d{4}
# Match quoted or unquoted value
# (")? captures optional quote, (?(1)") requires closing if present
(")?[^",]+(?(1)")
echo 'field,"quoted value",plain' | grep -oP '(")?[^",]+(?(1)")'
# Output: field
# "quoted value"
# plain
Branch Reset Groups (PCRE)
(?|…) - Alternatives share the same group numbers.
Useful when different patterns should populate the same capture group.
Without Branch Reset
# Normal alternation - different group numbers
(cat)|(dog)|(bird)
# "cat" → $1 = "cat", $2 = undef, $3 = undef
# "dog" → $1 = undef, $2 = "dog", $3 = undef
# "bird" → $1 = undef, $2 = undef, $3 = "bird"
With Branch Reset
# Branch reset - same group number for all alternatives
(?|(cat)|(dog)|(bird))
# "cat" → $1 = "cat"
# "dog" → $1 = "dog"
# "bird" → $1 = "bird"
# All alternatives populate group 1!
echo -e "cat\ndog\nbird" | grep -oP '(?|(cat)|(dog)|(bird))'
# Output: cat
# dog
# bird
Practical Use Case
# Extract date in multiple formats, normalize to one group
# Format 1: 2026-03-14 (ISO)
# Format 2: 03/14/2026 (US)
# Format 3: 14.03.2026 (EU)
(?|(\d{4})-(\d{2})-(\d{2})|(\d{2})/(\d{2})/(\d{4})|(\d{2})\.(\d{2})\.(\d{4}))
# This is messy - the captures don't align semantically
# Better to extract and post-process, but branch reset helps
# Extract IP:port or just IP (port optional)
(?|(\d+\.\d+\.\d+\.\d+):(\d+)|(\d+\.\d+\.\d+\.\d+)())
# Group 1 = IP, Group 2 = port (or empty)
Mode Modifiers (PCRE)
Change regex behavior inline. Apply to whole pattern or sections.
Global Modifiers
| Modifier | Name | Effect |
|---|---|---|
|
Case insensitive |
|
|
Multiline |
|
|
Single-line (dotall) |
|
|
Extended (free-spacing) |
Whitespace ignored, |
|
Ungreedy |
Quantifiers lazy by default |
Inline Modifier Syntax
# Apply to entire pattern (at start)
(?i)pattern # Case insensitive
# Apply to section only
normal(?i)insensitive(?-i)normal_again
# Multiple modifiers
(?im)pattern # Case insensitive + multiline
# Negative (turn off)
(?-i) # Turn OFF case insensitivity
Case Insensitive (?i)
# Match ERROR, error, Error, etc.
(?i)error
echo -e "ERROR\nerror\nError" | grep -P '(?i)error'
# Matches all three
# Same as grep -i:
grep -iP 'error' file
# Apply to portion only:
(?i:error)\s+(\d+) # "ERROR 404", "error 500" - but $1 is case-sensitive
Multiline (?m)
# Without (?m): ^ matches start of STRING only
# With (?m): ^ matches start of each LINE
# Match lines starting with #
(?m)^#.*$
# Equivalent to grep's default behavior (line-oriented)
# Useful in contexts where input is one big string
# Example in Perl one-liner:
echo -e "line1\n#comment\nline2" | perl -ne 'print if /(?m)^#/'
# Output: #comment
Single-line/Dotall (?s)
# Without (?s): . matches any char EXCEPT newline
# With (?s): . matches newline too
# Match everything between START and END, including newlines
(?s)START.*?END
# Example:
echo -e "START\nmulti\nline\nEND" | grep -Pzo '(?s)START.*?END'
# Output: START
# multi
# line
# END
# grep -z treats input as null-delimited (whole file as one record)
Extended (?x) (Free-Spacing)
# Whitespace ignored, # starts comments
# Allows readable complex patterns
(?x)
^ # Start of line
(\d{3}) # Area code
[-.\s]? # Optional separator
(\d{3}) # Exchange
[-.\s]? # Optional separator
(\d{4}) # Subscriber
$ # End of line
# Equivalent to:
^(\d{3})[-.\s]?(\d{3})[-.\s]?(\d{4})$
# In grep (must escape or use single line):
grep -P '(?x) ^ (\d{3}) [-.\s]? (\d{3}) [-.\s]? (\d{4}) $' file
Ungreedy (?U)
# Without (?U): quantifiers greedy by default
# With (?U): quantifiers lazy by default
# Match minimal between tags
(?U)<tag>.*</tag>
# Same as:
<tag>.*?</tag>
# With (?U), .* is lazy, .*? becomes greedy (inverted!)
Unicode Properties (PCRE)
Match characters by Unicode category, script, or property. Requires PCRE2 or Perl with Unicode support.
General Categories
\p{L} # Any letter (any script)
\p{Ll} # Lowercase letter
\p{Lu} # Uppercase letter
\p{N} # Any number
\p{Nd} # Decimal digit
\p{P} # Punctuation
\p{S} # Symbol
\p{Z} # Separator (space, line, paragraph)
\p{C} # Control/format/private use
\P{L} # NOT a letter (uppercase P negates)
Script Matching
\p{Latin} # Latin script
\p{Greek} # Greek script
\p{Cyrillic} # Cyrillic script
\p{Han} # Chinese characters
\p{Hiragana} # Japanese hiragana
\p{Katakana} # Japanese katakana
\p{Arabic} # Arabic script
\p{Hebrew} # Hebrew script
# Long form:
\p{Script=Latin}
\p{Script=Greek}
# Match word in any script:
[\p{L}\p{M}]+ # Letters + combining marks
Practical Examples
# Match any letter (international)
echo "Héllo Wörld 你好" | grep -oP '\p{L}+'
# Output: Héllo
# Wörld
# 你好
# Match email with international characters
[\p{L}\p{N}._%+-]+@[\p{L}\p{N}.-]+\.\p{L}{2,}
# Match only ASCII letters (not international)
[A-Za-z]+
# vs any letter:
\p{L}+
# Detect non-ASCII characters (security audit)
[^\x00-\x7F]
# Or:
\P{ASCII}
# Match emoji (basic)
[\x{1F300}-\x{1F9FF}]
# Or category:
\p{Emoji}
Unicode Character Classes
# Combining marks (accents, etc.)
\p{M} # Any mark
\p{Mn} # Non-spacing mark
\p{Mc} # Spacing combining mark
# Letter + marks (proper word matching)
[\p{L}\p{M}]+
# Example: Match accented words
echo "café naïve résumé" | grep -oP '[\p{L}\p{M}]+'
# Output: café
# naïve
# résumé
Common Unicode Gotchas
# \w does NOT match international letters by default!
echo "Müller" | grep -oP '\w+'
# Output: M ller (ü not matched!)
# Fix: Use \p{L} or enable Unicode mode
echo "Müller" | grep -oP '[\p{L}\p{N}_]+'
# Output: Müller
# Perl/PCRE2 - enable Unicode word characters:
# (?u) or (*UCP) at pattern start
echo "Müller" | grep -oP '(*UCP)\w+'
# Output: Müller
# Byte vs character length
echo "café" | wc -c # 6 bytes (é = 2 bytes in UTF-8)
echo "café" | wc -m # 5 characters
Regex Engine Internals
Understanding NFA vs DFA and catastrophic backtracking.
NFA vs DFA Engines
| Feature | NFA (grep -P, Perl) | DFA (grep -E, awk) |
|---|---|---|
Backreferences |
Yes |
No |
Lookaround |
Yes |
No |
Lazy quantifiers |
Yes |
No |
Atomic groups |
Yes |
No |
Speed guarantee |
No (can backtrack) |
Yes (linear time) |
Tools |
grep -P, Perl, Python |
grep -E, egrep, awk |
Catastrophic Backtracking
# The evil pattern: nested quantifiers with overlap
(a+)+
# Against "aaaaaaaaaaaaaaaaaaaaX":
# Engine tries every possible way to divide the a's
# 2^n combinations = exponential time = hang/crash
# Worse: (a*)* or (a+)* or (a|aa)+
# Demonstration (DON'T run on long strings):
# echo "aaaaaaaaaaaaaaaaaaaaaaaaX" | grep -P '(a+)+b'
# This will hang!
# Real-world examples of vulnerable patterns:
(.*a)+ # "aaaa...X" causes backtracking
(x+x+)+y # Overlapping x's
(\w+)* # Any word char, nested quantifiers
Fixing Catastrophic Backtracking
# Method 1: Possessive quantifiers
(a+)+ # BAD - catastrophic
(a++) # GOOD - possessive inner prevents backtrack
# Method 2: Atomic groups
(a+)+ # BAD
(?>a+) # GOOD - atomic group
# Method 3: Eliminate alternation overlap
(a|aa)+ # BAD - overlapping alternatives
a+ # GOOD - just match all a's
# Method 4: Use negated character class instead of .*
".*" # BAD with nested quotes
"[^"]*" # GOOD - can't backtrack
# Method 5: Anchor patterns
.*foo # Potentially slow on long non-matching lines
^.*foo # Better - fails fast at start of line
ReDoS (Regular Expression Denial of Service)
# Vulnerable patterns to NEVER use with untrusted input:
(a+)+
([a-zA-Z]+)*
(a|aa)+
(.*a){n} # Where n is significant
# Security audit: Find vulnerable patterns in code
grep -rP '\([^)]*[+*]\)[+*]' --include="*.py"
grep -rP '\.\*[^?]' --include="*.py"
# Safe alternatives:
# 1. Use atomic groups/possessive quantifiers
# 2. Set regex timeout in your language
# 3. Limit input length before regex
# 4. Use DFA engine for untrusted input
Performance Tips
# 1. Anchor when possible
^pattern # Much faster than searching entire line
pattern$ # Fails fast if line end doesn't match
# 2. Most specific first
(cat|catch) # "catch" never matches (cat matches first)
(catch|cat) # Better - longer/specific first
# 3. Avoid .* at pattern start
.*foo # Scans entire string
[^f]*foo # Better - stops at first 'f'
# 4. Use non-capturing groups when capture not needed
(?:abc)+ # Faster than (abc)+
# 5. Character class vs alternation
[aeiou] # Faster
(a|e|i|o|u) # Slower (creates capture group + alternation overhead)
# 6. Pre-filter with fast grep before complex regex
grep 'error' file | grep -P 'complex(?=pattern)'
Drills: Expert Level
These patterns use advanced PCRE features.
Drill 16: Balanced Parentheses Validation
# Match only strings with balanced parens
^\((?:[^()]*|(?R))*\)$
# Test data
echo -e "(valid)\n((nested))\n((broken)\n()()" | while read line; do
echo -n "$line -> "
echo "$line" | grep -qP '^\((?:[^()]*|(?R))*\)$' && echo "VALID" || echo "INVALID"
done
Drill 17: Atomic IP Validation
# Efficient IP validation with atomic groups
^(?>(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.){3}(?>25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)$
# Atomic groups prevent backtracking on invalid IPs
# Test:
echo -e "192.168.1.1\n256.1.1.1\n10.0.0.1\nabc" | grep -P '^(?>(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.){3}(?>25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)$'
Drill 18: Password Strength (Multiple Lookaheads)
# At least: 12 chars, 1 upper, 1 lower, 1 digit, 1 special
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%^&*]).{12,}$
# Extended with Unicode (international passwords):
^(?=.*\p{Ll})(?=.*\p{Lu})(?=.*\d)(?=.*[!@#$%^&*]).{12,}$
# Test:
echo -e "weak\nStrongPass1!\nValidPass123!" | grep -P '^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%^&*]).{12,}$'
Drill 19: Semver Parsing
# Semantic versioning: MAJOR.MINOR.PATCH(-prerelease)?(+build)?
^(?<major>0|[1-9]\d*)\.(?<minor>0|[1-9]\d*)\.(?<patch>0|[1-9]\d*)(?:-(?<prerelease>[0-9A-Za-z-]+(?:\.[0-9A-Za-z-]+)*))?(?:\+(?<build>[0-9A-Za-z-]+(?:\.[0-9A-Za-z-]+)*))?$
# Simpler version:
^\d+\.\d+\.\d+(-[a-zA-Z0-9.]+)?(\+[a-zA-Z0-9.]+)?$
# Test:
echo -e "1.2.3\n1.0.0-alpha\n2.1.0+build.123\n1.0.0-beta+exp.sha.5114f85" | \
grep -P '^\d+\.\d+\.\d+(-[a-zA-Z0-9.]+)?(\+[a-zA-Z0-9.]+)?$'
Drill 20: RFC 5322 Email (Simplified)
# Practical email regex (not full RFC 5322, but handles 99%)
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
# With Unicode support:
^[\p{L}\p{N}._%+-]+@[\p{L}\p{N}.-]+\.\p{L}{2,}$
# More complete (handles quoted local parts):
^(?:[a-zA-Z0-9._%+-]+|"[^"]+")@[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?)*\.[a-zA-Z]{2,}$
Drill 21: JSON Key-Value Extraction
# Extract "key": "value" pairs from JSON
"([^"]+)":\s*"([^"]+)"
# With possible escapes in values:
"([^"]+)":\s*"((?:[^"\\]|\\.)*)\"
# Test:
echo '{"name": "John", "city": "New York"}' | grep -oP '"([^"]+)":\s*"\K[^"]+'
# Output: John
# New York
Drill 22: Nested HTML Tag Matching
# Match balanced <div>...</div> with nesting
<div\b[^>]*>(?:[^<]*|<(?!/?div\b)|(?R))*</div>
# Explanation:
# <div\b[^>]*> opening <div> tag with attributes
# (?: non-capture alternation:
# [^<]* text (no tags)
# |<(?!/?div\b) tag that's not div (using negative lookahead)
# |(?R) recurse for nested div
# )* zero or more
# </div> closing tag
# Note: Real HTML parsing should use proper parser, not regex
Drill 23: Conditionals - Optional Sections
# Match log entries: IP (optional user) timestamp message
# If user present (not -), capture it
^(\d+\.\d+\.\d+\.\d+)\s+(-|\w+)\s+\[([^\]]+)\]\s+(.+)$
# With conditional for user:
^(\d+\.\d+\.\d+\.\d+)\s+(?:(?!-)\K(\w+)|-)?\s+\[([^\]]+)\]\s+(.+)$
# Test data:
# 10.0.0.1 evan [2026-03-14] message
# 10.0.0.1 - [2026-03-14] message
Drill 24: Mode Modifier Scoping
# Case-insensitive match, but capture preserves case
(?i)error:\s*(?-i)([A-Z0-9_]+)
# This matches "ERROR: FILE_NOT_FOUND" or "error: FILE_NOT_FOUND"
# But $1 only captures uppercase codes
echo -e "ERROR: TEST_123\nerror: REAL_CODE" | grep -oP '(?i)error:\s*(?-i)([A-Z0-9_]+)' | grep -oP '[A-Z0-9_]+$'
# Output: TEST_123
# REAL_CODE
Drill 25: Branch Reset for Format Normalization
# Match phone in multiple formats, normalize groups
# (?| branch reset - all alternatives use same group numbers
(?|(\d{3})-(\d{3})-(\d{4})|(\d{3})\.(\d{3})\.(\d{4})|\((\d{3})\)\s*(\d{3})-(\d{4}))
# Group 1,2,3 = area, exchange, subscriber (regardless of format)
echo -e "555-123-4567\n555.123.4567\n(555) 123-4567" | grep -oP '(?|(\d{3})-(\d{3})-(\d{4})|(\d{3})\.(\d{3})\.(\d{4})|\((\d{3})\)\s*(\d{3})-(\d{4}))'
Real-World Complex Patterns
Production-ready patterns with explanations.
URL Parsing
# Full URL with capture groups
^(?<scheme>https?|ftp)://(?<host>[^:/\s]+)(?::(?<port>\d+))?(?<path>/[^\s?#]*)?(?:\?(?<query>[^\s#]*))?(?:#(?<fragment>\S*))?$
# Simplified:
^(https?|ftp)://([^:/\s]+)(:\d+)?(/[^\s?#]*)?(\?[^\s#]*)?(#\S*)?$
# Example extraction:
URL="https://api.example.com:8443/v1/users?page=1#section"
echo "$URL" | grep -oP '(?<=://)[^:/]+' # Host: api.example.com
echo "$URL" | grep -oP '(?<=:)\d+' # Port: 8443
echo "$URL" | grep -oP '/[^?#]+' # Path: /v1/users
Log Parsing (Apache Combined)
# Apache combined log format
^(?<ip>\S+)\s+\S+\s+(?<user>\S+)\s+\[(?<time>[^\]]+)\]\s+"(?<method>\S+)\s+(?<path>\S+)\s+(?<proto>[^"]+)"\s+(?<status>\d+)\s+(?<bytes>\d+)\s+"(?<referrer>[^"]*)"\s+"(?<agent>[^"]*)"$
# Extract 4xx/5xx errors with response time > 1000ms
# (Assuming log has response time at end)
^(\S+).*" [45]\d{2} .* (\d{4,})$
Credit Card Masking
# Match credit card numbers (various formats)
\b(?:\d{4}[-\s]?){3}\d{4}\b
# Mask all but last 4 digits
echo "4111-1111-1111-1234" | sed -E 's/\d(?=(\d{4}[-\s]?){1,3}\d{4})/X/g'
# Output: XXXX-XXXX-XXXX-1234
# Validate Luhn checksum requires code, not just regex
Security: Secret Detection
# AWS Access Key ID
\b(AKIA[0-9A-Z]{16})\b
# AWS Secret Key (40 char base64)
\b([A-Za-z0-9+/]{40})\b
# GitHub Personal Access Token
\bghp_[A-Za-z0-9]{36}\b
# Generic API Key (32+ hex or alphanumeric)
\b[A-Za-z0-9]{32,}\b
# JWT Token
\beyJ[A-Za-z0-9_-]+\.eyJ[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+\b
# Private Key
-----BEGIN\s+(RSA|EC|OPENSSH|DSA|ENCRYPTED)?\s*PRIVATE\s+KEY-----
# Combined secrets scanner:
grep -rP '(AKIA[0-9A-Z]{16}|ghp_[A-Za-z0-9]{36}|-----BEGIN.*PRIVATE KEY-----)' .
Network: ACL/Firewall Rules
# Cisco ACL parsing
^(?<action>permit|deny)\s+(?<proto>ip|tcp|udp|icmp)\s+(?<src>\S+)\s+(?<srcwc>\S+)\s+(?<dst>\S+)\s+(?<dstwc>\S+)(?:\s+eq\s+(?<port>\d+))?
# iptables rule extraction
-A\s+(?<chain>\w+)\s+(?:-[sp]\s+(?<src>\S+)\s+)?(?:-d\s+(?<dst>\S+)\s+)?.*-j\s+(?<action>ACCEPT|DROP|REJECT)
# Extract source IPs from deny rules
grep -oP '(?<=deny\s+ip\s+)\S+(?=\s)' acl.txt
Config File Validation
# YAML key-value (basic)
^(\s*)([a-zA-Z_][a-zA-Z0-9_-]*):\s*(.*)$
# INI file section and key
^\s*\[([^\]]+)\]|^\s*([^=\s]+)\s*=\s*(.*)$
# systemd unit file
^\[(?<section>\w+)\]|^(?<key>[A-Z][a-zA-Z]+)=(?<value>.*)$
# Check for hardcoded IPs in config (audit)
grep -rP '\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b' /etc/
Quick Reference Card
. any char * 0 or more + 1 or more ? 0 or 1 ^ line start $ line end \b word boundary | alternation [] char class () grouping
\d digit [0-9] \D not digit \w word [a-zA-Z0-9_] \W not word \s whitespace \S not whitespace
{n} exactly n {n,} n or more
{n,m} n to m *? lazy (PCRE)
(?=x) followed by x (?!x) not followed by x (?<=x) preceded by x (?<!x) not preceded by x
*+ possessive 0+ (?>x) atomic group
++ possessive 1+ (?|...) branch reset
?+ possessive 0/1 (?R) recurse pattern
{n,m}+ possessive n-m (?1) recurse group 1
(?i) case insensitive (?m) multiline (^$ per line) (?s) dotall (. = \n) (?x) extended/free-spacing (?U) ungreedy default (?-i) turn off modifier
(?(1)y|n) if group 1, match y else n (?(?=x)y|n) if followed by x, match y else n
\p{L} any letter \p{N} any number
\p{Ll} lowercase \p{Lu} uppercase
\p{P} punctuation \p{S} symbol
\p{Han} Chinese \p{Latin} Latin script
\P{L} NOT letter (*UCP) Unicode mode
-E ERE (extended) -P PCRE (perl) -o only matching -i case insensitive -v invert match -c count -n line numbers -l filenames only -z null-delimited -A/-B/-C context