Regex Patterns

Regular expression syntax across BRE, ERE, and PCRE flavors.

The Three Flavors

Flavor Flag Tools Key Differences

BRE

(default)

grep, sed, ed

\+, \?, |, \(\) need escaping

ERE

-E

grep -E, sed -E, awk

+, ?, |, () work without escaping

PCRE

-P

grep -P, perl, python

\d, \w, \b, lookahead/behind

Quick test which you have:

echo "test123" | grep -P '\d+'    # PCRE - should print test123
echo "test123" | grep -E '[0-9]+' # ERE - should print test123
echo "test123" | grep '[0-9]\+'   # BRE - note escaped +

Metacharacters

Char Meaning Example

.

Any single character (except newline)

c.t matches cat, cut, c9t

*

Zero or more of previous

go*d matches gd, god, good, goood

+

One or more of previous (ERE/PCRE)

go+d matches god, good (not gd)

?

Zero or one of previous (ERE/PCRE)

colou?r matches color, colour

|

Alternation (OR)

cat|dog matches cat or dog

^

Start of line

^# matches lines starting with #

$

End of line

\.conf$ matches lines ending in .conf

\

Escape metacharacter

\. matches literal dot

[]

Character class

[aeiou] matches any vowel

()

Grouping / capture

(ab)+ matches ab, abab, ababab

Custom Classes

[abc]       # a, b, or c
[a-z]       # lowercase letter
[A-Z]       # uppercase letter
[0-9]       # digit
[a-zA-Z]    # any letter
[a-zA-Z0-9] # alphanumeric
[^abc]      # NOT a, b, or c (negation)
[^0-9]      # NOT a digit
[-abc]      # literal hyphen (first position)
[abc-]      # literal hyphen (last position)
[]abc]      # literal ] (first position)
[.?*]       # metacharacters literal inside []

POSIX Classes (BRE/ERE)

[[:alnum:]]  # alphanumeric [a-zA-Z0-9]
[[:alpha:]]  # alphabetic [a-zA-Z]
[[:digit:]]  # digit [0-9]
[[:lower:]]  # lowercase [a-z]
[[:upper:]]  # uppercase [A-Z]
[[:space:]]  # whitespace (space, tab, newline)
[[:blank:]]  # space or tab only
[[:punct:]]  # punctuation
[[:xdigit:]] # hex digit [0-9A-Fa-f]

PCRE Shortcuts (grep -P)

\d    # digit [0-9]
\D    # NOT digit [^0-9]
\w    # word char [a-zA-Z0-9_]
\W    # NOT word char
\s    # whitespace
\S    # NOT whitespace
\b    # word boundary
\B    # NOT word boundary

Quantifiers

Greedy Lazy (PCRE) Meaning Example

*

*?

0 or more

a* matches "", a, aa, aaa

+

+?

1 or more

a+ matches a, aa, aaa (not "")

?

??

0 or 1

a? matches "", a

{n}

{n}?

exactly n

a{3} matches aaa only

{n,}

{n,}?

n or more

a{2,} matches aa, aaa, aaaa…​

{n,m}

{n,m}?

n to m

a{2,4} matches aa, aaa, aaaa

Greedy vs Lazy
# Greedy (default): match as MUCH as possible
echo "<div>content</div>" | grep -oP '<.*>'
# Output: <div>content</div>

# Lazy: match as LITTLE as possible
echo "<div>content</div>" | grep -oP '<.*?>'
# Output: <div>
#         </div>

Anchors and Boundaries

^pattern    # Start of line
pattern$    # End of line
^$          # Empty line
^.+$        # Non-empty line
\bword\b    # Whole word only (PCRE)
\<word\>    # Whole word only (GNU grep BRE/ERE)

# Examples
grep '^#' file           # Comment lines
grep '\.conf$' file      # Lines ending in .conf
grep -v '^$' file        # Non-empty lines
grep -P '\berror\b' file # "error" as whole word

Basic Grouping

(pattern)      # Capture group - saves match
(?:pattern)    # Non-capture group (PCRE) - just grouping

# Alternation within group
grep -E '(cat|dog)' file      # cat or dog
grep -E 'gr(a|e)y' file       # gray or grey
grep -E '(ab)+' file          # ab, abab, ababab

# BRE requires escaping
grep '\(cat\|dog\)' file      # BRE version

Backreferences

\1    # First capture group
\2    # Second capture group

# Match repeated characters
grep -E '(.)\1' file          # aa, bb, cc...
grep -E '(.)\1\1' file        # aaa, bbb, ccc...

# Match repeated words
grep -E '\b(\w+)\s+\1\b' file # "the the", "is is"

# sed replacement with groups
echo "hello world" | sed -E 's/(\w+) (\w+)/\2 \1/'
# Output: world hello

Named Groups (PCRE)

(?<name>pattern)   # Named capture
\k<name>           # Named backreference

# Example: Match IP and reference by name
grep -P '(?<ip>\d+\.\d+\.\d+\.\d+).*\k<ip>' file

Lookahead and Lookbehind (PCRE)

These match a position, not characters. They don’t consume input.

Syntax Name Matches

(?=pattern)

Positive lookahead

Position followed by pattern

(?!pattern)

Negative lookahead

Position NOT followed by pattern

(?⇐pattern)

Positive lookbehind

Position preceded by pattern

(?<!pattern)

Negative lookbehind

Position NOT preceded by pattern

Examples
# Find "error" followed by a number
grep -P 'error(?=\s*\d)' file

# Find "error" NOT followed by "404"
grep -P 'error(?!.*404)' file

# Extract price (digits after $)
echo "Price: $199" | grep -oP '(?<=\$)\d+'
# Output: 199

# Find port numbers NOT preceded by "127.0.0.1:"
grep -P '(?<!127\.0\.0\.1:)\d{2,5}' file
Password Validation Pattern
# At least: 8 chars, 1 upper, 1 lower, 1 digit
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}$

Flavor Comparison Table

Feature BRE ERE PCRE

One or more +

\+

+

+

Zero or one ?

\?

?

?

Alternation |

|

|

|

Grouping ()

\(\)

()

()

\d for digits

No

No

Yes

\w for word

No

No

Yes

\b word boundary

No

No

Yes

Lookahead

No

No

Yes

Lookbehind

No

No

Yes

Lazy quantifiers

No

No

Yes

Non-capture groups

No

No

Yes

Drills: Basic Patterns

Practice these in regex101.com, then replicate in terminal.

Test Data

10.50.1.1 - - [12/Mar/2026:10:23:45 +0000] "GET /api/v1/users HTTP/1.1" 200 1234
192.168.1.100 - admin [12/Mar/2026:10:23:46 +0000] "POST /login HTTP/1.1" 401 89
10.50.1.20 - - [12/Mar/2026:10:23:47 +0000] "GET /health HTTP/1.1" 200 15
172.16.0.50 - evan [12/Mar/2026:10:23:48 +0000] "DELETE /api/v1/users/5 HTTP/1.1" 403 201
fe80::1 - - [12/Mar/2026:10:23:49 +0000] "GET /metrics HTTP/1.1" 200 8492
MAC: 14:F6:D8:7B:31:80 assigned to VLAN 10
MAC: 98:BB:1E:1F:A7:13 assigned to VLAN 20
error: connection refused to 10.50.1.50:389
warning: certificate expires in 30 days
ERROR: authentication failed for user 'admin'

Drill 1: IP Addresses

# Match any IPv4 address
\b\d{1,3}(?:\.\d{1,3}){3}\b

# Terminal:
grep -oP '\b\d{1,3}(?:\.\d{1,3}){3}\b' file

Drill 2: 10.x.x.x Network Only

# Only 10.x.x.x addresses
\b10(?:\.\d{1,3}){3}\b

# Terminal:
grep -oP '\b10(?:\.\d{1,3}){3}\b' file

Drill 3: MAC Addresses

# Standard MAC format (colon-separated)
\b([0-9A-Fa-f]{2}:){5}[0-9A-Fa-f]{2}\b

# Terminal:
grep -oP '\b([0-9A-Fa-f]{2}:){5}[0-9A-Fa-f]{2}\b' file

Drill 4: HTTP Status Codes

# 4xx errors
" (4\d{2}) "

# Terminal:
grep -oP '" \K4\d{2}(?= )' file

# All status codes
grep -oP '" \K[0-9]{3}(?= )' file

Drill 5: Usernames (not -)

# Capture username field (third field, not -)
 - (\w+) \[

# Terminal:
grep -oP ' - \K\w+(?= \[)' file | grep -v '^-$'

Drill 6: HTTP Methods

# Extract HTTP method
"(GET|POST|PUT|DELETE|PATCH)"

# Terminal:
grep -oP '"\K(GET|POST|PUT|DELETE|PATCH)' file

Drill 7: Request Paths

# Path after method
"(?:GET|POST|PUT|DELETE|PATCH) ([^ ]+)

# Terminal:
grep -oP '"(?:GET|POST|PUT|DELETE|PATCH) \K[^ ]+' file

Drill 8: Log Levels (Case-Insensitive)

# Match error/warning/ERROR/WARNING etc.
(?i)\b(error|warn(?:ing)?|fatal|critical)\b

# Terminal:
grep -iP '\b(error|warn(ing)?|fatal|critical)\b' file

Drill 9: Port Numbers

# Extract port after IP:port
(?<=:)\d{2,5}\b

# Terminal:
grep -oP '(?<=:)\d{2,5}\b' file

Drill 10: VLAN IDs

# VLAN followed by number
VLAN\s+(\d+)

# Terminal:
grep -oP 'VLAN\s+\K\d+' file

Drill 11: Date Extraction

# Extract date from log brackets
\[(\d{2}/\w{3}/\d{4})

# Terminal:
grep -oP '\[\K\d{2}/\w{3}/\d{4}' file

Drill 12: Failed Auth Lines

# Lines with 401 or 403 status
" (40[13]) "

# Full line extraction:
grep -P '" 40[13] ' file

Drill 13: Extract Quoted Strings

# Content inside double quotes
"([^"]+)"

# Terminal:
grep -oP '"[^"]+"' file

Drill 14: IPv6 Detection

# Simple IPv6 pattern (link-local example)
fe80::[0-9a-fA-F:]+

# Terminal:
grep -oP 'fe80::[0-9a-fA-F:]+' file

Drill 15: Certificate Expiry Days

# Extract number from "expires in X days"
expires in (\d+) days

# Terminal:
grep -oP 'expires in \K\d+(?= days)' file

Network

# IPv4
\b\d{1,3}(?:\.\d{1,3}){3}\b

# IPv4 with CIDR
\b\d{1,3}(?:\.\d{1,3}){3}/\d{1,2}\b

# MAC (colon)
\b([0-9A-Fa-f]{2}:){5}[0-9A-Fa-f]{2}\b

# MAC (hyphen - Windows style)
\b([0-9A-Fa-f]{2}-){5}[0-9A-Fa-f]{2}\b

# Port number
\b([0-9]{1,5})\b

# FQDN
\b[a-zA-Z0-9]([a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(\.[a-zA-Z]{2,})+\b

# URL
https?://[^\s<>"]+

Security

# JWT token (3 base64 parts)
eyJ[A-Za-z0-9_-]+\.eyJ[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+

# API key patterns (generic)
[A-Za-z0-9]{32,}

# AWS Access Key
AKIA[0-9A-Z]{16}

# Private key header
-----BEGIN [A-Z ]+ PRIVATE KEY-----

# Password in URL (security audit)
://[^:]+:([^@]+)@

Logs

# Syslog timestamp
[A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2}

# ISO 8601
\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}

# Apache/Nginx combined log (full line)
^(\S+) (\S+) (\S+) \[([^\]]+)\] "([^"]*)" (\d{3}) (\d+)

# Error levels
\b(EMERG|ALERT|CRIT|ERR|WARN|NOTICE|INFO|DEBUG)\b

Config Files

# Key=Value
^(\w+)\s*=\s*(.+)$

# YAML key: value
^(\s*)(\w+):\s*(.+)?$

# Comment lines (# or //)
^\s*(#|//).*$

# Empty or whitespace-only lines
^\s*$

# INI section headers
^\[([^\]]+)\]$

sed Examples

# Basic substitution
sed 's/old/new/' file

# Global (all occurrences on line)
sed 's/old/new/g' file

# Using capture groups
sed -E 's/([0-9]+)/[\1]/' file           # 123 -> [123]

# Swap two words
sed -E 's/(\w+) (\w+)/\2 \1/' file

# Delete lines matching pattern
sed '/pattern/d' file

# Print only matching lines (like grep)
sed -n '/pattern/p' file

# In-place edit
sed -i 's/old/new/g' file
sed -i.bak 's/old/new/g' file    # With backup

awk Examples

# Match pattern
awk '/pattern/ {print}' file

# Match and extract field
awk '/error/ {print $1, $NF}' file

# Regex on specific field
awk '$1 ~ /^10\./ {print}' file           # First field starts with 10.

# Negative match
awk '$1 !~ /^192\./ {print}' file         # First field NOT starting with 192.

# gsub (global substitution)
awk '{gsub(/old/, "new"); print}' file

# sub (first occurrence only)
awk '{sub(/old/, "new"); print}' file

# Match and extract with gensub (GNU awk)
awk '{print gensub(/.*:([0-9]+).*/, "\\1", "g")}' file

Escaping in Different Contexts

# Shell escaping - use single quotes for regex
grep 'pattern with $special' file         # $ is literal
grep "pattern with $special" file         # $ is shell variable!

# BRE vs ERE escaping
grep 'a\+b' file      # BRE: one or more a
grep -E 'a+b' file    # ERE: same thing

# Escaping in character class
grep '[.?*]' file     # Metacharacters are literal inside []
grep '[^abc]' file    # ^ means NOT only at start of []

Greedy Matching Trap

# Problem: Greedy .* matches too much
echo '<div>one</div><div>two</div>' | grep -oP '<div>.*</div>'
# Output: <div>one</div><div>two</div>  (one match, not two!)

# Solution: Lazy .*?
echo '<div>one</div><div>two</div>' | grep -oP '<div>.*?</div>'
# Output: <div>one</div>
#         <div>two</div>

# Alternative: Negated class (works in BRE/ERE too)
echo '<div>one</div><div>two</div>' | grep -oE '<div>[^<]*</div>'

Word Boundary Differences

# PCRE word boundary
grep -P '\bword\b' file

# GNU BRE/ERE word boundary
grep '\<word\>' file

# ERE alternative (may not work everywhere)
grep -E '(^|[^a-zA-Z])word($|[^a-zA-Z])' file

Newline Handling

# . does NOT match newline by default
# Use -z for null-delimited (multiline) in grep

# Match across lines (GNU grep)
grep -Pzo 'start.*?end' file

# In sed, use N to read next line
sed 'N;s/line1\nline2/replaced/' file

Atomic Groups and Possessive Quantifiers (PCRE)

These prevent backtracking - once matched, the engine won’t give up characters. Critical for performance and avoiding catastrophic backtracking.

Possessive Quantifiers

Add + after any quantifier to make it possessive:

Greedy Possessive Behavior

*

*+

0+ chars, no backtrack

+

++

1+ chars, no backtrack

?

?+

0 or 1, no backtrack

{n,m}

{n,m}+

n-m chars, no backtrack

# Compare greedy vs possessive
echo "aaaaaaaaab" | grep -P 'a+b'     # Matches - greedy backtracks
echo "aaaaaaaaab" | grep -P 'a++b'    # Matches - no backtrack needed

echo "aaaaaaaaaa" | grep -P 'a+b'     # No match (backtracks, tries, fails)
echo "aaaaaaaaaa" | grep -P 'a++b'    # No match (fails fast, no backtrack)

Atomic Groups

(?>pattern) - Once matched, contents cannot be backtracked into.

# Atomic group syntax
(?>pattern)

# Example: Match integer OR float, prefer integer
echo "3.14" | grep -oP '(?>\d+)\.?\d*'
# Output: 3.14 (atomic group captures "3", then matches ".14")

# Without atomic group - ambiguity
echo "3.14" | grep -oP '(\d+)\.?\d*'
# Also matches, but engine may backtrack unnecessarily

When to Use

# Pattern that causes catastrophic backtracking
# DON'T: (a+)+ against "aaaaaaaaaaaaaaaaaaaaX"

# DO: Use possessive or atomic
(?>a+)+     # Atomic group prevents inner backtrack
a++         # Possessive on inner quantifier

# Real example: Matching quoted strings efficiently
# DON'T: ".*"  (backtracks excessively on non-matches)
# DO:    "[^"]*"  (no backtracking possible)
# OR:    "(?>[^"]*)"  (atomic for complex inner patterns)

Recursive Patterns (PCRE)

Match nested structures like parentheses, HTML tags, JSON.

Basic Recursion

(?R)        # Recurse entire pattern
(?0)        # Same as (?R)
(?1)        # Recurse first capture group
(?2)        # Recurse second capture group
(?&name)    # Recurse named group
(?P>name)   # Python-style named recursion

Matching Balanced Parentheses

# Match balanced parens: (), (()), ((())), etc.
\((?:[^()]*|(?R))*\)

# Breakdown:
# \(        opening paren
# (?:       non-capture group for contents:
#   [^()]*    any non-paren chars
#   |         OR
#   (?R)      recurse entire pattern (nested parens)
# )*        zero or more content items
# \)        closing paren

# Test:
echo "(a(b(c)d)e)" | grep -oP '\((?:[^()]*|(?R))*\)'
# Output: (a(b(c)d)e)

echo "((nested))" | grep -oP '\((?:[^()]*|(?R))*\)'
# Output: ((nested))

Matching Nested Structures

# Match balanced braces (JSON-like)
\{(?:[^{}]*|(?R))*\}

# Match balanced brackets
\[(?:[^\[\]]*|(?R))*\]

# Match balanced angle brackets (XML-like)
<(?:[^<>]*|(?R))*>

Group Recursion

# Recurse specific group instead of whole pattern
# (?1) recurses group 1

# Pattern: word = (nested stuff)
(\w+)\s*=\s*(\((?:[^()]*|(?2))*\))

# (?2) recurses only the paren-matching group
echo "config = ((a)(b))" | grep -oP '(\w+)\s*=\s*(\((?:[^()]*|(?2))*\))'
# Output: config = ((a)(b))

Named Group Recursion

# Define pattern once, recurse by name
(?<parens>\((?:[^()]*|(?&parens))*\))

# Match: expression (with (nested) parens)
\w+\s*(?<parens>\((?:[^()]*|(?&parens))*\))

echo "func((a,b),(c,d))" | grep -oP '\w+(?<parens>\((?:[^()]*|(?&parens))*\))'
# Output: func((a,b),(c,d))

Conditional Patterns (PCRE)

Match different patterns based on conditions.

Syntax

(?(condition)yes-pattern|no-pattern)
(?(condition)yes-pattern)    # No-pattern is empty match

Conditional on Capture Group

# (?(1)...|...) - true if group 1 matched
# Match optional opening paren, require closing if present

(\()?[a-z]+(?(1)\))

# Breakdown:
# (\()?     optional capture of opening paren (group 1)
# [a-z]+    one or more letters
# (?(1)\))  IF group 1 matched, require closing paren

echo "abc" | grep -oP '(\()?[a-z]+(?(1)\))'
# Output: abc (no parens needed)

echo "(abc)" | grep -oP '(\()?[a-z]+(?(1)\))'
# Output: (abc) (parens balanced)

echo "(abc" | grep -oP '(\()?[a-z]+(?(1)\))'
# No output (opening but no closing)

Conditional on Lookahead

# (?(?=condition)yes|no)
# Match based on what follows

# If followed by digits, match word; else match number
(?(?=\d)\d+|\w+)

echo "abc 123" | grep -oP '(?(?=\d)\d+|\w+)'
# Output: abc
#         123

Practical Examples

# Match phone with optional country code
# If +, require country code pattern; else local pattern
(?(\+)\+\d{1,3}[-\s]?)?\d{3}[-\s]?\d{3}[-\s]?\d{4}

# Match quoted or unquoted value
# (")? captures optional quote, (?(1)") requires closing if present
(")?[^",]+(?(1)")

echo 'field,"quoted value",plain' | grep -oP '(")?[^",]+(?(1)")'
# Output: field
#         "quoted value"
#         plain

Branch Reset Groups (PCRE)

(?|…​) - Alternatives share the same group numbers. Useful when different patterns should populate the same capture group.

Without Branch Reset

# Normal alternation - different group numbers
(cat)|(dog)|(bird)

# "cat"  → $1 = "cat",  $2 = undef, $3 = undef
# "dog"  → $1 = undef,  $2 = "dog", $3 = undef
# "bird" → $1 = undef,  $2 = undef, $3 = "bird"

With Branch Reset

# Branch reset - same group number for all alternatives
(?|(cat)|(dog)|(bird))

# "cat"  → $1 = "cat"
# "dog"  → $1 = "dog"
# "bird" → $1 = "bird"

# All alternatives populate group 1!
echo -e "cat\ndog\nbird" | grep -oP '(?|(cat)|(dog)|(bird))'
# Output: cat
#         dog
#         bird

Practical Use Case

# Extract date in multiple formats, normalize to one group
# Format 1: 2026-03-14  (ISO)
# Format 2: 03/14/2026  (US)
# Format 3: 14.03.2026  (EU)

(?|(\d{4})-(\d{2})-(\d{2})|(\d{2})/(\d{2})/(\d{4})|(\d{2})\.(\d{2})\.(\d{4}))

# This is messy - the captures don't align semantically
# Better to extract and post-process, but branch reset helps

# Extract IP:port or just IP (port optional)
(?|(\d+\.\d+\.\d+\.\d+):(\d+)|(\d+\.\d+\.\d+\.\d+)())

# Group 1 = IP, Group 2 = port (or empty)

Mode Modifiers (PCRE)

Change regex behavior inline. Apply to whole pattern or sections.

Global Modifiers

Modifier Name Effect

(?i)

Case insensitive

a matches a or A

(?m)

Multiline

^ and $ match line boundaries

(?s)

Single-line (dotall)

. matches newline too

(?x)

Extended (free-spacing)

Whitespace ignored, # comments allowed

(?U)

Ungreedy

Quantifiers lazy by default

Inline Modifier Syntax

# Apply to entire pattern (at start)
(?i)pattern           # Case insensitive

# Apply to section only
normal(?i)insensitive(?-i)normal_again

# Multiple modifiers
(?im)pattern          # Case insensitive + multiline

# Negative (turn off)
(?-i)                 # Turn OFF case insensitivity

Case Insensitive (?i)

# Match ERROR, error, Error, etc.
(?i)error

echo -e "ERROR\nerror\nError" | grep -P '(?i)error'
# Matches all three

# Same as grep -i:
grep -iP 'error' file

# Apply to portion only:
(?i:error)\s+(\d+)    # "ERROR 404", "error 500" - but $1 is case-sensitive

Multiline (?m)

# Without (?m): ^ matches start of STRING only
# With (?m): ^ matches start of each LINE

# Match lines starting with #
(?m)^#.*$

# Equivalent to grep's default behavior (line-oriented)
# Useful in contexts where input is one big string

# Example in Perl one-liner:
echo -e "line1\n#comment\nline2" | perl -ne 'print if /(?m)^#/'
# Output: #comment

Single-line/Dotall (?s)

# Without (?s): . matches any char EXCEPT newline
# With (?s): . matches newline too

# Match everything between START and END, including newlines
(?s)START.*?END

# Example:
echo -e "START\nmulti\nline\nEND" | grep -Pzo '(?s)START.*?END'
# Output: START
#         multi
#         line
#         END

# grep -z treats input as null-delimited (whole file as one record)

Extended (?x) (Free-Spacing)

# Whitespace ignored, # starts comments
# Allows readable complex patterns

(?x)
  ^                    # Start of line
  (\d{3})              # Area code
  [-.\s]?              # Optional separator
  (\d{3})              # Exchange
  [-.\s]?              # Optional separator
  (\d{4})              # Subscriber
  $                    # End of line

# Equivalent to:
^(\d{3})[-.\s]?(\d{3})[-.\s]?(\d{4})$

# In grep (must escape or use single line):
grep -P '(?x) ^ (\d{3}) [-.\s]? (\d{3}) [-.\s]? (\d{4}) $' file

Ungreedy (?U)

# Without (?U): quantifiers greedy by default
# With (?U): quantifiers lazy by default

# Match minimal between tags
(?U)<tag>.*</tag>

# Same as:
<tag>.*?</tag>

# With (?U), .* is lazy, .*? becomes greedy (inverted!)

Unicode Properties (PCRE)

Match characters by Unicode category, script, or property. Requires PCRE2 or Perl with Unicode support.

General Categories

\p{L}    # Any letter (any script)
\p{Ll}   # Lowercase letter
\p{Lu}   # Uppercase letter
\p{N}    # Any number
\p{Nd}   # Decimal digit
\p{P}    # Punctuation
\p{S}    # Symbol
\p{Z}    # Separator (space, line, paragraph)
\p{C}    # Control/format/private use

\P{L}    # NOT a letter (uppercase P negates)

Script Matching

\p{Latin}      # Latin script
\p{Greek}      # Greek script
\p{Cyrillic}   # Cyrillic script
\p{Han}        # Chinese characters
\p{Hiragana}   # Japanese hiragana
\p{Katakana}   # Japanese katakana
\p{Arabic}     # Arabic script
\p{Hebrew}     # Hebrew script

# Long form:
\p{Script=Latin}
\p{Script=Greek}

# Match word in any script:
[\p{L}\p{M}]+  # Letters + combining marks

Practical Examples

# Match any letter (international)
echo "Héllo Wörld 你好" | grep -oP '\p{L}+'
# Output: Héllo
#         Wörld
#         你好

# Match email with international characters
[\p{L}\p{N}._%+-]+@[\p{L}\p{N}.-]+\.\p{L}{2,}

# Match only ASCII letters (not international)
[A-Za-z]+
# vs any letter:
\p{L}+

# Detect non-ASCII characters (security audit)
[^\x00-\x7F]
# Or:
\P{ASCII}

# Match emoji (basic)
[\x{1F300}-\x{1F9FF}]
# Or category:
\p{Emoji}

Unicode Character Classes

# Combining marks (accents, etc.)
\p{M}      # Any mark
\p{Mn}     # Non-spacing mark
\p{Mc}     # Spacing combining mark

# Letter + marks (proper word matching)
[\p{L}\p{M}]+

# Example: Match accented words
echo "café naïve résumé" | grep -oP '[\p{L}\p{M}]+'
# Output: café
#         naïve
#         résumé

Common Unicode Gotchas

# \w does NOT match international letters by default!
echo "Müller" | grep -oP '\w+'
# Output: M  ller (ü not matched!)

# Fix: Use \p{L} or enable Unicode mode
echo "Müller" | grep -oP '[\p{L}\p{N}_]+'
# Output: Müller

# Perl/PCRE2 - enable Unicode word characters:
# (?u) or (*UCP) at pattern start
echo "Müller" | grep -oP '(*UCP)\w+'
# Output: Müller

# Byte vs character length
echo "café" | wc -c   # 6 bytes (é = 2 bytes in UTF-8)
echo "café" | wc -m   # 5 characters

Regex Engine Internals

Understanding NFA vs DFA and catastrophic backtracking.

NFA vs DFA Engines

Feature NFA (grep -P, Perl) DFA (grep -E, awk)

Backreferences

Yes

No

Lookaround

Yes

No

Lazy quantifiers

Yes

No

Atomic groups

Yes

No

Speed guarantee

No (can backtrack)

Yes (linear time)

Tools

grep -P, Perl, Python

grep -E, egrep, awk

Catastrophic Backtracking

# The evil pattern: nested quantifiers with overlap
(a+)+

# Against "aaaaaaaaaaaaaaaaaaaaX":
# Engine tries every possible way to divide the a's
# 2^n combinations = exponential time = hang/crash

# Worse: (a*)*  or  (a+)*  or  (a|aa)+

# Demonstration (DON'T run on long strings):
# echo "aaaaaaaaaaaaaaaaaaaaaaaaX" | grep -P '(a+)+b'
# This will hang!

# Real-world examples of vulnerable patterns:
(.*a)+           # "aaaa...X" causes backtracking
(x+x+)+y         # Overlapping x's
(\w+)*           # Any word char, nested quantifiers

Fixing Catastrophic Backtracking

# Method 1: Possessive quantifiers
(a+)+     # BAD - catastrophic
(a++)     # GOOD - possessive inner prevents backtrack

# Method 2: Atomic groups
(a+)+     # BAD
(?>a+)    # GOOD - atomic group

# Method 3: Eliminate alternation overlap
(a|aa)+   # BAD - overlapping alternatives
a+        # GOOD - just match all a's

# Method 4: Use negated character class instead of .*
".*"          # BAD with nested quotes
"[^"]*"       # GOOD - can't backtrack

# Method 5: Anchor patterns
.*foo         # Potentially slow on long non-matching lines
^.*foo        # Better - fails fast at start of line

ReDoS (Regular Expression Denial of Service)

# Vulnerable patterns to NEVER use with untrusted input:
(a+)+
([a-zA-Z]+)*
(a|aa)+
(.*a){n}     # Where n is significant

# Security audit: Find vulnerable patterns in code
grep -rP '\([^)]*[+*]\)[+*]' --include="*.py"
grep -rP '\.\*[^?]' --include="*.py"

# Safe alternatives:
# 1. Use atomic groups/possessive quantifiers
# 2. Set regex timeout in your language
# 3. Limit input length before regex
# 4. Use DFA engine for untrusted input

Performance Tips

# 1. Anchor when possible
^pattern      # Much faster than searching entire line
pattern$      # Fails fast if line end doesn't match

# 2. Most specific first
(cat|catch)   # "catch" never matches (cat matches first)
(catch|cat)   # Better - longer/specific first

# 3. Avoid .* at pattern start
.*foo         # Scans entire string
[^f]*foo      # Better - stops at first 'f'

# 4. Use non-capturing groups when capture not needed
(?:abc)+      # Faster than (abc)+

# 5. Character class vs alternation
[aeiou]       # Faster
(a|e|i|o|u)   # Slower (creates capture group + alternation overhead)

# 6. Pre-filter with fast grep before complex regex
grep 'error' file | grep -P 'complex(?=pattern)'

Drills: Expert Level

These patterns use advanced PCRE features.

Drill 16: Balanced Parentheses Validation

# Match only strings with balanced parens
^\((?:[^()]*|(?R))*\)$

# Test data
echo -e "(valid)\n((nested))\n((broken)\n()()" | while read line; do
  echo -n "$line -> "
  echo "$line" | grep -qP '^\((?:[^()]*|(?R))*\)$' && echo "VALID" || echo "INVALID"
done

Drill 17: Atomic IP Validation

# Efficient IP validation with atomic groups
^(?>(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.){3}(?>25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)$

# Atomic groups prevent backtracking on invalid IPs

# Test:
echo -e "192.168.1.1\n256.1.1.1\n10.0.0.1\nabc" | grep -P '^(?>(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.){3}(?>25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)$'

Drill 18: Password Strength (Multiple Lookaheads)

# At least: 12 chars, 1 upper, 1 lower, 1 digit, 1 special
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%^&*]).{12,}$

# Extended with Unicode (international passwords):
^(?=.*\p{Ll})(?=.*\p{Lu})(?=.*\d)(?=.*[!@#$%^&*]).{12,}$

# Test:
echo -e "weak\nStrongPass1!\nValidPass123!" | grep -P '^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%^&*]).{12,}$'

Drill 19: Semver Parsing

# Semantic versioning: MAJOR.MINOR.PATCH(-prerelease)?(+build)?
^(?<major>0|[1-9]\d*)\.(?<minor>0|[1-9]\d*)\.(?<patch>0|[1-9]\d*)(?:-(?<prerelease>[0-9A-Za-z-]+(?:\.[0-9A-Za-z-]+)*))?(?:\+(?<build>[0-9A-Za-z-]+(?:\.[0-9A-Za-z-]+)*))?$

# Simpler version:
^\d+\.\d+\.\d+(-[a-zA-Z0-9.]+)?(\+[a-zA-Z0-9.]+)?$

# Test:
echo -e "1.2.3\n1.0.0-alpha\n2.1.0+build.123\n1.0.0-beta+exp.sha.5114f85" | \
  grep -P '^\d+\.\d+\.\d+(-[a-zA-Z0-9.]+)?(\+[a-zA-Z0-9.]+)?$'

Drill 20: RFC 5322 Email (Simplified)

# Practical email regex (not full RFC 5322, but handles 99%)
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

# With Unicode support:
^[\p{L}\p{N}._%+-]+@[\p{L}\p{N}.-]+\.\p{L}{2,}$

# More complete (handles quoted local parts):
^(?:[a-zA-Z0-9._%+-]+|"[^"]+")@[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?)*\.[a-zA-Z]{2,}$

Drill 21: JSON Key-Value Extraction

# Extract "key": "value" pairs from JSON
"([^"]+)":\s*"([^"]+)"

# With possible escapes in values:
"([^"]+)":\s*"((?:[^"\\]|\\.)*)\"

# Test:
echo '{"name": "John", "city": "New York"}' | grep -oP '"([^"]+)":\s*"\K[^"]+'
# Output: John
#         New York

Drill 22: Nested HTML Tag Matching

# Match balanced <div>...</div> with nesting
<div\b[^>]*>(?:[^<]*|<(?!/?div\b)|(?R))*</div>

# Explanation:
# <div\b[^>]*>   opening <div> tag with attributes
# (?:            non-capture alternation:
#   [^<]*          text (no tags)
#   |<(?!/?div\b)  tag that's not div (using negative lookahead)
#   |(?R)          recurse for nested div
# )*             zero or more
# </div>         closing tag

# Note: Real HTML parsing should use proper parser, not regex

Drill 23: Conditionals - Optional Sections

# Match log entries: IP (optional user) timestamp message
# If user present (not -), capture it

^(\d+\.\d+\.\d+\.\d+)\s+(-|\w+)\s+\[([^\]]+)\]\s+(.+)$

# With conditional for user:
^(\d+\.\d+\.\d+\.\d+)\s+(?:(?!-)\K(\w+)|-)?\s+\[([^\]]+)\]\s+(.+)$

# Test data:
# 10.0.0.1 evan [2026-03-14] message
# 10.0.0.1 - [2026-03-14] message

Drill 24: Mode Modifier Scoping

# Case-insensitive match, but capture preserves case
(?i)error:\s*(?-i)([A-Z0-9_]+)

# This matches "ERROR: FILE_NOT_FOUND" or "error: FILE_NOT_FOUND"
# But $1 only captures uppercase codes

echo -e "ERROR: TEST_123\nerror: REAL_CODE" | grep -oP '(?i)error:\s*(?-i)([A-Z0-9_]+)' | grep -oP '[A-Z0-9_]+$'
# Output: TEST_123
#         REAL_CODE

Drill 25: Branch Reset for Format Normalization

# Match phone in multiple formats, normalize groups
# (?| branch reset - all alternatives use same group numbers

(?|(\d{3})-(\d{3})-(\d{4})|(\d{3})\.(\d{3})\.(\d{4})|\((\d{3})\)\s*(\d{3})-(\d{4}))

# Group 1,2,3 = area, exchange, subscriber (regardless of format)

echo -e "555-123-4567\n555.123.4567\n(555) 123-4567" | grep -oP '(?|(\d{3})-(\d{3})-(\d{4})|(\d{3})\.(\d{3})\.(\d{4})|\((\d{3})\)\s*(\d{3})-(\d{4}))'

Real-World Complex Patterns

Production-ready patterns with explanations.

URL Parsing

# Full URL with capture groups
^(?<scheme>https?|ftp)://(?<host>[^:/\s]+)(?::(?<port>\d+))?(?<path>/[^\s?#]*)?(?:\?(?<query>[^\s#]*))?(?:#(?<fragment>\S*))?$

# Simplified:
^(https?|ftp)://([^:/\s]+)(:\d+)?(/[^\s?#]*)?(\?[^\s#]*)?(#\S*)?$

# Example extraction:
URL="https://api.example.com:8443/v1/users?page=1#section"
echo "$URL" | grep -oP '(?<=://)[^:/]+'  # Host: api.example.com
echo "$URL" | grep -oP '(?<=:)\d+'        # Port: 8443
echo "$URL" | grep -oP '/[^?#]+'          # Path: /v1/users

Log Parsing (Apache Combined)

# Apache combined log format
^(?<ip>\S+)\s+\S+\s+(?<user>\S+)\s+\[(?<time>[^\]]+)\]\s+"(?<method>\S+)\s+(?<path>\S+)\s+(?<proto>[^"]+)"\s+(?<status>\d+)\s+(?<bytes>\d+)\s+"(?<referrer>[^"]*)"\s+"(?<agent>[^"]*)"$

# Extract 4xx/5xx errors with response time > 1000ms
# (Assuming log has response time at end)
^(\S+).*" [45]\d{2} .* (\d{4,})$

Credit Card Masking

# Match credit card numbers (various formats)
\b(?:\d{4}[-\s]?){3}\d{4}\b

# Mask all but last 4 digits
echo "4111-1111-1111-1234" | sed -E 's/\d(?=(\d{4}[-\s]?){1,3}\d{4})/X/g'
# Output: XXXX-XXXX-XXXX-1234

# Validate Luhn checksum requires code, not just regex

Security: Secret Detection

# AWS Access Key ID
\b(AKIA[0-9A-Z]{16})\b

# AWS Secret Key (40 char base64)
\b([A-Za-z0-9+/]{40})\b

# GitHub Personal Access Token
\bghp_[A-Za-z0-9]{36}\b

# Generic API Key (32+ hex or alphanumeric)
\b[A-Za-z0-9]{32,}\b

# JWT Token
\beyJ[A-Za-z0-9_-]+\.eyJ[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+\b

# Private Key
-----BEGIN\s+(RSA|EC|OPENSSH|DSA|ENCRYPTED)?\s*PRIVATE\s+KEY-----

# Combined secrets scanner:
grep -rP '(AKIA[0-9A-Z]{16}|ghp_[A-Za-z0-9]{36}|-----BEGIN.*PRIVATE KEY-----)' .

Network: ACL/Firewall Rules

# Cisco ACL parsing
^(?<action>permit|deny)\s+(?<proto>ip|tcp|udp|icmp)\s+(?<src>\S+)\s+(?<srcwc>\S+)\s+(?<dst>\S+)\s+(?<dstwc>\S+)(?:\s+eq\s+(?<port>\d+))?

# iptables rule extraction
-A\s+(?<chain>\w+)\s+(?:-[sp]\s+(?<src>\S+)\s+)?(?:-d\s+(?<dst>\S+)\s+)?.*-j\s+(?<action>ACCEPT|DROP|REJECT)

# Extract source IPs from deny rules
grep -oP '(?<=deny\s+ip\s+)\S+(?=\s)' acl.txt

Config File Validation

# YAML key-value (basic)
^(\s*)([a-zA-Z_][a-zA-Z0-9_-]*):\s*(.*)$

# INI file section and key
^\s*\[([^\]]+)\]|^\s*([^=\s]+)\s*=\s*(.*)$

# systemd unit file
^\[(?<section>\w+)\]|^(?<key>[A-Z][a-zA-Z]+)=(?<value>.*)$

# Check for hardcoded IPs in config (audit)
grep -rP '\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b' /etc/

Quick Reference Card

Metacharacters
.       any char          *       0 or more
+       1 or more         ?       0 or 1
^       line start        $       line end
\b      word boundary     |       alternation
[]      char class        ()      grouping
PCRE Shortcuts
\d  digit [0-9]           \D  not digit
\w  word [a-zA-Z0-9_]     \W  not word
\s  whitespace            \S  not whitespace
Quantifiers
{n}     exactly n         {n,}    n or more
{n,m}   n to m            *?      lazy (PCRE)
Lookaround (PCRE)
(?=x)   followed by x     (?!x)   not followed by x
(?<=x)  preceded by x     (?<!x)  not preceded by x
Possessive & Atomic (PCRE)
*+      possessive 0+     (?>x)   atomic group
++      possessive 1+     (?|...) branch reset
?+      possessive 0/1    (?R)    recurse pattern
{n,m}+  possessive n-m    (?1)    recurse group 1
Mode Modifiers (PCRE)
(?i)  case insensitive    (?m)  multiline (^$ per line)
(?s)  dotall (. = \n)     (?x)  extended/free-spacing
(?U)  ungreedy default    (?-i) turn off modifier
Conditionals (PCRE)
(?(1)y|n)   if group 1, match y else n
(?(?=x)y|n) if followed by x, match y else n
Unicode (PCRE)
\p{L}   any letter        \p{N}   any number
\p{Ll}  lowercase         \p{Lu}  uppercase
\p{P}   punctuation       \p{S}   symbol
\p{Han} Chinese           \p{Latin} Latin script
\P{L}   NOT letter        (*UCP)  Unicode mode
grep Flags
-E  ERE (extended)        -P  PCRE (perl)
-o  only matching         -i  case insensitive
-v  invert match          -c  count
-n  line numbers          -l  filenames only
-z  null-delimited        -A/-B/-C context