Regex Fundamentals
Every regex pattern consists of two types of characters: literals that match themselves, and metacharacters that have special meaning. Understanding this distinction is the foundation of regex mastery.
Literal Characters
Most characters match themselves exactly.
Pattern: cat
Text: The cat sat on the mat.
Match: ^^^
Key Points:
- Case-sensitive by default (Cat β cat)
- Spaces are literal (hello world matches exactly that)
- Numbers are literal (192 matches "192")
Case Sensitivity
| Pattern | Matches | Does Not Match |
|---|---|---|
|
"error" |
"Error", "ERROR" |
|
"Error" |
"error", "ERROR" |
Making case-insensitive:
# grep with -i flag
grep -i 'error' logfile.txt
# PCRE inline modifier
grep -P '(?i)error' logfile.txt
Metacharacters
These characters have special meaning and require escaping to match literally:
| Character | Meaning | To Match Literally |
|---|---|---|
|
Any single character |
|
|
Zero or more of preceding |
|
|
One or more of preceding |
|
|
Zero or one of preceding |
|
|
Start of line |
|
|
End of line |
|
|
Alternation (OR) |
|
|
Grouping |
|
|
Character class |
|
|
Quantifier |
|
|
Escape character |
|
The Dot Metacharacter
The dot . matches any single character except newline.
Pattern: c.t
Matches: cat, cot, cut, c1t, c@t, c t
Does not match: ct (no character), cart (two characters)
Practical Example: IP Octets
Pattern: 192.168
Text: 192.168.1.1 and 192X168Y1Z1
Match: ^^^^^^^ ^^^^^^^
Problem: Dot matches ANY character, including 'X'
Solution: Escape the dot
Pattern: 192\.168
Text: 192.168.1.1 and 192X168Y1Z1
Match: ^^^^^^^
Escaping Rules
When to Escape
Escape these characters when you want their literal meaning:
. * + ? ^ $ | ( ) [ ] { } \
Escape Sequences
| Sequence | Meaning | Example |
|---|---|---|
|
Literal backslash |
|
|
Literal period |
|
|
Literal asterisk |
|
|
Literal question mark |
|
|
Literal parenthesis |
|
|
Literal bracket |
|
Infrastructure Example: File Paths
# Windows path - must escape backslashes
grep -E 'C:\\Users\\Admin' config.txt
# URL with query parameters
grep -E 'api\.example\.com/v1/users\?id=' access.log
# IP address
grep -E '10\.50\.1\.20' network.log
Combining Literals and Metacharacters
Pattern: Match Log Level
Pattern: \[ERROR\]
Text: [ERROR] Connection failed
Match: ^^^^^^^
Explanation:
- \[ matches literal [
- ERROR matches literally
- \] matches literal ]
Pattern: Match File Extension
Pattern: \.log$
Text: access.log
Match: ^^^^
Explanation:
- \. matches literal .
- log matches literally
- $ anchors to end of line
Common Mistakes
Mistake 1: Unescaped Dots
# Wrong - matches 192X168Y1Z100 too
grep '192.168.1.100' file.txt
# Correct - matches only IP
grep '192\.168\.1\.100' file.txt
Mistake 2: Forgetting Case Sensitivity
# Misses "ERROR" and "Error"
grep 'error' logfile.txt
# Catches all cases
grep -i 'error' logfile.txt
Mistake 3: Special Characters in Passwords/Strings
# If searching for "cost=$100"
# Wrong - $ is metacharacter
grep 'cost=$100' file.txt
# Correct - escape special chars
grep 'cost=\$100' file.txt
Self-Test Exercises
| Try each challenge FIRST. Type your pattern in the terminal. Only expand the answer after you’ve attempted it honestly. |
Setup Test Data
Run this once to create your test file:
cat << 'EOF' > /tmp/fundamentals.txt
The quick brown fox jumps over the lazy dog.
IP: 192.168.1.100
IP: 10.50.1.20
FAKE: 192X168Y1Z100
MAC: AA:BB:CC:DD:EE:FF
Price: $99.99
Total: $1,234.56
Path: C:\Users\Admin\Documents
Path: /home/evan/atelier
URL: https://api.example.com/v1/users?id=123
[ERROR] Connection refused
[WARN] Low disk space
[INFO] Server started
error: lowercase error
ERROR: UPPERCASE ERROR
What? Really? Yes!
2+2=4
5*5=25
EOF
Challenge 1: Find Literal Text
Goal: Find lines containing the word "fox"
Answer
grep (BRE/ERE - identical for literals):
grep 'fox' /tmp/fundamentals.txt
Output: The quick brown fox jumps over the lazy dog.
Literal strings work the same in BRE and ERE - no flags needed.
In Other Tools (sed, awk, find)
sed (print matching lines):
sed -n '/fox/p' /tmp/fundamentals.txt
-n suppresses default output, /fox/p prints matches.
awk (pattern matching):
awk '/fox/' /tmp/fundamentals.txt
awk prints lines matching the pattern by default.
find -regex (match filenames):
# If you had a file named "fox.txt" in /tmp:
find /tmp -regex '.*fox.*'
find -regex matches the FULL PATH, not just filename. Use .* to match leading path.
|
Challenge 2: Case Insensitive
Goal: Find ALL error lines (ERROR, error, Error - any case)
Answer - Multiple Approaches (varying efficiency)
Level 1: The -i flag (easiest)
grep -i 'error' /tmp/fundamentals.txt
Works in grep, but not all tools have this flag.
Level 2: ERE alternation (grep -E)
grep -E '(ERROR|error|Error)' /tmp/fundamentals.txt
In ERE, | means OR. Parentheses group without escaping.
Level 3: BRE alternation (plain grep)
grep '\(ERROR\|error\|Error\)' /tmp/fundamentals.txt
In BRE, you must escape \| for alternation and \(\) for grouping.
Level 4: Character class (most portable)
grep '[Ee][Rr][Rr][Oo][Rr]' /tmp/fundamentals.txt
No flags, no alternation - works in ANY regex engine. Each [Xx] matches either case for that position.
Why this matters: sed, awk, vim, Python all have different defaults. Learning to match without -i makes you portable across tools.
In Other Tools (sed, awk, find)
sed (case-insensitive with character class):
sed -n '/[Ee][Rr][Rr][Oo][Rr]/p' /tmp/fundamentals.txt
GNU sed has no -i flag for case-insensitive. Use character classes.
awk (multiple approaches):
# Character class (portable)
awk '/[Ee][Rr][Rr][Oo][Rr]/' /tmp/fundamentals.txt
# Using tolower() function
awk 'tolower($0) ~ /error/' /tmp/fundamentals.txt
# IGNORECASE variable (GNU awk)
awk 'BEGIN{IGNORECASE=1} /error/' /tmp/fundamentals.txt
find -iregex (case-insensitive regex):
# -iregex is case-insensitive version of -regex
find /tmp -iregex '.*error.*'
# Without -iregex, use character class:
find /tmp -regex '.*[Ee][Rr][Rr][Oo][Rr].*'
Challenge 3: Escape the Brackets
Goal: Match [ERROR] including the literal brackets
Answer
grep (BRE):
grep '\[ERROR\]' /tmp/fundamentals.txt
Brackets [] define character classes. Escape with \[ and \] for literals.
grep -E (ERE):
grep -E '\[ERROR\]' /tmp/fundamentals.txt
Same escaping in ERE - brackets always need escaping for literal match.
In Other Tools (sed, awk, find)
sed:
sed -n '/\[ERROR\]/p' /tmp/fundamentals.txt
Same escaping rules as grep BRE.
awk:
awk '/\[ERROR\]/' /tmp/fundamentals.txt
awk uses ERE - same escaping.
find -regex:
find /tmp -regex '.*\[ERROR\].*'
POSIX regex in find also requires bracket escaping.
Alternative - use character class to match bracket:
# Match literal [ using a character class containing [
grep '[[]]ERROR[]]' /tmp/fundamentals.txt # Tricky but works
This is obscure - escaping is clearer.
Challenge 4: Escape the Dot (Critical!)
Goal: Match ONLY 192.168.1.100 (not 192X168Y1Z100)
Answer
grep (BRE):
grep '192\.168\.1\.100' /tmp/fundamentals.txt
The . matches ANY character. Escape it with \. for literal period.
grep -E (ERE):
grep -E '192\.168\.1\.100' /tmp/fundamentals.txt
Same escaping - dot is a metacharacter in all flavors.
The Bug:
# WRONG - matches 192X168Y1Z100 too!
grep '192.168.1.100' /tmp/fundamentals.txt
In Other Tools (sed, awk, find)
sed:
sed -n '/192\.168\.1\.100/p' /tmp/fundamentals.txt
awk:
awk '/192\.168\.1\.100/' /tmp/fundamentals.txt
find -regex:
# Match files with this IP in the name
find /tmp -regex '.*192\.168\.1\.100.*'
Why this matters: Unescaped dots in IP matching is the #1 regex bug in log parsing scripts.
Challenge 5: Escape the Dollar Sign
Goal: Find the line with $99.99
Answer
grep (BRE):
grep '\$99\.99' /tmp/fundamentals.txt
$ means end-of-line in regex. Escape for literal dollar sign.
grep -E (ERE):
grep -E '\$99\.99' /tmp/fundamentals.txt
Same escaping required.
Shell quoting matters:
# Single quotes protect $ from shell expansion
grep '\$99' file.txt # Correct
# Double quotes - shell tries to expand $99 as variable!
grep "\$99" file.txt # Need extra escaping
grep "\\$99" file.txt # Works but confusing
Always use single quotes for regex patterns.
In Other Tools (sed, awk, find)
sed:
sed -n '/\$99\.99/p' /tmp/fundamentals.txt
awk:
awk '/\$99\.99/' /tmp/fundamentals.txt
In awk, $ also means field reference ($1, $2), so escaping is critical.
awk alternative - avoid regex entirely:
awk 'index($0, "$99.99")' /tmp/fundamentals.txt
index() does literal string matching, no regex escaping needed.
find -regex:
find /tmp -regex '.*\$99\.99.*'
Challenge 6: Windows Path
Goal: Find the Windows path C:\Users\Admin
Answer
grep (BRE):
grep 'C:\\Users\\Admin' /tmp/fundamentals.txt
Backslashes need escaping: \\ for literal \
grep -E (ERE):
grep -E 'C:\\Users\\Admin' /tmp/fundamentals.txt
Same escaping in ERE.
Double escaping nightmare:
# In double quotes, shell eats one level of escaping
grep "C:\\\\Users\\\\Admin" file.txt # Confusing!
# Single quotes - regex engine sees exactly what you type
grep 'C:\\Users\\Admin' file.txt # Much clearer
In Other Tools (sed, awk, find)
sed:
sed -n '/C:\\Users\\Admin/p' /tmp/fundamentals.txt
awk:
awk '/C:\\Users\\Admin/' /tmp/fundamentals.txt
awk with literal match (avoids regex):
awk 'index($0, "C:\\Users\\Admin")' /tmp/fundamentals.txt
Note: In awk string literals, \\ is also needed for literal backslash.
find -regex:
find /tmp -regex '.*C:\\Users\\Admin.*'
Infrastructure note: Windows paths in Linux logs (Samba, Wine) require this escaping constantly.
Challenge 7: Match Question Mark
Goal: Find lines containing literal ?
Answer
grep (BRE) - tricky!:
grep '?' /tmp/fundamentals.txt
In BRE, ? is NOT a metacharacter - it’s literal! No escaping needed.
grep -E (ERE) - must escape:
grep -E '\?' /tmp/fundamentals.txt
In ERE, ? means "zero or one of preceding" - must escape for literal.
The BRE/ERE trap:
| Character | BRE | ERE |
|---|---|---|
|
Literal (no escape) |
Metachar (escape: |
|
Literal (no escape) |
Metachar (escape: |
|
Alternation |
Literal backslash + pipe |
(pipe char) |
Literal pipe |
Alternation |
In Other Tools (sed, awk, find)
sed (BRE default):
sed -n '/?/p' /tmp/fundamentals.txt
No escaping in BRE sed.
sed -E (ERE):
sed -En '/\?/p' /tmp/fundamentals.txt
awk (always ERE):
awk '/\?/' /tmp/fundamentals.txt
find -regex (POSIX ERE):
find /tmp -regex '.*\?.*'
Gotcha: This BRE/ERE difference is why many people get confused. When in doubt, escape it.
Challenge 8: Match Math Expression
Goal: Find the line with 5*5=25
Answer
grep (BRE):
grep '5\*5=25' /tmp/fundamentals.txt
* means "zero or more of preceding" - escape for literal asterisk.
grep -E (ERE):
grep -E '5\*5=25' /tmp/fundamentals.txt
Same escaping in ERE - * is metachar in all flavors.
The danger of unescaped *:
# WRONG - matches "55=25", "555=25", even "5=25"!
grep '5*5=25' /tmp/fundamentals.txt
# Why? 5* means "zero or more 5s", then literal 5=25
In Other Tools (sed, awk, find)
sed:
sed -n '/5\*5=25/p' /tmp/fundamentals.txt
awk:
awk '/5\*5=25/' /tmp/fundamentals.txt
awk literal match (no regex):
awk 'index($0, "5*5=25")' /tmp/fundamentals.txt
find -regex:
find /tmp -regex '.*5\*5=25.*'
Real-world: Escaping * matters in log parsing (e.g., SELECT * FROM in SQL logs).
Challenge 9: Extract IP with -o
Goal: Extract ONLY the IP address 192.168.1.100 (not the whole line)
Answer
grep -oE (ERE with extraction):
grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' /tmp/fundamentals.txt
-
-ooutputs only the matched part (not full line) -
-Eenables ERE so+works without escaping -
[0-9]+means one or more digits -
\.matches literal dot
grep BRE equivalent:
grep -o '[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+' /tmp/fundamentals.txt
In BRE, ` needs escaping as `\ to mean "one or more".
In Other Tools (sed, awk)
sed (extract with substitution):
# Replace entire line with just the IP
sed -n 's/.*\([0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+\).*/\1/p' /tmp/fundamentals.txt
sed doesn’t have -o. Use capture groups \(…\) and back-reference \1.
sed -E (ERE):
sed -En 's/.*([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+).*/\1/p' /tmp/fundamentals.txt
awk (field extraction + regex):
# Print each field if it matches IP pattern
awk '{for(i=1;i<=NF;i++) if($i ~ /^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$/) print $i}' /tmp/fundamentals.txt
awk with match() and substr():
awk 'match($0, /[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+/) {print substr($0, RSTART, RLENGTH)}' /tmp/fundamentals.txt
RSTART and RLENGTH give position and length of match.
Note: grep -o is much simpler. Use awk/sed when you need more complex processing.
Challenge 10: Both IPs
Goal: Extract ALL IPs from the file (both 192.168.1.100 and 10.50.1.20)
Answer
grep -oE (ERE with quantifier):
grep -oE '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' /tmp/fundamentals.txt
Output:
192.168.1.100 10.50.1.20
{1,3} limits each octet to 1-3 digits.
grep BRE equivalent:
grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' /tmp/fundamentals.txt
In BRE, braces need escaping: \{1,3\} not {1,3}.
More precise IP pattern (validates 0-255):
# This rejects 999.999.999.999
grep -oE '(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])' /tmp/fundamentals.txt
Complex but correct. The simple {1,3} pattern matches invalid IPs like 999.999.999.999.
In Other Tools (sed, awk, find)
sed -E (extract all IPs):
sed -En 's/.*([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}).*/\1/p' /tmp/fundamentals.txt
Note: This only extracts ONE IP per line. Use grep -o for multiple.
awk (all IPs, multiple per line):
awk '{
while (match($0, /[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/)) {
print substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + RLENGTH)
}
}' /tmp/fundamentals.txt
awk with gensub (GNU awk):
awk '{
n = split($0, a, /[^0-9.]/)
for (i=1; i<=n; i++)
if (a[i] ~ /^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$/)
print a[i]
}' /tmp/fundamentals.txt
find -regex (files with IPs in name):
find /var/log -regex '.*[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}.*'
find uses POSIX BRE by default, so \{1,3\} syntax.
Infrastructure use case: Extracting IPs from firewall logs, access logs, network configs.
Gotcha: The Escaped Bracket Mistake
A common beginner error is escaping the wrong character:
# WRONG - escaping bracket instead of dot
grep '[0-9]*.\[0-9]*' file
# CORRECT - escape the dot, not the bracket
grep '[0-9]*\.[0-9]*' file
The [0-9] is a character class - the brackets define it. You don’t escape those.
The . matches ANY character - that’s what you escape with \.
Key Takeaways
-
Most characters are literal - they match themselves
-
Metacharacters have special meaning - memorize the list
-
Backslash escapes metacharacters -
\.matches literal dot -
Case matters - use
-iflag or(?i)modifier when needed -
Test patterns incrementally - build from simple to complex
Regex vs Globs: Know the Difference
| These tools do NOT use regex. Don’t confuse glob patterns with regex! |
| Tool/Context | Pattern Type | How * Works |
|---|---|---|
|
Shell glob |
Any characters (including none) |
|
Shell glob |
Use |
|
No pattern matching |
Just passes arguments to commands |
|
Shell glob |
Use |
|
Shell glob |
Extended glob with |
The Trap: Same Symbol, Different Meaning
# GLOB: * means "any characters" (like .* in regex)
ls *.txt # Matches: foo.txt, bar.txt, .txt
# REGEX: * means "zero or more of PRECEDING"
grep 'a*' file.txt # Matches: "", "a", "aa", "aaa"...
grep '.*' file.txt # This is the regex equivalent of glob's *
Quick Decision Tree
Am I using regex?
ββ grep, grep -E, grep -P β YES (BRE, ERE, PCRE)
ββ sed, sed -E β YES (BRE, ERE)
ββ awk β YES (ERE always)
ββ find -regex β YES (POSIX, matches FULL path)
ββ find -name β NO (glob)
ββ ls, mv, cp arguments β NO (shell glob expansion)
ββ xargs β NO (no pattern matching)
ββ [[ $x == pattern ]] β NO (glob, use =~ for regex)
ββ case/esac β NO (glob)
Test Yourself
# Create test files
touch /tmp/test1.txt /tmp/test2.txt /tmp/test.log
# Glob (shell expands before command runs)
ls /tmp/test*.txt
# Output: /tmp/test1.txt /tmp/test2.txt
# Regex (grep interprets pattern)
ls /tmp | grep 'test.*\.txt'
# Output: test1.txt test2.txt
# WRONG: Using regex syntax in glob context
ls /tmp/test.+\.txt # FAILS - glob doesn't understand + or \.
Next Module
Character Classes - Matching sets of characters with [] and shorthand classes.