Regex Fundamentals

Every regex pattern consists of two types of characters: literals that match themselves, and metacharacters that have special meaning. Understanding this distinction is the foundation of regex mastery.

Literal Characters

Most characters match themselves exactly.

Pattern: cat
Text:    The cat sat on the mat.
Match:       ^^^

Key Points: - Case-sensitive by default (Cat β‰  cat) - Spaces are literal (hello world matches exactly that) - Numbers are literal (192 matches "192")

Case Sensitivity

Pattern Matches Does Not Match

error

"error"

"Error", "ERROR"

Error

"Error"

"error", "ERROR"

Making case-insensitive:

# grep with -i flag
grep -i 'error' logfile.txt

# PCRE inline modifier
grep -P '(?i)error' logfile.txt

Metacharacters

These characters have special meaning and require escaping to match literally:

Character Meaning To Match Literally

.

Any single character

\.

*

Zero or more of preceding

\*

+

One or more of preceding

\+

?

Zero or one of preceding

\?

^

Start of line

\^

$

End of line

\$

|

Alternation (OR)

\|

()

Grouping

\(\)

[]

Character class

\[\]

{}

Quantifier

\{\}

\

Escape character

\\

The Dot Metacharacter

The dot . matches any single character except newline.

Pattern: c.t
Matches: cat, cot, cut, c1t, c@t, c t
Does not match: ct (no character), cart (two characters)

Practical Example: IP Octets

Pattern: 192.168
Text:    192.168.1.1 and 192X168Y1Z1
Match:   ^^^^^^^         ^^^^^^^

Problem: Dot matches ANY character, including 'X'

Solution: Escape the dot

Pattern: 192\.168
Text:    192.168.1.1 and 192X168Y1Z1
Match:   ^^^^^^^

Escaping Rules

When to Escape

Escape these characters when you want their literal meaning:

. * + ? ^ $ | ( ) [ ] { } \

Escape Sequences

Sequence Meaning Example

\\

Literal backslash

C:\\Users matches "C:\Users"

\.

Literal period

192\.168 matches "192.168"

\*

Literal asterisk

5\*5 matches "5*5"

\?

Literal question mark

What\? matches "What?"

\(

Literal parenthesis

f\(x\) matches "f(x)"

\[

Literal bracket

array\[0\] matches "array[0]"

Infrastructure Example: File Paths

# Windows path - must escape backslashes
grep -E 'C:\\Users\\Admin' config.txt

# URL with query parameters
grep -E 'api\.example\.com/v1/users\?id=' access.log

# IP address
grep -E '10\.50\.1\.20' network.log

Combining Literals and Metacharacters

Pattern: Match Log Level

Pattern: \[ERROR\]
Text:    [ERROR] Connection failed
Match:   ^^^^^^^

Explanation:
- \[ matches literal [
- ERROR matches literally
- \] matches literal ]

Pattern: Match File Extension

Pattern: \.log$
Text:    access.log
Match:        ^^^^

Explanation:
- \. matches literal .
- log matches literally
- $ anchors to end of line

Common Mistakes

Mistake 1: Unescaped Dots

# Wrong - matches 192X168Y1Z100 too
grep '192.168.1.100' file.txt

# Correct - matches only IP
grep '192\.168\.1\.100' file.txt

Mistake 2: Forgetting Case Sensitivity

# Misses "ERROR" and "Error"
grep 'error' logfile.txt

# Catches all cases
grep -i 'error' logfile.txt

Mistake 3: Special Characters in Passwords/Strings

# If searching for "cost=$100"
# Wrong - $ is metacharacter
grep 'cost=$100' file.txt

# Correct - escape special chars
grep 'cost=\$100' file.txt

Self-Test Exercises

Try each challenge FIRST. Type your pattern in the terminal. Only expand the answer after you’ve attempted it honestly.

Setup Test Data

Run this once to create your test file:

cat << 'EOF' > /tmp/fundamentals.txt
The quick brown fox jumps over the lazy dog.
IP: 192.168.1.100
IP: 10.50.1.20
FAKE: 192X168Y1Z100
MAC: AA:BB:CC:DD:EE:FF
Price: $99.99
Total: $1,234.56
Path: C:\Users\Admin\Documents
Path: /home/evan/atelier
URL: https://api.example.com/v1/users?id=123
[ERROR] Connection refused
[WARN] Low disk space
[INFO] Server started
error: lowercase error
ERROR: UPPERCASE ERROR
What? Really? Yes!
2+2=4
5*5=25
EOF

Challenge 1: Find Literal Text

Goal: Find lines containing the word "fox"

Answer

grep (BRE/ERE - identical for literals):

grep 'fox' /tmp/fundamentals.txt

Output: The quick brown fox jumps over the lazy dog.

Literal strings work the same in BRE and ERE - no flags needed.

In Other Tools (sed, awk, find)

sed (print matching lines):

sed -n '/fox/p' /tmp/fundamentals.txt

-n suppresses default output, /fox/p prints matches.

awk (pattern matching):

awk '/fox/' /tmp/fundamentals.txt

awk prints lines matching the pattern by default.

find -regex (match filenames):

# If you had a file named "fox.txt" in /tmp:
find /tmp -regex '.*fox.*'
find -regex matches the FULL PATH, not just filename. Use .* to match leading path.

Challenge 2: Case Insensitive

Goal: Find ALL error lines (ERROR, error, Error - any case)

Answer - Multiple Approaches (varying efficiency)

Level 1: The -i flag (easiest)

grep -i 'error' /tmp/fundamentals.txt

Works in grep, but not all tools have this flag.


Level 2: ERE alternation (grep -E)

grep -E '(ERROR|error|Error)' /tmp/fundamentals.txt

In ERE, | means OR. Parentheses group without escaping.


Level 3: BRE alternation (plain grep)

grep '\(ERROR\|error\|Error\)' /tmp/fundamentals.txt

In BRE, you must escape \| for alternation and \(\) for grouping.


Level 4: Character class (most portable)

grep '[Ee][Rr][Rr][Oo][Rr]' /tmp/fundamentals.txt

No flags, no alternation - works in ANY regex engine. Each [Xx] matches either case for that position.


Why this matters: sed, awk, vim, Python all have different defaults. Learning to match without -i makes you portable across tools.

In Other Tools (sed, awk, find)

sed (case-insensitive with character class):

sed -n '/[Ee][Rr][Rr][Oo][Rr]/p' /tmp/fundamentals.txt

GNU sed has no -i flag for case-insensitive. Use character classes.

awk (multiple approaches):

# Character class (portable)
awk '/[Ee][Rr][Rr][Oo][Rr]/' /tmp/fundamentals.txt

# Using tolower() function
awk 'tolower($0) ~ /error/' /tmp/fundamentals.txt

# IGNORECASE variable (GNU awk)
awk 'BEGIN{IGNORECASE=1} /error/' /tmp/fundamentals.txt

find -iregex (case-insensitive regex):

# -iregex is case-insensitive version of -regex
find /tmp -iregex '.*error.*'

# Without -iregex, use character class:
find /tmp -regex '.*[Ee][Rr][Rr][Oo][Rr].*'

Challenge 3: Escape the Brackets

Goal: Match [ERROR] including the literal brackets

Answer

grep (BRE):

grep '\[ERROR\]' /tmp/fundamentals.txt

Brackets [] define character classes. Escape with \[ and \] for literals.

grep -E (ERE):

grep -E '\[ERROR\]' /tmp/fundamentals.txt

Same escaping in ERE - brackets always need escaping for literal match.

In Other Tools (sed, awk, find)

sed:

sed -n '/\[ERROR\]/p' /tmp/fundamentals.txt

Same escaping rules as grep BRE.

awk:

awk '/\[ERROR\]/' /tmp/fundamentals.txt

awk uses ERE - same escaping.

find -regex:

find /tmp -regex '.*\[ERROR\].*'

POSIX regex in find also requires bracket escaping.

Alternative - use character class to match bracket:

# Match literal [ using a character class containing [
grep '[[]]ERROR[]]' /tmp/fundamentals.txt   # Tricky but works

This is obscure - escaping is clearer.


Challenge 4: Escape the Dot (Critical!)

Goal: Match ONLY 192.168.1.100 (not 192X168Y1Z100)

Answer

grep (BRE):

grep '192\.168\.1\.100' /tmp/fundamentals.txt

The . matches ANY character. Escape it with \. for literal period.

grep -E (ERE):

grep -E '192\.168\.1\.100' /tmp/fundamentals.txt

Same escaping - dot is a metacharacter in all flavors.

The Bug:

# WRONG - matches 192X168Y1Z100 too!
grep '192.168.1.100' /tmp/fundamentals.txt
In Other Tools (sed, awk, find)

sed:

sed -n '/192\.168\.1\.100/p' /tmp/fundamentals.txt

awk:

awk '/192\.168\.1\.100/' /tmp/fundamentals.txt

find -regex:

# Match files with this IP in the name
find /tmp -regex '.*192\.168\.1\.100.*'

Why this matters: Unescaped dots in IP matching is the #1 regex bug in log parsing scripts.


Challenge 5: Escape the Dollar Sign

Goal: Find the line with $99.99

Answer

grep (BRE):

grep '\$99\.99' /tmp/fundamentals.txt

$ means end-of-line in regex. Escape for literal dollar sign.

grep -E (ERE):

grep -E '\$99\.99' /tmp/fundamentals.txt

Same escaping required.

Shell quoting matters:

# Single quotes protect $ from shell expansion
grep '\$99' file.txt      # Correct

# Double quotes - shell tries to expand $99 as variable!
grep "\$99" file.txt      # Need extra escaping
grep "\\$99" file.txt     # Works but confusing

Always use single quotes for regex patterns.

In Other Tools (sed, awk, find)

sed:

sed -n '/\$99\.99/p' /tmp/fundamentals.txt

awk:

awk '/\$99\.99/' /tmp/fundamentals.txt

In awk, $ also means field reference ($1, $2), so escaping is critical.

awk alternative - avoid regex entirely:

awk 'index($0, "$99.99")' /tmp/fundamentals.txt

index() does literal string matching, no regex escaping needed.

find -regex:

find /tmp -regex '.*\$99\.99.*'

Challenge 6: Windows Path

Goal: Find the Windows path C:\Users\Admin

Answer

grep (BRE):

grep 'C:\\Users\\Admin' /tmp/fundamentals.txt

Backslashes need escaping: \\ for literal \

grep -E (ERE):

grep -E 'C:\\Users\\Admin' /tmp/fundamentals.txt

Same escaping in ERE.

Double escaping nightmare:

# In double quotes, shell eats one level of escaping
grep "C:\\\\Users\\\\Admin" file.txt   # Confusing!

# Single quotes - regex engine sees exactly what you type
grep 'C:\\Users\\Admin' file.txt       # Much clearer
In Other Tools (sed, awk, find)

sed:

sed -n '/C:\\Users\\Admin/p' /tmp/fundamentals.txt

awk:

awk '/C:\\Users\\Admin/' /tmp/fundamentals.txt

awk with literal match (avoids regex):

awk 'index($0, "C:\\Users\\Admin")' /tmp/fundamentals.txt

Note: In awk string literals, \\ is also needed for literal backslash.

find -regex:

find /tmp -regex '.*C:\\Users\\Admin.*'

Infrastructure note: Windows paths in Linux logs (Samba, Wine) require this escaping constantly.


Challenge 7: Match Question Mark

Goal: Find lines containing literal ?

Answer

grep (BRE) - tricky!:

grep '?' /tmp/fundamentals.txt

In BRE, ? is NOT a metacharacter - it’s literal! No escaping needed.

grep -E (ERE) - must escape:

grep -E '\?' /tmp/fundamentals.txt

In ERE, ? means "zero or one of preceding" - must escape for literal.

The BRE/ERE trap:

Character BRE ERE

?

Literal (no escape)

Metachar (escape: \?)

+

Literal (no escape)

Metachar (escape: \+)

|

Alternation

Literal backslash + pipe

(pipe char)

Literal pipe

Alternation

In Other Tools (sed, awk, find)

sed (BRE default):

sed -n '/?/p' /tmp/fundamentals.txt

No escaping in BRE sed.

sed -E (ERE):

sed -En '/\?/p' /tmp/fundamentals.txt

awk (always ERE):

awk '/\?/' /tmp/fundamentals.txt

find -regex (POSIX ERE):

find /tmp -regex '.*\?.*'

Gotcha: This BRE/ERE difference is why many people get confused. When in doubt, escape it.


Challenge 8: Match Math Expression

Goal: Find the line with 5*5=25

Answer

grep (BRE):

grep '5\*5=25' /tmp/fundamentals.txt

* means "zero or more of preceding" - escape for literal asterisk.

grep -E (ERE):

grep -E '5\*5=25' /tmp/fundamentals.txt

Same escaping in ERE - * is metachar in all flavors.

The danger of unescaped *:

# WRONG - matches "55=25", "555=25", even "5=25"!
grep '5*5=25' /tmp/fundamentals.txt

# Why? 5* means "zero or more 5s", then literal 5=25
In Other Tools (sed, awk, find)

sed:

sed -n '/5\*5=25/p' /tmp/fundamentals.txt

awk:

awk '/5\*5=25/' /tmp/fundamentals.txt

awk literal match (no regex):

awk 'index($0, "5*5=25")' /tmp/fundamentals.txt

find -regex:

find /tmp -regex '.*5\*5=25.*'

Real-world: Escaping * matters in log parsing (e.g., SELECT * FROM in SQL logs).


Challenge 9: Extract IP with -o

Goal: Extract ONLY the IP address 192.168.1.100 (not the whole line)

Answer

grep -oE (ERE with extraction):

grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' /tmp/fundamentals.txt
  • -o outputs only the matched part (not full line)

  • -E enables ERE so + works without escaping

  • [0-9]+ means one or more digits

  • \. matches literal dot

grep BRE equivalent:

grep -o '[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+' /tmp/fundamentals.txt

In BRE, ` needs escaping as `\ to mean "one or more".

In Other Tools (sed, awk)

sed (extract with substitution):

# Replace entire line with just the IP
sed -n 's/.*\([0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+\).*/\1/p' /tmp/fundamentals.txt

sed doesn’t have -o. Use capture groups \(…​\) and back-reference \1.

sed -E (ERE):

sed -En 's/.*([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+).*/\1/p' /tmp/fundamentals.txt

awk (field extraction + regex):

# Print each field if it matches IP pattern
awk '{for(i=1;i<=NF;i++) if($i ~ /^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$/) print $i}' /tmp/fundamentals.txt

awk with match() and substr():

awk 'match($0, /[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+/) {print substr($0, RSTART, RLENGTH)}' /tmp/fundamentals.txt

RSTART and RLENGTH give position and length of match.

Note: grep -o is much simpler. Use awk/sed when you need more complex processing.


Challenge 10: Both IPs

Goal: Extract ALL IPs from the file (both 192.168.1.100 and 10.50.1.20)

Answer

grep -oE (ERE with quantifier):

grep -oE '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' /tmp/fundamentals.txt

Output:

192.168.1.100
10.50.1.20

{1,3} limits each octet to 1-3 digits.

grep BRE equivalent:

grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' /tmp/fundamentals.txt

In BRE, braces need escaping: \{1,3\} not {1,3}.

More precise IP pattern (validates 0-255):

# This rejects 999.999.999.999
grep -oE '(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])' /tmp/fundamentals.txt

Complex but correct. The simple {1,3} pattern matches invalid IPs like 999.999.999.999.

In Other Tools (sed, awk, find)

sed -E (extract all IPs):

sed -En 's/.*([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}).*/\1/p' /tmp/fundamentals.txt

Note: This only extracts ONE IP per line. Use grep -o for multiple.

awk (all IPs, multiple per line):

awk '{
  while (match($0, /[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}/)) {
    print substr($0, RSTART, RLENGTH)
    $0 = substr($0, RSTART + RLENGTH)
  }
}' /tmp/fundamentals.txt

awk with gensub (GNU awk):

awk '{
  n = split($0, a, /[^0-9.]/)
  for (i=1; i<=n; i++)
    if (a[i] ~ /^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$/)
      print a[i]
}' /tmp/fundamentals.txt

find -regex (files with IPs in name):

find /var/log -regex '.*[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}.*'

find uses POSIX BRE by default, so \{1,3\} syntax.

Infrastructure use case: Extracting IPs from firewall logs, access logs, network configs.


Gotcha: The Escaped Bracket Mistake

A common beginner error is escaping the wrong character:

# WRONG - escaping bracket instead of dot
grep '[0-9]*.\[0-9]*' file

# CORRECT - escape the dot, not the bracket
grep '[0-9]*\.[0-9]*' file

The [0-9] is a character class - the brackets define it. You don’t escape those. The . matches ANY character - that’s what you escape with \.

Key Takeaways

  1. Most characters are literal - they match themselves

  2. Metacharacters have special meaning - memorize the list

  3. Backslash escapes metacharacters - \. matches literal dot

  4. Case matters - use -i flag or (?i) modifier when needed

  5. Test patterns incrementally - build from simple to complex

Regex vs Globs: Know the Difference

These tools do NOT use regex. Don’t confuse glob patterns with regex!
Tool/Context Pattern Type How * Works

ls *.txt

Shell glob

Any characters (including none)

find -name '*.log'

Shell glob

Use -regex for actual regex

xargs

No pattern matching

Just passes arguments to commands

[[ $x == pattern ]]

Shell glob

Use =~ operator for regex

case $x in pattern)

Shell glob

Extended glob with shopt -s extglob

The Trap: Same Symbol, Different Meaning

# GLOB: * means "any characters" (like .* in regex)
ls *.txt              # Matches: foo.txt, bar.txt, .txt

# REGEX: * means "zero or more of PRECEDING"
grep 'a*' file.txt    # Matches: "", "a", "aa", "aaa"...
grep '.*' file.txt    # This is the regex equivalent of glob's *

Quick Decision Tree

Am I using regex?

β”œβ”€ grep, grep -E, grep -P  β†’ YES (BRE, ERE, PCRE)
β”œβ”€ sed, sed -E             β†’ YES (BRE, ERE)
β”œβ”€ awk                     β†’ YES (ERE always)
β”œβ”€ find -regex             β†’ YES (POSIX, matches FULL path)
β”œβ”€ find -name              β†’ NO (glob)
β”œβ”€ ls, mv, cp arguments    β†’ NO (shell glob expansion)
β”œβ”€ xargs                   β†’ NO (no pattern matching)
β”œβ”€ [[ $x == pattern ]]     β†’ NO (glob, use =~ for regex)
└─ case/esac               β†’ NO (glob)

Test Yourself

# Create test files
touch /tmp/test1.txt /tmp/test2.txt /tmp/test.log

# Glob (shell expands before command runs)
ls /tmp/test*.txt
# Output: /tmp/test1.txt /tmp/test2.txt

# Regex (grep interprets pattern)
ls /tmp | grep 'test.*\.txt'
# Output: test1.txt  test2.txt

# WRONG: Using regex syntax in glob context
ls /tmp/test.+\.txt    # FAILS - glob doesn't understand + or \.

Next Module

Character Classes - Matching sets of characters with [] and shorthand classes.