Character Classes

Character classes define a set of characters, any one of which can match at that position. They are fundamental to matching variable content like digits, letters, or specific character sets.

Basic Character Classes

Syntax: Square Brackets

A character class is enclosed in square brackets []. Any single character inside the brackets can match.

Pattern: [aeiou]
Matches: Any single vowel (a, e, i, o, or u)

Pattern: [0123456789]
Matches: Any single digit

Order Doesn’t Matter

[abc] = [bca] = [cab]

All match a single character: 'a', 'b', or 'c'.

Character Ranges

Use hyphen - to specify ranges.

Pattern Matches Description

[a-z]

a through z

Lowercase letters

[A-Z]

A through Z

Uppercase letters

[0-9]

0 through 9

Digits

[a-zA-Z]

a-z, A-Z

All letters

[a-zA-Z0-9]

a-z, A-Z, 0-9

Alphanumeric

[A-Fa-f0-9]

Hex characters

Hexadecimal digits

Combining Ranges and Literals

Pattern: [a-zA-Z_][a-zA-Z0-9_]*
Meaning: Valid identifier (starts with letter/underscore, followed by alphanumeric/underscore)
Matches: myVar, _private, count123, MAX_VALUE

Negated Character Classes

Caret ^ at the START of a class negates it.

Pattern Matches

[^0-9]

Any character EXCEPT digits

[^a-zA-Z]

Any non-letter

[^aeiou]

Any non-vowel

[^ \t\n]

Any non-whitespace

Important: ^ only negates when it’s the FIRST character inside [].

[^abc]  → NOT a, b, or c
[a^bc]  → a, ^, b, or c (^ is literal here)

Special Characters Inside Classes

Most metacharacters lose their special meaning inside []:

Character Inside [] Notes

.

Literal dot

No need to escape

*

Literal asterisk

No need to escape

+

Literal plus

No need to escape

?

Literal question mark

No need to escape

^

Negation (if first) or literal

Escape or place not-first

-

Range operator

Escape or place first/last

]

Ends class

Escape or place first

\

Escape character

Still works

Matching Literal Hyphen, Caret, Bracket

# Hyphen at end (literal)
[a-z-]

# Hyphen at start (literal)
[-a-z]

# Hyphen escaped
[a-z\-0-9]

# Caret not first (literal)
[a-z^]

# Closing bracket first (literal)
[]a-z]

Shorthand Character Classes

PCRE and most modern engines provide shorthand:

Shorthand Longhand Matches Notes

\d

[0-9]

Digit

PCRE only

\D

[^0-9]

Non-digit

PCRE only

\w

[A-Za-z0-9_]

Word character

PCRE only

\W

[^A-Za-z0-9_]

Non-word character

PCRE only

\s

[ \t\n\r\f\v]

Whitespace

PCRE only

\S

[^ \t\n\r\f\v]

Non-whitespace

PCRE only

Availability: - grep -P, ripgrep, Python, JavaScript, Perl: YES - grep, grep -E, sed, awk: NO (use longhand)

Using Shorthand in Practice

# PCRE: Match IP address
grep -P '\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}' file.txt

# ERE equivalent (no shorthand)
grep -E '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' file.txt

POSIX Character Classes

POSIX defines named classes (must be inside another []):

POSIX Class Equivalent Matches

[A-Za-z]

Letters

[0-9]

Digits

[A-Za-z0-9]

Alphanumeric

[ \t\n\r\f\v]

Whitespace

[A-Z]

Uppercase

[a-z]

Lowercase

Punctuation

!"#$%&'()*+,-./:;<⇒?@[\]^_{|}~`

[A-Fa-f0-9]

Hex digits

Usage:

# Note: Double brackets required
grep '[[:digit:]]' file.txt
grep '[[:alpha:]][[:digit:]]' file.txt

Infrastructure Patterns

Hex Characters (MAC addresses, hashes)

Pattern: [A-Fa-f0-9]
Use: Single hex digit

Pattern: [A-Fa-f0-9]{2}
Use: Hex byte (AA, FF, 0a)

Pattern: [A-Fa-f0-9]{32}
Use: MD5 hash

Pattern: [A-Fa-f0-9]{64}
Use: SHA-256 hash

Port Numbers

Pattern: [0-9]{1,5}
Use: 1-5 digit number (0-99999)
Note: Doesn't validate range (65535 max)

Usernames

Pattern: [a-z][a-z0-9_-]{2,31}
Use: Linux username rules
- Starts with lowercase letter
- 3-32 characters
- Lowercase, digits, underscore, hyphen

Log Levels

Pattern: [DIWEF][A-Z]+
Use: Matches DEBUG, INFO, WARN, ERROR, FATAL
- First letter distinguishes level

Self-Test Exercises

Try each challenge FIRST. Only expand the answer after you’ve attempted it.

Setup Test Data

cat << 'EOF' > /tmp/classes.txt
User: admin123
User: _system
User: 123invalid
MAC: AA:BB:CC:DD:EE:FF
MAC: aa:bb:cc:dd:ee:ff
MAC: GG:HH:II:JJ:KK:LL
Hash: 5d41402abc4b2a76b9719d911017c592
Port: 443
Port: 80
Port: 99999
Level: INFO
Level: ERROR
Level: DEBUG
Temperature: -5C
Temperature: 25C
VLAN: 10
VLAN: 999
IP: 192.168.1.100
IP: 10.50.1.20
EOF

Challenge 1: Match Any Digit

Goal: Find lines containing any digit 0-9

Answer
grep '[0-9]' /tmp/classes.txt

[0-9] matches any single digit


Challenge 2: Match Hex Characters Only

Goal: Extract valid hex octets (AA, BB, aa, ff, etc.) but NOT invalid ones (GG, HH)

Answer
grep -oE '[A-Fa-f0-9]{2}' /tmp/classes.txt

[A-Fa-f0-9] matches hex chars only (A-F, a-f, 0-9)


Challenge 3: Valid MAC Address

Goal: Match only valid MAC addresses (hex chars, not GG:HH:II…​)

Answer
grep -E '([A-Fa-f0-9]{2}:){5}[A-Fa-f0-9]{2}' /tmp/classes.txt

This won’t match GG:HH:II:JJ:KK:LL because G, H, I aren’t hex


Challenge 4: Username Starting with Letter

Goal: Find usernames that start with a letter (valid)

Answer
grep -E 'User: [a-zA-Z]' /tmp/classes.txt

[a-zA-Z] matches any letter


Challenge 5: Username Starting with Digit (Invalid)

Goal: Find usernames that start with a digit (invalid)

Answer
grep -E 'User: [0-9]' /tmp/classes.txt

Output: User: 123invalid


Challenge 6: Match NOT a Digit

Goal: Find lines where the character after "User: " is NOT a digit

Answer
grep -E 'User: [^0-9]' /tmp/classes.txt

[^0-9] = negated class = NOT a digit


Challenge 7: Extract Port Numbers

Goal: Extract just the port numbers (443, 80, 99999)

Answer
grep -oE 'Port: [0-9]+' /tmp/classes.txt | grep -oE '[0-9]+'

Or with lookbehind (PCRE):

grep -oP '(?<=Port: )[0-9]+' /tmp/classes.txt

Challenge 8: Log Levels (Uppercase Words)

Goal: Extract log levels (INFO, ERROR, DEBUG)

Answer
grep -oE 'Level: [A-Z]+' /tmp/classes.txt

[A-Z]+ = one or more uppercase letters


Challenge 9: Negative Numbers

Goal: Match temperatures including negative values (-5C, 25C)

Answer
grep -oE '-?[0-9]+C' /tmp/classes.txt

-? = optional minus sign


Challenge 10: PCRE vs ERE

Goal: Match digits using PCRE shorthand, then using ERE equivalent

Answer
# PCRE shorthand (grep -P required)
grep -oP '\d+' /tmp/classes.txt

# ERE equivalent (works everywhere)
grep -oE '[0-9]+' /tmp/classes.txt

\d = [0-9] but only works with -P


Challenge 11: Alphanumeric Only

Goal: Match the hash (alphanumeric characters only)

Answer
grep -oE '[a-f0-9]{32}' /tmp/classes.txt

MD5 hash is 32 hex characters


Challenge 12: VLAN Range

Goal: Match VLAN IDs (1-3 digit numbers after "VLAN: ")

Answer
grep -oE 'VLAN: [0-9]{1,4}' /tmp/classes.txt

{1,4} limits to 1-4 digits (VLANs go up to 4094)

Common Mistakes

Mistake 1: Unescaped Hyphen

# Wrong - hyphen creates range
[a-z-0-9]  # Undefined behavior

# Correct - hyphen at end
[a-z0-9-]

# Correct - hyphen escaped
[a-z\-0-9]

Mistake 2: Shorthand in Wrong Engine

# Fails in basic grep/awk
grep '\d+' file.txt  # \d not recognized

# Works in PCRE
grep -P '\d+' file.txt

# Works everywhere
grep -E '[0-9]+' file.txt

Mistake 3: Forgetting Double Brackets for POSIX

# Wrong
grep '[:digit:]' file.txt  # Matches ':', 'd', 'i', etc.

# Correct
grep '[[:digit:]]' file.txt  # Matches digits

Key Takeaways

  1. [] defines a set - any character inside can match

  2. Ranges use hyphen - [a-z], [0-9]

  3. ^ at start negates - [^0-9] means non-digit

  4. Shorthand requires PCRE - \d, \w, \s not universal

  5. POSIX uses double brackets -

  6. Most metacharacters are literal inside []

Next Module

Quantifiers - Specifying repetition with *, +, ?, and {n,m}.