Drill 07: Lookahead
Lookahead assertions match a position based on what follows, without consuming those characters. This enables powerful conditional matching like "find X only if followed by Y" or "find X only if NOT followed by Y."
Core Concepts
| Syntax | Meaning | Example |
|---|---|---|
|
Positive lookahead - must follow |
|
|
Negative lookahead - must NOT follow |
|
Zero-width |
Doesn’t consume characters |
Position assertion, not a match |
PCRE only |
Not in BRE/ERE |
Use |
Zero-Width Explained
Lookahead assertions check what’s ahead WITHOUT including it in the match:
# Without lookahead: "foobar" consumed entirely
echo "foobar" | grep -oP 'foo.*'
# Output: foobar
# With lookahead: only "foo" matched, "bar" verified but not consumed
echo "foobar" | grep -oP 'foo(?=bar)'
# Output: foo
# The difference matters for extraction and replacement
Interactive CLI Drill
bash ~/atelier/_bibliotheca/domus-captures/docs/modules/ROOT/examples/regex-drills/07-lookahead.sh
Exercise Set 1: Positive Lookahead Basics
cat << 'EOF' > /tmp/ex-lookahead.txt
filename.txt
filename.pdf
filename.doc
config.yaml
config.json
config.xml
script.sh
script.py
script.rb
user123
user456admin
admin789
EOF
Ex 1.1: Match "filename" only if followed by ".txt"
Solution
grep -oP 'filename(?=\.txt)' /tmp/ex-lookahead.txt
Output: filename (only from the .txt line)
Note: The .txt is verified but NOT included in the match.
Ex 1.2: Match "config" only if followed by ".yaml" or ".json"
Solution
grep -oP 'config(?=\.(yaml|json))' /tmp/ex-lookahead.txt
Output: config (twice - for yaml and json lines)
Ex 1.3: Match "user" followed by digits
Solution
grep -oP 'user(?=\d+)' /tmp/ex-lookahead.txt
Output: user (from user123 and user456admin lines)
Ex 1.4: Match "script" followed by .py or .sh
Solution
grep -oP 'script(?=\.(py|sh))' /tmp/ex-lookahead.txt
Exercise Set 2: Negative Lookahead Basics
cat << 'EOF' > /tmp/ex-neglook.txt
192.168.1.100
192.168.1.1
10.0.0.1
172.16.0.100
8.8.8.8
1.1.1.1
user_test
user_prod
user_dev
admin_test
admin_prod
EOF
Ex 2.1: Match "user_" NOT followed by "test"
Solution
grep -oP 'user_(?!test)' /tmp/ex-neglook.txt
Output: user_ (from user_prod and user_dev)
Ex 2.2: Match IPs NOT ending in ".1"
Solution
grep -P '\d+\.\d+\.\d+\.(?!1$)\d+' /tmp/ex-neglook.txt
Output: Lines with IPs not ending in .1
Ex 2.3: Match "admin_" NOT followed by "test"
Solution
grep -oP 'admin_(?!test)\w+' /tmp/ex-neglook.txt
Output: admin_prod
Ex 2.4: Match lines NOT followed by specific pattern
Solution
# Match 192.168 NOT followed by .1.
grep -P '192\.168\.(?!1\.)' /tmp/ex-neglook.txt
This matches 192.168.x.x where x is not 1.
Exercise Set 3: Password Validation
Lookahead is perfect for password complexity rules (must contain X AND Y AND Z):
cat << 'EOF' > /tmp/ex-passwords.txt
password
Password1
Password1!
Pass1!
Abcd1234!
abc123
ABC123!
longpasswordwithoutspecials
P@ssw0rd!
EOF
Ex 3.1: Must contain at least one digit
Solution
grep -P '(?=.*\d)' /tmp/ex-passwords.txt
(?=.*\d) - From start, look ahead for any chars followed by a digit.
Ex 3.2: Must contain at least one uppercase
Solution
grep -P '(?=.*[A-Z])' /tmp/ex-passwords.txt
Ex 3.3: Must contain digit AND uppercase AND special char
Solution
grep -P '(?=.*\d)(?=.*[A-Z])(?=.*[!@#$%^&*])' /tmp/ex-passwords.txt
Multiple lookaheads - ALL must be satisfied. Output: Password1!, Abcd1234!, P@ssw0rd!
Ex 3.4: Full password validation (8+ chars, digit, upper, lower, special)
Solution
grep -P '^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[!@#$%^&*]).{8,}$' /tmp/ex-passwords.txt
Breaking it down:
- (?=.\d) - must contain digit
- (?=.[a-z]) - must contain lowercase
- (?=.[A-Z]) - must contain uppercase
- (?=.[!@#$%^&*]) - must contain special
- .{8,} - must be 8+ characters
Exercise Set 4: Data Extraction
cat << 'EOF' > /tmp/ex-extract.txt
price: $99.99 USD
price: $149.50 USD
price: 50 EUR
total: $1,234.56
discount: 20%
tax: 8.5%
quantity: 100 units
quantity: 50 units
EOF
Ex 4.1: Extract numbers followed by "USD"
Solution
grep -oP '\$[\d,.]+(?= USD)' /tmp/ex-extract.txt
Output: $99.99, $149.50 Note: The " USD" is checked but not included in output.
Ex 4.2: Extract percentages (numbers followed by %)
Solution
grep -oP '[\d.]+(?=%)' /tmp/ex-extract.txt
Output: 20, 8.5
Ex 4.3: Extract quantities followed by "units"
Solution
grep -oP '\d+(?= units)' /tmp/ex-extract.txt
Output: 100, 50
Exercise Set 5: Code Pattern Matching
cat << 'EOF' > /tmp/ex-code.txt
function getName() {
function getData() {
function setConfig() {
const name = "value";
let count = 0;
var oldStyle = true;
import React from 'react';
import { useState } from 'react';
export default App;
export const helper;
EOF
Ex 5.1: Match "function" followed by "get"
Solution
grep -oP 'function (?=get)' /tmp/ex-code.txt
Output: `function ` (from getName and getData lines)
Ex 5.2: Match variable declarations NOT using "var"
Solution
grep -P '(?!var)(const|let) \w+' /tmp/ex-code.txt
Ex 5.3: Match "export" followed by "default"
Solution
grep -oP 'export(?= default)' /tmp/ex-code.txt
Ex 5.4: Match imports from 'react'
Solution
grep -oP "import .+(?= from 'react')" /tmp/ex-code.txt
Real-World Applications
Professional: ISE Log Filtering
# Match MACs followed by "Passed" (successful auth)
grep -oP '[0-9A-F:]{17}(?=.*Passed)' /var/log/ise-psc.log
# Match usernames NOT followed by "Failed"
grep -oP 'User-Name=\K\w+(?!.*Failed)' /var/log/ise-psc.log
# Match IPs followed by specific policy
grep -oP '[\d.]+(?=.*Wired_802\.1X)' /var/log/ise-psc.log
Professional: Log Analysis
# Error messages followed by stack traces
grep -P 'ERROR(?=.*at .+\.java:\d+)' app.log
# Match timestamps followed by ERROR/FATAL
grep -oP '\d{4}-\d{2}-\d{2}T[\d:]+(?=.*(ERROR|FATAL))' app.log
# API calls NOT followed by 200
grep -P 'GET /api/\w+(?!.*200)' access.log
Professional: Network Config Validation
# Interfaces with IP but NOT shutdown
grep -P 'interface \w+(?=.*ip address)(?!.*shutdown)' config.txt
# VLANs followed by specific name patterns
grep -oP 'vlan \d+(?= name (DATA|VOICE|MGMT))' config.txt
# ACLs NOT followed by "permit any any" (potential security)
grep -P 'ip access-list \w+(?!.*permit any any)' config.txt
Personal: Email Filtering
# Subjects containing "urgent" followed by action words
grep -Pi 'subject:.*urgent(?=.*(action|required|asap))' ~/mail/
# From addresses NOT followed by @spam domains
grep -Pi 'from:.*@\w+(?!\.(spam|junk|promo)\.)' ~/mail/
Personal: Document Organization
# Files with dates followed by specific types
ls | grep -P '\d{4}-\d{2}-\d{2}(?=.*(report|summary|review))'
# Notes containing TODO followed by priority
grep -Pi 'TODO(?=.*(high|urgent|p1))' ~/notes/*.md
# Entries with amounts followed by category
grep -oP '\$[\d.]+(?=.*(groceries|utilities|rent))' ~/budget.txt
Personal: Health/Fitness Tracking
# Workout entries followed by duration
grep -oP '\w+day(?=.*\d+ min)' ~/fitness.log
# Meals NOT followed by "healthy" tag
grep -P 'meal:.*(?!.*#healthy)' ~/diet.log
# Weight entries followed by trend indicator
grep -oP '[\d.]+(?= (kg|lbs).*↓)' ~/weight.log
Tool Variants
grep -P: PCRE Lookahead
# Basic positive lookahead
grep -oP 'foo(?=bar)' file.txt
# Basic negative lookahead
grep -oP 'foo(?!bar)' file.txt
# Multiple conditions
grep -oP '(?=.*pattern1)(?=.*pattern2).*' file.txt
# Combined with other features
grep -oP '(?=.*\d)[A-Za-z]+' file.txt
Python: Lookahead
import re
text = "foobar foobaz fooqux"
# Positive lookahead
pattern = re.compile(r'foo(?=bar)')
matches = pattern.findall(text)
print(matches) # ['foo'] - only before 'bar'
# Negative lookahead
pattern = re.compile(r'foo(?!bar)')
matches = pattern.findall(text)
print(matches) # ['foo', 'foo'] - before 'baz' and 'qux'
# Password validation
def validate_password(pwd):
pattern = re.compile(
r'^(?=.*\d)' # Must have digit
r'(?=.*[a-z])' # Must have lowercase
r'(?=.*[A-Z])' # Must have uppercase
r'(?=.*[!@#$%])' # Must have special
r'.{8,}$' # 8+ characters
)
return bool(pattern.match(pwd))
# Conditional extraction
text = "price: $99.99 USD, price: 50 EUR"
pattern = re.compile(r'\$[\d.]+(?= USD)')
usd_prices = pattern.findall(text) # ['$99.99']
JavaScript: Lookahead
const text = "foobar foobaz fooqux";
// Positive lookahead
const positive = text.match(/foo(?=bar)/g);
console.log(positive); // ['foo']
// Negative lookahead
const negative = text.match(/foo(?!bar)/g);
console.log(negative); // ['foo', 'foo']
// Password validation
const passwordRegex = /^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[!@#$%]).{8,}$/;
console.log(passwordRegex.test("Password1!")); // true
sed: Limited Lookahead Support
# sed doesn't support lookahead natively
# Workaround: Use grep -P for filtering, then sed
# Extract then transform
grep -oP 'foo(?=bar)' file.txt | sed 's/foo/FOO/'
# Or use perl for one-liner
perl -pe 's/foo(?=bar)/FOO/g' file.txt
awk: No Native Lookahead
# awk doesn't support lookahead
# Workaround: Two-step matching
# Check if pattern exists, then extract
awk '/foobar/ {gsub(/bar/, ""); print}' file.txt
# Or use PCRE with gawk (if available)
gawk 'match($0, /foo(?=bar)/, a) {print a[0]}' file.txt
Combining Multiple Lookaheads
Multiple lookaheads at the same position create AND logic:
# Must satisfy ALL conditions:
# - Contains "error"
# - Contains a number
# - Contains "critical"
grep -P '(?=.*error)(?=.*\d+)(?=.*critical)' logs.txt
# Password must have:
# - Digit
# - Uppercase
# - Lowercase
# - Special character
# - 12+ characters
grep -P '^(?=.*\d)(?=.*[A-Z])(?=.*[a-z])(?=.*[!@#$%^&*]).{12,}$' passwords.txt
Gotchas
Lookahead Position Matters
# WRONG: Lookahead at wrong position
echo "foobar" | grep -oP '(?=bar)foo'
# No match - looking for "bar" at position 0, then "foo"
# CORRECT: Lookahead after the match point
echo "foobar" | grep -oP 'foo(?=bar)'
# Match: "foo"
Zero-Width Doesn’t Consume
# Lookahead doesn't consume - pattern continues from same position
echo "foobar" | grep -oP 'foo(?=bar)bar'
# Match: "foobar" - after lookahead, "bar" still needs to be matched
# vs capturing group which DOES consume
echo "foobar" | grep -oP 'foo(bar)'
# Match: "foobar" - "bar" was consumed by the group
Not Supported in BRE/ERE
# FAILS: No lookahead in basic grep
grep 'foo(?=bar)' file.txt # Literal match!
# CORRECT: Use PCRE
grep -P 'foo(?=bar)' file.txt
# Or use awk/sed workarounds for BRE/ERE
Nested Lookahead Complexity
# Complex nested lookaheads can be confusing
# (?=(?=.*a)(?=.*b)) # Redundant nesting
# Simpler: Multiple lookaheads at same level
(?=.*a)(?=.*b) # Check for 'a' AND 'b' anywhere
Lookahead vs Lazy Quantifiers
# Lookahead: Stop AT the pattern (don't include it)
echo "foo:bar:baz" | grep -oP '.+(?=:baz)'
# Output: foo:bar
# Lazy: Match minimum up to pattern
echo "foo:bar:baz" | grep -oP '.+?:'
# Output: foo:
# Different purposes - understand which you need
Key Takeaways
| Concept | Remember |
|---|---|
|
Positive lookahead - must be followed by pattern |
|
Negative lookahead - must NOT be followed by pattern |
Zero-width |
Lookahead checks but doesn’t consume characters |
Multiple lookaheads |
Creates AND logic - all must match |
Position |
Lookahead checks from current position forward |
PCRE only |
Use |
Password rules |
Perfect use case for multiple positive lookaheads |
Self-Test
-
What does
foo(?=bar)match in "foobar"? -
What’s the difference between
foo(?=bar)andfoobar? -
How do you match "error" NOT followed by "handled"?
-
Why use multiple lookaheads for password validation?
-
Does
(?=bar)foomatch anything in "foobar"?
Answers
-
Just "foo" - the lookahead verifies "bar" follows but doesn’t include it
-
foo(?=bar)matches only "foo";foobarmatches "foobar" entirely -
error(?!handled)orerror(?! handled) -
Multiple lookaheads create AND logic - must satisfy all conditions simultaneously
-
No - at position 0, it looks for "bar" (not there), then tries to match "foo"
Next Drill
Drill 08: Lookbehind - Master (?⇐…) positive and (?<!…) negative lookbehind assertions.