Drill 07: Lookahead

Lookahead assertions match a position based on what follows, without consuming those characters. This enables powerful conditional matching like "find X only if followed by Y" or "find X only if NOT followed by Y."

Core Concepts

Syntax Meaning Example

(?=pattern)

Positive lookahead - must follow

foo(?=bar) matches "foo" only if followed by "bar"

(?!pattern)

Negative lookahead - must NOT follow

foo(?!bar) matches "foo" only if NOT followed by "bar"

Zero-width

Doesn’t consume characters

Position assertion, not a match

PCRE only

Not in BRE/ERE

Use grep -P, Python, JavaScript

Zero-Width Explained

Lookahead assertions check what’s ahead WITHOUT including it in the match:

# Without lookahead: "foobar" consumed entirely
echo "foobar" | grep -oP 'foo.*'
# Output: foobar

# With lookahead: only "foo" matched, "bar" verified but not consumed
echo "foobar" | grep -oP 'foo(?=bar)'
# Output: foo

# The difference matters for extraction and replacement

Interactive CLI Drill

bash ~/atelier/_bibliotheca/domus-captures/docs/modules/ROOT/examples/regex-drills/07-lookahead.sh

Exercise Set 1: Positive Lookahead Basics

cat << 'EOF' > /tmp/ex-lookahead.txt
filename.txt
filename.pdf
filename.doc
config.yaml
config.json
config.xml
script.sh
script.py
script.rb
user123
user456admin
admin789
EOF

Ex 1.1: Match "filename" only if followed by ".txt"

Solution
grep -oP 'filename(?=\.txt)' /tmp/ex-lookahead.txt

Output: filename (only from the .txt line) Note: The .txt is verified but NOT included in the match.

Ex 1.2: Match "config" only if followed by ".yaml" or ".json"

Solution
grep -oP 'config(?=\.(yaml|json))' /tmp/ex-lookahead.txt

Output: config (twice - for yaml and json lines)

Ex 1.3: Match "user" followed by digits

Solution
grep -oP 'user(?=\d+)' /tmp/ex-lookahead.txt

Output: user (from user123 and user456admin lines)

Ex 1.4: Match "script" followed by .py or .sh

Solution
grep -oP 'script(?=\.(py|sh))' /tmp/ex-lookahead.txt

Exercise Set 2: Negative Lookahead Basics

cat << 'EOF' > /tmp/ex-neglook.txt
192.168.1.100
192.168.1.1
10.0.0.1
172.16.0.100
8.8.8.8
1.1.1.1
user_test
user_prod
user_dev
admin_test
admin_prod
EOF

Ex 2.1: Match "user_" NOT followed by "test"

Solution
grep -oP 'user_(?!test)' /tmp/ex-neglook.txt

Output: user_ (from user_prod and user_dev)

Ex 2.2: Match IPs NOT ending in ".1"

Solution
grep -P '\d+\.\d+\.\d+\.(?!1$)\d+' /tmp/ex-neglook.txt

Output: Lines with IPs not ending in .1

Ex 2.3: Match "admin_" NOT followed by "test"

Solution
grep -oP 'admin_(?!test)\w+' /tmp/ex-neglook.txt

Output: admin_prod

Ex 2.4: Match lines NOT followed by specific pattern

Solution
# Match 192.168 NOT followed by .1.
grep -P '192\.168\.(?!1\.)' /tmp/ex-neglook.txt

This matches 192.168.x.x where x is not 1.

Exercise Set 3: Password Validation

Lookahead is perfect for password complexity rules (must contain X AND Y AND Z):

cat << 'EOF' > /tmp/ex-passwords.txt
password
Password1
Password1!
Pass1!
Abcd1234!
abc123
ABC123!
longpasswordwithoutspecials
P@ssw0rd!
EOF

Ex 3.1: Must contain at least one digit

Solution
grep -P '(?=.*\d)' /tmp/ex-passwords.txt

(?=.*\d) - From start, look ahead for any chars followed by a digit.

Ex 3.2: Must contain at least one uppercase

Solution
grep -P '(?=.*[A-Z])' /tmp/ex-passwords.txt

Ex 3.3: Must contain digit AND uppercase AND special char

Solution
grep -P '(?=.*\d)(?=.*[A-Z])(?=.*[!@#$%^&*])' /tmp/ex-passwords.txt

Multiple lookaheads - ALL must be satisfied. Output: Password1!, Abcd1234!, P@ssw0rd!

Ex 3.4: Full password validation (8+ chars, digit, upper, lower, special)

Solution
grep -P '^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[!@#$%^&*]).{8,}$' /tmp/ex-passwords.txt

Breaking it down: - (?=.\d) - must contain digit - (?=.[a-z]) - must contain lowercase - (?=.[A-Z]) - must contain uppercase - (?=.[!@#$%^&*]) - must contain special - .{8,} - must be 8+ characters

Exercise Set 4: Data Extraction

cat << 'EOF' > /tmp/ex-extract.txt
price: $99.99 USD
price: $149.50 USD
price: 50 EUR
total: $1,234.56
discount: 20%
tax: 8.5%
quantity: 100 units
quantity: 50 units
EOF

Ex 4.1: Extract numbers followed by "USD"

Solution
grep -oP '\$[\d,.]+(?= USD)' /tmp/ex-extract.txt

Output: $99.99, $149.50 Note: The " USD" is checked but not included in output.

Ex 4.2: Extract percentages (numbers followed by %)

Solution
grep -oP '[\d.]+(?=%)' /tmp/ex-extract.txt

Output: 20, 8.5

Ex 4.3: Extract quantities followed by "units"

Solution
grep -oP '\d+(?= units)' /tmp/ex-extract.txt

Output: 100, 50

Exercise Set 5: Code Pattern Matching

cat << 'EOF' > /tmp/ex-code.txt
function getName() {
function getData() {
function setConfig() {
const name = "value";
let count = 0;
var oldStyle = true;
import React from 'react';
import { useState } from 'react';
export default App;
export const helper;
EOF

Ex 5.1: Match "function" followed by "get"

Solution
grep -oP 'function (?=get)' /tmp/ex-code.txt

Output: `function ` (from getName and getData lines)

Ex 5.2: Match variable declarations NOT using "var"

Solution
grep -P '(?!var)(const|let) \w+' /tmp/ex-code.txt

Ex 5.3: Match "export" followed by "default"

Solution
grep -oP 'export(?= default)' /tmp/ex-code.txt

Ex 5.4: Match imports from 'react'

Solution
grep -oP "import .+(?= from 'react')" /tmp/ex-code.txt

Real-World Applications

Professional: ISE Log Filtering

# Match MACs followed by "Passed" (successful auth)
grep -oP '[0-9A-F:]{17}(?=.*Passed)' /var/log/ise-psc.log

# Match usernames NOT followed by "Failed"
grep -oP 'User-Name=\K\w+(?!.*Failed)' /var/log/ise-psc.log

# Match IPs followed by specific policy
grep -oP '[\d.]+(?=.*Wired_802\.1X)' /var/log/ise-psc.log

Professional: Log Analysis

# Error messages followed by stack traces
grep -P 'ERROR(?=.*at .+\.java:\d+)' app.log

# Match timestamps followed by ERROR/FATAL
grep -oP '\d{4}-\d{2}-\d{2}T[\d:]+(?=.*(ERROR|FATAL))' app.log

# API calls NOT followed by 200
grep -P 'GET /api/\w+(?!.*200)' access.log

Professional: Network Config Validation

# Interfaces with IP but NOT shutdown
grep -P 'interface \w+(?=.*ip address)(?!.*shutdown)' config.txt

# VLANs followed by specific name patterns
grep -oP 'vlan \d+(?= name (DATA|VOICE|MGMT))' config.txt

# ACLs NOT followed by "permit any any" (potential security)
grep -P 'ip access-list \w+(?!.*permit any any)' config.txt

Personal: Email Filtering

# Subjects containing "urgent" followed by action words
grep -Pi 'subject:.*urgent(?=.*(action|required|asap))' ~/mail/

# From addresses NOT followed by @spam domains
grep -Pi 'from:.*@\w+(?!\.(spam|junk|promo)\.)' ~/mail/

Personal: Document Organization

# Files with dates followed by specific types
ls | grep -P '\d{4}-\d{2}-\d{2}(?=.*(report|summary|review))'

# Notes containing TODO followed by priority
grep -Pi 'TODO(?=.*(high|urgent|p1))' ~/notes/*.md

# Entries with amounts followed by category
grep -oP '\$[\d.]+(?=.*(groceries|utilities|rent))' ~/budget.txt

Personal: Health/Fitness Tracking

# Workout entries followed by duration
grep -oP '\w+day(?=.*\d+ min)' ~/fitness.log

# Meals NOT followed by "healthy" tag
grep -P 'meal:.*(?!.*#healthy)' ~/diet.log

# Weight entries followed by trend indicator
grep -oP '[\d.]+(?= (kg|lbs).*↓)' ~/weight.log

Tool Variants

grep -P: PCRE Lookahead

# Basic positive lookahead
grep -oP 'foo(?=bar)' file.txt

# Basic negative lookahead
grep -oP 'foo(?!bar)' file.txt

# Multiple conditions
grep -oP '(?=.*pattern1)(?=.*pattern2).*' file.txt

# Combined with other features
grep -oP '(?=.*\d)[A-Za-z]+' file.txt

Python: Lookahead

import re

text = "foobar foobaz fooqux"

# Positive lookahead
pattern = re.compile(r'foo(?=bar)')
matches = pattern.findall(text)
print(matches)  # ['foo'] - only before 'bar'

# Negative lookahead
pattern = re.compile(r'foo(?!bar)')
matches = pattern.findall(text)
print(matches)  # ['foo', 'foo'] - before 'baz' and 'qux'

# Password validation
def validate_password(pwd):
    pattern = re.compile(
        r'^(?=.*\d)'       # Must have digit
        r'(?=.*[a-z])'     # Must have lowercase
        r'(?=.*[A-Z])'     # Must have uppercase
        r'(?=.*[!@#$%])'   # Must have special
        r'.{8,}$'          # 8+ characters
    )
    return bool(pattern.match(pwd))

# Conditional extraction
text = "price: $99.99 USD, price: 50 EUR"
pattern = re.compile(r'\$[\d.]+(?= USD)')
usd_prices = pattern.findall(text)  # ['$99.99']

JavaScript: Lookahead

const text = "foobar foobaz fooqux";

// Positive lookahead
const positive = text.match(/foo(?=bar)/g);
console.log(positive);  // ['foo']

// Negative lookahead
const negative = text.match(/foo(?!bar)/g);
console.log(negative);  // ['foo', 'foo']

// Password validation
const passwordRegex = /^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[!@#$%]).{8,}$/;
console.log(passwordRegex.test("Password1!")); // true

sed: Limited Lookahead Support

# sed doesn't support lookahead natively
# Workaround: Use grep -P for filtering, then sed

# Extract then transform
grep -oP 'foo(?=bar)' file.txt | sed 's/foo/FOO/'

# Or use perl for one-liner
perl -pe 's/foo(?=bar)/FOO/g' file.txt

awk: No Native Lookahead

# awk doesn't support lookahead
# Workaround: Two-step matching

# Check if pattern exists, then extract
awk '/foobar/ {gsub(/bar/, ""); print}' file.txt

# Or use PCRE with gawk (if available)
gawk 'match($0, /foo(?=bar)/, a) {print a[0]}' file.txt

Combining Multiple Lookaheads

Multiple lookaheads at the same position create AND logic:

# Must satisfy ALL conditions:
# - Contains "error"
# - Contains a number
# - Contains "critical"
grep -P '(?=.*error)(?=.*\d+)(?=.*critical)' logs.txt

# Password must have:
# - Digit
# - Uppercase
# - Lowercase
# - Special character
# - 12+ characters
grep -P '^(?=.*\d)(?=.*[A-Z])(?=.*[a-z])(?=.*[!@#$%^&*]).{12,}$' passwords.txt

Gotchas

Lookahead Position Matters

# WRONG: Lookahead at wrong position
echo "foobar" | grep -oP '(?=bar)foo'
# No match - looking for "bar" at position 0, then "foo"

# CORRECT: Lookahead after the match point
echo "foobar" | grep -oP 'foo(?=bar)'
# Match: "foo"

Zero-Width Doesn’t Consume

# Lookahead doesn't consume - pattern continues from same position
echo "foobar" | grep -oP 'foo(?=bar)bar'
# Match: "foobar" - after lookahead, "bar" still needs to be matched

# vs capturing group which DOES consume
echo "foobar" | grep -oP 'foo(bar)'
# Match: "foobar" - "bar" was consumed by the group

Not Supported in BRE/ERE

# FAILS: No lookahead in basic grep
grep 'foo(?=bar)' file.txt  # Literal match!

# CORRECT: Use PCRE
grep -P 'foo(?=bar)' file.txt

# Or use awk/sed workarounds for BRE/ERE

Nested Lookahead Complexity

# Complex nested lookaheads can be confusing
# (?=(?=.*a)(?=.*b))  # Redundant nesting

# Simpler: Multiple lookaheads at same level
(?=.*a)(?=.*b)  # Check for 'a' AND 'b' anywhere

Lookahead vs Lazy Quantifiers

# Lookahead: Stop AT the pattern (don't include it)
echo "foo:bar:baz" | grep -oP '.+(?=:baz)'
# Output: foo:bar

# Lazy: Match minimum up to pattern
echo "foo:bar:baz" | grep -oP '.+?:'
# Output: foo:

# Different purposes - understand which you need

Key Takeaways

Concept Remember

(?=pattern)

Positive lookahead - must be followed by pattern

(?!pattern)

Negative lookahead - must NOT be followed by pattern

Zero-width

Lookahead checks but doesn’t consume characters

Multiple lookaheads

Creates AND logic - all must match

Position

Lookahead checks from current position forward

PCRE only

Use grep -P, Python, JavaScript - not sed/awk

Password rules

Perfect use case for multiple positive lookaheads

Self-Test

  1. What does foo(?=bar) match in "foobar"?

  2. What’s the difference between foo(?=bar) and foobar?

  3. How do you match "error" NOT followed by "handled"?

  4. Why use multiple lookaheads for password validation?

  5. Does (?=bar)foo match anything in "foobar"?

Answers
  1. Just "foo" - the lookahead verifies "bar" follows but doesn’t include it

  2. foo(?=bar) matches only "foo"; foobar matches "foobar" entirely

  3. error(?!handled) or error(?! handled)

  4. Multiple lookaheads create AND logic - must satisfy all conditions simultaneously

  5. No - at position 0, it looks for "bar" (not there), then tries to match "foo"

Next Drill

Drill 08: Lookbehind - Master (?⇐…​) positive and (?<!…​) negative lookbehind assertions.