Drill 08: Lookbehind

Lookbehind assertions check what comes BEFORE your match position without including it in the result. Combined with lookahead, they enable surgical text extraction.

Core Concepts

Syntax Meaning Example

Syntax	Meaning	Example
`(?⇐pattern)`	Positive lookbehind	`(?⇐@)\w+` matches domain after @
`(?<!pattern)`	Negative lookbehind	`(?<!\.)\d+` matches digits not after decimal
`\K`	Reset match start (PCRE)	`@\K\w+` same as `(?⇐@)\w+` but flexible
`(?⇐a\|b)`	Alternation in lookbehind	Fixed-width alternatives OK

(?⇐pattern)

Positive lookbehind

(?⇐@)\w+ matches domain after @

(?<!pattern)

Negative lookbehind

(?<!\.)\d+ matches digits not after decimal

\K

Reset match start (PCRE)

@\K\w+ same as (?⇐@)\w+ but flexible

(?⇐a|b)

Alternation in lookbehind

Fixed-width alternatives OK

The Fixed-Width Limitation

CRITICAL: In most regex engines, lookbehind patterns must be fixed-width (known length at compile time).

# WORKS: Fixed width (3 characters)
grep -oP '(?<=abc)def' <<< "abcdef"

# WORKS: Alternation with same-width alternatives
grep -oP '(?<=cat|dog)food' <<< "catfood dogfood"

# FAILS: Variable width (quantifiers)
grep -oP '(?<=a+)def' <<< "aaadef"
# Error: lookbehind assertion is not fixed length

# FAILS: Variable alternation widths
grep -oP '(?<=cat|mouse)food' <<< "mousefood"
# Error: different-length alternatives not allowed

Solution: Use \K for variable-width patterns (PCRE only).

The \K Alternative

\K resets the match start - everything before it is required but not included in the match. Unlike lookbehind, it has no fixed-width restriction.

# Variable-width with \K (works!)
echo "aaadef" | grep -oP 'a+\Kdef'
# Output: def

# Variable alternation with \K
echo "mousefood catfood" | grep -oP '(cat|mouse)\Kfood'
# Output: food (twice)

Interactive CLI Drill

bash ~/atelier/_bibliotheca/domus-captures/docs/modules/ROOT/examples/regex-drills/08-lookbehind.sh

Exercise Set 1: Positive Lookbehind

cat << 'EOF' > /tmp/ex-lookbehind.txt
user@example.com
admin@company.org
price: $99.99
cost: $1,234.56
version=2.0.1
config=production
host: server-01
target: db-prod
EOF

Ex 1.1: Extract domain from email (after @)

Solution

grep -oP '(?<=@)[a-z.]+' /tmp/ex-lookbehind.txt

Output: example.com, company.org

Ex 1.2: Extract price amount (after $)

Solution

grep -oP '(?<=\$)[0-9,]+\.[0-9]{2}' /tmp/ex-lookbehind.txt

Output: 99.99, 1,234.56

Note: $ must be escaped as \$ because $ is a regex anchor.

Ex 1.3: Extract value after equals sign

Solution

grep -oP '(?<==)[a-z0-9.]+' /tmp/ex-lookbehind.txt

Output: 2.0.1, production

Ex 1.4: Extract hostname after "host: "

Solution

grep -oP '(?<=host: )[\w-]+' /tmp/ex-lookbehind.txt

Output: server-01

Alternative with \K:

grep -oP 'host: \K[\w-]+' /tmp/ex-lookbehind.txt

Exercise Set 2: Negative Lookbehind

cat << 'EOF' > /tmp/ex-neg-behind.txt
100
3.14159
.50
$100
-50
0.75
10.5
port80
EOF

Ex 2.1: Match numbers NOT preceded by decimal point

Solution

grep -oP '(?<!\.)\b\d+' /tmp/ex-neg-behind.txt

Output: 100, 3, 100, 50, 0, 10

Explanation: (?<!\.) ensures no decimal point before the digits.

Ex 2.2: Match numbers NOT preceded by dollar sign

Solution

grep -oP '(?<!\$)\b\d+' /tmp/ex-neg-behind.txt

Excludes the "100" after "$" in "$100".

Ex 2.3: Match numbers NOT preceded by minus sign

Solution

grep -oP '(?<!-)\b\d+' /tmp/ex-neg-behind.txt

Excludes "50" in "-50".

Ex 2.4: Match "port" NOT preceded by letter

Solution

echo -e "port80\nexport\ntransport\nport" | grep -oP '(?<![a-z])port'

Output: port (from port80), port (standalone)

Does NOT match: export, transport (preceded by letters)

Exercise Set 3: Combining Lookbehind with Lookahead

cat << 'EOF' > /tmp/ex-both.txt
<tag>content</tag>
key="value"
[section]
(parenthetical)
{placeholder}
BEGIN:data:END
prefix_middle_suffix
EOF

Ex 3.1: Extract content between angle brackets

Solution

grep -oP '(?<=>)[^<]+(?=<)' /tmp/ex-both.txt

Output: content

Lookbehind checks for >, lookahead checks for <.

Ex 3.2: Extract value between quotes

Solution

grep -oP '(?<=")[^"]+(?=")' /tmp/ex-both.txt

Output: value

Ex 3.3: Extract text between square brackets

Solution

grep -oP '(?<=\[)[^\]]+(?=\])' /tmp/ex-both.txt

Output: section

Note: ] must be escaped or first in negated class.

Ex 3.4: Extract middle section between delimiters

Solution

grep -oP '(?<=BEGIN:)[^:]+(?=:END)' /tmp/ex-both.txt

Output: data

Or using \K for the left side:

grep -oP 'BEGIN:\K[^:]+(?=:END)' /tmp/ex-both.txt

Exercise Set 4: Variable-Width with \K

cat << 'EOF' > /tmp/ex-varwidth.txt
Hello, World!
Hi there!
Hey everyone!
Greetings, friends!
prefix123suffix
pre456suf
pref789suffix
LOG: [INFO] Application started
LOG: [ERROR] Connection failed
EOF

Ex 4.1: Extract text after greeting (variable-length greetings)

Solution

# Can't use lookbehind (variable width)
# grep -oP '(?<=Hello, |Hi |Hey ).*' # FAILS

# Use \K instead
grep -oP '(Hello, |Hi |Hey )\K.*' /tmp/ex-varwidth.txt

Output: World!, there!, everyone!

Ex 4.2: Extract numbers between variable-length prefix/suffix

Solution

grep -oP 'pre\w*\K\d+' /tmp/ex-varwidth.txt

Output: 123, 456, 789

Note: \w* matches variable-length prefix continuation.

Ex 4.3: Extract log level

Solution

grep -oP 'LOG: \[\K[A-Z]+(?=\])' /tmp/ex-varwidth.txt

Output: INFO, ERROR

Combines \K for left side and lookahead for right side.

Ex 4.4: Extract text after variable colons

Solution

echo "a:b::c:::d" | grep -oP ':+\K[^:]+'

Output: b, c, d

\K after :+ allows variable colon sequences.

Exercise Set 5: Network and Config Patterns

cat << 'EOF' > /tmp/ex-network.txt
interface GigabitEthernet0/1
interface FastEthernet0/24
ip address 192.168.1.1 255.255.255.0
ip address 10.50.1.100 255.255.255.0
permit tcp any host 10.50.1.20 eq 443
deny tcp any host 10.50.1.30 eq 22
vlan 100 name DATA_VLAN
vlan 200 name VOICE_VLAN
EOF

Ex 5.1: Extract interface names

Solution

grep -oP '(?<=interface )\S+' /tmp/ex-network.txt

Output: GigabitEthernet0/1, FastEthernet0/24

Ex 5.2: Extract IP addresses from "ip address" lines

Solution

grep -oP '(?<=ip address )\d+\.\d+\.\d+\.\d+' /tmp/ex-network.txt

Output: 192.168.1.1, 10.50.1.100

Ex 5.3: Extract destination IPs from ACLs

Solution

grep -oP '(?<=host )\d+\.\d+\.\d+\.\d+' /tmp/ex-network.txt

Output: 10.50.1.20, 10.50.1.30

Ex 5.4: Extract VLAN names

Solution

grep -oP '(?<=name )\w+' /tmp/ex-network.txt

Output: DATA_VLAN, VOICE_VLAN

Real-World Applications

Professional: ISE Log Parsing

# Extract MAC address after "Calling-Station-ID="
grep -oP '(?<=Calling-Station-ID=)[0-9A-F:-]+' /var/log/ise-psc.log

# Using \K for longer prefix
grep -oP 'Calling-Station-ID=\K[0-9A-F:-]+' /var/log/ise-psc.log

# Extract username after "UserName="
grep -oP '(?<=UserName=)\S+' /var/log/ise-psc.log

# Extract policy set after "SelectedAccessService="
grep -oP '(?<=SelectedAccessService=)[^,]+' /var/log/ise-psc.log

Professional: Network Config Extraction

# Extract VLAN IDs from switchport config
grep -oP '(?<=switchport access vlan )\d+' config.txt

# Extract IP from interface config
grep -oP '(?<=ip address )\d+\.\d+\.\d+\.\d+' config.txt

# Extract hostname from device config
grep -oP '(?<=hostname )\S+' config.txt

# Extract NTP servers
grep -oP '(?<=ntp server )\S+' config.txt

Professional: Log Analysis

# Extract log level after timestamp
grep -oP '\d{4}-\d{2}-\d{2}T[\d:]+\s+\K\[?\w+\]?' /var/log/app.log

# Extract error code after "ERROR:"
grep -oP '(?<=ERROR: E)\d+' /var/log/app.log

# Extract response time after "took "
grep -oP '(?<=took )\d+(?=ms)' /var/log/app.log

# Extract URL path after method
grep -oP '(?<=(GET|POST|PUT|DELETE) )\S+' /var/log/access.log

Professional: Security Audit

# Extract SSH user attempts
grep -oP '(?<=user=)\w+' /var/log/auth.log

# Extract source IPs from failed logins
grep -oP '(?<=from )\d+\.\d+\.\d+\.\d+(?= port)' /var/log/auth.log

# Extract certificate CN
grep -oP '(?<=CN=)[^,/]+' /var/log/ssl.log

# Find secrets NOT preceded by REDACTED marker
grep -P '(?<!\[REDACTED\] )password\s*=' config.txt

Personal: Document Parsing

# Extract amounts after dollar sign
grep -oP '(?<=\$)\d+(?:\.\d{2})?' ~/receipts/*.txt

# Extract dates after "Date:"
grep -oP '(?<=Date: )\d{4}-\d{2}-\d{2}' ~/documents/*.txt

# Extract email domains
grep -oP '(?<=@)[a-z0-9.-]+\.[a-z]+' ~/contacts.txt

# Extract phone numbers after "Tel:"
grep -oP '(?<=Tel: )[\d-]+' ~/contacts.txt

Personal: Note Analysis

# Extract task names after checkbox
grep -oP '(?<=\[ \] )[^\n]+' ~/notes/*.md

# Extract tags (after #)
grep -oP '(?<=#)\w+' ~/notes/*.md

# Extract due dates from tasks
grep -oP '(?<=due: )\d{4}-\d{2}-\d{2}' ~/tasks.md

# Extract priority levels after "P"
grep -oP '(?<=\[P)\d(?=\])' ~/tasks.md

Personal: Financial Tracking

# Extract amounts after currency symbols
grep -oP '(?<=[\$\xe2\x82\xac\xc2\xa3])\d+(?:,\d{3})*(?:\.\d{2})?' ~/budget.txt

# Extract vendor names from transactions
grep -oP '(?<=Paid: )[A-Z][a-z]+(?: [A-Z][a-z]+)*' ~/expenses.txt

# Extract interest rates
grep -oP '(?<=APR: )\d+\.\d+(?=%)' ~/accounts.txt

# Extract account numbers (last 4 after "****")
grep -oP '(?<=\*{4})\d{4}' ~/accounts.txt

Personal: Calendar & Time Tracking

# Extract event names after time
grep -oP '(?<=\d{2}:\d{2} )[A-Z][^\n]+' ~/calendar.txt

# Extract durations after "Duration:"
grep -oP '(?<=Duration: )\d+(?= hours?)' ~/timesheet.txt

# Extract project names from time entries
grep -oP '(?<=\[)[^\]]+(?=\])' ~/timesheet.txt

Tool Variants

grep: Lookbehind with -P

# Positive lookbehind
grep -oP '(?<=prefix)pattern' file.txt

# Negative lookbehind
grep -oP '(?<!exclude)pattern' file.txt

# \K alternative (more flexible)
grep -oP 'prefix\Kpattern' file.txt

# Combining with lookahead
grep -oP '(?<=left)middle(?=right)' file.txt

sed: No Native Lookbehind (Use Workarounds)

# sed doesn't support lookbehind
# Workaround: Capture and replace

# Extract after prefix:
echo "prefix:value" | sed 's/prefix:\(.*\)/\1/'
# Output: value

# Remove prefix but keep suffix:
echo "prefix:value:suffix" | sed 's/prefix:\([^:]*\):.*/\1/'
# Output: value

# Use with capturing groups:
echo "key=value" | sed -E 's/.*=(.+)/\1/'
# Output: value

awk: Field-Based Alternative

# awk doesn't support lookbehind
# Use field splitting instead

# Extract after @
echo "user@domain.com" | awk -F'@' '{print $2}'
# Output: domain.com

# Extract after =
echo "key=value" | awk -F'=' '{print $2}'
# Output: value

# Extract with match()
echo "prefix123suffix" | awk 'match($0, /prefix([0-9]+)/, a) {print a[1]}'
# Output: 123

vim: Lookbehind Patterns

" Positive lookbehind (vim uses \@<= )
/\(prefix\)\@<=pattern

" Negative lookbehind (vim uses \@<! )
/\(prefix\)\@<!pattern

" Example: Find numbers after $
/\$\@<=\d\+

" Example: Find words NOT after "the "
/\(the \)\@<!\<\w\+\>

" Extract value after = (using substitute)
:%s/.*=\(\w\+\)/\1/

Vim uses \@⇐ and \@<! instead of (?⇐) and (?<!).

Python: Full Lookbehind Support

import re

text = "user@example.com price: $99.99"

# Positive lookbehind
domain = re.search(r'(?<=@)[a-z.]+', text)
print(domain.group())  # example.com

# Extract price
price = re.search(r'(?<=\$)\d+\.\d+', text)
print(price.group())  # 99.99

# Negative lookbehind
text2 = "export transport port"
words = re.findall(r'(?<![a-z])port', text2)
print(words)  # ['port'] - only standalone port

# Variable-width lookbehind (Python 3.8+)
# Python's regex module supports variable-width!
# Standard re module still has fixed-width limitation

JavaScript: Lookbehind (ES2018+)

const text = "user@example.com";

// Positive lookbehind
const domain = text.match(/(?<=@)[a-z.]+/);
console.log(domain[0]);  // example.com

// Negative lookbehind
const text2 = "export transport port";
const words = text2.match(/(?<![a-z])port/g);
console.log(words);  // ['port']

// Note: Lookbehind added in ES2018
// Not supported in older browsers

Gotchas

Fixed-Width Requirement

# FAILS: Quantifiers in lookbehind
grep -oP '(?<=a+)b' <<< "aaab"
# Error: lookbehind assertion is not fixed length

# FAILS: Different-length alternatives
grep -oP '(?<=cat|mouse)s' <<< "cats mouses"
# Error: different-length alternatives

# WORKS: Same-length alternatives
grep -oP '(?<=cat|dog)s' <<< "cats dogs"
# Output: s, s (both 3 characters)

# SOLUTION: Use \K
grep -oP '(cat|mouse)\Ks' <<< "cats mouses"
# Output: s, s

Escaping Special Characters

# $ needs escaping (it's a regex anchor)
grep -oP '(?<=\$)\d+' <<< "Price: $100"
# Output: 100

# [ ] need escaping
grep -oP '(?<=\[)INFO(?=\])' <<< "[INFO] Message"
# Output: INFO

# Parentheses need escaping for literal match
grep -oP '(?<=\()value(?=\))' <<< "(value)"
# Output: value

Lookbehind vs \K Behavior

# Lookbehind: Position must FOLLOW the pattern
echo "abcdef" | grep -oP '(?<=abc)...'
# Output: def

# \K: Everything before \K is matched but discarded
echo "abcdef" | grep -oP 'abc\K...'
# Output: def

# Difference with overlapping matches:
echo "aaa" | grep -oP '(?<=a)a'
# Output: a, a (positions 1 and 2)

echo "aaa" | grep -oP 'a\Ka'
# Output: a (only one match - 'aa' consumed)

Zero-Width Nature

# Lookbehind doesn't consume characters
echo "abc" | grep -oP '(?<=a)b'
# Output: b (only 'b', not 'ab')

# This matters for replacement:
echo "abc" | sed -E 's/(?<=a)b/X/'  # sed doesn't support lookbehind
# Use capturing group instead:
echo "abc" | sed 's/\(a\)b/\1X/'
# Output: aXc

Combining Multiple Lookbehinds

# Multiple lookbehinds: both must match
echo "123abc456" | grep -oP '(?<=\d)(?<=[0-9]{3})abc'
# First lookbehind: preceded by digit
# Second lookbehind: preceded by 3 digits
# Both check from same position

# More practical: lookbehind + other assertion
echo "abc123def" | grep -oP '(?<=abc)\d+(?=def)'
# Output: 123

Key Takeaways

Concept Usage

Concept	Usage
`(?⇐pattern)`	Match if preceded by pattern (fixed-width)
`(?<!pattern)`	Match if NOT preceded by pattern (fixed-width)
`\K`	Reset match start (variable-width alternative)
Fixed-width rule	Lookbehind can’t use `*`, `+`, `{n,m}`
Escaping	Remember `$`, `[`, `]`, `(`, `)` need `\`
Zero-width	Lookbehind checks position, doesn’t consume
vim syntax	`\@⇐` for positive, `\@<!` for negative
sed limitation	No lookbehind - use capturing groups instead

(?⇐pattern)

Match if preceded by pattern (fixed-width)

(?<!pattern)

Match if NOT preceded by pattern (fixed-width)

\K

Reset match start (variable-width alternative)

Fixed-width rule

Lookbehind can’t use *, +, {n,m}

Escaping

Remember $, [, ], (, ) need \

Zero-width

Lookbehind checks position, doesn’t consume

vim syntax

\@⇐ for positive, \@<! for negative

sed limitation

No lookbehind - use capturing groups instead

When to Use What

Scenario Use Example

Scenario	Use	Example
Simple fixed prefix	Lookbehind	`(?⇐\$)\d+`
Variable-length prefix	`\K`	`prefix.*\Kvalue`
Extract between delimiters	Both lookbehind + lookahead	`(?⇐\[)[^\]]+(?=\])`
sed/awk	Capturing groups	`s/.=\(.\)/\1/`
Exclude certain prefixes	Negative lookbehind	`(?<![a-z])port`
vim search	`\@⇐` syntax	`/\$\@⇐\d\+`

Simple fixed prefix

Lookbehind

(?⇐\$)\d+

Variable-length prefix

\K

prefix.*\Kvalue

Extract between delimiters

Both lookbehind + lookahead

(?⇐\[)[^\]]+(?=\])

sed/awk

Capturing groups

s/.=$.$/\1/

Exclude certain prefixes

Negative lookbehind

(?<![a-z])port

vim search

\@⇐ syntax

/\$\@⇐\d\+

Self-Test

What’s the difference between (?⇐@) and @\K?
Why does (?⇐\w+) fail?
How do you match "port" NOT preceded by letters?
What’s the vim equivalent of (?⇐prefix)?
How do you extract text between [ and ] using lookbehind/lookahead?

Answers

Both match position after @, but \K can handle variable-width patterns and lookbehind cannot. Also \K consumes the @ while lookbehind doesn’t.
\w+ is variable-width (1 or more characters). Lookbehind requires fixed-width patterns.
(?<![a-z])port - negative lookbehind for any lowercase letter
/$prefix$\@⇐pattern - vim uses \@⇐ after the group
(?⇐\[)[^\]]+(?=\]) - lookbehind for [, negated class for content, lookahead for ]

Next Drill

Drill 09: Infrastructure Patterns - Real-world network, security, and config patterns.