Drill 08: Lookbehind

Lookbehind assertions check what comes BEFORE your match position without including it in the result. Combined with lookahead, they enable surgical text extraction.

Core Concepts

Syntax Meaning Example

(?⇐pattern)

Positive lookbehind

(?⇐@)\w+ matches domain after @

(?<!pattern)

Negative lookbehind

(?<!\.)\d+ matches digits not after decimal

\K

Reset match start (PCRE)

@\K\w+ same as (?⇐@)\w+ but flexible

(?⇐a|b)

Alternation in lookbehind

Fixed-width alternatives OK

The Fixed-Width Limitation

CRITICAL: In most regex engines, lookbehind patterns must be fixed-width (known length at compile time).

# WORKS: Fixed width (3 characters)
grep -oP '(?<=abc)def' <<< "abcdef"

# WORKS: Alternation with same-width alternatives
grep -oP '(?<=cat|dog)food' <<< "catfood dogfood"

# FAILS: Variable width (quantifiers)
grep -oP '(?<=a+)def' <<< "aaadef"
# Error: lookbehind assertion is not fixed length

# FAILS: Variable alternation widths
grep -oP '(?<=cat|mouse)food' <<< "mousefood"
# Error: different-length alternatives not allowed

Solution: Use \K for variable-width patterns (PCRE only).

The \K Alternative

\K resets the match start - everything before it is required but not included in the match. Unlike lookbehind, it has no fixed-width restriction.

# Variable-width with \K (works!)
echo "aaadef" | grep -oP 'a+\Kdef'
# Output: def

# Variable alternation with \K
echo "mousefood catfood" | grep -oP '(cat|mouse)\Kfood'
# Output: food (twice)

Interactive CLI Drill

bash ~/atelier/_bibliotheca/domus-captures/docs/modules/ROOT/examples/regex-drills/08-lookbehind.sh

Exercise Set 1: Positive Lookbehind

cat << 'EOF' > /tmp/ex-lookbehind.txt
user@example.com
admin@company.org
price: $99.99
cost: $1,234.56
version=2.0.1
config=production
host: server-01
target: db-prod
EOF

Ex 1.1: Extract domain from email (after @)

Solution
grep -oP '(?<=@)[a-z.]+' /tmp/ex-lookbehind.txt

Output: example.com, company.org

Ex 1.2: Extract price amount (after $)

Solution
grep -oP '(?<=\$)[0-9,]+\.[0-9]{2}' /tmp/ex-lookbehind.txt

Output: 99.99, 1,234.56

Note: $ must be escaped as \$ because $ is a regex anchor.

Ex 1.3: Extract value after equals sign

Solution
grep -oP '(?<==)[a-z0-9.]+' /tmp/ex-lookbehind.txt

Output: 2.0.1, production

Ex 1.4: Extract hostname after "host: "

Solution
grep -oP '(?<=host: )[\w-]+' /tmp/ex-lookbehind.txt

Output: server-01

Alternative with \K:

grep -oP 'host: \K[\w-]+' /tmp/ex-lookbehind.txt

Exercise Set 2: Negative Lookbehind

cat << 'EOF' > /tmp/ex-neg-behind.txt
100
3.14159
.50
$100
-50
0.75
10.5
port80
EOF

Ex 2.1: Match numbers NOT preceded by decimal point

Solution
grep -oP '(?<!\.)\b\d+' /tmp/ex-neg-behind.txt

Output: 100, 3, 100, 50, 0, 10

Explanation: (?<!\.) ensures no decimal point before the digits.

Ex 2.2: Match numbers NOT preceded by dollar sign

Solution
grep -oP '(?<!\$)\b\d+' /tmp/ex-neg-behind.txt

Excludes the "100" after "$" in "$100".

Ex 2.3: Match numbers NOT preceded by minus sign

Solution
grep -oP '(?<!-)\b\d+' /tmp/ex-neg-behind.txt

Excludes "50" in "-50".

Ex 2.4: Match "port" NOT preceded by letter

Solution
echo -e "port80\nexport\ntransport\nport" | grep -oP '(?<![a-z])port'

Output: port (from port80), port (standalone)

Does NOT match: export, transport (preceded by letters)

Exercise Set 3: Combining Lookbehind with Lookahead

cat << 'EOF' > /tmp/ex-both.txt
<tag>content</tag>
key="value"
[section]
(parenthetical)
{placeholder}
BEGIN:data:END
prefix_middle_suffix
EOF

Ex 3.1: Extract content between angle brackets

Solution
grep -oP '(?<=>)[^<]+(?=<)' /tmp/ex-both.txt

Output: content

Lookbehind checks for >, lookahead checks for <.

Ex 3.2: Extract value between quotes

Solution
grep -oP '(?<=")[^"]+(?=")' /tmp/ex-both.txt

Output: value

Ex 3.3: Extract text between square brackets

Solution
grep -oP '(?<=\[)[^\]]+(?=\])' /tmp/ex-both.txt

Output: section

Note: ] must be escaped or first in negated class.

Ex 3.4: Extract middle section between delimiters

Solution
grep -oP '(?<=BEGIN:)[^:]+(?=:END)' /tmp/ex-both.txt

Output: data

Or using \K for the left side:

grep -oP 'BEGIN:\K[^:]+(?=:END)' /tmp/ex-both.txt

Exercise Set 4: Variable-Width with \K

cat << 'EOF' > /tmp/ex-varwidth.txt
Hello, World!
Hi there!
Hey everyone!
Greetings, friends!
prefix123suffix
pre456suf
pref789suffix
LOG: [INFO] Application started
LOG: [ERROR] Connection failed
EOF

Ex 4.1: Extract text after greeting (variable-length greetings)

Solution
# Can't use lookbehind (variable width)
# grep -oP '(?<=Hello, |Hi |Hey ).*' # FAILS

# Use \K instead
grep -oP '(Hello, |Hi |Hey )\K.*' /tmp/ex-varwidth.txt

Output: World!, there!, everyone!

Ex 4.2: Extract numbers between variable-length prefix/suffix

Solution
grep -oP 'pre\w*\K\d+' /tmp/ex-varwidth.txt

Output: 123, 456, 789

Note: \w* matches variable-length prefix continuation.

Ex 4.3: Extract log level

Solution
grep -oP 'LOG: \[\K[A-Z]+(?=\])' /tmp/ex-varwidth.txt

Output: INFO, ERROR

Combines \K for left side and lookahead for right side.

Ex 4.4: Extract text after variable colons

Solution
echo "a:b::c:::d" | grep -oP ':+\K[^:]+'

Output: b, c, d

\K after :+ allows variable colon sequences.

Exercise Set 5: Network and Config Patterns

cat << 'EOF' > /tmp/ex-network.txt
interface GigabitEthernet0/1
interface FastEthernet0/24
ip address 192.168.1.1 255.255.255.0
ip address 10.50.1.100 255.255.255.0
permit tcp any host 10.50.1.20 eq 443
deny tcp any host 10.50.1.30 eq 22
vlan 100 name DATA_VLAN
vlan 200 name VOICE_VLAN
EOF

Ex 5.1: Extract interface names

Solution
grep -oP '(?<=interface )\S+' /tmp/ex-network.txt

Output: GigabitEthernet0/1, FastEthernet0/24

Ex 5.2: Extract IP addresses from "ip address" lines

Solution
grep -oP '(?<=ip address )\d+\.\d+\.\d+\.\d+' /tmp/ex-network.txt

Output: 192.168.1.1, 10.50.1.100

Ex 5.3: Extract destination IPs from ACLs

Solution
grep -oP '(?<=host )\d+\.\d+\.\d+\.\d+' /tmp/ex-network.txt

Output: 10.50.1.20, 10.50.1.30

Ex 5.4: Extract VLAN names

Solution
grep -oP '(?<=name )\w+' /tmp/ex-network.txt

Output: DATA_VLAN, VOICE_VLAN

Real-World Applications

Professional: ISE Log Parsing

# Extract MAC address after "Calling-Station-ID="
grep -oP '(?<=Calling-Station-ID=)[0-9A-F:-]+' /var/log/ise-psc.log

# Using \K for longer prefix
grep -oP 'Calling-Station-ID=\K[0-9A-F:-]+' /var/log/ise-psc.log

# Extract username after "UserName="
grep -oP '(?<=UserName=)\S+' /var/log/ise-psc.log

# Extract policy set after "SelectedAccessService="
grep -oP '(?<=SelectedAccessService=)[^,]+' /var/log/ise-psc.log

Professional: Network Config Extraction

# Extract VLAN IDs from switchport config
grep -oP '(?<=switchport access vlan )\d+' config.txt

# Extract IP from interface config
grep -oP '(?<=ip address )\d+\.\d+\.\d+\.\d+' config.txt

# Extract hostname from device config
grep -oP '(?<=hostname )\S+' config.txt

# Extract NTP servers
grep -oP '(?<=ntp server )\S+' config.txt

Professional: Log Analysis

# Extract log level after timestamp
grep -oP '\d{4}-\d{2}-\d{2}T[\d:]+\s+\K\[?\w+\]?' /var/log/app.log

# Extract error code after "ERROR:"
grep -oP '(?<=ERROR: E)\d+' /var/log/app.log

# Extract response time after "took "
grep -oP '(?<=took )\d+(?=ms)' /var/log/app.log

# Extract URL path after method
grep -oP '(?<=(GET|POST|PUT|DELETE) )\S+' /var/log/access.log

Professional: Security Audit

# Extract SSH user attempts
grep -oP '(?<=user=)\w+' /var/log/auth.log

# Extract source IPs from failed logins
grep -oP '(?<=from )\d+\.\d+\.\d+\.\d+(?= port)' /var/log/auth.log

# Extract certificate CN
grep -oP '(?<=CN=)[^,/]+' /var/log/ssl.log

# Find secrets NOT preceded by REDACTED marker
grep -P '(?<!\[REDACTED\] )password\s*=' config.txt

Personal: Document Parsing

# Extract amounts after dollar sign
grep -oP '(?<=\$)\d+(?:\.\d{2})?' ~/receipts/*.txt

# Extract dates after "Date:"
grep -oP '(?<=Date: )\d{4}-\d{2}-\d{2}' ~/documents/*.txt

# Extract email domains
grep -oP '(?<=@)[a-z0-9.-]+\.[a-z]+' ~/contacts.txt

# Extract phone numbers after "Tel:"
grep -oP '(?<=Tel: )[\d-]+' ~/contacts.txt

Personal: Note Analysis

# Extract task names after checkbox
grep -oP '(?<=\[ \] )[^\n]+' ~/notes/*.md

# Extract tags (after #)
grep -oP '(?<=#)\w+' ~/notes/*.md

# Extract due dates from tasks
grep -oP '(?<=due: )\d{4}-\d{2}-\d{2}' ~/tasks.md

# Extract priority levels after "P"
grep -oP '(?<=\[P)\d(?=\])' ~/tasks.md

Personal: Financial Tracking

# Extract amounts after currency symbols
grep -oP '(?<=[\$\xe2\x82\xac\xc2\xa3])\d+(?:,\d{3})*(?:\.\d{2})?' ~/budget.txt

# Extract vendor names from transactions
grep -oP '(?<=Paid: )[A-Z][a-z]+(?: [A-Z][a-z]+)*' ~/expenses.txt

# Extract interest rates
grep -oP '(?<=APR: )\d+\.\d+(?=%)' ~/accounts.txt

# Extract account numbers (last 4 after "****")
grep -oP '(?<=\*{4})\d{4}' ~/accounts.txt

Personal: Calendar & Time Tracking

# Extract event names after time
grep -oP '(?<=\d{2}:\d{2} )[A-Z][^\n]+' ~/calendar.txt

# Extract durations after "Duration:"
grep -oP '(?<=Duration: )\d+(?= hours?)' ~/timesheet.txt

# Extract project names from time entries
grep -oP '(?<=\[)[^\]]+(?=\])' ~/timesheet.txt

Tool Variants

grep: Lookbehind with -P

# Positive lookbehind
grep -oP '(?<=prefix)pattern' file.txt

# Negative lookbehind
grep -oP '(?<!exclude)pattern' file.txt

# \K alternative (more flexible)
grep -oP 'prefix\Kpattern' file.txt

# Combining with lookahead
grep -oP '(?<=left)middle(?=right)' file.txt

sed: No Native Lookbehind (Use Workarounds)

# sed doesn't support lookbehind
# Workaround: Capture and replace

# Extract after prefix:
echo "prefix:value" | sed 's/prefix:\(.*\)/\1/'
# Output: value

# Remove prefix but keep suffix:
echo "prefix:value:suffix" | sed 's/prefix:\([^:]*\):.*/\1/'
# Output: value

# Use with capturing groups:
echo "key=value" | sed -E 's/.*=(.+)/\1/'
# Output: value

awk: Field-Based Alternative

# awk doesn't support lookbehind
# Use field splitting instead

# Extract after @
echo "user@domain.com" | awk -F'@' '{print $2}'
# Output: domain.com

# Extract after =
echo "key=value" | awk -F'=' '{print $2}'
# Output: value

# Extract with match()
echo "prefix123suffix" | awk 'match($0, /prefix([0-9]+)/, a) {print a[1]}'
# Output: 123

vim: Lookbehind Patterns

" Positive lookbehind (vim uses \@<= )
/\(prefix\)\@<=pattern

" Negative lookbehind (vim uses \@<! )
/\(prefix\)\@<!pattern

" Example: Find numbers after $
/\$\@<=\d\+

" Example: Find words NOT after "the "
/\(the \)\@<!\<\w\+\>

" Extract value after = (using substitute)
:%s/.*=\(\w\+\)/\1/
Vim uses \@⇐ and \@<! instead of (?⇐) and (?<!).

Python: Full Lookbehind Support

import re

text = "user@example.com price: $99.99"

# Positive lookbehind
domain = re.search(r'(?<=@)[a-z.]+', text)
print(domain.group())  # example.com

# Extract price
price = re.search(r'(?<=\$)\d+\.\d+', text)
print(price.group())  # 99.99

# Negative lookbehind
text2 = "export transport port"
words = re.findall(r'(?<![a-z])port', text2)
print(words)  # ['port'] - only standalone port

# Variable-width lookbehind (Python 3.8+)
# Python's regex module supports variable-width!
# Standard re module still has fixed-width limitation

JavaScript: Lookbehind (ES2018+)

const text = "user@example.com";

// Positive lookbehind
const domain = text.match(/(?<=@)[a-z.]+/);
console.log(domain[0]);  // example.com

// Negative lookbehind
const text2 = "export transport port";
const words = text2.match(/(?<![a-z])port/g);
console.log(words);  // ['port']

// Note: Lookbehind added in ES2018
// Not supported in older browsers

Gotchas

Fixed-Width Requirement

# FAILS: Quantifiers in lookbehind
grep -oP '(?<=a+)b' <<< "aaab"
# Error: lookbehind assertion is not fixed length

# FAILS: Different-length alternatives
grep -oP '(?<=cat|mouse)s' <<< "cats mouses"
# Error: different-length alternatives

# WORKS: Same-length alternatives
grep -oP '(?<=cat|dog)s' <<< "cats dogs"
# Output: s, s (both 3 characters)

# SOLUTION: Use \K
grep -oP '(cat|mouse)\Ks' <<< "cats mouses"
# Output: s, s

Escaping Special Characters

# $ needs escaping (it's a regex anchor)
grep -oP '(?<=\$)\d+' <<< "Price: $100"
# Output: 100

# [ ] need escaping
grep -oP '(?<=\[)INFO(?=\])' <<< "[INFO] Message"
# Output: INFO

# Parentheses need escaping for literal match
grep -oP '(?<=\()value(?=\))' <<< "(value)"
# Output: value

Lookbehind vs \K Behavior

# Lookbehind: Position must FOLLOW the pattern
echo "abcdef" | grep -oP '(?<=abc)...'
# Output: def

# \K: Everything before \K is matched but discarded
echo "abcdef" | grep -oP 'abc\K...'
# Output: def

# Difference with overlapping matches:
echo "aaa" | grep -oP '(?<=a)a'
# Output: a, a (positions 1 and 2)

echo "aaa" | grep -oP 'a\Ka'
# Output: a (only one match - 'aa' consumed)

Zero-Width Nature

# Lookbehind doesn't consume characters
echo "abc" | grep -oP '(?<=a)b'
# Output: b (only 'b', not 'ab')

# This matters for replacement:
echo "abc" | sed -E 's/(?<=a)b/X/'  # sed doesn't support lookbehind
# Use capturing group instead:
echo "abc" | sed 's/\(a\)b/\1X/'
# Output: aXc

Combining Multiple Lookbehinds

# Multiple lookbehinds: both must match
echo "123abc456" | grep -oP '(?<=\d)(?<=[0-9]{3})abc'
# First lookbehind: preceded by digit
# Second lookbehind: preceded by 3 digits
# Both check from same position

# More practical: lookbehind + other assertion
echo "abc123def" | grep -oP '(?<=abc)\d+(?=def)'
# Output: 123

Key Takeaways

Concept Usage

(?⇐pattern)

Match if preceded by pattern (fixed-width)

(?<!pattern)

Match if NOT preceded by pattern (fixed-width)

\K

Reset match start (variable-width alternative)

Fixed-width rule

Lookbehind can’t use *, +, {n,m}

Escaping

Remember $, [, ], (, ) need \

Zero-width

Lookbehind checks position, doesn’t consume

vim syntax

\@⇐ for positive, \@<! for negative

sed limitation

No lookbehind - use capturing groups instead

When to Use What

Scenario Use Example

Simple fixed prefix

Lookbehind

(?⇐\$)\d+

Variable-length prefix

\K

prefix.*\Kvalue

Extract between delimiters

Both lookbehind + lookahead

(?⇐\[)[^\]]+(?=\])

sed/awk

Capturing groups

s/.=\(.\)/\1/

Exclude certain prefixes

Negative lookbehind

(?<![a-z])port

vim search

\@⇐ syntax

/\$\@⇐\d\+

Self-Test

  1. What’s the difference between (?⇐@) and @\K?

  2. Why does (?⇐\w+) fail?

  3. How do you match "port" NOT preceded by letters?

  4. What’s the vim equivalent of (?⇐prefix)?

  5. How do you extract text between [ and ] using lookbehind/lookahead?

Answers
  1. Both match position after @, but \K can handle variable-width patterns and lookbehind cannot. Also \K consumes the @ while lookbehind doesn’t.

  2. \w+ is variable-width (1 or more characters). Lookbehind requires fixed-width patterns.

  3. (?<![a-z])port - negative lookbehind for any lowercase letter

  4. /\(prefix\)\@⇐pattern - vim uses \@⇐ after the group

  5. (?⇐\[)[^\]]+(?=\]) - lookbehind for [, negated class for content, lookahead for ]

Next Drill

Drill 09: Infrastructure Patterns - Real-world network, security, and config patterns.