Drill 08: Lookbehind
Lookbehind assertions check what comes BEFORE your match position without including it in the result. Combined with lookahead, they enable surgical text extraction.
Core Concepts
| Syntax | Meaning | Example |
|---|---|---|
|
Positive lookbehind |
|
|
Negative lookbehind |
|
|
Reset match start (PCRE) |
|
|
Alternation in lookbehind |
Fixed-width alternatives OK |
The Fixed-Width Limitation
CRITICAL: In most regex engines, lookbehind patterns must be fixed-width (known length at compile time).
# WORKS: Fixed width (3 characters)
grep -oP '(?<=abc)def' <<< "abcdef"
# WORKS: Alternation with same-width alternatives
grep -oP '(?<=cat|dog)food' <<< "catfood dogfood"
# FAILS: Variable width (quantifiers)
grep -oP '(?<=a+)def' <<< "aaadef"
# Error: lookbehind assertion is not fixed length
# FAILS: Variable alternation widths
grep -oP '(?<=cat|mouse)food' <<< "mousefood"
# Error: different-length alternatives not allowed
Solution: Use \K for variable-width patterns (PCRE only).
The \K Alternative
\K resets the match start - everything before it is required but not included in the match. Unlike lookbehind, it has no fixed-width restriction.
# Variable-width with \K (works!)
echo "aaadef" | grep -oP 'a+\Kdef'
# Output: def
# Variable alternation with \K
echo "mousefood catfood" | grep -oP '(cat|mouse)\Kfood'
# Output: food (twice)
Interactive CLI Drill
bash ~/atelier/_bibliotheca/domus-captures/docs/modules/ROOT/examples/regex-drills/08-lookbehind.sh
Exercise Set 1: Positive Lookbehind
cat << 'EOF' > /tmp/ex-lookbehind.txt
user@example.com
admin@company.org
price: $99.99
cost: $1,234.56
version=2.0.1
config=production
host: server-01
target: db-prod
EOF
Ex 1.1: Extract domain from email (after @)
Solution
grep -oP '(?<=@)[a-z.]+' /tmp/ex-lookbehind.txt
Output: example.com, company.org
Ex 1.2: Extract price amount (after $)
Solution
grep -oP '(?<=\$)[0-9,]+\.[0-9]{2}' /tmp/ex-lookbehind.txt
Output: 99.99, 1,234.56
Note: $ must be escaped as \$ because $ is a regex anchor.
Ex 1.3: Extract value after equals sign
Solution
grep -oP '(?<==)[a-z0-9.]+' /tmp/ex-lookbehind.txt
Output: 2.0.1, production
Ex 1.4: Extract hostname after "host: "
Solution
grep -oP '(?<=host: )[\w-]+' /tmp/ex-lookbehind.txt
Output: server-01
Alternative with \K:
grep -oP 'host: \K[\w-]+' /tmp/ex-lookbehind.txt
Exercise Set 2: Negative Lookbehind
cat << 'EOF' > /tmp/ex-neg-behind.txt
100
3.14159
.50
$100
-50
0.75
10.5
port80
EOF
Ex 2.1: Match numbers NOT preceded by decimal point
Solution
grep -oP '(?<!\.)\b\d+' /tmp/ex-neg-behind.txt
Output: 100, 3, 100, 50, 0, 10
Explanation: (?<!\.) ensures no decimal point before the digits.
Ex 2.2: Match numbers NOT preceded by dollar sign
Solution
grep -oP '(?<!\$)\b\d+' /tmp/ex-neg-behind.txt
Excludes the "100" after "$" in "$100".
Ex 2.3: Match numbers NOT preceded by minus sign
Solution
grep -oP '(?<!-)\b\d+' /tmp/ex-neg-behind.txt
Excludes "50" in "-50".
Ex 2.4: Match "port" NOT preceded by letter
Solution
echo -e "port80\nexport\ntransport\nport" | grep -oP '(?<![a-z])port'
Output: port (from port80), port (standalone)
Does NOT match: export, transport (preceded by letters)
Exercise Set 3: Combining Lookbehind with Lookahead
cat << 'EOF' > /tmp/ex-both.txt
<tag>content</tag>
key="value"
[section]
(parenthetical)
{placeholder}
BEGIN:data:END
prefix_middle_suffix
EOF
Ex 3.1: Extract content between angle brackets
Solution
grep -oP '(?<=>)[^<]+(?=<)' /tmp/ex-both.txt
Output: content
Lookbehind checks for >, lookahead checks for <.
Ex 3.2: Extract value between quotes
Solution
grep -oP '(?<=")[^"]+(?=")' /tmp/ex-both.txt
Output: value
Ex 3.3: Extract text between square brackets
Solution
grep -oP '(?<=\[)[^\]]+(?=\])' /tmp/ex-both.txt
Output: section
Note: ] must be escaped or first in negated class.
Ex 3.4: Extract middle section between delimiters
Solution
grep -oP '(?<=BEGIN:)[^:]+(?=:END)' /tmp/ex-both.txt
Output: data
Or using \K for the left side:
grep -oP 'BEGIN:\K[^:]+(?=:END)' /tmp/ex-both.txt
Exercise Set 4: Variable-Width with \K
cat << 'EOF' > /tmp/ex-varwidth.txt
Hello, World!
Hi there!
Hey everyone!
Greetings, friends!
prefix123suffix
pre456suf
pref789suffix
LOG: [INFO] Application started
LOG: [ERROR] Connection failed
EOF
Ex 4.1: Extract text after greeting (variable-length greetings)
Solution
# Can't use lookbehind (variable width)
# grep -oP '(?<=Hello, |Hi |Hey ).*' # FAILS
# Use \K instead
grep -oP '(Hello, |Hi |Hey )\K.*' /tmp/ex-varwidth.txt
Output: World!, there!, everyone!
Ex 4.2: Extract numbers between variable-length prefix/suffix
Solution
grep -oP 'pre\w*\K\d+' /tmp/ex-varwidth.txt
Output: 123, 456, 789
Note: \w* matches variable-length prefix continuation.
Ex 4.3: Extract log level
Solution
grep -oP 'LOG: \[\K[A-Z]+(?=\])' /tmp/ex-varwidth.txt
Output: INFO, ERROR
Combines \K for left side and lookahead for right side.
Ex 4.4: Extract text after variable colons
Solution
echo "a:b::c:::d" | grep -oP ':+\K[^:]+'
Output: b, c, d
\K after :+ allows variable colon sequences.
Exercise Set 5: Network and Config Patterns
cat << 'EOF' > /tmp/ex-network.txt
interface GigabitEthernet0/1
interface FastEthernet0/24
ip address 192.168.1.1 255.255.255.0
ip address 10.50.1.100 255.255.255.0
permit tcp any host 10.50.1.20 eq 443
deny tcp any host 10.50.1.30 eq 22
vlan 100 name DATA_VLAN
vlan 200 name VOICE_VLAN
EOF
Ex 5.1: Extract interface names
Solution
grep -oP '(?<=interface )\S+' /tmp/ex-network.txt
Output: GigabitEthernet0/1, FastEthernet0/24
Ex 5.2: Extract IP addresses from "ip address" lines
Solution
grep -oP '(?<=ip address )\d+\.\d+\.\d+\.\d+' /tmp/ex-network.txt
Output: 192.168.1.1, 10.50.1.100
Ex 5.3: Extract destination IPs from ACLs
Solution
grep -oP '(?<=host )\d+\.\d+\.\d+\.\d+' /tmp/ex-network.txt
Output: 10.50.1.20, 10.50.1.30
Ex 5.4: Extract VLAN names
Solution
grep -oP '(?<=name )\w+' /tmp/ex-network.txt
Output: DATA_VLAN, VOICE_VLAN
Real-World Applications
Professional: ISE Log Parsing
# Extract MAC address after "Calling-Station-ID="
grep -oP '(?<=Calling-Station-ID=)[0-9A-F:-]+' /var/log/ise-psc.log
# Using \K for longer prefix
grep -oP 'Calling-Station-ID=\K[0-9A-F:-]+' /var/log/ise-psc.log
# Extract username after "UserName="
grep -oP '(?<=UserName=)\S+' /var/log/ise-psc.log
# Extract policy set after "SelectedAccessService="
grep -oP '(?<=SelectedAccessService=)[^,]+' /var/log/ise-psc.log
Professional: Network Config Extraction
# Extract VLAN IDs from switchport config
grep -oP '(?<=switchport access vlan )\d+' config.txt
# Extract IP from interface config
grep -oP '(?<=ip address )\d+\.\d+\.\d+\.\d+' config.txt
# Extract hostname from device config
grep -oP '(?<=hostname )\S+' config.txt
# Extract NTP servers
grep -oP '(?<=ntp server )\S+' config.txt
Professional: Log Analysis
# Extract log level after timestamp
grep -oP '\d{4}-\d{2}-\d{2}T[\d:]+\s+\K\[?\w+\]?' /var/log/app.log
# Extract error code after "ERROR:"
grep -oP '(?<=ERROR: E)\d+' /var/log/app.log
# Extract response time after "took "
grep -oP '(?<=took )\d+(?=ms)' /var/log/app.log
# Extract URL path after method
grep -oP '(?<=(GET|POST|PUT|DELETE) )\S+' /var/log/access.log
Professional: Security Audit
# Extract SSH user attempts
grep -oP '(?<=user=)\w+' /var/log/auth.log
# Extract source IPs from failed logins
grep -oP '(?<=from )\d+\.\d+\.\d+\.\d+(?= port)' /var/log/auth.log
# Extract certificate CN
grep -oP '(?<=CN=)[^,/]+' /var/log/ssl.log
# Find secrets NOT preceded by REDACTED marker
grep -P '(?<!\[REDACTED\] )password\s*=' config.txt
Personal: Document Parsing
# Extract amounts after dollar sign
grep -oP '(?<=\$)\d+(?:\.\d{2})?' ~/receipts/*.txt
# Extract dates after "Date:"
grep -oP '(?<=Date: )\d{4}-\d{2}-\d{2}' ~/documents/*.txt
# Extract email domains
grep -oP '(?<=@)[a-z0-9.-]+\.[a-z]+' ~/contacts.txt
# Extract phone numbers after "Tel:"
grep -oP '(?<=Tel: )[\d-]+' ~/contacts.txt
Personal: Note Analysis
# Extract task names after checkbox
grep -oP '(?<=\[ \] )[^\n]+' ~/notes/*.md
# Extract tags (after #)
grep -oP '(?<=#)\w+' ~/notes/*.md
# Extract due dates from tasks
grep -oP '(?<=due: )\d{4}-\d{2}-\d{2}' ~/tasks.md
# Extract priority levels after "P"
grep -oP '(?<=\[P)\d(?=\])' ~/tasks.md
Personal: Financial Tracking
# Extract amounts after currency symbols
grep -oP '(?<=[\$\xe2\x82\xac\xc2\xa3])\d+(?:,\d{3})*(?:\.\d{2})?' ~/budget.txt
# Extract vendor names from transactions
grep -oP '(?<=Paid: )[A-Z][a-z]+(?: [A-Z][a-z]+)*' ~/expenses.txt
# Extract interest rates
grep -oP '(?<=APR: )\d+\.\d+(?=%)' ~/accounts.txt
# Extract account numbers (last 4 after "****")
grep -oP '(?<=\*{4})\d{4}' ~/accounts.txt
Personal: Calendar & Time Tracking
# Extract event names after time
grep -oP '(?<=\d{2}:\d{2} )[A-Z][^\n]+' ~/calendar.txt
# Extract durations after "Duration:"
grep -oP '(?<=Duration: )\d+(?= hours?)' ~/timesheet.txt
# Extract project names from time entries
grep -oP '(?<=\[)[^\]]+(?=\])' ~/timesheet.txt
Tool Variants
grep: Lookbehind with -P
# Positive lookbehind
grep -oP '(?<=prefix)pattern' file.txt
# Negative lookbehind
grep -oP '(?<!exclude)pattern' file.txt
# \K alternative (more flexible)
grep -oP 'prefix\Kpattern' file.txt
# Combining with lookahead
grep -oP '(?<=left)middle(?=right)' file.txt
sed: No Native Lookbehind (Use Workarounds)
# sed doesn't support lookbehind
# Workaround: Capture and replace
# Extract after prefix:
echo "prefix:value" | sed 's/prefix:\(.*\)/\1/'
# Output: value
# Remove prefix but keep suffix:
echo "prefix:value:suffix" | sed 's/prefix:\([^:]*\):.*/\1/'
# Output: value
# Use with capturing groups:
echo "key=value" | sed -E 's/.*=(.+)/\1/'
# Output: value
awk: Field-Based Alternative
# awk doesn't support lookbehind
# Use field splitting instead
# Extract after @
echo "user@domain.com" | awk -F'@' '{print $2}'
# Output: domain.com
# Extract after =
echo "key=value" | awk -F'=' '{print $2}'
# Output: value
# Extract with match()
echo "prefix123suffix" | awk 'match($0, /prefix([0-9]+)/, a) {print a[1]}'
# Output: 123
vim: Lookbehind Patterns
" Positive lookbehind (vim uses \@<= ) /\(prefix\)\@<=pattern " Negative lookbehind (vim uses \@<! ) /\(prefix\)\@<!pattern " Example: Find numbers after $ /\$\@<=\d\+ " Example: Find words NOT after "the " /\(the \)\@<!\<\w\+\> " Extract value after = (using substitute) :%s/.*=\(\w\+\)/\1/
Vim uses \@⇐ and \@<! instead of (?⇐) and (?<!).
|
Python: Full Lookbehind Support
import re
text = "user@example.com price: $99.99"
# Positive lookbehind
domain = re.search(r'(?<=@)[a-z.]+', text)
print(domain.group()) # example.com
# Extract price
price = re.search(r'(?<=\$)\d+\.\d+', text)
print(price.group()) # 99.99
# Negative lookbehind
text2 = "export transport port"
words = re.findall(r'(?<![a-z])port', text2)
print(words) # ['port'] - only standalone port
# Variable-width lookbehind (Python 3.8+)
# Python's regex module supports variable-width!
# Standard re module still has fixed-width limitation
JavaScript: Lookbehind (ES2018+)
const text = "user@example.com";
// Positive lookbehind
const domain = text.match(/(?<=@)[a-z.]+/);
console.log(domain[0]); // example.com
// Negative lookbehind
const text2 = "export transport port";
const words = text2.match(/(?<![a-z])port/g);
console.log(words); // ['port']
// Note: Lookbehind added in ES2018
// Not supported in older browsers
Gotchas
Fixed-Width Requirement
# FAILS: Quantifiers in lookbehind
grep -oP '(?<=a+)b' <<< "aaab"
# Error: lookbehind assertion is not fixed length
# FAILS: Different-length alternatives
grep -oP '(?<=cat|mouse)s' <<< "cats mouses"
# Error: different-length alternatives
# WORKS: Same-length alternatives
grep -oP '(?<=cat|dog)s' <<< "cats dogs"
# Output: s, s (both 3 characters)
# SOLUTION: Use \K
grep -oP '(cat|mouse)\Ks' <<< "cats mouses"
# Output: s, s
Escaping Special Characters
# $ needs escaping (it's a regex anchor)
grep -oP '(?<=\$)\d+' <<< "Price: $100"
# Output: 100
# [ ] need escaping
grep -oP '(?<=\[)INFO(?=\])' <<< "[INFO] Message"
# Output: INFO
# Parentheses need escaping for literal match
grep -oP '(?<=\()value(?=\))' <<< "(value)"
# Output: value
Lookbehind vs \K Behavior
# Lookbehind: Position must FOLLOW the pattern
echo "abcdef" | grep -oP '(?<=abc)...'
# Output: def
# \K: Everything before \K is matched but discarded
echo "abcdef" | grep -oP 'abc\K...'
# Output: def
# Difference with overlapping matches:
echo "aaa" | grep -oP '(?<=a)a'
# Output: a, a (positions 1 and 2)
echo "aaa" | grep -oP 'a\Ka'
# Output: a (only one match - 'aa' consumed)
Zero-Width Nature
# Lookbehind doesn't consume characters
echo "abc" | grep -oP '(?<=a)b'
# Output: b (only 'b', not 'ab')
# This matters for replacement:
echo "abc" | sed -E 's/(?<=a)b/X/' # sed doesn't support lookbehind
# Use capturing group instead:
echo "abc" | sed 's/\(a\)b/\1X/'
# Output: aXc
Combining Multiple Lookbehinds
# Multiple lookbehinds: both must match
echo "123abc456" | grep -oP '(?<=\d)(?<=[0-9]{3})abc'
# First lookbehind: preceded by digit
# Second lookbehind: preceded by 3 digits
# Both check from same position
# More practical: lookbehind + other assertion
echo "abc123def" | grep -oP '(?<=abc)\d+(?=def)'
# Output: 123
Key Takeaways
| Concept | Usage |
|---|---|
|
Match if preceded by pattern (fixed-width) |
|
Match if NOT preceded by pattern (fixed-width) |
|
Reset match start (variable-width alternative) |
Fixed-width rule |
Lookbehind can’t use |
Escaping |
Remember |
Zero-width |
Lookbehind checks position, doesn’t consume |
vim syntax |
|
sed limitation |
No lookbehind - use capturing groups instead |
When to Use What
| Scenario | Use | Example |
|---|---|---|
Simple fixed prefix |
Lookbehind |
|
Variable-length prefix |
|
|
Extract between delimiters |
Both lookbehind + lookahead |
|
sed/awk |
Capturing groups |
|
Exclude certain prefixes |
Negative lookbehind |
|
vim search |
|
|
Self-Test
-
What’s the difference between
(?⇐@)and@\K? -
Why does
(?⇐\w+)fail? -
How do you match "port" NOT preceded by letters?
-
What’s the vim equivalent of
(?⇐prefix)? -
How do you extract text between
[and]using lookbehind/lookahead?
Answers
-
Both match position after @, but
\Kcan handle variable-width patterns and lookbehind cannot. Also\Kconsumes the @ while lookbehind doesn’t. -
\w+is variable-width (1 or more characters). Lookbehind requires fixed-width patterns. -
(?<![a-z])port- negative lookbehind for any lowercase letter -
/\(prefix\)\@⇐pattern- vim uses\@⇐after the group -
(?⇐\[)[^\]]+(?=\])- lookbehind for[, negated class for content, lookahead for]
Next Drill
Drill 09: Infrastructure Patterns - Real-world network, security, and config patterns.