Stack & Functions
Function calling, stack management, and calling conventions.
The Stack: Concept
The stack is memory that grows DOWNWARD (toward lower addresses). RSP (Stack Pointer) always points to the TOP of the stack.
Memory Address
High
│
│ ┌─────────────────┐
│ │ Previous data │
│ ├─────────────────┤ ← RSP before push
│ │ (empty/garbage) │
│ ├─────────────────┤ ← RSP after push
│ │ Pushed value │
│ ├─────────────────┤
▼ │ │
Low
PUSH decreases RSP (moves "up" in diagram, but DOWN in address)
POP increases RSP (moves "down" in diagram, but UP in address)
Why downward? Historical reasons, but useful: stack and heap grow toward each other, maximizing memory usage. If they meet, you’re out of memory.
Push and Pop
; PUSH - Store value on stack
push rax ; RSP = RSP - 8; [RSP] = RAX
; (64-bit: always pushes 8 bytes on x86-64)
; Equivalent to:
sub rsp, 8
mov [rsp], rax
; PUSH memory
push QWORD PTR [rbx] ; Push value from memory address in rbx
; PUSH immediate
push 42 ; Push literal value
; POP - Retrieve value from stack
pop rax ; RAX = [RSP]; RSP = RSP + 8
; Equivalent to:
mov rax, [rsp]
add rsp, 8
; POP to memory
pop QWORD PTR [rbx] ; Pop into memory address
; Common patterns
push rbx ; Save callee-saved register
push rbp
; ... do work ...
pop rbp ; Restore in REVERSE order
pop rbx
CRITICAL: x86-64 requires 16-byte stack alignment before CALL. The CALL instruction pushes 8 bytes (return address), so on function entry, RSP mod 16 == 8. Adjust accordingly when allocating stack space.
System V AMD64 Calling Convention (Linux)
╔═══════════════════════════════════════════════════════════════════╗
║ ARGUMENT PASSING ║
╠═══════════════════════════════════════════════════════════════════╣
║ Arg# │ Integer/Pointer │ Floating Point │ Additional ║
║─────────┼─────────────────┼────────────────┼──────────────────────║
║ 1 │ RDI │ XMM0 │ ║
║ 2 │ RSI │ XMM1 │ ║
║ 3 │ RDX │ XMM2 │ ║
║ 4 │ RCX │ XMM3 │ ║
║ 5 │ R8 │ XMM4 │ ║
║ 6 │ R9 │ XMM5 │ ║
║ 7+ │ Stack (right to left) │ [RSP+8], [RSP+16]... ║
╠═══════════════════════════════════════════════════════════════════╣
║ RETURN VALUES ║
╠═══════════════════════════════════════════════════════════════════╣
║ RAX │ Integer return (up to 64-bit) ║
║ RAX:RDX│ 128-bit return (RDX=high, RAX=low) ║
║ XMM0 │ Floating point return ║
╠═══════════════════════════════════════════════════════════════════╣
║ REGISTER PRESERVATION ║
╠═══════════════════════════════════════════════════════════════════╣
║ Caller-saved (scratch): RAX, RCX, RDX, RSI, RDI, R8-R11 ║
║ → Function CAN destroy these; caller must save if needed ║
║ ║
║ Callee-saved (preserved): RBX, RBP, R12-R15, RSP ║
║ → Function MUST restore these before returning ║
╚═══════════════════════════════════════════════════════════════════╝
// C function
long add3(long a, long b, long c) {
return a + b + c;
}
// Caller in C
long result = add3(10, 20, 30);
; Caller (before call)
mov rdi, 10 ; First argument
mov rsi, 20 ; Second argument
mov rdx, 30 ; Third argument
call add3 ; Call function
; RAX now contains return value
; Callee (add3 implementation)
add3:
; No need to save anything - we only use scratch registers
mov rax, rdi ; rax = a
add rax, rsi ; rax = a + b
add rax, rdx ; rax = a + b + c
ret ; Return (result in rax)
Stack Frame Structure
High Address
┌────────────────────────────────────────┐
│ Caller's stack frame │
├────────────────────────────────────────┤
│ Argument 8 (if > 6 args) │ [RBP + 24]
├────────────────────────────────────────┤
│ Argument 7 (if > 6 args) │ [RBP + 16]
├────────────────────────────────────────┤
│ Return Address (pushed by CALL) │ [RBP + 8]
├────────────────────────────────────────┤
│ Saved RBP (old frame pointer) │ [RBP] ← RBP points here
├────────────────────────────────────────┤
│ Local variable 1 │ [RBP - 8]
├────────────────────────────────────────┤
│ Local variable 2 │ [RBP - 16]
├────────────────────────────────────────┤
│ Local variable 3 │ [RBP - 24]
├────────────────────────────────────────┤
│ ... more locals ... │ ← RSP points here
└────────────────────────────────────────┘
Low Address
; Standard function prologue
my_function:
push rbp ; Save caller's base pointer
mov rbp, rsp ; Set up our frame pointer
sub rsp, 32 ; Allocate 32 bytes for locals (must be 16-aligned)
; Now we can use:
; [rbp - 8] for local var 1
; [rbp - 16] for local var 2
; etc.
; Arguments came in registers:
; rdi = arg1, rsi = arg2, rdx = arg3, rcx = arg4, r8 = arg5, r9 = arg6
; If we need to save them to stack:
mov [rbp - 8], rdi ; Save arg1 to local
mov [rbp - 16], rsi ; Save arg2 to local
; ... function body ...
; Standard function epilogue
mov rsp, rbp ; Deallocate locals
pop rbp ; Restore caller's base pointer
ret ; Return to caller
; Alternative epilogue (one instruction)
leave ; Equivalent to: mov rsp, rbp; pop rbp
ret
CALL and RET Instructions
; CALL instruction does two things:
; 1. Push return address (address of next instruction) onto stack
; 2. Jump to target function
call my_function ; Equivalent to:
; push rip ; (actually pushes address AFTER call)
; jmp my_function
; RET instruction does the reverse:
; 1. Pop return address from stack
; 2. Jump to that address
ret ; Equivalent to:
; pop rip
; RET with immediate (clean up stack args - rare in x86-64)
ret 16 ; Pop return address, then add 16 to RSP
; Used when callee cleans up stack arguments
Visualizing CALL/RET:
Before CALL:
RSP → ┌─────────────┐
│ ... │
After CALL (inside function):
RSP → ┌─────────────┐
│ Return Addr │ ← Pushed by CALL
├─────────────┤
│ ... │
After RET (back in caller):
RSP → ┌─────────────┐
│ ... │ ← Return address popped
Leaf Functions (Optimized)
A "leaf" function doesn’t call other functions. It can skip frame setup.
// Leaf function - no other function calls
int multiply(int a, int b) {
return a * b;
}
; Optimized leaf function (no frame setup needed)
multiply:
mov eax, edi ; eax = first arg (a)
imul eax, esi ; eax = a * b
ret ; Return result in eax
; No push/pop, no stack allocation - maximum efficiency
// Non-leaf function - calls another function
int complex_calc(int a, int b) {
int x = multiply(a, b); // Calls another function
return x + 1;
}
; Non-leaf must save callee-saved registers if it uses them
complex_calc:
push rbx ; Save rbx (we'll use it)
sub rsp, 8 ; Align stack to 16 bytes (push was 8)
mov ebx, edi ; Save 'a' in callee-saved register
; Call multiply(a, b)
; edi already has 'a', esi already has 'b'
call multiply
; eax = multiply result
; ebx still has original 'a' (callee-saved, preserved across call)
add eax, 1 ; x + 1
add rsp, 8 ; Deallocate alignment
pop rbx ; Restore rbx
ret
Stack Alignment Requirements
; x86-64 ABI requires 16-byte alignment at CALL
; CALL pushes 8 bytes, so on function entry: RSP mod 16 == 8
; Before calling another function, RSP mod 16 must == 0
; WRONG: Misaligned stack
my_func:
sub rsp, 24 ; 24 bytes, RSP was 8-aligned, now 16-aligned... wait
; Entry: RSP mod 16 = 8
; After sub 24: RSP mod 16 = 8 - 24 mod 16 = 8 - 8 = 0? No...
; Actually: (RSP - 24) mod 16 = (8 - 24) mod 16 = -16 mod 16 = 0
call other_func ; OK, aligned
; CORRECT pattern: Calculate alignment
my_func:
push rbp ; -8 bytes, RSP mod 16 = 0
mov rbp, rsp
sub rsp, 32 ; 32 bytes for locals, RSP mod 16 = 0
; Ready to call other functions
; Alternative: Just ensure stack is 16-aligned before CALL
my_func:
sub rsp, 8 ; Now RSP mod 16 = 0
call other_func
add rsp, 8
ret
Quick rule: Allocate stack in multiples of 16, accounting for pushes.
The Red Zone (128 bytes below RSP)
┌─────────────────────┐
RSP → │ Top of stack │
├─────────────────────┤
│ │
│ RED ZONE │ ← 128 bytes
│ (Leaf functions │ Usable WITHOUT adjusting RSP
│ can use this) │ But: Destroyed by signals, interrupts
│ │
├─────────────────────┤
│ RSP - 128 │
└─────────────────────┘
; Leaf function using red zone (no sub rsp needed!)
simple_leaf:
mov [rsp - 8], rdi ; Use red zone for temp storage
mov [rsp - 16], rsi
; ... do work ...
mov rax, [rsp - 8]
ret ; Never touched RSP
; WARNING: Signal handlers CAN destroy red zone
; Only use in leaf functions that don't need signal safety
; Kernel code CANNOT use red zone (interrupts)
In practice: Compilers automatically use the red zone for small leaf functions. You’ll see it in optimized code.
Practical Example: String Length
// C implementation
size_t my_strlen(const char *s) {
size_t len = 0;
while (*s != '\0') {
len++;
s++;
}
return len;
}
; Hand-optimized assembly
my_strlen:
; Input: RDI = pointer to string
; Output: RAX = length
xor eax, eax ; len = 0 (also clears upper RAX)
.loop:
cmp BYTE PTR [rdi + rax], 0 ; Is s[len] == 0?
je .done ; If zero, we're done
inc rax ; len++
jmp .loop ; Continue
.done:
ret ; Return length in RAX
; Even more optimized (no extra register)
my_strlen_v2:
mov rax, rdi ; Save start pointer
.loop:
cmp BYTE PTR [rdi], 0
je .done
inc rdi ; s++
jmp .loop
.done:
sub rdi, rax ; length = end - start
mov rax, rdi ; Return in RAX
ret
Variadic Functions (printf-style)
; For variadic functions (printf, etc.), AL must contain
; the number of vector registers (XMM) used for arguments
; Calling printf("x=%d, y=%d\n", 10, 20)
section .data
fmt: db "x=%d, y=%d", 10, 0 ; Format string with newline
section .text
lea rdi, [rel fmt] ; First arg: format string
mov esi, 10 ; Second arg: x value
mov edx, 20 ; Third arg: y value
xor eax, eax ; AL = 0 (no floating point args)
call printf
; Calling printf("pi=%f\n", 3.14159)
section .data
pi_fmt: db "pi=%f", 10, 0
pi_val: dq 3.14159 ; Double precision
section .text
lea rdi, [rel pi_fmt]
movsd xmm0, [rel pi_val] ; Float arg in XMM0
mov eax, 1 ; AL = 1 (one XMM register used)
call printf
Recursion
// Factorial in C
long factorial(long n) {
if (n <= 1) return 1;
return n * factorial(n - 1);
}
factorial:
; Base case: n <= 1
cmp rdi, 1
jle .base_case
; Recursive case: n * factorial(n-1)
push rdi ; Save n (callee would destroy it)
dec rdi ; n - 1
call factorial ; factorial(n-1)
pop rdi ; Restore n
imul rax, rdi ; rax = factorial(n-1) * n
ret
.base_case:
mov eax, 1 ; Return 1
ret
; Stack unwinding visualization for factorial(4):
;
; factorial(4): push 4, call factorial(3)
; factorial(3): push 3, call factorial(2)
; factorial(2): push 2, call factorial(1)
; factorial(1): return 1
; factorial(2): pop 2, return 2*1 = 2
; factorial(3): pop 3, return 3*2 = 6
; factorial(4): pop 4, return 4*6 = 24
Common Stack Gotchas
; WRONG: Forgetting to preserve callee-saved registers
my_func:
mov rbx, rdi ; Using RBX without saving it!
call other_func
mov rax, rbx
ret ; Caller's RBX is now corrupted!
; CORRECT:
my_func:
push rbx ; Save RBX
mov rbx, rdi
call other_func
mov rax, rbx
pop rbx ; Restore RBX
ret
; WRONG: Stack imbalance
my_func:
push rax
; ... forget to pop ...
ret ; Returns to WRONG address (pushed value)!
; WRONG: Using stack after deallocating
my_func:
sub rsp, 16
mov [rsp], rax ; Store value
add rsp, 16 ; Deallocate
mov rbx, [rsp - 16] ; WRONG! This memory is now "free"
; An interrupt could overwrite it
ret
; WRONG: Mismatched push/pop order
push rax
push rbx
pop rax ; Wrong! Gets RBX value
pop rbx ; Wrong! Gets RAX value
; CORRECT: LIFO order
push rax
push rbx
pop rbx ; Restore RBX
pop rax ; Restore RAX
Debugging Stack with GDB
# Start GDB
gdb ./program
# Set breakpoint at function
(gdb) break my_function
(gdb) run
# Examine stack
(gdb) info registers rsp rbp # Stack and frame pointers
(gdb) x/20gx $rsp # 20 quad words at stack top
(gdb) x/s *(char**)($rsp) # If stack has string pointer
# Backtrace (call stack)
(gdb) bt # Show call stack
(gdb) bt full # With local variables
# Examine stack frame
(gdb) info frame # Current frame details
(gdb) info locals # Local variables
(gdb) info args # Function arguments
# Walk up/down call stack
(gdb) up # Go to caller
(gdb) down # Go back to callee
# Print return address
(gdb) x/gx $rbp+8 # Return address is at RBP+8