Stack & Functions

Function calling, stack management, and calling conventions.

The Stack: Concept

The stack is memory that grows DOWNWARD (toward lower addresses). RSP (Stack Pointer) always points to the TOP of the stack.

Memory Address
     High
      │
      │   ┌─────────────────┐
      │   │ Previous data   │
      │   ├─────────────────┤ ← RSP before push
      │   │ (empty/garbage) │
      │   ├─────────────────┤ ← RSP after push
      │   │   Pushed value  │
      │   ├─────────────────┤
      ▼   │                 │
     Low

PUSH decreases RSP (moves "up" in diagram, but DOWN in address)
POP  increases RSP (moves "down" in diagram, but UP in address)

Why downward? Historical reasons, but useful: stack and heap grow toward each other, maximizing memory usage. If they meet, you’re out of memory.

Push and Pop

; PUSH - Store value on stack
push rax            ; RSP = RSP - 8; [RSP] = RAX
                    ; (64-bit: always pushes 8 bytes on x86-64)

; Equivalent to:
sub rsp, 8
mov [rsp], rax

; PUSH memory
push QWORD PTR [rbx]  ; Push value from memory address in rbx

; PUSH immediate
push 42             ; Push literal value

; POP - Retrieve value from stack
pop rax             ; RAX = [RSP]; RSP = RSP + 8

; Equivalent to:
mov rax, [rsp]
add rsp, 8

; POP to memory
pop QWORD PTR [rbx]   ; Pop into memory address

; Common patterns
push rbx            ; Save callee-saved register
push rbp
; ... do work ...
pop rbp             ; Restore in REVERSE order
pop rbx

CRITICAL: x86-64 requires 16-byte stack alignment before CALL. The CALL instruction pushes 8 bytes (return address), so on function entry, RSP mod 16 == 8. Adjust accordingly when allocating stack space.

System V AMD64 Calling Convention (Linux)

╔═══════════════════════════════════════════════════════════════════╗
║                    ARGUMENT PASSING                                ║
╠═══════════════════════════════════════════════════════════════════╣
║  Arg#   │ Integer/Pointer │ Floating Point │ Additional           ║
║─────────┼─────────────────┼────────────────┼──────────────────────║
║    1    │      RDI        │     XMM0       │                      ║
║    2    │      RSI        │     XMM1       │                      ║
║    3    │      RDX        │     XMM2       │                      ║
║    4    │      RCX        │     XMM3       │                      ║
║    5    │      R8         │     XMM4       │                      ║
║    6    │      R9         │     XMM5       │                      ║
║   7+    │   Stack (right to left)          │ [RSP+8], [RSP+16]... ║
╠═══════════════════════════════════════════════════════════════════╣
║                    RETURN VALUES                                   ║
╠═══════════════════════════════════════════════════════════════════╣
║  RAX    │ Integer return (up to 64-bit)                           ║
║  RAX:RDX│ 128-bit return (RDX=high, RAX=low)                      ║
║  XMM0   │ Floating point return                                   ║
╠═══════════════════════════════════════════════════════════════════╣
║                    REGISTER PRESERVATION                           ║
╠═══════════════════════════════════════════════════════════════════╣
║ Caller-saved (scratch): RAX, RCX, RDX, RSI, RDI, R8-R11           ║
║   → Function CAN destroy these; caller must save if needed        ║
║                                                                   ║
║ Callee-saved (preserved): RBX, RBP, R12-R15, RSP                  ║
║   → Function MUST restore these before returning                  ║
╚═══════════════════════════════════════════════════════════════════╝
// C function
long add3(long a, long b, long c) {
    return a + b + c;
}

// Caller in C
long result = add3(10, 20, 30);
; Caller (before call)
mov rdi, 10         ; First argument
mov rsi, 20         ; Second argument
mov rdx, 30         ; Third argument
call add3           ; Call function
; RAX now contains return value

; Callee (add3 implementation)
add3:
    ; No need to save anything - we only use scratch registers
    mov rax, rdi    ; rax = a
    add rax, rsi    ; rax = a + b
    add rax, rdx    ; rax = a + b + c
    ret             ; Return (result in rax)

Stack Frame Structure

High Address
┌────────────────────────────────────────┐
│          Caller's stack frame          │
├────────────────────────────────────────┤
│   Argument 8 (if > 6 args)             │ [RBP + 24]
├────────────────────────────────────────┤
│   Argument 7 (if > 6 args)             │ [RBP + 16]
├────────────────────────────────────────┤
│   Return Address (pushed by CALL)      │ [RBP + 8]
├────────────────────────────────────────┤
│   Saved RBP (old frame pointer)        │ [RBP] ← RBP points here
├────────────────────────────────────────┤
│   Local variable 1                     │ [RBP - 8]
├────────────────────────────────────────┤
│   Local variable 2                     │ [RBP - 16]
├────────────────────────────────────────┤
│   Local variable 3                     │ [RBP - 24]
├────────────────────────────────────────┤
│   ... more locals ...                  │ ← RSP points here
└────────────────────────────────────────┘
Low Address
; Standard function prologue
my_function:
    push rbp            ; Save caller's base pointer
    mov rbp, rsp        ; Set up our frame pointer
    sub rsp, 32         ; Allocate 32 bytes for locals (must be 16-aligned)

    ; Now we can use:
    ; [rbp - 8]  for local var 1
    ; [rbp - 16] for local var 2
    ; etc.

    ; Arguments came in registers:
    ; rdi = arg1, rsi = arg2, rdx = arg3, rcx = arg4, r8 = arg5, r9 = arg6

    ; If we need to save them to stack:
    mov [rbp - 8], rdi   ; Save arg1 to local
    mov [rbp - 16], rsi  ; Save arg2 to local

    ; ... function body ...

; Standard function epilogue
    mov rsp, rbp        ; Deallocate locals
    pop rbp             ; Restore caller's base pointer
    ret                 ; Return to caller

; Alternative epilogue (one instruction)
    leave               ; Equivalent to: mov rsp, rbp; pop rbp
    ret

CALL and RET Instructions

; CALL instruction does two things:
; 1. Push return address (address of next instruction) onto stack
; 2. Jump to target function

call my_function    ; Equivalent to:
                    ;   push rip    ; (actually pushes address AFTER call)
                    ;   jmp my_function

; RET instruction does the reverse:
; 1. Pop return address from stack
; 2. Jump to that address

ret                 ; Equivalent to:
                    ;   pop rip

; RET with immediate (clean up stack args - rare in x86-64)
ret 16              ; Pop return address, then add 16 to RSP
                    ; Used when callee cleans up stack arguments

Visualizing CALL/RET:

Before CALL:
    RSP → ┌─────────────┐
          │   ...       │

After CALL (inside function):
    RSP → ┌─────────────┐
          │ Return Addr │  ← Pushed by CALL
          ├─────────────┤
          │   ...       │

After RET (back in caller):
    RSP → ┌─────────────┐
          │   ...       │  ← Return address popped

Leaf Functions (Optimized)

A "leaf" function doesn’t call other functions. It can skip frame setup.

// Leaf function - no other function calls
int multiply(int a, int b) {
    return a * b;
}
; Optimized leaf function (no frame setup needed)
multiply:
    mov eax, edi        ; eax = first arg (a)
    imul eax, esi       ; eax = a * b
    ret                 ; Return result in eax
    ; No push/pop, no stack allocation - maximum efficiency
// Non-leaf function - calls another function
int complex_calc(int a, int b) {
    int x = multiply(a, b);  // Calls another function
    return x + 1;
}
; Non-leaf must save callee-saved registers if it uses them
complex_calc:
    push rbx            ; Save rbx (we'll use it)
    sub rsp, 8          ; Align stack to 16 bytes (push was 8)

    mov ebx, edi        ; Save 'a' in callee-saved register

    ; Call multiply(a, b)
    ; edi already has 'a', esi already has 'b'
    call multiply

    ; eax = multiply result
    ; ebx still has original 'a' (callee-saved, preserved across call)

    add eax, 1          ; x + 1

    add rsp, 8          ; Deallocate alignment
    pop rbx             ; Restore rbx
    ret

Stack Alignment Requirements

; x86-64 ABI requires 16-byte alignment at CALL
; CALL pushes 8 bytes, so on function entry: RSP mod 16 == 8
; Before calling another function, RSP mod 16 must == 0

; WRONG: Misaligned stack
my_func:
    sub rsp, 24         ; 24 bytes, RSP was 8-aligned, now 16-aligned... wait
    ; Entry: RSP mod 16 = 8
    ; After sub 24: RSP mod 16 = 8 - 24 mod 16 = 8 - 8 = 0? No...
    ; Actually: (RSP - 24) mod 16 = (8 - 24) mod 16 = -16 mod 16 = 0
    call other_func     ; OK, aligned

; CORRECT pattern: Calculate alignment
my_func:
    push rbp            ; -8 bytes, RSP mod 16 = 0
    mov rbp, rsp
    sub rsp, 32         ; 32 bytes for locals, RSP mod 16 = 0
    ; Ready to call other functions

; Alternative: Just ensure stack is 16-aligned before CALL
my_func:
    sub rsp, 8          ; Now RSP mod 16 = 0
    call other_func
    add rsp, 8
    ret

Quick rule: Allocate stack in multiples of 16, accounting for pushes.

The Red Zone (128 bytes below RSP)

           ┌─────────────────────┐
    RSP →  │ Top of stack        │
           ├─────────────────────┤
           │                     │
           │  RED ZONE           │  ← 128 bytes
           │  (Leaf functions    │     Usable WITHOUT adjusting RSP
           │   can use this)     │     But: Destroyed by signals, interrupts
           │                     │
           ├─────────────────────┤
           │ RSP - 128           │
           └─────────────────────┘
; Leaf function using red zone (no sub rsp needed!)
simple_leaf:
    mov [rsp - 8], rdi    ; Use red zone for temp storage
    mov [rsp - 16], rsi
    ; ... do work ...
    mov rax, [rsp - 8]
    ret                   ; Never touched RSP

; WARNING: Signal handlers CAN destroy red zone
; Only use in leaf functions that don't need signal safety
; Kernel code CANNOT use red zone (interrupts)

In practice: Compilers automatically use the red zone for small leaf functions. You’ll see it in optimized code.

Practical Example: String Length

// C implementation
size_t my_strlen(const char *s) {
    size_t len = 0;
    while (*s != '\0') {
        len++;
        s++;
    }
    return len;
}
; Hand-optimized assembly
my_strlen:
    ; Input: RDI = pointer to string
    ; Output: RAX = length

    xor eax, eax        ; len = 0 (also clears upper RAX)

.loop:
    cmp BYTE PTR [rdi + rax], 0   ; Is s[len] == 0?
    je .done                       ; If zero, we're done
    inc rax                        ; len++
    jmp .loop                      ; Continue

.done:
    ret                 ; Return length in RAX

; Even more optimized (no extra register)
my_strlen_v2:
    mov rax, rdi        ; Save start pointer
.loop:
    cmp BYTE PTR [rdi], 0
    je .done
    inc rdi             ; s++
    jmp .loop
.done:
    sub rdi, rax        ; length = end - start
    mov rax, rdi        ; Return in RAX
    ret

Variadic Functions (printf-style)

; For variadic functions (printf, etc.), AL must contain
; the number of vector registers (XMM) used for arguments

; Calling printf("x=%d, y=%d\n", 10, 20)
section .data
    fmt: db "x=%d, y=%d", 10, 0   ; Format string with newline

section .text
    lea rdi, [rel fmt]    ; First arg: format string
    mov esi, 10           ; Second arg: x value
    mov edx, 20           ; Third arg: y value
    xor eax, eax          ; AL = 0 (no floating point args)
    call printf

; Calling printf("pi=%f\n", 3.14159)
section .data
    pi_fmt: db "pi=%f", 10, 0
    pi_val: dq 3.14159    ; Double precision

section .text
    lea rdi, [rel pi_fmt]
    movsd xmm0, [rel pi_val]  ; Float arg in XMM0
    mov eax, 1            ; AL = 1 (one XMM register used)
    call printf

Recursion

// Factorial in C
long factorial(long n) {
    if (n <= 1) return 1;
    return n * factorial(n - 1);
}
factorial:
    ; Base case: n <= 1
    cmp rdi, 1
    jle .base_case

    ; Recursive case: n * factorial(n-1)
    push rdi            ; Save n (callee would destroy it)

    dec rdi             ; n - 1
    call factorial      ; factorial(n-1)

    pop rdi             ; Restore n
    imul rax, rdi       ; rax = factorial(n-1) * n
    ret

.base_case:
    mov eax, 1          ; Return 1
    ret

; Stack unwinding visualization for factorial(4):
;
; factorial(4): push 4, call factorial(3)
;   factorial(3): push 3, call factorial(2)
;     factorial(2): push 2, call factorial(1)
;       factorial(1): return 1
;     factorial(2): pop 2, return 2*1 = 2
;   factorial(3): pop 3, return 3*2 = 6
; factorial(4): pop 4, return 4*6 = 24

Common Stack Gotchas

; WRONG: Forgetting to preserve callee-saved registers
my_func:
    mov rbx, rdi        ; Using RBX without saving it!
    call other_func
    mov rax, rbx
    ret                 ; Caller's RBX is now corrupted!

; CORRECT:
my_func:
    push rbx            ; Save RBX
    mov rbx, rdi
    call other_func
    mov rax, rbx
    pop rbx             ; Restore RBX
    ret

; WRONG: Stack imbalance
my_func:
    push rax
    ; ... forget to pop ...
    ret                 ; Returns to WRONG address (pushed value)!

; WRONG: Using stack after deallocating
my_func:
    sub rsp, 16
    mov [rsp], rax      ; Store value
    add rsp, 16         ; Deallocate
    mov rbx, [rsp - 16] ; WRONG! This memory is now "free"
                        ; An interrupt could overwrite it
    ret

; WRONG: Mismatched push/pop order
    push rax
    push rbx
    pop rax             ; Wrong! Gets RBX value
    pop rbx             ; Wrong! Gets RAX value

; CORRECT: LIFO order
    push rax
    push rbx
    pop rbx             ; Restore RBX
    pop rax             ; Restore RAX

Debugging Stack with GDB

# Start GDB
gdb ./program

# Set breakpoint at function
(gdb) break my_function
(gdb) run

# Examine stack
(gdb) info registers rsp rbp       # Stack and frame pointers
(gdb) x/20gx $rsp                  # 20 quad words at stack top
(gdb) x/s *(char**)($rsp)          # If stack has string pointer

# Backtrace (call stack)
(gdb) bt                           # Show call stack
(gdb) bt full                      # With local variables

# Examine stack frame
(gdb) info frame                   # Current frame details
(gdb) info locals                  # Local variables
(gdb) info args                    # Function arguments

# Walk up/down call stack
(gdb) up                           # Go to caller
(gdb) down                         # Go back to callee

# Print return address
(gdb) x/gx $rbp+8                  # Return address is at RBP+8