Phase 10: Ollama + AI Stack

Phase 10: AI Stack (Ollama + RTX 5090)

The P16g has an NVIDIA RTX 5090 with 24GB GDDR7 — capable of running 30B+ parameter models locally. This phase deploys Ollama for local LLM inference with persistent storage and a custom API service.

Install Ollama

curl -fsSL https://ollama.com/install.sh | sh
# Verify installation
ollama --version
# Verify GPU detected
ollama run --verbose llama3.2:1b "hello" 2>&1 | head -5

Configure Model Storage

Models are large (10-40GB each). Store on /home (1.6TB btrfs) not / (250GB root).

# Create model storage directory
sudo mkdir -p /home/ollama-models
sudo chown ollama:ollama /home/ollama-models
# Bind mount to Ollama's default location
sudo mkdir -p /var/lib/ollama/.ollama/models
echo "/home/ollama-models /var/lib/ollama/.ollama/models none bind 0 0" | sudo tee -a /etc/fstab
# Mount now
sudo mount -a
# Verify mount
df -hT /var/lib/ollama/.ollama/models
# Should show btrfs from /home partition, NOT root

Model Selection for RTX 5090 (24GB GDDR7)

Model Size VRAM Use Case

qwen3-coder:30b

~18 GB

~20 GB

Primary coding assistant — fits in 24GB VRAM

qwen2.5-coder:32b

~19 GB

~21 GB

Alternative coder — slightly larger context

qwen2.5-coder:14b

~9 GB

~11 GB

Fast inference when 30B is too slow

llama3.2:3b

~2 GB

~3 GB

Quick queries, testing, lightweight tasks

nomic-embed-text

~270 MB

~1 GB

Embeddings for RAG pipelines

VRAM usage is approximate. Actual usage depends on context length and quantization. With 24GB, you can run 30B Q4 models comfortably. 70B models require CPU offloading (slow).

Pull Models

# Primary coder (largest that fits in VRAM)
ollama pull qwen3-coder:30b
# Alternative coder
ollama pull qwen2.5-coder:32b
# Fast fallback
ollama pull qwen2.5-coder:14b
# Lightweight for quick queries
ollama pull llama3.2:3b
# Embeddings
ollama pull nomic-embed-text
# Verify all models pulled
ollama list
# Check storage usage
du -sh /home/ollama-models/

Create Custom Models

Custom modelfiles define system prompts and parameters for specific use cases.

mkdir -p ~/.ollama/Modelfiles

domus-chat-v3 (Infrastructure Assistant)

cat > ~/.ollama/Modelfiles/domus-chat-v3.Modelfile << 'EOF'
FROM qwen2.5-coder:14b

SYSTEM """
You are a senior infrastructure engineer assistant for the Domus Digitalis environment.
You specialize in: Cisco ISE, HashiCorp Vault, VyOS, Kubernetes (k3s), Ansible,
Linux administration (Arch, RHEL), 802.1X EAP-TLS, and AsciiDoc documentation.
Be concise, use code blocks, suggest verification commands.
"""

PARAMETER temperature 0.3
PARAMETER num_ctx 8192
EOF
ollama create domus-chat-v3 -f ~/.ollama/Modelfiles/domus-chat-v3.Modelfile

quick (Fast Responses)

cat > ~/.ollama/Modelfiles/quick.Modelfile << 'EOF'
FROM llama3.2:3b

SYSTEM "Be extremely concise. One-line answers when possible. No preamble."

PARAMETER temperature 0.1
PARAMETER num_ctx 4096
EOF
ollama create quick -f ~/.ollama/Modelfiles/quick.Modelfile

Ollama as systemd Service

Ensure Ollama runs persistently and starts on boot.

# Check if install script created the service
systemctl is-enabled ollama.service
# If not enabled:
sudo systemctl enable --now ollama.service
# Verify running
systemctl is-active ollama.service
# Check Ollama is listening
ss -tlnp | grep 11434

Custom API Service (ollama-local)

Optional: a custom FastAPI wrapper for Ollama with additional endpoints.

cd ~/atelier/_projects/personal/ollama-local
uv sync
# Start API service
uv run uvicorn api.main:app --port 8080 &
# Test health endpoint
curl -s http://localhost:8080/health | jq

Inference Verification

# Test primary model inference
ollama run qwen3-coder:30b "Write a bash function that checks if a port is open"
# Test custom model
ollama run domus-chat-v3 "How do I check ISE authentication logs?"
# Test quick model
ollama run quick "What is the default SSH port?"
# Monitor GPU during inference
nvidia-smi -l 1
# (Ctrl+C to stop — watch VRAM usage during model loading)

GPU Memory Management

# Check current VRAM usage
nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits
# If VRAM is full (model won't load), unload current model:
# Ollama auto-unloads after 5 minutes idle, or restart the service:
sudo systemctl restart ollama.service
# Set Ollama to keep models loaded longer (optional)
# In /etc/systemd/system/ollama.service.d/override.conf:
# [Service]
# Environment="OLLAMA_KEEP_ALIVE=30m"

Troubleshooting

Symptom Fix

error: no GPUs detected

Verify nvidia-smi works. Check lsmod | grep nvidia. Reinstall nvidia-open if needed.

Model pulls fail (network error)

Check DNS (dig +short ollama.com). If on iPSK VLAN, check firewall rules.

Out of VRAM (model too large)

Use smaller quantization: ollama pull model:q4_0 instead of default. Or use 14B model.

Slow inference

Check nvidia-smi — GPU should be at high utilization. If CPU inference, NVIDIA driver may not be loaded.

Models disappear after reboot

Verify fstab bind mount: mount | grep ollama-models. If missing, sudo mount -a.

Check Status

Ollama installed and running

[ ]

Model storage on /home (bind mount verified)

[ ]

qwen3-coder:30b pulled

[ ]

qwen2.5-coder:14b pulled

[ ]

llama3.2:3b pulled

[ ]

Custom models created (domus-chat-v3, quick)

[ ]

Ollama systemd service enabled

[ ]

Inference test passed (30B model on RTX 5090)

[ ]

GPU VRAM usage validated during inference

[ ]

ollama-local API service working (if applicable)

[ ]