Phase 10: Ollama + AI Stack
Phase 10: AI Stack (Ollama + RTX 5090)
The P16g has an NVIDIA RTX 5090 with 24GB GDDR7 — capable of running 30B+ parameter models locally. This phase deploys Ollama for local LLM inference with persistent storage and a custom API service.
Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Verify installation
ollama --version
# Verify GPU detected
ollama run --verbose llama3.2:1b "hello" 2>&1 | head -5
Configure Model Storage
Models are large (10-40GB each). Store on /home (1.6TB btrfs) not / (250GB root).
# Create model storage directory
sudo mkdir -p /home/ollama-models
sudo chown ollama:ollama /home/ollama-models
# Bind mount to Ollama's default location
sudo mkdir -p /var/lib/ollama/.ollama/models
echo "/home/ollama-models /var/lib/ollama/.ollama/models none bind 0 0" | sudo tee -a /etc/fstab
# Mount now
sudo mount -a
# Verify mount
df -hT /var/lib/ollama/.ollama/models
# Should show btrfs from /home partition, NOT root
Model Selection for RTX 5090 (24GB GDDR7)
| Model | Size | VRAM | Use Case |
|---|---|---|---|
|
~18 GB |
~20 GB |
Primary coding assistant — fits in 24GB VRAM |
|
~19 GB |
~21 GB |
Alternative coder — slightly larger context |
|
~9 GB |
~11 GB |
Fast inference when 30B is too slow |
|
~2 GB |
~3 GB |
Quick queries, testing, lightweight tasks |
|
~270 MB |
~1 GB |
Embeddings for RAG pipelines |
| VRAM usage is approximate. Actual usage depends on context length and quantization. With 24GB, you can run 30B Q4 models comfortably. 70B models require CPU offloading (slow). |
Pull Models
# Primary coder (largest that fits in VRAM)
ollama pull qwen3-coder:30b
# Alternative coder
ollama pull qwen2.5-coder:32b
# Fast fallback
ollama pull qwen2.5-coder:14b
# Lightweight for quick queries
ollama pull llama3.2:3b
# Embeddings
ollama pull nomic-embed-text
# Verify all models pulled
ollama list
# Check storage usage
du -sh /home/ollama-models/
Create Custom Models
Custom modelfiles define system prompts and parameters for specific use cases.
mkdir -p ~/.ollama/Modelfiles
domus-chat-v3 (Infrastructure Assistant)
cat > ~/.ollama/Modelfiles/domus-chat-v3.Modelfile << 'EOF'
FROM qwen2.5-coder:14b
SYSTEM """
You are a senior infrastructure engineer assistant for the Domus Digitalis environment.
You specialize in: Cisco ISE, HashiCorp Vault, VyOS, Kubernetes (k3s), Ansible,
Linux administration (Arch, RHEL), 802.1X EAP-TLS, and AsciiDoc documentation.
Be concise, use code blocks, suggest verification commands.
"""
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
EOF
ollama create domus-chat-v3 -f ~/.ollama/Modelfiles/domus-chat-v3.Modelfile
quick (Fast Responses)
cat > ~/.ollama/Modelfiles/quick.Modelfile << 'EOF'
FROM llama3.2:3b
SYSTEM "Be extremely concise. One-line answers when possible. No preamble."
PARAMETER temperature 0.1
PARAMETER num_ctx 4096
EOF
ollama create quick -f ~/.ollama/Modelfiles/quick.Modelfile
Ollama as systemd Service
Ensure Ollama runs persistently and starts on boot.
# Check if install script created the service
systemctl is-enabled ollama.service
# If not enabled:
sudo systemctl enable --now ollama.service
# Verify running
systemctl is-active ollama.service
# Check Ollama is listening
ss -tlnp | grep 11434
Custom API Service (ollama-local)
Optional: a custom FastAPI wrapper for Ollama with additional endpoints.
cd ~/atelier/_projects/personal/ollama-local
uv sync
# Start API service
uv run uvicorn api.main:app --port 8080 &
# Test health endpoint
curl -s http://localhost:8080/health | jq
Inference Verification
# Test primary model inference
ollama run qwen3-coder:30b "Write a bash function that checks if a port is open"
# Test custom model
ollama run domus-chat-v3 "How do I check ISE authentication logs?"
# Test quick model
ollama run quick "What is the default SSH port?"
# Monitor GPU during inference
nvidia-smi -l 1
# (Ctrl+C to stop — watch VRAM usage during model loading)
GPU Memory Management
# Check current VRAM usage
nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits
# If VRAM is full (model won't load), unload current model:
# Ollama auto-unloads after 5 minutes idle, or restart the service:
sudo systemctl restart ollama.service
# Set Ollama to keep models loaded longer (optional)
# In /etc/systemd/system/ollama.service.d/override.conf:
# [Service]
# Environment="OLLAMA_KEEP_ALIVE=30m"
Troubleshooting
| Symptom | Fix |
|---|---|
|
Verify |
Model pulls fail (network error) |
Check DNS ( |
Out of VRAM (model too large) |
Use smaller quantization: |
Slow inference |
Check |
Models disappear after reboot |
Verify fstab bind mount: |
| Check | Status |
|---|---|
Ollama installed and running |
[ ] |
Model storage on /home (bind mount verified) |
[ ] |
|
[ ] |
|
[ ] |
|
[ ] |
Custom models created (domus-chat-v3, quick) |
[ ] |
Ollama systemd service enabled |
[ ] |
Inference test passed (30B model on RTX 5090) |
[ ] |
GPU VRAM usage validated during inference |
[ ] |
ollama-local API service working (if applicable) |
[ ] |