Aider + Ollama: Offline Coding Assistant

Quick Start

# Primary — newest MoE model (3.3B active params, fast + smart)
aider --model ollama_chat/qwen3-coder:30b

# Alternative — battle-tested, 71.4% Aider benchmark
aider --model ollama_chat/qwen2.5-coder:32b

# Fallback — fast, smaller VRAM footprint
aider --model ollama_chat/qwen2.5-coder:14b

Use ollama_chat/ prefix, NOT ollama/. The chat prefix uses the proper chat API and is more reliable.

Pre-Flight (While Online)

Pull models before going offline:

ollama pull qwen2.5-coder:32b

ollama pull qwen2.5-coder:14b

# Verify they're cached
ollama list

Set the environment variable (add to ~/.zshrc):

export OLLAMA_API_BASE=http://127.0.0.1:11434

Configuration Files

File Purpose

File	Purpose
`~/.aider.conf.yml`	Global settings — architect mode, auto-commit OFF, whole edit format
`~/.aider.model.settings.yml`	CRITICAL — sets `num_ctx: 32768` (Ollama defaults to 2048 and silently truncates)
`~/.aider/CONVENTIONS.md`	Global coding standards (AsciiDoc rules, git conventions)
`.aider/CONVENTIONS.md` (per-repo)	Project-specific rules (partials system, attributes, prefixes)
`.aiderignore` (per-repo)	Excludes large/irrelevant directories from context

~/.aider.conf.yml

Global settings — architect mode, auto-commit OFF, whole edit format

~/.aider.model.settings.yml

CRITICAL — sets num_ctx: 32768 (Ollama defaults to 2048 and silently truncates)

~/.aider/CONVENTIONS.md

Global coding standards (AsciiDoc rules, git conventions)

.aider/CONVENTIONS.md (per-repo)

Project-specific rules (partials system, attributes, prefixes)

.aiderignore (per-repo)

Excludes large/irrelevant directories from context

Ollama defaults num_ctx to 2048 tokens. This is catastrophically small. Aider’s system prompt alone can exceed this limit. Ollama silently discards everything that doesn’t fit — you get garbage output with zero error messages.

The .aider.model.settings.yml file sets this to 32768. Without it, nothing works.

Architect Mode

Architect mode is the single biggest quality improvement for local models. It separates two concerns:

Architect pass — the model reasons about what to do (natural language)
Editor pass — the model formats the file edits (structured output)

Quantized local models fail when asked to reason AND follow strict formatting in one pass. Architect mode fixes this.

Edit Format: `whole` not `diff`

Format How It Works Local Model Result

Format	How It Works	Local Model Result
`diff`	Search/replace blocks	FAILS — quantized models can’t follow the syntax
`whole`	Returns entire file	Works — no formatting to get wrong
`editor-whole`	Simplified whole for architect mode	Best — used automatically in architect mode

diff

Search/replace blocks

FAILS — quantized models can’t follow the syntax

whole

Returns entire file

Works — no formatting to get wrong

editor-whole

Simplified whole for architect mode

Best — used automatically in architect mode

Critical Safety Settings

Setting Value Why

Setting	Value	Why
`auto-commits`	`false`	CRITICAL. Local models WILL produce bad code. Review everything before committing.
`suggest-shell-commands`	`false`	32B models hallucinate shell commands. Don’t trust them.
`dirty-commits`	`false`	Only commit staged changes, never working tree noise.
`attribute-author`	`false`	No AI attribution per policy.
`architect`	`true`	Separates thinking from editing for better quality.
`edit-format`	`whole`	Quantized models cannot reliably produce diff format.
`temperature`	`0.3`	Low temperature = precise formatting. Set in model settings.

auto-commits

false

CRITICAL. Local models WILL produce bad code. Review everything before committing.

suggest-shell-commands

false

32B models hallucinate shell commands. Don’t trust them.

dirty-commits

false

Only commit staged changes, never working tree noise.

attribute-author

false

No AI attribution per policy.

architect

true

Separates thinking from editing for better quality.

edit-format

whole

Quantized models cannot reliably produce diff format.

temperature

0.3

Low temperature = precise formatting. Set in model settings.

Workflow: How to Use Effectively

Rule 1: Small, focused tasks

Local models lose coherence on multi-step tasks. Ask for ONE thing at a time:

# GOOD — single focused request
/add docs/modules/ROOT/pages/templates/TEMPLATE-math-concept.adoc
> Create a new math concept page about logarithms following this template exactly

# BAD — multi-step, model will drift
> Reorganize the worklogs, update the nav, create a new partial, and fix the xrefs

Rule 2: Add only the files the model needs

# Add specific files to context
/add docs/modules/ROOT/pages/drafts/my-file.adoc
/add docs/antora.yml

# Drop files when switching tasks
/drop docs/antora.yml

Rule 3: Load reference files as read-only

# Read-only — model sees it but won't edit it (enables prompt caching)
/read docs/modules/ROOT/pages/templates/TEMPLATE-math-concept.adoc

# Then ask for work based on that reference
> Using the template I loaded, create a new page for logarithms

Rule 4: Review every diff

Since auto-commits are OFF, Aider shows diffs but doesn’t commit:

# 1. Aider makes edits (shown as diff)
# 2. Review the diff carefully
# 3. If good: commit manually
git diff
gach << 'EOF'
docs(scope): Your commit message
EOF

# 4. If bad: revert
git checkout -- path/to/file.adoc

Rule 5: Use /ask before /code

# Research mode — model reads but doesn't edit
/ask What attributes are defined in docs/antora.yml for ISE?

# Review the plan, then let it execute
/code Add a new practice problem to the math problem set

Rule 6: Keep context under 25K tokens

Above ~25K tokens, local models stop following instructions. Symptoms:

Ignores conventions
Produces wrong edit format
Hallucinates file paths

Fix: /drop files aggressively, use .aiderignore, keep /add list small.

Model Comparison (Your Hardware)

RTX 5090 Mobile (24GB VRAM) + 64GB RAM:

Model VRAM Speed Code Quality Best For

Model	VRAM	Speed	Code Quality	Best For
`qwen3-coder:30b`	~18GB	Fast (MoE, 3.3B active)	Excellent	Primary — newest, fast, smart
`qwen2.5-coder:32b`	~19GB	Medium	Excellent (71.4% Aider benchmark)	Battle-tested alternative
`qwen2.5-coder:14b`	~9GB	Fast	Good (69.2%)	Quick edits, small VRAM

qwen3-coder:30b

~18GB

Fast (MoE, 3.3B active)

Excellent

Primary — newest, fast, smart

qwen2.5-coder:32b

~19GB

Medium

Excellent (71.4% Aider benchmark)

Battle-tested alternative

qwen2.5-coder:14b

~9GB

Fast

Good (69.2%)

Quick edits, small VRAM

codestral:22b, deepseek-r1:14b, llava, and analyst were removed on 2026-03-30. Qwen models outperform them on all benchmarks for this hardware.

Limitations vs Claude Code

Capability	Local Model	Workaround
Multi-file refactoring	Unreliable	One file at a time, manual coordination
Following complex rules	Drifts after ~5 rules	CONVENTIONS.md is distilled to essentials
AsciiDoc attribute verification	Will hallucinate attributes	Always grep antora.yml yourself first
Large context	32K usable max	Keep /add list small, /drop aggressively
Commit messages	Generic	Write your own (auto-commit is OFF)
Shell commands	Dangerous	Never trust generated shell commands
Cross-repo awareness	No	Only knows files you /add
Template adherence	Partial	Use /read to load template, architect mode helps

Capability

Local Model

Workaround

Multi-file refactoring

Unreliable

One file at a time, manual coordination

Following complex rules

Drifts after ~5 rules

CONVENTIONS.md is distilled to essentials

AsciiDoc attribute verification

Will hallucinate attributes

Always grep antora.yml yourself first

Large context

32K usable max

Keep /add list small, /drop aggressively

Commit messages

Generic

Write your own (auto-commit is OFF)

Shell commands

Dangerous

Never trust generated shell commands

Cross-repo awareness

Only knows files you /add

Template adherence

Partial

Use /read to load template, architect mode helps

Troubleshooting

Model producing garbage or ignoring instructions

Most likely cause: num_ctx too small. Verify your model settings are loaded:

# Check that .aider.model.settings.yml exists
cat ~/.aider.model.settings.yml

If missing, the model runs with 2048 token context and everything breaks silently.

Model too slow

# Check VRAM usage
nvidia-smi

# Switch to smaller model
aider --model ollama_chat/qwen2.5-coder:14b

Edit format errors

If Aider complains about malformed edits, the model is failing to produce valid output:

# Switch to whole format explicitly
/chat-mode code

Or restart with:

aider --model ollama_chat/qwen2.5-coder:32b --edit-format whole

Context overloaded

# Clear everything and start fresh
/clear
/drop *
/add only-the-file-you-need.adoc

Ollama not responding

# Check if Ollama is running
ss -tlnp | grep 11434

# Restart Ollama service
sudo systemctl restart ollama

Model not found

# List available models
ollama list

# Pull if missing (requires internet)
ollama pull qwen2.5-coder:32b

Custom Ollama Chat Models

Create purpose-built chat models with system prompts baked in — no CONVENTIONS.md needed for interactive use.

Creating a Modelfile

When using zsh heredocs, the delimiter must be quoted ('EOF' not EOF) or zsh will mangle the triple quotes inside the Modelfile.

cat << 'EOF' > /tmp/domus-chat-v3.Modelfile
FROM qwen3-coder:30b

SYSTEM """You are a senior infrastructure and security engineer assistant. Follow these rules strictly:

- AsciiDoc only, never markdown
- NEVER add :toc: attributes (Antora UI handles navigation)
- Use stem:[...] for inline math, [stem] with ++++ for display blocks
- Use [%collapsible] for expandable sections
- No AI attribution ever
- Be direct, concise, no preamble
- Use AsciiDoc source blocks: [source,bash] then ---- delimiters
- Never use markdown backtick fences
- Use {attribute} references, never hardcode IPs or hostnames
- For code blocks with attributes, add subs=attributes+"""

PARAMETER num_ctx 32768
PARAMETER temperature 0.3
PARAMETER top_k 40
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
EOF

Do NOT add a TEMPLATE line. Omitting it inherits the proper chat template from the base model. Overriding with TEMPLATE {{ .Prompt }} breaks system/user/assistant turn formatting and degrades quality.

ollama create domus-chat-v3 -f /tmp/domus-chat-v3.Modelfile

ollama run domus-chat-v3-v3

Modelfile Syntax Rules

Rule Detail

Rule	Detail
Triple quotes `"""`	Required for multi-line SYSTEM prompts. Opening and closing `"""` are delimiters.
Quoted heredoc `'EOF'`	Critical in zsh. Unquoted heredocs expand `"""` and `$` variables, breaking the file.
Instructions are case-insensitive	`SYSTEM`, `system`, `System` all work. Uppercase is convention.
No escaping inside triple quotes	Parser reads raw content until closing `"""`. No backslash escaping needed.
Changes require re-create	Editing the Modelfile on disk does nothing. Must re-run `ollama create`.

Triple quotes """

Required for multi-line SYSTEM prompts. Opening and closing """ are delimiters.

Quoted heredoc 'EOF'

Critical in zsh. Unquoted heredocs expand """ and $ variables, breaking the file.

Instructions are case-insensitive

SYSTEM, system, System all work. Uppercase is convention.

No escaping inside triple quotes

Parser reads raw content until closing """. No backslash escaping needed.

Changes require re-create

Editing the Modelfile on disk does nothing. Must re-run ollama create.

Few-Shot Examples (Optional)

Pre-load examples to steer the model’s output format:

MESSAGE user "Create a collapsible section about SSH keys"
MESSAGE assistant "[%collapsible]
.SSH Key Generation
====
[source,bash]

ssh-keygen -t ed25519 -C \"user@host\"

====
"

Active Custom Models (as of 2026-03-30)

Model Name Base Purpose

Model Name	Base	Purpose
`domus-chat-v3`	`qwen3-coder:30b`	Interactive chat with AsciiDoc + infrastructure conventions
`quick`	`qwen2.5:7b`	Ultra-fast Q&A, one paragraph max

domus-chat-v3

qwen3-coder:30b

Interactive chat with AsciiDoc + infrastructure conventions

quick

qwen2.5:7b

Ultra-fast Q&A, one paragraph max

Modelfiles stored in ~/.ollama/Modelfiles/.

Capturing Sessions to File

Ollama’s interactive REPL does not write to files. Use these workarounds:

One-shot (pipe output)

ollama run domus-chat-v3 "explain btrfs snapshots" > ~/output.adoc

Interactive with capture (tee)

ollama run domus-chat-v3 | tee ~/session-output.adoc

Model output goes to both screen and file. Your typed prompts are not captured.

Full session recording (script)

script -q ~/session-raw.txt -c "ollama run domus-chat-v3"

Captures both sides (input + output). Clean up into AsciiDoc afterward, or ask the model as its last prompt:

Now take everything we discussed and output it as a single clean AsciiDoc document

Inspecting Existing Models

# See what a model shipped with (system prompt, parameters, template)
ollama show qwen2.5-coder:32b --modelfile

Removing Unused Models

# Check disk usage
ollama list

# Remove a model
ollama rm model-name

Aider + Ollama: Offline Coding Assistant

Quick Start

Pre-Flight (While Online)

Configuration Files

Why These Settings Matter

num_ctx (Context Window)

Architect Mode

Edit Format: `whole` not `diff`

Critical Safety Settings

Workflow: How to Use Effectively

Rule 1: Small, focused tasks

Rule 2: Add only the files the model needs

Rule 3: Load reference files as read-only

Rule 4: Review every diff

Rule 5: Use /ask before /code

Rule 6: Keep context under 25K tokens

Model Comparison (Your Hardware)

Limitations vs Claude Code

Troubleshooting

Model producing garbage or ignoring instructions

Model too slow

Edit format errors

Context overloaded

Ollama not responding

Model not found

Custom Ollama Chat Models

Creating a Modelfile

Modelfile Syntax Rules

Few-Shot Examples (Optional)

Active Custom Models (as of 2026-03-30)

Capturing Sessions to File

One-shot (pipe output)

Interactive with capture (tee)

Full session recording (script)

Inspecting Existing Models

Removing Unused Models