1. Choosing a Coding Model

With 16 GB VRAM, the target is models that fit entirely in GPU memory at a reasonable quantization level. The shortlist evaluated:

Model	Notes
`qwen2.5-coder:32b-instruct-q4_K_M`	Best coding benchmark scores (92.7% HumanEval), but needs ~20–22 GB VRAM — requires CPU offload
`devstral-small:24b`	Consistently top-ranked for coding quality in the Ollama ecosystem; fits with Q4 quant
`qwen3:14b`	Good general + coding balance; fits cleanly in 16 GB
`deepseek-r1:14b`	Chain-of-thought reasoning; excellent for debugging
Kimi K2.6 (MoE)	Emerging top-tier; 42B active params, MIT licensed — worth tracking

Decision: Pull qwen2.5-coder:32b-instruct-q4_K_M as the primary model. At Q4_K_M it slightly exceeds VRAM and will use CPU offload for the overflow layers (~5–10 tok/s), which is acceptable for interactive coding use. The 14B variants are the fallback if latency is unacceptable.

Pulling the Model

ollama pull qwen2.5-coder:32b-instruct-q4_K_M

Ollama fetches the GGUF directly — no manual quantization needed. Verify with:

ollama list

2. Ollama Server Configuration

By default, Ollama binds only to localhost. To make it reachable on the LAN (or by Docker containers like Open WebUI), override the bind address in the systemd service:

sudo systemctl edit ollama

Add:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_NOHISTORY=1"
Environment="OLLAMA_NO_CLOUD=1"

Then reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Block external access with UFW, keeping it LAN-only:

sudo ufw allow from 10.0.0.0/24 to any port 11434

3. Open WebUI (Local Browser Interface)

Open WebUI runs in Docker and gives a ChatGPT-style interface to any Ollama model:

docker run -d \
  --network host \
  -e OLLAMA_BASE_URL=http://localhost:11434 \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Using --network host avoids the container-vs-host localhost confusion. Access it at http://localhost:3000.

4. Reaching Ollama from LAN Clients via SSH Tunnel

From any machine on the same LAN (Linux, macOS, or Windows with Git Bash/OpenSSH), set up a local port forward:

ssh -N -L 11434:localhost:11434 username@ip-address

Leave this running. Ollama's API is now accessible at http://localhost:11434 on the client.

Simplify with `~/.ssh/config`

On the client machine, add:

Host ollama-server
    HostName {ip-address}
    User {username}
    IdentityFile ~/.ssh/{your_key}
    LocalForward 11434 localhost:11434
    LocalForward 3000 localhost:3000

Then the tunnel is simply:

ssh -N ollama-server

The second LocalForward also tunnels Open WebUI, so http://localhost:3000 in the client browser gives the full web UI.

5. Using the CLI from Various Clients

Linux / macOS (Ollama installed)

OLLAMA_HOST=http://localhost:11434 ollama list
OLLAMA_HOST=http://localhost:11434 ollama run qwen2.5-coder:32b-instruct-q4_K_M

Windows — Git Bash (no Ollama binary)

Git Bash doesn't include the ollama binary, but curl works fine:

# List models
curl http://localhost:11434/api/tags

# Chat completion
curl http://localhost:11434/api/chat \
  -d '{"model":"qwen2.5-coder:32b-instruct-q4_K_M","messages":[{"role":"user","content":"Explain async/await in Python"}],"stream":false}'

Windows — PowerShell (Ollama installed)

Install Ollama from ollama.com (client-only, no GPU needed). Then:

$env:OLLAMA_HOST="http://localhost:11434"; ollama list

Cline CLI (AI Coding Agent)

Cline is an agentic coding tool that wraps Ollama with file-edit and terminal capabilities.

Install:

npm install -g cline   # Node.js 18+ required

Configure:

cline auth

Choose Ollama as the provider, set the base URL to http://localhost:11434, and specify the model name (qwen2.5-coder:32b-instruct-q4_K_M).

Alternatively, edit ~/.cline/data/settings/providers.json directly:

{
  "ollama": {
    "ollamaBaseUrl": "http://localhost:11434",
    "ollamaModelId": "qwen2.5-coder:32b-instruct-q4_K_M"
  }
}

Run a task:

cline -P ollama -m qwen2.5-coder:32b-instruct-q4_K_M "Refactor this function to be async"

Or open the interactive TUI:

cline -i

Note: Cline's agentic features (file edits, terminal execution) require the model to support function calling. qwen2.5-coder does support it — make sure you're using the qwen2.5-coder tag, not older qwen variants.

Aider (AI Pair Programmer)

Aider is a terminal-native AI pair programmer with deep Git integration. Unlike Cline, it has no GUI and no IDE dependency — you work entirely in your shell, and every change the model makes is automatically committed to Git with a generated commit message.

Install:

pip install aider-install
aider-install   # installs the latest aider into its own environment

Or directly via pip (into a venv):

pip install aider-chat

Configure for Ollama:

Point aider at your Ollama instance using the OLLAMA_API_BASE environment variable and prefix the model name with ollama/:

export OLLAMA_API_BASE=http://localhost:11434
aider --model ollama/qwen2.5-coder:32b-instruct-q4_K_M

For a remote Ollama server (via SSH tunnel), the same env var works:

export OLLAMA_API_BASE=http://localhost:11434   # tunnel already running
aider --model ollama/qwen2.5-coder:32b-instruct-q4_K_M myfile.py

Persist config in .aider.conf.yml (project-level):

model: ollama/qwen2.5-coder:32b-instruct-q4_K_M
auto-commits: false   # recommended: review commits manually

Start a session on specific files:

aider main.py utils.py

Aider reads those files, builds a repo map of the surrounding codebase, and enters an interactive chat. Describe the change you want; it edits the files and commits. Type /help for commands, /exit to quit.

Architect mode (stronger reasoning, lower cost):

aider --architect --model ollama/qwen2.5-coder:32b-instruct-q4_K_M \
      --editor-model ollama/qwen3:14b

The architect model reasons about the solution; the editor model writes the actual diffs. Useful for complex refactors where you want the bigger model thinking but not burning context on boilerplate.

Aider vs Cline — Which to Use

Both are free, open-source, and model-agnostic. The difference is philosophy and environment:

	Aider	Cline
Environment	Pure terminal / CLI	VS Code extension (GUI) + CLI 2.0
Git integration	First-class — auto-commits every change with a generated message	You handle Git yourself
Repo awareness	Builds a repo map, so the model can reason across files it hasn't opened	Works on files you explicitly include
Human oversight	Asks per-command; you see diffs before they land	Visual approval at each step in the GUI; auto-approve mode available
Agentic features	Pair programmer — edits files and runs tests	Full agent — edits files, runs terminal commands, controls browser
Multimodel	Architect + editor mode (two models cooperating)	Parallel subagents (since Feb 2026)
Best for	Terminal-first developers, Git-heavy workflows, large codebases	VS Code users who want visual oversight and broader agent automation

6. Working from the CLI on the Server Itself

On the server machine, no tunnel is needed — Ollama is already local.

Quick Interactive Chat

ollama run qwen2.5-coder:32b-instruct-q4_K_M

Type your prompt, hit Enter. Exit with /bye or Ctrl+D.

Pipe Input / Scripting

echo "Write a Python function to parse ISO 8601 dates" | ollama run qwen2.5-coder:32b-instruct-q4_K_M

Or with a heredoc for multi-line prompts:

ollama run qwen2.5-coder:32b-instruct-q4_K_M << 'EOF'
Review this code for bugs:

def parse_dates(dates):
    return [datetime.strptime(d, "%Y-%m-%d") for d in dates]
EOF

REST API Directly

curl http://localhost:11434/api/chat \
  -d '{
    "model": "qwen2.5-coder:32b-instruct-q4_K_M",
    "messages": [{"role": "user", "content": "What does this SQL do: SELECT * FROM orders WHERE status = '\''pending'\''"}],
    "stream": false
  }'

Streaming Output

By default ollama run streams. For the API, set "stream": true and pipe through jq to extract tokens:

curl -s http://localhost:11434/api/chat \
  -d '{"model":"qwen2.5-coder:32b-instruct-q4_K_M","messages":[{"role":"user","content":"Hello"}],"stream":true}' \
  | while IFS= read -r line; do echo "$line" | jq -r '.message.content // empty' 2>/dev/null; done

Cline CLI on the Server

Same installation as on any other machine — but here no tunnel or OLLAMA_HOST override is needed since Cline defaults to http://localhost:11434.

Summary

Scenario	Method
Browser UI (any client)	SSH tunnel port 3000, open `http://localhost:3000`
CLI on server	`ollama run <model>` directly
CLI on Linux/Mac client	SSH tunnel + `OLLAMA_HOST=http://localhost:11434 ollama run`
CLI on Windows (Git Bash)	SSH tunnel + `curl http://localhost:11434/api/chat`
CLI on Windows (PowerShell)	SSH tunnel + `$env:OLLAMA_HOST=...; ollama list`
Agentic coding (any client)	SSH tunnel + Cline CLI pointed at `http://localhost:11434`

In-House LLMs with Ollama

1. Choosing a Coding Model

Pulling the Model

2. Ollama Server Configuration

3. Open WebUI (Local Browser Interface)

4. Reaching Ollama from LAN Clients via SSH Tunnel

Simplify with `~/.ssh/config`

5. Using the CLI from Various Clients

Linux / macOS (Ollama installed)

Windows — Git Bash (no Ollama binary)

Windows — PowerShell (Ollama installed)

Cline CLI (AI Coding Agent)

Aider (AI Pair Programmer)

Aider vs Cline — Which to Use

6. Working from the CLI on the Server Itself

Quick Interactive Chat

Pipe Input / Scripting

REST API Directly

Streaming Output

Cline CLI on the Server

Summary

About Me

In-House LLMs with Ollama

1. Choosing a Coding Model

Pulling the Model

2. Ollama Server Configuration

3. Open WebUI (Local Browser Interface)

4. Reaching Ollama from LAN Clients via SSH Tunnel

Simplify with ~/.ssh/config

5. Using the CLI from Various Clients

Linux / macOS (Ollama installed)

Windows — Git Bash (no Ollama binary)

Windows — PowerShell (Ollama installed)

Cline CLI (AI Coding Agent)

Aider (AI Pair Programmer)

Aider vs Cline — Which to Use

6. Working from the CLI on the Server Itself

Quick Interactive Chat

Pipe Input / Scripting

REST API Directly

Streaming Output

Cline CLI on the Server

Summary

About Me

Simplify with `~/.ssh/config`