Run a Local LLM with Ollama: Home Setup Guide
TL;DR: To run a local LLM with Ollama, install Ollama for your OS, then run ollama run gemma3 to download and chat with a model completely offline. A 4B model runs fine on 8 GB of RAM; larger models want a GPU with 8 GB or more of VRAM. Nothing leaves your machine — no API keys, no cloud, no per-token bill.
Why run a local LLM with Ollama?
Three reasons, in the order that usually matters. Privacy: your prompts and files never leave the box. Cost: after the download, inference is free — no metered API. And it works offline, which is handy on a flaky connection or an air-gapped homelab.
Ollama is the path of least resistance. It bundles the model runtime (llama.cpp under the hood), handles downloads and quantization, and exposes a local HTTP API on port 11434 that mimics the shape most tools already expect. You install one thing and you're chatting in about two minutes.
The tradeoff is honest: a 4B or 12B model you can run at home is not going to match a frontier cloud model on hard reasoning. But for summarizing, drafting, classification, code autocomplete, and RAG over your own documents, a local model is more than enough — and it's yours.
What hardware do you need? (Ollama GPU requirements)
You do not need a GPU. Ollama runs on CPU and system RAM, and small models are genuinely usable that way. A GPU mostly buys you speed — tokens per second — not capability.
The rule of thumb that actually predicts whether a model will run: a 4-bit quantized model needs roughly its download size resident in memory, plus a bit of headroom for context. So an 8 GB model wants about 10–12 GB free. If it fits in your GPU's VRAM, it runs on the GPU. If it doesn't, Ollama spills the rest into system RAM and runs slower. Check where a model actually landed with ollama ps:
NAME ID SIZE PROCESSOR UNTIL
llama3:70b bcfb190ca3a7 42 GB 100% GPU 4 minutes from now
That PROCESSOR column is the one to watch. 100% GPU is what you want; 48%/52% CPU/GPU means the model didn't fully fit and you're paying for it in speed.
On the GPU side, the official support baseline is: Nvidia cards with compute capability 5.0+ and driver 531 or newer (so a GTX 750 Ti and up, though anything modern is fine); AMD Radeon via ROCm; and Apple Silicon via Metal, where unified memory means the whole RAM pool doubles as VRAM. That last point is why an M-series Mac with 16 GB or 32 GB is such a good local-LLM machine.
How do I install Ollama?
Pick your OS. All three get you the same ollama command.
macOS
Download the app from ollama.com/download and drag it to Applications, or use the install script. macOS 14 (Sonoma) or later is required.
curl -fsSL https://ollama.com/install.sh | sh
Linux
One line. The script sets Ollama up as a systemd service that starts on boot.
curl -fsSL https://ollama.com/install.sh | sh
Confirm it's alive:
ollama -v
systemctl status ollama
Windows
Either download OllamaSetup.exe from the site and double-click it, or run this in PowerShell:
irm https://ollama.com/install.ps1 | iex
The installer registers a background service and adds ollama to your PATH. Open a fresh PowerShell window afterward so the PATH change takes effect, then check ollama -v.
I've had the "command not found" thing happen right after install — nine times out of ten it's a terminal that was open before installation and hasn't picked up the new PATH. Close it and open a new one.
Running your first model
Just run a model name. Ollama downloads it on first use, then drops you into a chat:
ollama run gemma3
Type your question, hit enter. Use /bye to exit, and """ to wrap multi-line input. If the model is multimodal, you can point it at a local image:
ollama run gemma3 "What's in this image? /home/you/desktop/photo.png"
The everyday commands you'll actually use:
ollama pull gemma3— download a model without running itollama ls— list models you've downloaded (ollama listworks too)ollama ps— show what's currently loaded and on GPU or CPUollama stop gemma3— unload a model from memory nowollama rm gemma3— delete it from disk to reclaim space
There's also an interactive launcher — run ollama with no arguments and you get a menu to pick a model or wire Ollama into tools like Claude Code, Codex, or OpenCode. Nice for discovery; the explicit commands above are faster once you know what you want.
Which model should you pick for your hardware?
Start small and size up. A 1B model that answers instantly beats a 27B model that swaps to disk and takes 40 seconds a reply. Here are verified download sizes for the Gemma 3 family, which is a sensible, well-documented default across a range of machines:
| Model tag | Download size | Comfortable on | Good for |
|---|---|---|---|
gemma3:270m | 292 MB | Anything, even a Pi | Testing, tiny tasks |
gemma3:1b | 815 MB | 8 GB RAM, no GPU | Fast text tasks, autocomplete |
gemma3:4b | 3.3 GB | 8 GB RAM / 6 GB GPU | General chat, summarizing, vision |
gemma3:12b | 8.1 GB | 16 GB RAM / 12 GB GPU | Better reasoning, longer docs |
gemma3:27b | ~17 GB | 32 GB RAM / 24 GB GPU | Best local quality in the family |
Google's newer gemma4 family (released April 2026 under an Apache 2.0 license) is now what Ollama's own README points to by default, with edge tags e2b/e4b and larger 26b/31b builds. Sizes differ from Gemma 3, so check the current numbers on the Ollama model library before pulling. Other solid local families worth a look: llama3.2 (small, fast), qwen2.5-coder (code), and deepseek-r1 distills (reasoning).
Making Ollama actually useful (the settings that matter)
The context window default will bite you
This is the single most common "why is my model dumb" cause. By default Ollama uses a context window of 4096 tokens, regardless of what the model supports. Feed it a long document and it silently truncates. Raise it globally when starting the server:
OLLAMA_CONTEXT_LENGTH=32768 ollama serve
Or per-session inside a chat with /set parameter num_ctx 32768, or per-request via the API's num_ctx option. Bigger context costs more memory, so don't crank it to the model's max unless you need it.
Keep models loaded (or don't)
By default a model stays in memory for 5 minutes after its last use, then unloads. If you're hitting it repeatedly, that reload delay is annoying. Set OLLAMA_KEEP_ALIVE — a duration like 30m, or -1 to keep it resident until you stop it.
Move the model directory
Models are big and they pile up. Default locations:
- macOS:
~/.ollama/models - Linux:
/usr/share/ollama/.ollama/models - Windows:
C:\Users\<username>\.ollama\models
Point them at a bigger drive with the OLLAMA_MODELS environment variable. On Linux, the ollama service user needs write access to the new path, so sudo chown -R ollama:ollama /your/new/path after creating it.
Reaching it from another machine (homelab)
Ollama binds to 127.0.0.1:11434 by default — localhost only, which is the safe default. To use it from your local AI homelab, other machines on the LAN, set OLLAMA_HOST so it listens on all interfaces. On a Linux systemd install:
sudo systemctl edit ollama.service
Add, under [Service]:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Then systemctl daemon-reload && systemctl restart ollama.
Do this only behind your own firewall. Ollama has no authentication of its own — an open
11434on the public internet is an open door. If you need remote access, put it behind a reverse proxy with auth, or a tunnel like Tailscale/Cloudflare, not a raw port forward.
Using it from code
The local REST API is the whole point once you move past chatting. Native endpoint:
curl http://localhost:11434/api/chat -d '{
"model": "gemma3",
"messages": [{ "role": "user", "content": "Why is the sky blue?" }],
"stream": false
}'
There are official Python (pip install ollama) and JavaScript (npm i ollama) libraries, plus an OpenAI-compatible endpoint so most existing SDKs and agent frameworks work by just pointing their base URL at your local server.
Common problems and how to fix them
- Model runs on CPU when you have a GPU. Run
ollama psto confirm. On Nvidia, check your driver is 531+. On Linux after a suspend/resume, the GPU sometimes drops out — reload the driver withsudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm. - Out of memory / very slow. The model doesn't fit. Drop to a smaller tag (12b instead of 27b) or lower your
num_ctx. - Downloads crawl on Windows in WSL2. Known networking quirk — disabling "Large Send Offload V2" on the
vEthernet (WSL)adapter fixes it. - Upgrading. macOS and Windows auto-update from the tray/menu bar. On Linux, re-run the install script.
FAQ
Is Ollama free?
Yes. Ollama is open source (MIT licensed), and the local models it runs are free to download. You pay only in disk space, electricity, and the price of your hardware. Individual models carry their own licenses — Gemma 4, for instance, ships under Apache 2.0 — so check the license if you plan to build a commercial product on one.
Can I run a local LLM without a GPU?
Yes. Ollama runs on CPU and system RAM. Small models (1B–4B) are responsive on a modern laptop with no dedicated GPU. Larger models still run CPU-only, just slower. A GPU improves tokens-per-second, not what the model can do.
Does Ollama send my prompts to the cloud?
Not when you run local models — everything stays on your machine. Ollama does now offer optional cloud-hosted models and web search as separate features; if you want a strictly local setup, you can disable cloud features entirely by setting OLLAMA_NO_CLOUD=1 or disable_ollama_cloud in ~/.ollama/server.json.
How much disk space do I need?
Budget by model. A 1B model is under a gigabyte; a 12B model is around 8 GB; 27B–31B models run 17–20 GB each. If you collect a few, a spare 100 GB fills up faster than you'd think — which is why moving OLLAMA_MODELS to a big drive is worth doing early.
What's the difference between Ollama and llama.cpp?
Ollama uses llama.cpp as its inference backend and wraps it in model management, an HTTP API, and a friendly CLI. If you want maximum control over quantization and runtime flags, use llama.cpp directly. For running models at home with the least friction, Ollama is the better starting point.
Can I self-host an LLM for my whole household or team?
Yes — that's the homelab use case. Run Ollama on one capable machine, set OLLAMA_HOST so it listens on your LAN, and point clients (a chat UI like Open WebUI, editor plugins, or your own scripts) at it. Keep it behind your firewall and add authentication via a reverse proxy if you expose it beyond the local network.