Ollama
Ollama is the simplest way to run powerful open-weight AI models locally on your own hardware. One command — ollama run llama3.3 — downloads and runs a frontier-class model with no cloud account, no subscription, and no data leaving your machine. It then serves the model via an OpenAI-compatible API that any AI-powered application can use as a drop-in replacement for ChatGPT.
Overview
Launched in September 2023, Ollama transformed local AI from a complex DIY project into a two-minute setup. It handles all the complexity of model quantization, GPU memory management, and API serving so developers can focus on building.
In April 2026, Ollama supports hundreds of models from its official library — including Llama 3.3, DeepSeek R1, Mistral, Qwen 2.5 Coder, Gemma 3, and Phi-4 — plus any custom GGUF model from Hugging Face. It runs on Apple Silicon (via Metal), NVIDIA (via CUDA), and AMD (via ROCm) GPUs, and falls back to CPU when no GPU is available.
Key Features
- One-Command Model Management:
ollama pull llama3.3downloads and caches the model.ollama run llama3.3starts an interactive chat. That's it. - OpenAI-Compatible API: Ollama serves a local REST API at
localhost:11434that is fully compatible with the OpenAI Python SDK — point any OpenAI-powered app to your local machine. - Hardware-Accelerated Inference: Automatically detects and uses Apple Metal, NVIDIA CUDA, or AMD ROCm for GPU-accelerated inference. Falls back to optimized CPU (AVX2) when needed.
- Multi-Model Serving: Run multiple models simultaneously and switch between them via API.
- Custom Models via Modelfile: Create custom system prompts, model blends, and fine-tuned variants using a simple
Modelfileformat. - Vision Model Support: Run vision models like LLaVA and Moondream locally to analyze images without sending them to the cloud.
- Streaming Responses: Full streaming support for real-time token output.
How It Works
Ollama uses llama.cpp under the hood — a highly optimized C++ inference engine for running quantized GGUF models. When you run a model:
- Download: Pulls the quantized GGUF model weights from the Ollama registry.
- Load: Loads model layers into GPU VRAM (and system RAM for overflow).
- Serve: Starts a local HTTP server on port 11434.
- Infer: Processes prompts using optimized matrix operations on your GPU.
Technical Architecture:
- Inference Engine: llama.cpp (C++, cross-platform).
- Model Format: GGUF (quantized, memory-mapped).
- GPU Backends: Metal (Apple), CUDA (NVIDIA), ROCm (AMD), Vulkan (experimental).
- API: OpenAI-compatible REST API (v1/chat/completions, v1/completions, v1/embeddings).
- License: MIT (fully open source).
Use Cases
Privacy-First Development
- Run code generation models locally so your proprietary code never leaves your machine.
- Analyze sensitive documents, customer data, and internal reports without cloud exposure.
Cost-Free Agent Infrastructure
- Power Aider, Cline, and OpenClaw with local models for unlimited, zero-cost agent runs.
- Run multi-agent workflows on your own hardware without per-token API fees.
Offline & Air-Gapped Environments
- AI assistance in locations without reliable internet (travel, remote work, secure facilities).
- Air-gapped enterprise environments where cloud AI is not permitted.
Model Evaluation
- Quickly test different models and quantization levels for specific tasks.
- Compare local model performance against cloud APIs for the same prompts.
Getting Started
Step 1: Install Ollama
# macOS / Linux:
curl -fsSL https://ollama.com/install.sh | sh
# macOS (Homebrew):
brew install ollama
# Windows: Download installer from https://ollama.com/download
Step 2: Run Your First Model
# Pull and run Llama 3.3 (7B — runs on most hardware):
ollama run llama3.3
# Or try DeepSeek R1 for reasoning tasks:
ollama run deepseek-r1:7b
# For code-specific tasks:
ollama run qwen2.5-coder:7b
Step 3: Use the API (OpenAI-Compatible)
from openai import OpenAI
# Point OpenAI SDK to your local Ollama instance:
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # required but ignored
)
response = client.chat.completions.create(
model="llama3.3",
messages=[{"role": "user", "content": "Explain quantum entanglement simply."}]
)
print(response.choices[0].message.content)
Step 4: Connect to Cline or Aider
# In Aider, point to local Ollama:
aider --model ollama/deepseek-r1:7b
# In Cline (VS Code):
# Settings → API Provider → Ollama → Model: deepseek-r1:7b
Step 5: Manage Models
# List installed models:
ollama list
# Remove a model:
ollama rm llama3.3
# See running models:
ollama ps
Choosing the Right Model
| Task | Recommended Model | Min VRAM |
|---|---|---|
| General chat | llama3.3:7b | 8GB |
| Coding | qwen2.5-coder:7b | 8GB |
| Reasoning | deepseek-r1:7b | 8GB |
| Vision | llava:7b | 8GB |
| Maximum quality | llama3.3:70b | 40GB |
Best Practices
- Start with smaller models (7B) and scale up as needed.
- Use quantized Q4_K_M for the best quality/speed trade-off on consumer hardware.
- Keep Ollama running as a background service (
ollama serve) for fast response times.
Pricing & Access
- Ollama Tool: Completely free and open-source (MIT license).
- Models: Free to download from the Ollama registry.
- No Subscription: Pay nothing except your hardware electricity costs.
Limitations
- Hardware Requirements: Quality degrades significantly on CPU-only inference. 8GB+ VRAM recommended for 7B models.
- Speed vs. Cloud: Local inference on consumer hardware is slower than cloud APIs for the same model size.
- Model Quality Ceiling: Even the best local models (70B) can lag behind frontier cloud models (Claude, GPT-4.1) on complex reasoning.
- No GUI: Terminal and API-based — use LM Studio if you want a graphical interface.
Community & Support
- Official Website: ollama.com
- GitHub: github.com/ollama/ollama (60K+ stars)
- Model Library: ollama.com/library
- Discord: Active community for troubleshooting hardware/model issues.
- Reddit: r/ollama