LM Studio is the friendliest way to get started with local AI. While Ollama is for developers who love the terminal, LM Studio is for everyone else — it provides a beautiful desktop application for finding, downloading, and chatting with hundreds of open-weight models, plus a local server that makes any model available to other apps.

Overview

Launched in 2023, LM Studio has become one of the most popular GUIs for local AI, available on macOS, Windows, and Linux. Its key strength is discoverability: a built-in browser connected to Hugging Face lets you search, filter, and download models with a single click — no command line required.

As of July 2026, LM Studio runs any open-weight model published in GGUF or MLX format on Hugging Face — LM Studio's own site highlights gpt-oss, Qwen3.6, Gemma 4, and DeepSeek. The MLX path gives Apple Silicon users a natively optimized inference route. Its local server mode turns your machine into a privacy-preserving AI backend for any application, and the llmster runtime plus the lms CLI allow headless deployment on servers and in CI.

Key Features

Model Discovery Browser: Search and download models from Hugging Face directly within the app. Filters for model family, size, quantization, and hardware compatibility make finding the right model intuitive.
Chat Interface: A clean, ChatGPT-like interface for conversations, with support for multi-turn context, system prompts, and character personas.
Local Server Mode: Start a local OpenAI-compatible server with one click and use it as a drop-in backend for any app that supports OpenAI's API.
MLX Support (Apple Silicon): Native MLX model format support for Apple M-series chips, using Apple's own inference framework rather than the generic GGUF path.
Multi-Model Loading: Load multiple models simultaneously and switch between them within the app without reloading.
Hardware Performance Monitor: Real-time display of GPU/CPU load, VRAM usage, and tokens-per-second speed for each model.
Prompt Templates: Built-in support for model-specific chat templates (ChatML, Llama, Alpaca, etc.) so models respond correctly out of the box.

How It Works

LM Studio uses llama.cpp for GGUF models and the MLX framework for Apple Silicon-native inference. When you load a model:

Download: GGUF or MLX weights are downloaded and cached locally (~5-50GB depending on model size).
Load: Model weights are memory-mapped into VRAM (and system RAM for offloading).
Serve: An optional local server starts at localhost:1234 with OpenAI-compatible endpoints.
Infer: Requests are processed with hardware acceleration and responses streamed back.

Technical Architecture:

Inference Engines: llama.cpp (GGUF), MLX (Apple Silicon), ROCm (AMD experimental).
Model Formats: GGUF (universal), MLX (Apple Silicon optimized).
API: OpenAI-compatible REST API (/v1/chat/completions, /v1/completions, /v1/embeddings).
Platforms: macOS on Apple Silicon (M1/M2/M3/M4), macOS 14.0 or newer — Intel Macs are not supported. Windows on x64 (AVX2 required) and ARM. Linux on x64/ARM64.

Use Cases

Private Document Analysis

Load a high-quality model and analyze sensitive business documents, legal contracts, or personal notes without cloud exposure.
Run an embedding model locally to build a private, offline semantic search system.

App Development & Prototyping

Use the local server as a free, private backend for testing AI-powered features in your app.
Compare multiple models on the same prompt to find the best one for your use case.

Learning & Exploration

Explore different model families, sizes, and quantization levels to understand how they affect output quality.
No API bills — run hundreds of experiments at zero cost.

Offline AI Assistance

Chat with a local model in environments without internet access: flights, remote locations, secure networks.

Getting Started

Step 1: Download LM Studio

Visit lmstudio.ai and download for your OS.
Install and launch the application.
Complete the first-time setup wizard (hardware detection is automatic).

Step 2: Find and Download a Model

Click the "Discover" tab (search icon in the left sidebar).
Search for a model: try "llama-3.3" or "deepseek-r1".
Filter by "My Hardware" to see only models that fit your VRAM.
Click Download on your preferred quantization (Q4_K_M is usually the best balance).

Recommended starting models by hardware:

VRAM	Recommended Model
8GB	Llama 3.2 3B Q8 or Qwen3 8B Q4
16GB	Gemma 4 12B Q8 or Qwen2.5-Coder 14B Q4
24GB	Mistral Nemo 12B Q8 or DeepSeek R1 32B Q4
48GB+	Llama 3.3 70B Q4 or DeepSeek R1 70B Q4

Note that Llama 3.3 ships only as a 70B model — there is no 8B variant. For a smaller Llama, use Llama 3.2 or Llama 4 Scout.

Step 3: Chat with the Model

Click the "Chat" tab (speech bubble icon).
Select your downloaded model from the dropdown at the top.
Click "Load Model" — model loads in 10-30 seconds.
Start chatting in the input box at the bottom.

Step 4: Start the Local Server

Click the "Local Server" tab (≡ icon).
Select your model and click "Start Server".
The server runs at http://localhost:1234.
Use it with the OpenAI SDK:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")
response = client.chat.completions.create(
    # Use the exact model identifier LM Studio shows for the loaded model,
    # e.g. "qwen3-8b" or "llama-3.3-70b-instruct".
    model="qwen3-8b",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Best Practices

Choose Q4_K_M quantization for the best balance of quality and speed.
Use MLX models on Apple Silicon for significantly faster inference.
Preload your most-used model at startup to avoid wait times.
Monitor VRAM usage — if the bar is red, try a smaller or more quantized model.

Pricing & Access

Free for home and work use: Since July 2025, LM Studio's terms no longer require a separate commercial licence — the app is free to use at home and at a company or organization (announcement).
LM Studio Enterprise & Teams: An optional offering for organizations that want centralized control over models, MCP servers, and plugins across their deployment. Pricing is not published; contact LM Studio via lmstudio.ai/work.

Limitations

Storage Requirements: Models range from a few GB (small models at Q4) to 40GB+ (70B Q4) — requires significant disk space.
VRAM Constraints: Larger models require substantial GPU VRAM. CPU fallback is very slow.
Model Quality vs. Cloud: Even the best local models can't match frontier cloud models (GPT, Claude) on complex tasks.
Apple Silicon Only on macOS: Intel-based Macs are not supported; macOS 14.0 or newer is required.

Community & Support

Official Website: lmstudio.ai
Documentation: lmstudio.ai/docs
Discord: Official LM Studio Discord — active support community.
Reddit: r/LocalLLaMA — broader local AI community.
GitHub: github.com/lmstudio-ai

LM Studio