Ollama vs LM Studio vs Jan vs LocalAI vs vLLM (2026)
Five local LLM runtimes, five very different shapes. Two of them (Ollama, LM Studio) are aimed at a developer running a model on a laptop. One (Jan) wraps the same engine with a privacy-first desktop app. Two (LocalAI, vLLM) are server products you put behind a load balancer. We pulled every claim below from the project’s own README, docs, or pricing page — the goal is a decision, not a benchmark shootout we can’t reproduce.

On this page · 14 sections▾
TL;DR + decision tree
- Want a CLI you can script with the smoothest install on the planet? Ollama. One binary, one command, MCP server included.
- Want a desktop app with a model browser and a built-in chat UI? LM Studio. Closed binary, but the most polished onboarding for a non-CLI user.
- Want privacy-first, fully open-source desktop with project workspaces and an extension model? Jan. AGPLv3, telemetry-off by default.
- Want a self-hosted OpenAI-compatible drop-in that runs LLMs, embeddings, image and audio models behind one API and natively orchestrates MCP tools? LocalAI.
- Want production GPU inference at scale with continuous batching and PagedAttention? vLLM. Not a desktop tool — it’s a server.
We’ll cover each in detail below — the matrix first, then per-tool deep dives, then a memory-by-RAM section, then the MCP integration question (which is where most people arrive at this comparison from).
What local LLM runtimes actually do
A “local LLM runtime” is the software that loads model weights into RAM (or VRAM), accepts a prompt, runs forward passes, and streams tokens back. It is not the model. The model is a multi-gigabyte file of weights — the runtime is the inference engine that executes those weights on your hardware.
Three concepts you need before the matrix makes sense:
- Quantization. Llama 3 8B in full precision (FP16) is about 16GB. Quantization compresses the weights down to 4 or 5 bits per parameter, reducing the same model to ~4–6GB at a small quality cost. Q4_K_M is the sweet-spot quant most consumer runtimes use; Q5_K_M is slightly larger and slightly better.
- Formats. GGUF (the successor to GGML) is the dominant CPU+GPU consumer format, used by llama.cpp/Ollama/LM Studio/Jan. AWQ and GPTQ are GPU-optimized quantization formats favored by vLLM and TGI. Safetensors holds raw FP16 weights for unquantized inference.
- Inference engine. Most consumer runtimes bundle llama.cpp under the hood (Ollama, LM Studio, Jan, LocalAI all do, in different forms). vLLM is its own engine, written for high-throughput GPU serving. When you choose a runtime you are mostly choosing a wrapper around llama.cpp — except for vLLM, which is a fundamentally different shape of software.
The split that matters: consumer runtimes (Ollama, LM Studio, Jan) optimize for one user on one machine; server runtimes (LocalAI, vLLM) optimize for many concurrent users on shared hardware. Picking the wrong category is the most common mistake — vLLM is a terrible laptop tool, Ollama is a terrible production-multi-tenant tool. Both work, neither fits.
If you’re newer to running models locally, our llama.cpp skill cookbook covers the engine that sits underneath three of the five tools here. The rest of this post assumes you understand that distinction.
Side-by-side matrix
Every cell sourced from the project’s own README/docs/pricing as of 2026-05-08. License rows verified against repo LICENSE files. MCP support rows verified against official documentation only — community wrappers excluded.
| Dimension | Ollama | LM Studio | Jan | LocalAI | vLLM |
|---|---|---|---|---|---|
| Shape | CLI + daemon | Desktop app | Desktop app | Server (Docker) | Server (Python/Docker) |
| License | MIT | Closed binary | AGPLv3 | MIT | Apache 2.0 |
| macOS | Yes (universal) | Yes (universal) | Yes (universal) | Yes (Docker) | No |
| Windows | Yes (native) | Yes (native) | Yes (native) | Yes (Docker/WSL2) | No (WSL2 only) |
| Linux | Yes (native) | Yes (AppImage) | Yes (native) | Yes (Docker) | Yes (native) |
| Apple Silicon (Metal) | Yes | Yes | Yes | Yes (via llama.cpp) | No |
| Nvidia (CUDA) | Yes | Yes | Yes | Yes | Yes (primary) |
| AMD (ROCm) | Yes (Linux) | Yes (Linux) | Yes (Linux) | Yes | Yes |
| OpenAI API | Yes (/v1) | Yes (/v1) | Yes (/v1) | Yes (drop-in) | Yes (/v1) |
| MCP server (in directory) | Yes (/servers/ollama) | No | No | No | No |
| MCP client (calls MCP tools) | No (chat-only) | No | Browser MCP add-on | Yes (native in agent flows) | No |
| Quant formats | GGUF | GGUF | GGUF | GGUF + multiple | AWQ, GPTQ, GGUF, FP16 |
| Model browser | ollama.com library | Built-in HF browser | Built-in HF browser | Manifest YAML | HF model name in CLI |
| Web UI | No (third-party) | Bundled | Bundled | Bundled (basic) | No |
| Best for | Scripting, MCP, Claude Code | Non-CLI desktop user | Privacy-first desktop | OpenAI replacement | Production multi-tenant |
Three takeaways. Only Ollama has a first-class MCP server in the directory — the other four are reachable via OpenAI-compatible gateways but you don’t get a clean install card. vLLM is the only one that isn’t macOS-native — Apple Silicon support was a long-standing tracking issue that, as of 2026-05-08, still has no production path; Mac users serving production inference run vLLM in a Linux VM or a remote GPU box. LM Studio is the only closed-source binary — that’s a real consideration for anyone with an audit-everything compliance posture.
Ollama — install + recipe
What it does best
Ollama wins on developer ergonomics. One install script, one ollama run llama3.1, and the daemon is serving an OpenAI-compatible API on localhost:11434 — no GPU configuration, no Docker, no Python venv. The unique angle for this audience: it is the only runtime in the comparison with a first-class MCP server in our catalog, which means Claude Code can delegate subtasks to a local model through a documented install card rather than a hand-rolled OpenAI gateway. The CLI is scriptable, the model library is curated, and the daemon survives reboots.
Pick this if you...
- Want a local model wired into Claude Code through MCP without writing a custom gateway.
- Live in the terminal — pulling a model is one command, swapping models is one command, scripting a batch job is a few lines of bash.
- Need an OpenAI-compatible
/v1/chat/completionsendpoint locally without standing up a server stack. - Want one daemon you can leave running and call from multiple tools (editor plugins, CLI scripts, MCP clients) on the same machine.
Recipe: route Claude Code subtasks to a local Llama via the Ollama MCP
You are running a long Claude Code session on a private codebase and want the local model to handle the cheap, repetitive work — docstring stubs, redactions, lint rationales — without burning Anthropic tokens or letting source leave the laptop. Pull a coding model with ollama pull qwen2.5-coder:7b, install the Ollama MCP server from the card above, then prompt Claude Code: “Use the ollama MCP server. Run qwen2.5-coder:7b on each file in src/utils/ and add a one-line docstring summary. Don’t touch the implementation. Diff at the end.” Claude orchestrates the loop, the local model writes the docstrings, and the only tokens you pay for are the planning ones. Pair with Anthropic’s Claude Code best practices guide for the broader pattern.
Skip it if...
You are serving a model to multiple concurrent users on a GPU box — Ollama’s single-stream-first design loses to vLLM’s continuous batching once you cross two or three simultaneous sessions. Skip it too if you need a desktop chat UI out of the box (Ollama is daemon-only — use Jan or LM Studio), or if you care about exact quantization variants: Ollama’s default picks are often Q4_0 rather than Q4_K_M, so quality-conscious users pull GGUFs manually from Hugging Face.
LM Studio — what makes it different
What it does best
LM Studio is the most polished onboarding for someone who has never opened a terminal. Browse Hugging Face from inside the app, click a row to download a GGUF, click another tab to chat — the built-in chat window is good enough that most users never reach for a separate UI. On Apple Silicon it ships an MLX backend alongside llama.cpp, and for some workloads MLX is meaningfully faster than the Metal path; the toggle lives in app settings. The trade-off is that the app itself is a closed binary, even though every engine it bundles is open source. There is no first-party MCP server.
Pick this if you...
- Are onboarding a non-developer — designer, analyst, product manager — to local LLMs and need a one-window experience without any CLI.
- Specifically want the MLX engine on Apple Silicon without bolting it onto llama.cpp by hand.
- Want a Hugging Face model browser inside the app rather than copy-pasting GGUF URLs into a terminal.
- Are fine with a closed-source binary in exchange for the smoothest desktop UX.
Where it shines: the Excel-and-PowerPoint user dipping into local LLMs
A finance analyst hears about local models, wants to try Mistral 7B on their work laptop without filing an IT ticket. They install LM Studio, browse to TheBloke’s GGUFs from inside the app, click the green download button next to a Q4_K_M variant, and click Chat. No homebrew, no curl, no port numbers. The chat window remembers history per session, the model dropdown swaps models in seconds, and an OpenAI-compatible server toggle in the bottom panel exposes the model on port 1234 for whatever VS Code extension they want to point at it. This is the exact path that has made LM Studio the dominant desktop pick on Reddit’s r/LocalLLaMA over the past year.
Skip it if...
Your compliance posture requires audit-everything open source — the closed binary is disqualifying, and Jan or Ollama is the right move. Skip it too if you want a scriptable CLI (Ollama wins on automation), or if you need MCP integration today: LM Studio ships no MCP server, no MCP client, and no MCP extension as of May 2026, so wiring it into Claude Code requires a generic OpenAI gateway like LiteLLM.
Source / try it: lmstudio.ai · licence terms
Jan — what makes it different
What it does best
Jan is the AGPLv3 desktop alternative built around an audit-the-source posture. Telemetry is off by default, the app runs entirely offline once models are downloaded, and conversation data lives in a plain folder under your home directory that you can back up, inspect, or wipe. The thing that makes Jan unique inside this comparison is its Browser MCP extension: among the four non-Ollama runtimes, Jan is the only one that can act as an MCP client, calling out to MCP tool servers from inside a Jan conversation. It also ships its own Cortex inference engine alongside llama.cpp.
Pick this if you...
- Need the LM Studio shape but on an open-source binary you can compile from source.
- Have a privacy or compliance posture where the licence matters: legal, healthcare, defence, or regulated finance teams converge here.
- Want a desktop app that can call MCP servers as a client, not just expose itself as an OpenAI endpoint.
- Want project workspaces where each conversation pins to a folder of attached knowledge files instead of one flat chat list.
Where it shines: keeping page content on-device with the Browser MCP extension
A clinical researcher needs to summarise a vendor’s product page without sending the URL to a cloud LLM. They install Jan, pull a Llama 3 8B Q4_K_M from the built-in Hugging Face browser, enable the Browser MCP extension, and ask Jan to “summarise the open tab.” The extension feeds the rendered page content into the local Llama via MCP, the model runs on-device, and the summary appears in the chat window — no data ever leaves the laptop. The same pattern works for internal docs, intranet pages, or any authenticated-only URL the team would not paste into ChatGPT.
Skip it if...
You already work in a terminal-native flow — Ollama’s CLI is faster for scripting and its MCP server presence in the directory is more mature than Jan’s client-side extension. Skip Jan too if your team needs the absolute polish of LM Studio’s desktop UX (Jan is good but lags by a release or two on small details), or if your deployment target is a server rather than a laptop: Cortex can run headless, but LocalAI is the better-shaped tool for that job.
Source / try it: jan.ai · github.com/menloresearch/jan
LocalAI — what makes it different
What it does best
LocalAI is an OpenAI API emulator, not just a chat backend. The pitch is a drop-in replacement REST API: chat completions, embeddings, image generation, audio transcription, text-to-speech, and reranking all served from one endpoint, so a Node or Python codebase built against the OpenAI SDK can swap base URL and keep going. The trick under the hood is multiple backends — llama.cpp for GGUF, an embedded vLLM for AWQ/GPTQ, Diffusers and Whisper.cpp for the non-text modalities. And the MCP angle: LocalAI’s agent runtime can call MCP servers natively when a model emits a tool call, with continuous batching across requests.
Pick this if you...
- Already standardised on the OpenAI Chat Completions API in production code and want to swap the base URL without rewriting client SDKs.
- Need one binary that serves chat, embeddings, image generation, and audio behind the same endpoint.
- Want native MCP tool-calling on the server side rather than wiring an external orchestrator.
- Deploy in Docker or Kubernetes — LocalAI is container-shaped and built for on-prem rather than laptop use.
Where it shines: drop-in replace OpenAI in a Node app on a $40 VPS
A small SaaS team has a Node.js app calling openai.chat.completions.create() a few thousand times a day. The OpenAI bill creeps up. They spin up a $40/month GPU-less VPS, run a single docker run localai/localai, point a config YAML at a Q4_K_M Llama 3 8B GGUF, and change one line in their app — baseURL from api.openai.com/v1 to their-vps:8080/v1. Continuous batching keeps the CPU saturated for the modest QPS, the OpenAI SDK has no idea anything changed, and the bill drops to flat infra cost. Set API_KEY at boot — LocalAI’s auth is opt-in, and exposing it unauthenticated is the canonical footgun.
Skip it if...
You want a desktop chat experience — LocalAI ships a basic web UI but most users front it with Open WebUI or AnythingLLM, which is more setup than installing Jan or LM Studio. Skip it too if your QPS is high enough that GPU efficiency dominates the budget: vLLM’s PagedAttention will outperform LocalAI’s embedded runtimes on a busy multi-tenant box, and putting vLLM behind LocalAI just to keep the OpenAI shape is doable but adds a hop.
Source / try it: localai.io · github.com/mudler/LocalAI
vLLM — what makes it different
What it does best
vLLM is the production GPU serving engine that came out of Berkeley’s 2023 PagedAttention paper and is now the default open-source stack for high-throughput multi-tenant inference. PagedAttention treats the KV cache like virtual-memory pages so concurrent requests share blocks without copying; continuous batching inserts new requests into the GPU as soon as a slot frees, instead of waiting for static batches to drain. The vLLM team’s own launch post reports up to 24× higher throughput vs Hugging Face Transformers on shared workloads. Tensor and pipeline parallelism are built in.
Pick this if you...
- Serve a model to many concurrent users on a GPU box you control — internal chatbots, customer-facing products, bulk inference sweeps.
- Need to shard a 70B-class model across multiple GPUs on a single host (consumer runtimes do not do this natively).
- Care about P99 latency under load and tokens-per-dollar on rented GPU time.
- Are happy with a Python or Docker server install on Linux + CUDA, and a portable
/v1/chat/completionssurface for clients.
Where it shines: serving a 70B model to hundreds of concurrent users on one A100
An internal AI platform team needs to expose a Llama 3 70B endpoint to the engineering org — call it 1,000 concurrent users behind a Slack bot and an internal IDE plugin. They run the official vLLM Docker image on a single A100 node, configure tensor parallelism across the GPU’s memory partitions, and pin a CUDA version that matches the wheel. Continuous batching keeps the GPU saturated as requests arrive at random intervals, while Hugging Face TGI on the same hardware would either bottleneck on synchronous batches or burn capacity on warm idle slots. The engineering org gets one OpenAI-compatible base URL; vLLM does the messy scheduling underneath.
Skip it if...
You want to chat with a model on your laptop. vLLM is Linux-only, requires CUDA (or ROCm) wheels pinned to a specific PyTorch and driver combination, and treats Mac hardware as an unsupported target. The install path is a debugging adventure compared to Ollama’s one binary, and there is no native MCP support — vLLM is a serving engine, not an agent framework, so MCP-aware orchestration sits in front of it (Claude Code, LocalAI, or your own client).
Source / try it: docs.vllm.ai · github.com/vllm-project/vllm
Performance & memory
We do not publish a one-shot tokens-per-second benchmark in this post. Quants vary, hardware varies, model versions change weekly, and a single run from a single machine isn’t representative. What we will publish is the sizing math you actually need to make a decision — pulled from the runtimes’ own docs and the standard quantization references.
Disk and RAM, by parameter count at Q4_K_M:
| Model size | Q4_K_M file size (approx) | RAM/VRAM to load | Headroom for context |
|---|---|---|---|
| 3B | ~2.0 GB | ~3 GB | Ample on 8GB |
| 7B / 8B | ~4.5 GB | ~6 GB | OK on 16GB; comfortable on 24GB |
| 13B | ~7.5 GB | ~10 GB | Tight on 16GB; OK on 24GB |
| 30B / 34B | ~18–20 GB | ~24 GB | Needs 32–48GB; doesn't fit consumer GPUs |
| 70B | ~40 GB | ~48 GB | Needs 64–96GB unified RAM or 2× consumer GPUs |
These are approximate figures from the standard llama.cpp quant tables — they vary a few hundred MB by tokenizer and architecture. Treat as guard-rails, not quotes. The deeper truth: K-cache memory grows linearly with context length, so a 70B model with a 32k-context conversation in flight needs noticeably more VRAM than the bare model size. Run with a smaller --ctx-size if you’re thrashing.
Apple Silicon unified memory. The big advantage on Macs is that the same RAM pool serves both CPU and GPU — a 96GB M3 Max happily runs 70B Q4 with room to spare, no PCIe transfer penalty. This is why Apple Silicon punches above its weight on local LLM workloads despite worse raw FLOPS than a Nvidia 5090.
Throughput for serving. If you are serving a model to multiple concurrent users, vLLM’s continuous batching changes the math entirely — the same GPU that runs ~30 t/s single-stream on Ollama may serve ~250 t/s aggregate across 8 concurrent users on vLLM. That figure is workload-dependent; we cite vLLM’s own published benchmarks, not a number we measured.
MCP integration depth
This is the section the MCP.Directory reader actually opened the page for. Honest answer: the MCP story for local runtimes is uneven.
Ollama — first-class MCP server presence
The Ollama MCP server (/servers/ollama) lets Claude or Claude Code call into a running Ollama daemon as an MCP tool. This is the one entry on this page that ships in the directory’s catalog. Ollama itself does not act as an MCP client — it’s a chat backend, not an agent framework.
LocalAI — native MCPclient in agent flows
LocalAI’s agent runtime can call MCP servers directly when a model emits a tool call. This is the cleanest way to give an open-weights model MCP-flavoured tool use without writing your own orchestration layer.
Jan — MCP-aware via extension
Jan has an extension model and a Browser MCP extension that lets a Jan conversation call MCP tool servers. Not as battle-tested as Ollama’s server presence or LocalAI’s native flow, but the path is real and the source is auditable.
LM Studio — no native MCP of any kind
LM Studio exposes an OpenAI-compatible server but ships no MCP server, no MCP client, and no MCP extension as of 2026-05-08. Using LM Studio with Claude Code requires bridging via a generic OpenAI gateway like LiteLLM.
vLLM — no native MCP
vLLM is a serving engine, not an agent framework. The right pattern is to put an MCP-aware orchestrator (LocalAI, your own client, or a Claude Code instance) in front of vLLM and treat vLLM as the LLM-as-a-service layer.
The pattern that scales best for a developer who already lives in Claude Code: Anthropic API for the main loop + Ollama via MCP for cheap private subtasks. This is the same orchestration shape we describe in the MCP context bloat fix deep-dive — local models earn their keep on the short-context, repetitive tasks that would otherwise eat cloud tokens.
Common pitfalls
Ollama port 11434 conflict
Ollama defaults to localhost:11434. If you’ve installed it twice (Homebrew formula and the Mac app, common on macOS), one will fail to bind silently. lsof -i :11434 to see what’s holding it. Set OLLAMA_HOST if you need a different port.
LM Studio binary is closed source
Even though LM Studio bundles open-source engines, the app itself ships under a proprietary licence. For audit-everything compliance, this is disqualifying — Jan or Ollama instead.
Hugging Face throttling on Jan and LM Studio
Both apps browse HF for models. Anonymous downloads are rate-limited; if you’re pulling many GGUFs for evaluation, log into your HF account inside the app. The Pro/Enterprise HF plans have higher quotas but for individuals, free + logged-in is the sweet spot.
LocalAI auth disabled by default
LocalAI starts with no API key required. If your container is exposed to anything beyond localhost — even your home LAN — set API_KEY at boot. Searches for “LocalAI exposed” on Shodan turn up plenty of unintended public endpoints.
vLLM CUDA-version pinning
vLLM wheels are built against specific PyTorch + CUDA combinations. A driver upgrade on the host can break the existing wheel; an OS upgrade can break the driver. The safest path is the official vLLM Docker image, which pins everything; pip install vllm on a custom Linux box is a debugging adventure.
Community signal
Three voices that capture why developers reach for each. Verbatim with sources.
“Ollama is what got me from 'I should try a local model someday' to 'I run one every day.' The activation energy is gone. One curl, one command, you have a model.”
zerojames · Hacker News
Comment on a 'show HN: my local LLM workflow' thread, capturing the canonical reason Ollama dominates the easy-install niche.
“vLLM achieves up to 24x higher throughput compared to HuggingFace Transformers, without requiring any model architecture changes.”
vLLM team (Berkeley Sky Lab) · Blog
The original vLLM launch post stating the headline throughput claim. Cited because it's the project's own number, not a third-party benchmark.
“100% open source, AGPLv3. Your data stays on your machine. We don't see your prompts. We don't see your models. We don't see anything.”
Jan project (Menlo Research) · Blog
From Jan's own homepage. The privacy-first positioning is on-brand and verifiable in the AGPLv3 source.
Frequently asked questions
What's the easiest local LLM runtime to start with?
Ollama, by a wide margin. One install (a single binary on macOS/Windows/Linux), one command (ollama run llama3.1), and you have a working OpenAI-compatible HTTP API on port 11434. The model catalog is curated, the GGUF download is automatic, and the CLI is scriptable. If you want a GUI instead of a terminal, LM Studio is the second-easiest. Jan is close on macOS/Windows. LocalAI and vLLM are not beginner-friendly — both assume you know your way around containers, GPUs, and an OpenAI-shaped API surface.
Is Ollama the only one with an MCP server?
Among the five, Ollama is the only one with a dedicated MCP server you can install from the directory at /servers/ollama. LocalAI documents native MCP support inside its agent flows (it can call MCP tool servers as part of model orchestration). vLLM, LM Studio, and Jan do not currently ship an official MCP server — though you can wire any of them to Claude or Claude Code by using a generic OpenAI-compatible MCP gateway, since all four expose an OpenAI-shaped /v1/chat/completions endpoint.
Can LM Studio run on Linux?
Yes. LM Studio ships a Linux AppImage in addition to its macOS .dmg and Windows installer. The AppImage is x86_64 only — Linux on ARM (Asahi/Jetson) is unsupported. The macOS build is a universal binary covering both Apple Silicon and Intel; on Apple Silicon it uses Metal acceleration. On Linux it can target Nvidia GPUs via CUDA llama.cpp builds and AMD GPUs via ROCm. The binary itself is closed-source, even though the underlying llama.cpp engine it bundles is not.
Jan vs LM Studio — which is better for privacy?
Jan, on three concrete axes. First, Jan is open-source under AGPLv3 (full source on GitHub), while LM Studio is a closed-source binary that bundles open-source engines. Second, Jan ships with telemetry off by default and runs entirely offline once models are downloaded; LM Studio's licence permits commercial use but the closed binary is harder to audit. Third, Jan's data layout is a plain folder under your home directory you can back up or delete; LM Studio stores chats in its own format. If your threat model includes 'I want to verify nothing is phoning home,' Jan wins. If your threat model is 'I just don't want to pay OpenAI,' both are fine.
When should I use vLLM instead of Ollama?
When you're serving a model to more than one user at a time on a GPU (or pool of GPUs) and you care about throughput. vLLM's PagedAttention (the algorithm from the original Berkeley paper) and continuous batching let it interleave many concurrent requests on the same GPU at far higher tokens-per-second-per-dollar than Ollama. The tradeoffs: vLLM is Linux-only, requires CUDA (or ROCm on AMD), has no built-in model browser, and the install is a Python package (or a Docker image) that you run as a server. It's the right answer for production inference, the wrong answer for 'I want to chat with Llama 3 on my laptop.'
Does LocalAI work with Claude Code?
Yes — but indirectly. LocalAI exposes an OpenAI-compatible API, and Claude Code talks to Anthropic's API by default, so you can't simply 'point Claude Code at LocalAI.' What you can do: run an MCP server (or a generic OpenAI gateway like LiteLLM) that routes specific tool calls or sub-prompts to your LocalAI instance, while keeping Claude Code's main loop on Claude. LocalAI itself can act as an MCP-aware orchestrator on the server side, so it can call MCP tool servers as part of model execution. The pattern is documented in LocalAI's own docs.
Can I use these with claude.ai or chatgpt.com?
No, not directly. claude.ai and chatgpt.com are first-party hosted UIs locked to Anthropic and OpenAI's models respectively — they don't expose a 'choose your backend' setting. To use a local runtime with the same chat-style UI, install one of: LibreChat, Open WebUI, AnythingLLM, or Jan itself (Jan ships its own chat UI). Point that UI at your local Ollama/LM Studio/LocalAI/vLLM endpoint via OpenAI-compatible URL. For agent workflows, Claude Code remains tied to Claude — but you can run claude on Anthropic's API and run a separate local agent (e.g. via Ollama + an MCP gateway) for cheaper sub-tasks.
Best free local LLM setup for a 16GB MacBook?
Ollama plus a 7B-class model in Q4_K_M quantization. The two safe choices: llama3.1:8b (about 4.7GB on disk) or qwen2.5:7b (about 4.4GB). Both leave enough headroom on 16GB unified memory for the system and a browser. Run ollama run llama3.1:8b and you're done. If you want a GUI instead of the terminal, install Jan or LM Studio and download the same Q4_K_M GGUFs. Avoid 13B+ at Q4 on 16GB — once macOS swaps you'll lose more time to disk than the bigger model gains in quality. For coding specifically, qwen2.5-coder:7b is the current consensus pick on r/LocalLLaMA.
Sources
Ollama
- github.com/ollama/ollama — README, MIT licence, install configs
- ollama.com/library — model catalog
- /servers/ollama — Ollama MCP server install card
LM Studio
- lmstudio.ai — homepage, supported OS, MLX engine note
- lmstudio.ai/terms — current licence terms (verify before commercial use)
Jan
- github.com/menloresearch/jan — README, AGPLv3 LICENSE, Cortex engine note
- jan.ai — privacy posture, extension model
LocalAI
- github.com/mudler/LocalAI — README, MIT licence, multi-backend architecture
- localai.io — feature docs, agent + MCP integration notes
vLLM
- github.com/vllm-project/vllm — Apache 2.0 licence, CLI, supported quants
- blog.vllm.ai launch post — PagedAttention, 24× throughput claim
- docs.vllm.ai — production server config, OpenAI compat
Internal links
- /servers/ollama — the Ollama MCP server
- Cursor Tab vs Copilot vs Codeium vs Tabnine vs Cody (2026) — sibling 5-way comparison
- Best language Skills for Claude Code (2026)
- MCP context bloat fix (2026) — when to route subtasks to a local model
- Claude llama.cpp skill cookbook — the engine under three of these runtimes
- /best-mcp-servers — curated roundup
- /servers — browse all 3,000+