Claude llama-cpp skill: 10 local-inference recipes you can ship today
Ten real local-inference recipes — build with Metal/CUDA, run llama-server, quantize to Q4_K_M, stand up an embeddings server, wire a RAG pipeline, force JSON via GBNF grammars, run multi-slot concurrent inference, benchmark with llama-bench, convert HF safetensors to GGUF, and front it all with nginx — each as a single Claude prompt with the exact shell or Python the skill emits.
If you found this looking for the daemon-wrapper version, the companion piece is the ollama-setup skill. Ollama wraps the same llama.cpp binary in a friendlier daemon and CLI. Pick Ollama when the defaults match what you want; pick the llama.cpp skill in this cookbook when they don’t.
Already know what skills are? Skip to the cookbook. First time? Read the explainer then come back. Need the install? It’s on the /skills/llama-cpp page.

On this page · 21 sections▾
- What this skill does
- The cookbook
- Install + README
- Watch it built
- 01 · Build llama.cpp from source with Metal or CUDA
- 02 · Run llama-server and hit /completion
- 03 · Quantize an HF model to Q4_K_M
- 04 · Embeddings server with --embedding
- 05 · Local RAG pipeline (chunk → embed → query)
- 06 · Function calling via .gbnf grammar
- 07 · Multi-model concurrent inference (server slots)
- 08 · Benchmark a model with llama-bench
- 09 · Convert HF safetensors → GGUF
- 10 · Deploy llama-server behind nginx with API auth
- Community signal
- The contrarian take
- Real recipes shipped
- Gotchas
- Pairs well with
- FAQ
- Sources
What this skill actually does
Sixty seconds of context before the cookbook — what the llama.cpp skill is, what Claude returns when you invoke it, and the one thing it does NOT do for you.
What this skill actually does
“Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware.”
— zechenzhangAGI, the skill author · /skills/llama-cpp
What Claude returns
When triggered, Claude returns shell scripts that build llama.cpp from source (cmake with `-DGGML_METAL=ON`, `-DGGML_CUDA=ON`, `-DGGML_VULKAN=ON`, or `-DGGML_HIPBLAS=ON`), drive `llama-cli`, `llama-server`, `llama-bench`, and `llama-quantize`, and load GGUF models from HuggingFace. The skill picks Q4_K_M / Q5_K_M / Q8_0 quants from the standard family, runs `convert_hf_to_gguf.py` for fresh HF checkpoints, exposes the OpenAI-compatible `/v1/chat/completions` endpoint, and wires GBNF grammar files for structured output and function calling.
What it does NOT do
It does not install the llama.cpp toolchain (cmake, a working C++ compiler, CUDA Toolkit if you target NVIDIA) for you. Install those first, then trigger the skill — it builds, but it doesn't bootstrap your dev environment.
How you trigger it
Build llama.cpp on this Mac with Metal and run llama-server.Quantize Qwen2.5-7B to Q4_K_M and benchmark it.Stand up a local embeddings server on port 8081 for my RAG.Cost when idle
~100 tokens at idle (the skill name + description in the system prompt). Build scripts, server flags, and grammar templates load only when triggered.
The cookbook
Each entry below is a recipe you could ship this week. They run roughly in order of stack depth — the early ones build and serve, the middle ones quantize and embed, the later ones compose llama-server with grammars, parallel slots, and an nginx front door. Every entry pairs with one or two skills or MCP servers you already have on mcp.directory.
Install + README
If the skill isn’t on your machine yet, here’s the one-liner. The full install panel (Codex, Copilot, Antigravity variants) is on the skill page — the same UI’s embedded below.
One-line install · by zechenzhangAGI
Open skill pageInstall
mkdir -p .claude/skills/llama-cpp && curl -L -o skill.zip "https://mcp.directory/api/skills/download/202" && unzip -o skill.zip -d .claude/skills/llama-cpp && rm skill.zipInstalls to .claude/skills/llama-cpp
Watch it built
Prompt Engineering’s walkthrough of llama-server — the OpenAI-compatible endpoint, parallel decoding, and the deployment shape use cases 2, 4, and 7 below all rely on. Useful before the cookbook because it anchors the contract before you read the prompts.
Build llama.cpp from source with Metal or CUDA
Clone the upstream repo, build with the right backend for the host (Metal on Apple Silicon, CUDA on NVIDIA, Vulkan on AMD/Intel), and verify the resulting `llama-cli` and `llama-server` binaries.
ForAnyone setting up llama.cpp on a fresh machine — homelab, dev box, edge device.
The prompt
Detect the host (uname -s, nvidia-smi, system_profiler SPDisplaysDataType) and produce a build script that clones https://github.com/ggerganov/llama.cpp, runs the right cmake invocation, and verifies the binaries. On macOS pick Metal. On Linux with NVIDIA pick CUDA. On Linux without NVIDIA pick Vulkan. Print the four binaries that should land in `build/bin/`: llama-cli, llama-server, llama-bench, llama-quantize. Save as `build.sh`.What slides.md looks like
#!/usr/bin/env bash
set -euo pipefail
git clone https://github.com/ggerganov/llama.cpp ~/llama.cpp
cd ~/llama.cpp
case "$(uname -s)" in
Darwin) cmake -B build -DGGML_METAL=ON ;;
Linux)
if command -v nvidia-smi >/dev/null; then
cmake -B build -DGGML_CUDA=ON
else
cmake -B build -DGGML_VULKAN=ON
fi ;;
esac
cmake --build build --config Release -j
ls build/bin/llama-{cli,server,bench,quantize}One-line tweak
Swap `-DGGML_METAL=ON` for `-DGGML_HIPBLAS=ON` to target AMD ROCm instead of CUDA on Linux workstations with Radeon cards.
Run llama-server and hit /completion
Boot `llama-server` against a GGUF model, expose port 8080, and verify with both the OpenAI-compatible `/v1/chat/completions` and the native `/completion` endpoints.
ForDevelopers wiring llama.cpp into an existing OpenAI SDK app without changing client code.
The prompt
Download `Meta-Llama-3-8B-Instruct.Q4_K_M.gguf` via huggingface-cli, then start `llama-server` with -c 4096, -ngl 99 to offload all layers to the GPU when available, --host 0.0.0.0, --port 8080. After it boots, run two curl tests: one against `/v1/chat/completions` (OpenAI shape, with messages[]) and one against the native `/completion` (single prompt string + n_predict). Save as `serve.sh` and `smoke.sh`.What slides.md looks like
# serve.sh
huggingface-cli download \
bartowski/Meta-Llama-3-8B-Instruct-GGUF \
Meta-Llama-3-8B-Instruct-Q4_K_M.gguf --local-dir ./models
./build/bin/llama-server -m ./models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
-c 4096 -ngl 99 --host 0.0.0.0 --port 8080
# smoke.sh
curl -s http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' \
-d '{"messages":[{"role":"user","content":"Say hi in 5 words"}],"max_tokens":32}'
curl -s http://localhost:8080/completion -H 'Content-Type: application/json' \
-d '{"prompt":"The capital of France is","n_predict":8}'One-line tweak
Add `-np 4` to enable 4 parallel slots so multiple requests share one running model — turns a single-tenant server into a small multi-tenant one.
Quantize an HF model to Q4_K_M
Convert a HuggingFace safetensors checkpoint to f16 GGUF, then quantize down to Q4_K_M — the consensus sweet-spot quantization that fits 8B models in ~5GB.
ForEngineers shipping a fine-tune to laptop-class memory budgets.
The prompt
Take a local HF checkpoint at ./models/qwen2.5-7b-instruct/ and produce a two-step quantization pipeline: (1) `python convert_hf_to_gguf.py` to write `qwen2.5-7b-instruct-f16.gguf`, then (2) `llama-quantize` to write `qwen2.5-7b-instruct-Q4_K_M.gguf`. Print the resulting file sizes side by side so I can confirm the ~4× shrink. Save as `quantize.sh`.What slides.md looks like
#!/usr/bin/env bash
set -euo pipefail
SRC=./models/qwen2.5-7b-instruct
F16=$SRC/qwen2.5-7b-instruct-f16.gguf
Q4=$SRC/qwen2.5-7b-instruct-Q4_K_M.gguf
python ~/llama.cpp/convert_hf_to_gguf.py $SRC --outfile $F16 --outtype f16
~/llama.cpp/build/bin/llama-quantize $F16 $Q4 Q4_K_M
du -h $F16 $Q4
# typical output: 14G f16 / 4.6G Q4_K_MOne-line tweak
Replace the final argument `Q4_K_M` with `Q5_K_M` for the same script to produce a higher-quality quant; or `Q8_0` if you have memory to spare and want near-lossless output.
Embeddings server with --embedding
Run a second `llama-server` instance dedicated to embeddings on port 8081 using a small embedding model — turn llama.cpp into a drop-in OpenAI embeddings endpoint.
ForAnyone building local RAG without paying per-token to OpenAI.
The prompt
Download `nomic-embed-text-v1.5.Q5_K_M.gguf` from HuggingFace and start `llama-server` in embedding mode: --embedding, --pooling cls, -c 2048, --port 8081. Then curl `/v1/embeddings` to confirm a 768-dim vector lands. The output JSON should match OpenAI's embeddings shape so existing client SDKs work unchanged. Save as `embed-server.sh`.What slides.md looks like
huggingface-cli download \
nomic-ai/nomic-embed-text-v1.5-GGUF \
nomic-embed-text-v1.5.Q5_K_M.gguf --local-dir ./models
./build/bin/llama-server -m ./models/nomic-embed-text-v1.5.Q5_K_M.gguf \
--embedding --pooling cls -c 2048 --port 8081 &
curl -s http://localhost:8081/v1/embeddings \
-H 'Content-Type: application/json' \
-d '{"input":"the quick brown fox"}' | jq '.data[0].embedding | length'
# 768One-line tweak
Swap `--pooling cls` for `--pooling mean` if the embedding model card recommends mean-pool — read the model's README before picking; the wrong pooling silently degrades retrieval.
Local RAG pipeline (chunk → embed → query)
Stitch chunked text → llama.cpp embeddings server → in-memory FAISS index → llama-server completion. End-to-end RAG on one machine, zero cloud calls.
ForDevelopers building a homelab or air-gapped RAG demo.
The prompt
Write `rag.py` that: (1) reads `./docs/*.md`, splits into 512-char chunks; (2) calls `http://localhost:8081/v1/embeddings` for each chunk and stacks the vectors into a FAISS IndexFlatIP; (3) at query time embeds the question, retrieves top-5 chunks; (4) calls `http://localhost:8080/v1/chat/completions` with a system prompt that includes the retrieved chunks as context. Both servers are llama-server instances from use cases 2 and 4.What slides.md looks like
import requests, faiss, numpy as np, glob, textwrap
EMBED = "http://localhost:8081/v1/embeddings"
CHAT = "http://localhost:8080/v1/chat/completions"
def embed(text):
r = requests.post(EMBED, json={"input": text}).json()
return np.array(r["data"][0]["embedding"], dtype="float32")
chunks = [c for f in glob.glob("docs/*.md")
for c in textwrap.wrap(open(f).read(), 512)]
index = faiss.IndexFlatIP(768)
index.add(np.stack([embed(c) for c in chunks]))
def ask(q):
_, ids = index.search(embed(q)[None], 5)
ctx = "\n\n".join(chunks[i] for i in ids[0])
return requests.post(CHAT, json={"messages": [
{"role":"system","content": f"Use this context:\n{ctx}"},
{"role":"user","content": q}]}).json()["choices"][0]["message"]["content"]
print(ask("What did the team decide?"))One-line tweak
Replace `IndexFlatIP` with `IndexHNSWFlat(768, 32)` to scale past ~50k chunks without paging the whole index through RAM at every query.
Function calling via .gbnf grammar
Force any GGUF model to emit valid JSON for a tool call by passing a .gbnf grammar at decode time. No fine-tuning, no Pydantic — just constrained sampling.
ForAnyone bolting tool-use onto a model that wasn't fine-tuned for it.
The prompt
Take this Python tool signature: `def get_weather(city: str, unit: 'C'|'F') -> dict`. Convert it to a GBNF grammar that constrains the model's output to a JSON object with exactly those two fields. Then call `llama-cli` with `--grammar-file weather.gbnf` and prompt 'I need the weather in Paris.'. Show the grammar file, the CLI command, and the verifiably-valid JSON it emits. Save as `weather.gbnf` plus `call.sh`.What slides.md looks like
# weather.gbnf
root ::= "{" ws "\"city\"" ws ":" ws string ws "," ws \
"\"unit\"" ws ":" ws unit ws "}"
unit ::= "\"C\"" | "\"F\""
string ::= "\"" [^"]+ "\""
ws ::= [ \t\n]*
# call.sh
./build/bin/llama-cli -m ./models/llama3-8b.Q4_K_M.gguf \
--grammar-file weather.gbnf -n 80 \
-p 'I need the weather in Paris.' --no-display-prompt
# {"city":"Paris","unit":"C"}One-line tweak
Pipe `examples/json_schema_to_grammar.py` over a JSONSchema file to auto-generate the grammar — useful when the function signature is already in OpenAPI.
Multi-model concurrent inference (server slots)
Run one `llama-server` with `-np 4` so four concurrent requests share the same loaded model. Saturate a single GPU with parallel users instead of spinning up four servers.
ForTeams sharing one box across multiple Claude Code agents or chat sessions.
The prompt
Start `llama-server` with -np 4 (four parallel slots), -c 16384 (effective per-slot ctx is 16384/4 = 4096), --cont-batching, --port 8080. Then write a Python `bench.py` that fires 8 concurrent /v1/chat/completions requests via asyncio.gather and prints (request_id, first-token latency, total tokens/sec). Use it to confirm the server is actually parallelizing.What slides.md looks like
# server.sh
./build/bin/llama-server -m ./models/llama3-8b.Q4_K_M.gguf \
-c 16384 -np 4 --cont-batching --port 8080
# bench.py
import asyncio, httpx, time
async def hit(client, i):
t0 = time.time()
r = await client.post("http://localhost:8080/v1/chat/completions",
json={"messages":[{"role":"user","content":f"Count to 20 ({i})"}]},
timeout=120)
n = r.json()["usage"]["completion_tokens"]
print(f"req {i}: {n/(time.time()-t0):.1f} tok/s")
async def main():
async with httpx.AsyncClient() as c:
await asyncio.gather(*[hit(c, i) for i in range(8)])
asyncio.run(main())One-line tweak
Bump `-np` to 8 and `-c` to 32768 only if VRAM holds — every extra slot reserves its share of KV cache up front, so undersizing OOMs the GPU mid-request.
Benchmark a model with llama-bench
Run `llama-bench` against a GGUF file at a few prompt-size / generation-size points and emit a markdown table of pp512 / tg128 throughput. The number you actually quote when comparing hardware.
ForHardware shoppers and ops engineers picking GPU vs CPU vs Mac.
The prompt
Run `llama-bench` against ./models/llama3-8b.Q4_K_M.gguf with `-p 512,1024,2048` for prompt processing and `-n 128,256` for token generation, three repeats each. Capture the markdown table it emits, then post-process it into a single-line summary: 'On <hostname>, llama3-8b Q4_K_M runs PP512 at <X> tok/s and TG128 at <Y> tok/s.' Save as `bench.sh`.What slides.md looks like
#!/usr/bin/env bash
set -euo pipefail
HOST=$(hostname -s)
MODEL=./models/llama3-8b.Q4_K_M.gguf
./build/bin/llama-bench -m $MODEL -p 512,1024,2048 -n 128,256 -r 3 \
-o md > bench-$HOST.md
PP=$(awk -F'|' '/pp512/ {print $5}' bench-$HOST.md | tr -d ' ' | head -1)
TG=$(awk -F'|' '/tg128/ {print $5}' bench-$HOST.md | tr -d ' ' | head -1)
echo "On $HOST, llama3-8b Q4_K_M: PP512=$PP tok/s, TG128=$TG tok/s"One-line tweak
Add `-fa 1` to enable flash attention and rerun — the delta tells you whether your GPU/build actually supports it before you bake it into production flags.
Convert HF safetensors → GGUF
Run `convert_hf_to_gguf.py` against a freshly-downloaded HF model, validate the architecture is supported, and produce an f16 GGUF ready for `llama-quantize`.
ForAnyone bringing a brand-new HF release (Qwen, Mistral, Phi) into llama.cpp on day one.
The prompt
Download `Qwen/Qwen2.5-7B-Instruct` via `huggingface-cli download --local-dir ./models/qwen2.5-7b-instruct`, then run `convert_hf_to_gguf.py ./models/qwen2.5-7b-instruct --outtype f16 --outfile ./models/qwen2.5-7b-instruct-f16.gguf`. If the script errors with 'Architecture not supported', print the exact line `convert_hf_to_gguf_update.py` recommends running and stop. Save as `convert.sh`.What slides.md looks like
#!/usr/bin/env bash
set -euo pipefail
REPO=Qwen/Qwen2.5-7B-Instruct
DEST=./models/qwen2.5-7b-instruct
OUT=$DEST-f16.gguf
huggingface-cli download "$REPO" --local-dir "$DEST" \
--exclude '*.bin' '*.pt' # skip pickled weights
if ! python ~/llama.cpp/convert_hf_to_gguf.py "$DEST" \
--outtype f16 --outfile "$OUT" 2>conv.log; then
echo "FAIL — most likely architecture not supported."
echo "Run: python ~/llama.cpp/convert_hf_to_gguf_update.py"
cat conv.log
exit 1
fi
ls -lh "$OUT"One-line tweak
Pass `--outtype bf16` instead of `f16` for newer Hopper-class GPUs that prefer bfloat16 — many recent fine-tunes were trained in bf16 and lose nothing in the conversion.
Deploy llama-server behind nginx with API auth
Front a running `llama-server` with nginx, terminate TLS, require a Bearer token, and rate-limit per-IP. Production-shape deployment for a single-host inference service.
ForSolo operators putting a llama.cpp box on the open internet.
The prompt
Write an nginx site config that proxies https://llm.example.com/ → http://127.0.0.1:8080/, requires `Authorization: Bearer $LLM_TOKEN` (validated against an `auth_request` upstream that checks one env var), enforces `limit_req_zone` at 5 req/s per IP, and forwards SSE streaming correctly (`proxy_buffering off`, `proxy_read_timeout 300s`). Pair it with a 3-line systemd unit that auto-starts llama-server on boot. Save as `nginx.conf` plus `llama-server.service`.What slides.md looks like
# /etc/nginx/sites-available/llama
limit_req_zone $binary_remote_addr zone=llm:10m rate=5r/s;
server {
listen 443 ssl http2;
server_name llm.example.com;
ssl_certificate /etc/letsencrypt/live/llm.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/llm.example.com/privkey.pem;
location / {
if ($http_authorization != "Bearer $LLM_TOKEN") { return 401; }
limit_req zone=llm burst=10 nodelay;
proxy_pass http://127.0.0.1:8080;
proxy_buffering off;
proxy_read_timeout 300s;
}
}
# /etc/systemd/system/llama-server.service
[Service]
ExecStart=/home/llm/llama.cpp/build/bin/llama-server -m /srv/models/llama3-8b.Q4_K_M.gguf -c 8192 --port 8080
Restart=on-failureOne-line tweak
Swap the static-token check for an `auth_request` directive against a tiny FastAPI side-car that validates JWTs — keeps the nginx config dumb and rotates secrets without a reload.
Community signal
Three voices from people running llama.cpp for real work. The first is the migration story (Ollama and LM Studio refugees), the second is the daily-driver write-up, the third is the Apple Silicon Metal-vs-default moment that catches every new user.
“Just grab a .gguf file, point to it, and run. It reminded me why I love tinkering on Linux in the first place: fewer black boxes, more freedom to make things work your way.”
Bhuwan Mishra (It's FOSS) · Blog
Bhuwan after switching from Ollama and LM Studio. The whole post is the canonical 'why bare-metal llama.cpp wins for power users' write-up.
“Switching to llama.cpp strips all that away and gives you direct access, efficiency, and flexibility. Startup is quicker, resource utilization is lower, and you can tune everything to your liking.”
Dhruv Bhutani (XDA Developers) · Blog
Dhruv on what he gained moving off Ollama. Note the framing: the abstractions cost him startup time and tunability, not skill-floor.
“love that, I've been using Mistral 7B on my M1 and I thought it was tolerable but turned out I wasnt utilizing Metal”
yieldcrv · Hacker News
yieldcrv on HN, December 2023. The classic 'enable -ngl on Apple Silicon and watch the tokens-per-second triple' moment.
The contrarian take
Not everyone thinks llama.cpp is the right starting point. The most honest critique on the comparison-blog circuit is from D-Central (LM Studio vs Ollama vs llama.cpp):
“llama.cpp has a steeper learning curve, as the number of flags and options can be overwhelming for beginners… there's no model management — you download GGUF files manually from HuggingFace, manage them in folders yourself.”
D-Central (LM Studio vs Ollama vs llama.cpp) · Blog
From D-Central's LM Studio vs Ollama vs llama.cpp comparison — the 'who is this for' framing every newcomer hits.
Fair concern. llama.cpp is a power-user tool — the flag surface is enormous, and there’s no built-in model registry. That’s exactly why this skill exists. The cookbook above is the model-management layer Ollama and LM Studio bake in: each recipe names the GGUF, picks the quant, sets the right server flags, and ends in a runnable script. You opt out of the abstraction, but Claude carries the muscle memory — the ergonomic gap shrinks to one prompt.
One more comparison worth naming: there’s a community llama-cpp MCP server — llama-cpp-bridge — that wraps the HTTP API as MCP tools, plus the more popular ollama MCP server for the daemon side. Skill-vs-MCP trade-off is the usual one: the skill is ~100 idle tokens, the MCP’s tool schemas load every turn. Pick the MCP only when multiple AI clients need to share one running model — otherwise stick with the skill in this cookbook.
Real recipes shipped with llama.cpp
Concrete examples from public projects. Most don’t use the Claude skill specifically — they’re here to show what production-grade llama.cpp pipelines look like, so you have a target shape in mind when you write the prompt.
- Bhuwan Mishra — Migration story from Ollama/LM Studio to bare llama.cpp on Windows + Linux
- Dhruv Bhutani — XDA Developers walkthrough: switching to llama.cpp for finer control on anemic hardware
- Andreas Kunar — llama.cpp Apple Silicon performance deep-dive (Medium)
- XiongjieDai — GPU-Benchmarks-on-LLM-Inference: NVIDIA vs Apple Silicon comparisons run via llama-bench
- Aidan Cooper — Constrained decoding with GBNF: the Harry Potter JSON tutorial
- Maximilian Winter — llama_cpp_function_calling: GBNF-driven tool use
Gotchas (the four that bite)
Sourced from the llama.cpp issue tracker and the bundled skill source.
-ngl is silently zero on default builds
If you forget -ngl 99 (or some non-zero layer count), llama-server runs entirely on CPU even on a CUDA / Metal box. Token generation looks 'tolerable' until you realize the GPU is idle. Always pass -ngl explicitly; the skill writes 99 by default.
convert_hf_to_gguf.py rejects new architectures
When a brand-new model architecture lands on HuggingFace, the conversion script errors with 'Architecture <X> not supported.' Run convert_hf_to_gguf_update.py from the same repo first — it pulls the new tokenizer/architecture handlers — then retry.
-c is the TOTAL context, not per-slot
With -np 4 and -c 16384, each slot only gets 4096 effective context. Setting -c too low for the workload silently truncates prompts. Calculate per-slot ctx = total / np before you commit to flags; use case 7 above shows the math.
Pooling mode mismatches break embeddings silently
--pooling cls works for BGE-style models; --pooling mean works for nomic-embed; the wrong choice produces vectors that look fine but rank retrievals badly. Always read the embedding model's HF README and pass the pooling it recommends.
Pairs well with
Curated to match the cookbook’s actual integrations: the inference-adjacent skills (ollama-setup, gguf-quantization, embedding-strategies, rag-implementation) plus the MCP servers the longer use cases (2, 5, 6, 10) lean on.
Related skills
Related MCP servers
Two posts that compose well with this cookbook: What are Claude Code skills? covers the underlying mechanism, and What is MCP? covers the protocol the llama-cpp-bridge MCP server speaks — useful when you graduate from skill to shared service.
Frequently asked questions
Why is the llama.cpp skill useful when Ollama already wraps llama.cpp?
Ollama is great when its defaults match what you want. The skill exists for the cases where they don't — custom build flags (Vulkan on AMD, ROCm on Linux, fine-grained -ngl settings), custom GGUF files quantized in-house, parallel server slots tuned for your VRAM, and GBNF grammar files for tool use. The cookbook entries above each pick a moment where llama.cpp's surface area is the right answer and Ollama's CLI hides the knob.
Is Q4_K_M really the consensus sweet-spot quantization?
Yes for most 7-13B chat models on consumer hardware. Q4_K_M trades roughly 4× the disk + memory of f16 for ~5-15% perplexity loss versus f16 — barely noticeable in subjective chat use. Q5_K_M is the next step up if you have RAM. Q8_0 is near-lossless for cases where every nuance matters. Q2_K and Q3_K_M only make sense when you're genuinely RAM-starved. Use case 3 above shows the Q4_K_M conversion; the same script flips to Q5_K_M or Q8_0 by changing one argument.
Is there a llama.cpp MCP server I should use instead of the skill?
There is one community server — llama-cpp-bridge — that exposes llama.cpp's HTTP API as MCP tools. It's useful when multiple AI clients need to share one running model. For a single Claude session that builds, quantizes, and serves on its own machine, the skill is the lighter option: ~100 tokens at idle versus an MCP whose tool schemas load every turn. The same trade-off shows up for the ollama mcp server (21 impressions/mo on this domain) — pick MCP when sharing, skill when authoring.
Why does llama.cpp matter on Apple Silicon if MLX is faster?
MLX is faster on Apple Silicon for token generation — measurements suggest 20-87% faster on models under 14B. But MLX has no GGUF, no .gbnf grammar files, no llama-bench, and no first-class Vulkan/CUDA path on the rest of the lineup. llama.cpp gives you one toolchain that works the same on a Mac, a Linux box, and a Raspberry Pi 5. Pick MLX when the deployment target is exclusively Apple Silicon and you've already shipped; pick llama.cpp when you want one set of scripts for a fleet.
Does the skill handle GBNF grammar files for function calling?
Yes — use case 6 above. The skill writes the .gbnf grammar from a Python tool signature, calls `llama-cli --grammar-file weather.gbnf`, and verifies the JSON output. For complex schemas, it shells out to `examples/json_schema_to_grammar.py` from the upstream repo. The mechanism is constrained sampling at decode time, so any GGUF model — even one not fine-tuned for tool use — will respect the grammar.
Can I run the same scripts on Windows?
Build step needs adapting (the cookbook examples assume bash). Once built, llama-server, llama-cli, llama-bench, and llama-quantize are platform-neutral and the curl/Python clients run unchanged. The two YouTube alternates above (Code D. Roger's Windows + Linux/macOS install pair) cover the build-step gap. Long-term, WSL2 is the path most contributors recommend on Windows.
Why is llama.cpp getting impressions on Google but not many clicks here?
The bare 'llama.cpp' query brings up the ggerganov/llama.cpp GitHub README and the llama-cpp.com landing page as the top results — those will always win for the brand-term search. This blog targets the long-tail variants where developers want a how-to: 'llama.cpp skill', 'llama cpp skills', 'llama.cpp claude skill', 'llama.cpp gguf', and 'llama.cpp mcp'. Those are the queries the cookbook above is built to rank for.
Sources
Primary
- zechenzhangAGI/AI-research-SKILLs — llama-cpp SKILL.md
- ggerganov/llama.cpp — README and docs/build.md
- llama.cpp grammars/README.md (GBNF guide)
- llama.cpp grammars/json.gbnf (canonical JSON grammar)
- HuggingFace transformers — quantization documentation
Community
- Bhuwan Mishra (It's FOSS) — Blog
- Dhruv Bhutani (XDA Developers) — Blog
- yieldcrv — Hacker News
- johnklos — Hacker News
- discobot — Hacker News
- Simon Willison — Blog
Critical and contrarian
Internal