Claude llama-cpp skill: 10 local-inference recipes you can ship today

ollama-setup

Ollama wraps the same llama.cpp binary if you want a friendlier daemon.

github

Pulls the latest release tag so your build script tracks upstream automatically.

Run llama-server and hit /completion

Boot `llama-server` against a GGUF model, expose port 8080, and verify with both the OpenAI-compatible `/v1/chat/completions` and the native `/completion` endpoints.

ForDevelopers wiring llama.cpp into an existing OpenAI SDK app without changing client code.

The prompt

Download `Meta-Llama-3-8B-Instruct.Q4_K_M.gguf` via huggingface-cli, then start `llama-server` with -c 4096, -ngl 99 to offload all layers to the GPU when available, --host 0.0.0.0, --port 8080. After it boots, run two curl tests: one against `/v1/chat/completions` (OpenAI shape, with messages[]) and one against the native `/completion` (single prompt string + n_predict). Save as `serve.sh` and `smoke.sh`.

What slides.md looks like

# serve.sh
huggingface-cli download \
  bartowski/Meta-Llama-3-8B-Instruct-GGUF \
  Meta-Llama-3-8B-Instruct-Q4_K_M.gguf --local-dir ./models
./build/bin/llama-server -m ./models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf \
  -c 4096 -ngl 99 --host 0.0.0.0 --port 8080

# smoke.sh
curl -s http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"user","content":"Say hi in 5 words"}],"max_tokens":32}'
curl -s http://localhost:8080/completion -H 'Content-Type: application/json' \
  -d '{"prompt":"The capital of France is","n_predict":8}'

One-line tweak

Add `-np 4` to enable 4 parallel slots so multiple requests share one running model — turns a single-tenant server into a small multi-tenant one.

Pairs with

ollama

Ollama exposes the same OpenAI-compatible shape if you want a daemon instead.

inference-server

General playbook for production-grade local inference behind an API.

Quantize an HF model to Q4_K_M

Convert a HuggingFace safetensors checkpoint to f16 GGUF, then quantize down to Q4_K_M — the consensus sweet-spot quantization that fits 8B models in ~5GB.

ForEngineers shipping a fine-tune to laptop-class memory budgets.

The prompt

Take a local HF checkpoint at ./models/qwen2.5-7b-instruct/ and produce a two-step quantization pipeline: (1) `python convert_hf_to_gguf.py` to write `qwen2.5-7b-instruct-f16.gguf`, then (2) `llama-quantize` to write `qwen2.5-7b-instruct-Q4_K_M.gguf`. Print the resulting file sizes side by side so I can confirm the ~4× shrink. Save as `quantize.sh`.

What slides.md looks like

#!/usr/bin/env bash
set -euo pipefail
SRC=./models/qwen2.5-7b-instruct
F16=$SRC/qwen2.5-7b-instruct-f16.gguf
Q4=$SRC/qwen2.5-7b-instruct-Q4_K_M.gguf

python ~/llama.cpp/convert_hf_to_gguf.py $SRC --outfile $F16 --outtype f16
~/llama.cpp/build/bin/llama-quantize $F16 $Q4 Q4_K_M

du -h $F16 $Q4
# typical output: 14G f16 / 4.6G Q4_K_M

One-line tweak

Replace the final argument `Q4_K_M` with `Q5_K_M` for the same script to produce a higher-quality quant; or `Q8_0` if you have memory to spare and want near-lossless output.

Pairs with

gguf-quantization

Companion skill that compares the full Q-family (Q2_K through Q8_0).

huggingface

Pulls the source checkpoint and uploads the resulting GGUF to a HF repo.

Embeddings server with --embedding

Run a second `llama-server` instance dedicated to embeddings on port 8081 using a small embedding model — turn llama.cpp into a drop-in OpenAI embeddings endpoint.

ForAnyone building local RAG without paying per-token to OpenAI.

The prompt

Download `nomic-embed-text-v1.5.Q5_K_M.gguf` from HuggingFace and start `llama-server` in embedding mode: --embedding, --pooling cls, -c 2048, --port 8081. Then curl `/v1/embeddings` to confirm a 768-dim vector lands. The output JSON should match OpenAI's embeddings shape so existing client SDKs work unchanged. Save as `embed-server.sh`.

What slides.md looks like

huggingface-cli download \
  nomic-ai/nomic-embed-text-v1.5-GGUF \
  nomic-embed-text-v1.5.Q5_K_M.gguf --local-dir ./models

./build/bin/llama-server -m ./models/nomic-embed-text-v1.5.Q5_K_M.gguf \
  --embedding --pooling cls -c 2048 --port 8081 &

curl -s http://localhost:8081/v1/embeddings \
  -H 'Content-Type: application/json' \
  -d '{"input":"the quick brown fox"}' | jq '.data[0].embedding | length'
# 768

One-line tweak

Swap `--pooling cls` for `--pooling mean` if the embedding model card recommends mean-pool — read the model's README before picking; the wrong pooling silently degrades retrieval.

Pairs with

embedding-strategies

Decision guide for picking pool, dimension, and chunk size.

huggingface

Source-of-truth for the embedding model card and recommended pooling.

Local RAG pipeline (chunk → embed → query)

Stitch chunked text → llama.cpp embeddings server → in-memory FAISS index → llama-server completion. End-to-end RAG on one machine, zero cloud calls.

ForDevelopers building a homelab or air-gapped RAG demo.

The prompt

Write `rag.py` that: (1) reads `./docs/*.md`, splits into 512-char chunks; (2) calls `http://localhost:8081/v1/embeddings` for each chunk and stacks the vectors into a FAISS IndexFlatIP; (3) at query time embeds the question, retrieves top-5 chunks; (4) calls `http://localhost:8080/v1/chat/completions` with a system prompt that includes the retrieved chunks as context. Both servers are llama-server instances from use cases 2 and 4.

What slides.md looks like

import requests, faiss, numpy as np, glob, textwrap

EMBED = "http://localhost:8081/v1/embeddings"
CHAT  = "http://localhost:8080/v1/chat/completions"

def embed(text):
    r = requests.post(EMBED, json={"input": text}).json()
    return np.array(r["data"][0]["embedding"], dtype="float32")

chunks = [c for f in glob.glob("docs/*.md")
          for c in textwrap.wrap(open(f).read(), 512)]
index = faiss.IndexFlatIP(768)
index.add(np.stack([embed(c) for c in chunks]))

def ask(q):
    _, ids = index.search(embed(q)[None], 5)
    ctx = "\n\n".join(chunks[i] for i in ids[0])
    return requests.post(CHAT, json={"messages": [
        {"role":"system","content": f"Use this context:\n{ctx}"},
        {"role":"user","content": q}]}).json()["choices"][0]["message"]["content"]
print(ask("What did the team decide?"))

One-line tweak

Replace `IndexFlatIP` with `IndexHNSWFlat(768, 32)` to scale past ~50k chunks without paging the whole index through RAM at every query.

Pairs with

rag-implementation

Patterns for chunking, reranking, and citation in production RAG.

raglite

Lighter alternative when FAISS is overkill.

Function calling via .gbnf grammar

Force any GGUF model to emit valid JSON for a tool call by passing a .gbnf grammar at decode time. No fine-tuning, no Pydantic — just constrained sampling.

ForAnyone bolting tool-use onto a model that wasn't fine-tuned for it.

The prompt

Take this Python tool signature: `def get_weather(city: str, unit: 'C'|'F') -> dict`. Convert it to a GBNF grammar that constrains the model's output to a JSON object with exactly those two fields. Then call `llama-cli` with `--grammar-file weather.gbnf` and prompt 'I need the weather in Paris.'. Show the grammar file, the CLI command, and the verifiably-valid JSON it emits. Save as `weather.gbnf` plus `call.sh`.

What slides.md looks like

# weather.gbnf
root   ::= "{" ws "\"city\"" ws ":" ws string ws "," ws \
           "\"unit\"" ws ":" ws unit ws "}"
unit   ::= "\"C\"" | "\"F\""
string ::= "\"" [^"]+ "\""
ws     ::= [ \t\n]*

# call.sh
./build/bin/llama-cli -m ./models/llama3-8b.Q4_K_M.gguf \
  --grammar-file weather.gbnf -n 80 \
  -p 'I need the weather in Paris.' --no-display-prompt
# {"city":"Paris","unit":"C"}

One-line tweak

Pipe `examples/json_schema_to_grammar.py` over a JSONSchema file to auto-generate the grammar — useful when the function signature is already in OpenAPI.

Pairs with

llamaguard

Layer Llama Guard on the JSON output to filter unsafe tool arguments before execution.

github

Pull the model's official chat-template alongside the grammar.

Multi-model concurrent inference (server slots)

Run one `llama-server` with `-np 4` so four concurrent requests share the same loaded model. Saturate a single GPU with parallel users instead of spinning up four servers.

ForTeams sharing one box across multiple Claude Code agents or chat sessions.

The prompt

Start `llama-server` with -np 4 (four parallel slots), -c 16384 (effective per-slot ctx is 16384/4 = 4096), --cont-batching, --port 8080. Then write a Python `bench.py` that fires 8 concurrent /v1/chat/completions requests via asyncio.gather and prints (request_id, first-token latency, total tokens/sec). Use it to confirm the server is actually parallelizing.

What slides.md looks like

# server.sh
./build/bin/llama-server -m ./models/llama3-8b.Q4_K_M.gguf \
  -c 16384 -np 4 --cont-batching --port 8080

# bench.py
import asyncio, httpx, time
async def hit(client, i):
    t0 = time.time()
    r = await client.post("http://localhost:8080/v1/chat/completions",
        json={"messages":[{"role":"user","content":f"Count to 20 ({i})"}]},
        timeout=120)
    n = r.json()["usage"]["completion_tokens"]
    print(f"req {i}: {n/(time.time()-t0):.1f} tok/s")
async def main():
    async with httpx.AsyncClient() as c:
        await asyncio.gather(*[hit(c, i) for i in range(8)])
asyncio.run(main())

One-line tweak

Bump `-np` to 8 and `-c` to 32768 only if VRAM holds — every extra slot reserves its share of KV cache up front, so undersizing OOMs the GPU mid-request.

Pairs with

streaming-inference-setup

Playbook for SSE streaming over llama-server's /completion endpoint.

batch-inference-pipeline

When the workload is offline batch, not interactive serving.

Benchmark a model with llama-bench

Run `llama-bench` against a GGUF file at a few prompt-size / generation-size points and emit a markdown table of pp512 / tg128 throughput. The number you actually quote when comparing hardware.

ForHardware shoppers and ops engineers picking GPU vs CPU vs Mac.

The prompt

Run `llama-bench` against ./models/llama3-8b.Q4_K_M.gguf with `-p 512,1024,2048` for prompt processing and `-n 128,256` for token generation, three repeats each. Capture the markdown table it emits, then post-process it into a single-line summary: 'On <hostname>, llama3-8b Q4_K_M runs PP512 at <X> tok/s and TG128 at <Y> tok/s.' Save as `bench.sh`.

What slides.md looks like

#!/usr/bin/env bash
set -euo pipefail
HOST=$(hostname -s)
MODEL=./models/llama3-8b.Q4_K_M.gguf

./build/bin/llama-bench -m $MODEL -p 512,1024,2048 -n 128,256 -r 3 \
  -o md > bench-$HOST.md

PP=$(awk -F'|' '/pp512/ {print $5}' bench-$HOST.md | tr -d ' ' | head -1)
TG=$(awk -F'|' '/tg128/ {print $5}' bench-$HOST.md | tr -d ' ' | head -1)
echo "On $HOST, llama3-8b Q4_K_M: PP512=$PP tok/s, TG128=$TG tok/s"

One-line tweak

Add `-fa 1` to enable flash attention and rerun — the delta tells you whether your GPU/build actually supports it before you bake it into production flags.

Pairs with

inference-latency-profiler

Goes beyond throughput into per-token latency histograms.

github

Post the bench markdown straight to a repo issue for cross-machine comparison.

Convert HF safetensors → GGUF

Run `convert_hf_to_gguf.py` against a freshly-downloaded HF model, validate the architecture is supported, and produce an f16 GGUF ready for `llama-quantize`.

ForAnyone bringing a brand-new HF release (Qwen, Mistral, Phi) into llama.cpp on day one.

The prompt

Download `Qwen/Qwen2.5-7B-Instruct` via `huggingface-cli download --local-dir ./models/qwen2.5-7b-instruct`, then run `convert_hf_to_gguf.py ./models/qwen2.5-7b-instruct --outtype f16 --outfile ./models/qwen2.5-7b-instruct-f16.gguf`. If the script errors with 'Architecture not supported', print the exact line `convert_hf_to_gguf_update.py` recommends running and stop. Save as `convert.sh`.

What slides.md looks like

#!/usr/bin/env bash
set -euo pipefail
REPO=Qwen/Qwen2.5-7B-Instruct
DEST=./models/qwen2.5-7b-instruct
OUT=$DEST-f16.gguf

huggingface-cli download "$REPO" --local-dir "$DEST" \
  --exclude '*.bin' '*.pt'   # skip pickled weights

if ! python ~/llama.cpp/convert_hf_to_gguf.py "$DEST" \
       --outtype f16 --outfile "$OUT" 2>conv.log; then
  echo "FAIL — most likely architecture not supported."
  echo "Run: python ~/llama.cpp/convert_hf_to_gguf_update.py"
  cat conv.log
  exit 1
fi
ls -lh "$OUT"

One-line tweak

Pass `--outtype bf16` instead of `f16` for newer Hopper-class GPUs that prefer bfloat16 — many recent fine-tunes were trained in bf16 and lose nothing in the conversion.

Pairs with

gguf-quantization

Pick this as the next step after f16 conversion.

huggingface

Direct line to the source weights and tokenizer config.

Deploy llama-server behind nginx with API auth

Front a running `llama-server` with nginx, terminate TLS, require a Bearer token, and rate-limit per-IP. Production-shape deployment for a single-host inference service.

ForSolo operators putting a llama.cpp box on the open internet.

The prompt

Write an nginx site config that proxies https://llm.example.com/ → http://127.0.0.1:8080/, requires `Authorization: Bearer $LLM_TOKEN` (validated against an `auth_request` upstream that checks one env var), enforces `limit_req_zone` at 5 req/s per IP, and forwards SSE streaming correctly (`proxy_buffering off`, `proxy_read_timeout 300s`). Pair it with a 3-line systemd unit that auto-starts llama-server on boot. Save as `nginx.conf` plus `llama-server.service`.

What slides.md looks like

# /etc/nginx/sites-available/llama
limit_req_zone $binary_remote_addr zone=llm:10m rate=5r/s;
server {
  listen 443 ssl http2;
  server_name llm.example.com;
  ssl_certificate     /etc/letsencrypt/live/llm.example.com/fullchain.pem;
  ssl_certificate_key /etc/letsencrypt/live/llm.example.com/privkey.pem;

  location / {
    if ($http_authorization != "Bearer $LLM_TOKEN") { return 401; }
    limit_req zone=llm burst=10 nodelay;
    proxy_pass http://127.0.0.1:8080;
    proxy_buffering off;
    proxy_read_timeout 300s;
  }
}

# /etc/systemd/system/llama-server.service
[Service]
ExecStart=/home/llm/llama.cpp/build/bin/llama-server -m /srv/models/llama3-8b.Q4_K_M.gguf -c 8192 --port 8080
Restart=on-failure

One-line tweak

Swap the static-token check for an `auth_request` directive against a tiny FastAPI side-car that validates JWTs — keeps the nginx config dumb and rotates secrets without a reload.

Pairs with

inference-server

Deeper checklist for monitoring, log rotation, and graceful shutdown.

ollama

Side-by-side comparison if you'd rather have ollama serve do this for you.

Community signal

Three voices from people running llama.cpp for real work. The first is the migration story (Ollama and LM Studio refugees), the second is the daily-driver write-up, the third is the Apple Silicon Metal-vs-default moment that catches every new user.

“Just grab a .gguf file, point to it, and run. It reminded me why I love tinkering on Linux in the first place: fewer black boxes, more freedom to make things work your way.”

Bhuwan Mishra (It's FOSS) · Blog

Bhuwan after switching from Ollama and LM Studio. The whole post is the canonical 'why bare-metal llama.cpp wins for power users' write-up.

“Switching to llama.cpp strips all that away and gives you direct access, efficiency, and flexibility. Startup is quicker, resource utilization is lower, and you can tune everything to your liking.”

Dhruv Bhutani (XDA Developers) · Blog

Dhruv on what he gained moving off Ollama. Note the framing: the abstractions cost him startup time and tunability, not skill-floor.

“love that, I've been using Mistral 7B on my M1 and I thought it was tolerable but turned out I wasnt utilizing Metal”

yieldcrv · Hacker News

yieldcrv on HN, December 2023. The classic 'enable -ngl on Apple Silicon and watch the tokens-per-second triple' moment.

The contrarian take

Not everyone thinks llama.cpp is the right starting point. The most honest critique on the comparison-blog circuit is from D-Central (LM Studio vs Ollama vs llama.cpp):

“llama.cpp has a steeper learning curve, as the number of flags and options can be overwhelming for beginners… there's no model management — you download GGUF files manually from HuggingFace, manage them in folders yourself.”

D-Central (LM Studio vs Ollama vs llama.cpp) · Blog

From D-Central's LM Studio vs Ollama vs llama.cpp comparison — the 'who is this for' framing every newcomer hits.