llama-cpp
Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.
Install
mkdir -p .claude/skills/llama-cpp && curl -L -o skill.zip "https://mcp.directory/api/skills/download/202" && unzip -o skill.zip -d .claude/skills/llama-cpp && rm skill.zipInstalls to .claude/skills/llama-cpp
About this skill
llama.cpp
Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.
When to use llama.cpp
Use llama.cpp when:
- Running on CPU-only machines
- Deploying on Apple Silicon (M1/M2/M3/M4)
- Using AMD or Intel GPUs (no CUDA)
- Edge deployment (Raspberry Pi, embedded systems)
- Need simple deployment without Docker/Python
Use TensorRT-LLM instead when:
- Have NVIDIA GPUs (A100/H100)
- Need maximum throughput (100K+ tok/s)
- Running in datacenter with CUDA
Use vLLM instead when:
- Have NVIDIA GPUs
- Need Python-first API
- Want PagedAttention
Quick start
Installation
# macOS/Linux
brew install llama.cpp
# Or build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# With Metal (Apple Silicon)
make LLAMA_METAL=1
# With CUDA (NVIDIA)
make LLAMA_CUDA=1
# With ROCm (AMD)
make LLAMA_HIP=1
Download model
# Download from HuggingFace (GGUF format)
huggingface-cli download \
TheBloke/Llama-2-7B-Chat-GGUF \
llama-2-7b-chat.Q4_K_M.gguf \
--local-dir models/
# Or convert from HuggingFace
python convert_hf_to_gguf.py models/llama-2-7b-chat/
Run inference
# Simple chat
./llama-cli \
-m models/llama-2-7b-chat.Q4_K_M.gguf \
-p "Explain quantum computing" \
-n 256 # Max tokens
# Interactive chat
./llama-cli \
-m models/llama-2-7b-chat.Q4_K_M.gguf \
--interactive
Server mode
# Start OpenAI-compatible server
./llama-server \
-m models/llama-2-7b-chat.Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 32 # Offload 32 layers to GPU
# Client request
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-2-7b-chat",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7,
"max_tokens": 100
}'
Quantization formats
GGUF format overview
| Format | Bits | Size (7B) | Speed | Quality | Use Case |
|---|---|---|---|---|---|
| Q4_K_M | 4.5 | 4.1 GB | Fast | Good | Recommended default |
| Q4_K_S | 4.3 | 3.9 GB | Faster | Lower | Speed critical |
| Q5_K_M | 5.5 | 4.8 GB | Medium | Better | Quality critical |
| Q6_K | 6.5 | 5.5 GB | Slower | Best | Maximum quality |
| Q8_0 | 8.0 | 7.0 GB | Slow | Excellent | Minimal degradation |
| Q2_K | 2.5 | 2.7 GB | Fastest | Poor | Testing only |
Choosing quantization
# General use (balanced)
Q4_K_M # 4-bit, medium quality
# Maximum speed (more degradation)
Q2_K or Q3_K_M
# Maximum quality (slower)
Q6_K or Q8_0
# Very large models (70B, 405B)
Q3_K_M or Q4_K_S # Lower bits to fit in memory
Hardware acceleration
Apple Silicon (Metal)
# Build with Metal
make LLAMA_METAL=1
# Run with GPU acceleration (automatic)
./llama-cli -m model.gguf -ngl 999 # Offload all layers
# Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)
NVIDIA GPUs (CUDA)
# Build with CUDA
make LLAMA_CUDA=1
# Offload layers to GPU
./llama-cli -m model.gguf -ngl 35 # Offload 35/40 layers
# Hybrid CPU+GPU for large models
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest
AMD GPUs (ROCm)
# Build with ROCm
make LLAMA_HIP=1
# Run with AMD GPU
./llama-cli -m model.gguf -ngl 999
Common patterns
Batch processing
# Process multiple prompts from file
cat prompts.txt | ./llama-cli \
-m model.gguf \
--batch-size 512 \
-n 100
Constrained generation
# JSON output with grammar
./llama-cli \
-m model.gguf \
-p "Generate a person: " \
--grammar-file grammars/json.gbnf
# Outputs valid JSON only
Context size
# Increase context (default 512)
./llama-cli \
-m model.gguf \
-c 4096 # 4K context window
# Very long context (if model supports)
./llama-cli -m model.gguf -c 32768 # 32K context
Performance benchmarks
CPU performance (Llama 2-7B Q4_K_M)
| CPU | Threads | Speed | Cost |
|---|---|---|---|
| Apple M3 Max | 16 | 50 tok/s | $0 (local) |
| AMD Ryzen 9 7950X | 32 | 35 tok/s | $0.50/hour |
| Intel i9-13900K | 32 | 30 tok/s | $0.40/hour |
| AWS c7i.16xlarge | 64 | 40 tok/s | $2.88/hour |
GPU acceleration (Llama 2-7B Q4_K_M)
| GPU | Speed | vs CPU | Cost |
|---|---|---|---|
| NVIDIA RTX 4090 | 120 tok/s | 3-4× | $0 (local) |
| NVIDIA A10 | 80 tok/s | 2-3× | $1.00/hour |
| AMD MI250 | 70 tok/s | 2× | $2.00/hour |
| Apple M3 Max (Metal) | 50 tok/s | ~Same | $0 (local) |
Supported models
LLaMA family:
- Llama 2 (7B, 13B, 70B)
- Llama 3 (8B, 70B, 405B)
- Code Llama
Mistral family:
- Mistral 7B
- Mixtral 8x7B, 8x22B
Other:
- Falcon, BLOOM, GPT-J
- Phi-3, Gemma, Qwen
- LLaVA (vision), Whisper (audio)
Find models: https://huggingface.co/models?library=gguf
References
- Quantization Guide - GGUF formats, conversion, quality comparison
- Server Deployment - API endpoints, Docker, monitoring
- Optimization - Performance tuning, hybrid CPU+GPU
Resources
- GitHub: https://github.com/ggerganov/llama.cpp
- Models: https://huggingface.co/models?library=gguf
- Discord: https://discord.gg/llama-cpp
More by zechenzhangAGI
View all →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
rust-coding-skill
UtakataKyosui
Guides Claude in writing idiomatic, efficient, well-structured Rust code using proper data modeling, traits, impl organization, macros, and build-speed best practices.
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.