coreml
Use when deploying custom ML models on-device, converting PyTorch models, compressing models, implementing LLM inference, or optimizing CoreML performance. Covers model conversion, compression, stateful models, KV-cache, multi-function models, MLTensor.
Install
mkdir -p .claude/skills/coreml && curl -L -o skill.zip "https://mcp.directory/api/skills/download/3649" && unzip -o skill.zip -d .claude/skills/coreml && rm skill.zipInstalls to .claude/skills/coreml
About this skill
CoreML On-Device Machine Learning
Overview
CoreML enables on-device machine learning inference across all Apple platforms. It abstracts hardware details while leveraging Apple Silicon's CPU, GPU, and Neural Engine for high-performance, private, and efficient execution.
Key principle: Start with the simplest approach, then optimize based on profiling. Don't over-engineer compression or caching until you have real performance data.
Decision Tree - CoreML vs Foundation Models
Need on-device ML?
├─ Text generation (LLM)?
│ ├─ Simple prompts, structured output? → Foundation Models (ios-ai skill)
│ └─ Custom model, fine-tuned, specific architecture? → CoreML
├─ Custom trained model?
│ └─ Yes → CoreML
├─ Image/audio/sensor processing?
│ └─ Yes → CoreML
└─ Apple's built-in intelligence?
└─ Yes → Foundation Models (ios-ai skill)
Red Flags
Use this skill when you see:
- "Convert PyTorch model to CoreML"
- "Model too large for device"
- "Slow inference performance"
- "LLM on-device"
- "KV-cache" or "stateful model"
- "Model compression" or "quantization"
- MLModel, MLTensor, or coremltools in context
Pattern 1 - Basic Model Conversion
The standard PyTorch → CoreML workflow.
import coremltools as ct
import torch
# Trace the model
model.eval()
traced_model = torch.jit.trace(model, example_input)
# Convert to CoreML
mlmodel = ct.convert(
traced_model,
inputs=[ct.TensorType(shape=example_input.shape)],
minimum_deployment_target=ct.target.iOS18
)
# Save
mlmodel.save("MyModel.mlpackage")
Critical: Always set minimum_deployment_target to enable latest optimizations.
Pattern 2 - Model Compression (Post-Training)
Three techniques, each with different tradeoffs:
Palettization (Best for Neural Engine)
Clusters weights into lookup tables. Use per-grouped-channel for better accuracy.
from coremltools.optimize.coreml import (
OpPalettizerConfig,
OptimizationConfig,
palettize_weights
)
# 4-bit with grouped channels (iOS 18+)
op_config = OpPalettizerConfig(
mode="kmeans",
nbits=4,
granularity="per_grouped_channel",
group_size=16
)
config = OptimizationConfig(global_config=op_config)
compressed_model = palettize_weights(model, config)
| Bits | Compression | Accuracy Impact |
|---|---|---|
| 8-bit | 2x | Minimal |
| 6-bit | 2.7x | Low |
| 4-bit | 4x | Moderate (use grouped channels) |
| 2-bit | 8x | High (requires training-time) |
Quantization (Best for GPU on Mac)
Linear mapping to INT8/INT4. Use per-block for better accuracy.
from coremltools.optimize.coreml import (
OpLinearQuantizerConfig,
OptimizationConfig,
linear_quantize_weights
)
# INT4 per-block quantization (iOS 18+)
op_config = OpLinearQuantizerConfig(
mode="linear",
dtype="int4",
granularity="per_block",
block_size=32
)
config = OptimizationConfig(global_config=op_config)
compressed_model = linear_quantize_weights(model, config)
Pruning (Combine with other techniques)
Sets weights to zero for sparse representation. Can combine with palettization.
from coremltools.optimize.coreml import (
OpMagnitudePrunerConfig,
OptimizationConfig,
prune_weights
)
op_config = OpMagnitudePrunerConfig(
target_sparsity=0.4 # 40% zeros
)
config = OptimizationConfig(global_config=op_config)
sparse_model = prune_weights(model, config)
Pattern 3 - Training-Time Compression
When post-training compression loses too much accuracy, fine-tune with compression.
from coremltools.optimize.torch.palettization import (
DKMPalettizerConfig,
DKMPalettizer
)
# Configure 4-bit palettization
config = DKMPalettizerConfig(global_config={"n_bits": 4})
# Prepare model
palettizer = DKMPalettizer(model, config)
prepared_model = palettizer.prepare()
# Fine-tune (your training loop)
for epoch in range(num_epochs):
train_epoch(prepared_model, data_loader)
palettizer.step()
# Finalize
final_model = palettizer.finalize()
Tradeoff: Better accuracy than post-training, but requires training data and time.
Pattern 4 - Calibration-Based Compression (iOS 18+)
Middle ground: uses calibration data without full training.
from coremltools.optimize.torch.pruning import (
MagnitudePrunerConfig,
LayerwiseCompressor
)
# Configure
config = MagnitudePrunerConfig(
target_sparsity=0.4,
n_samples=128 # Calibration samples
)
# Create pruner
pruner = LayerwiseCompressor(model, config)
# Calibrate
sparse_model = pruner.compress(calibration_data_loader)
Pattern 5 - Stateful Models (KV-Cache for LLMs)
For transformer models, use state to avoid recomputing key/value vectors.
PyTorch Model with State
class StatefulLLM(nn.Module):
def __init__(self):
super().__init__()
# Register state buffers
self.register_buffer("keyCache", torch.zeros(batch, heads, seq_len, dim))
self.register_buffer("valueCache", torch.zeros(batch, heads, seq_len, dim))
def forward(self, input_ids, causal_mask):
# Update caches in-place during forward
# ... attention with KV-cache ...
return logits
Conversion with State
import coremltools as ct
mlmodel = ct.convert(
traced_model,
inputs=[
ct.TensorType(name="input_ids", shape=(1, ct.RangeDim(1, 2048))),
ct.TensorType(name="causal_mask", shape=(1, 1, ct.RangeDim(1, 2048), ct.RangeDim(1, 2048)))
],
states=[
ct.StateType(name="keyCache", ...),
ct.StateType(name="valueCache", ...)
],
minimum_deployment_target=ct.target.iOS18
)
Using State at Runtime
// Create state from model
let state = model.makeState()
// Run prediction with state (updated in-place)
let output = try model.prediction(from: input, using: state)
Performance: 1.6x speedup on Mistral-7B (M3 Max) compared to manual KV-cache I/O.
Pattern 6 - Multi-Function Models (Adapters/LoRA)
Deploy multiple adapters in a single model, sharing base weights.
from coremltools.models import MultiFunctionDescriptor
from coremltools.models.utils import save_multifunction
# Convert individual models
sticker_model = ct.convert(sticker_adapter_model, ...)
storybook_model = ct.convert(storybook_adapter_model, ...)
# Save individually
sticker_model.save("sticker.mlpackage")
storybook_model.save("storybook.mlpackage")
# Merge with shared weights
desc = MultiFunctionDescriptor()
desc.add_function("sticker", "sticker.mlpackage")
desc.add_function("storybook", "storybook.mlpackage")
save_multifunction(desc, "MultiAdapter.mlpackage")
Loading Specific Function
let config = MLModelConfiguration()
config.functionName = "sticker" // or "storybook"
let model = try MLModel(contentsOf: modelURL, configuration: config)
Pattern 7 - MLTensor for Pipeline Stitching (iOS 18+)
Simplifies computation between models (decoding, post-processing).
import CoreML
// Create tensors
let scores = MLTensor(shape: [1, vocab_size], scalars: logits)
// Operations (executed asynchronously on Apple Silicon)
let topK = scores.topK(k: 10)
let probs = (topK.values / temperature).softmax()
// Sample from distribution
let sampled = probs.multinomial(numSamples: 1)
// Materialize to access data (blocks until complete)
let shapedArray = await sampled.shapedArray(of: Int32.self)
Key insight: MLTensor operations are async. Call shapedArray() to materialize results.
Pattern 8 - Async Prediction for Concurrency
Thread-safe concurrent predictions for throughput.
class ImageProcessor {
let model: MLModel
func processImages(_ images: [CGImage]) async throws -> [Output] {
try await withThrowingTaskGroup(of: Output.self) { group in
for image in images {
group.addTask {
// Check cancellation before expensive work
try Task.checkCancellation()
let input = try self.prepareInput(image)
// Async prediction - thread safe!
return try await self.model.prediction(from: input)
}
}
return try await group.reduce(into: []) { $0.append($1) }
}
}
}
Warning: Limit concurrent predictions to avoid memory pressure from multiple input/output buffers.
// Limit concurrency
let semaphore = AsyncSemaphore(value: 2)
for image in images {
group.addTask {
await semaphore.wait()
defer { semaphore.signal() }
return try await process(image)
}
}
Anti-Patterns
Don't - Load models on main thread at launch
// BAD - blocks UI
class AppDelegate {
let model = try! MLModel(contentsOf: url) // Blocks!
}
// GOOD - lazy async loading
class ModelManager {
private var model: MLModel?
func getModel() async throws -> MLModel {
if let model { return model }
model = try await Task.detached {
try MLModel(contentsOf: url)
}.value
return model!
}
}
Don't - Reload model for each prediction
// BAD - reloads every time
func predict(_ input: Input) throws -> Output {
let model = try MLModel(contentsOf: url) // Expensive!
return try model.prediction(from: input)
}
// GOOD - keep model loaded
class Predictor {
private let model: MLModel
func predict(_ input: Input) throws -> Output {
try model.prediction(from: input)
}
}
Don't - Compress without profiling first
// BAD - blind compression
let compressed = palettize_weights(model, 2bit_config) // May break accuracy!
// GOOD - profile, then compress iteratively
// 1. Profile Float16 baseline
// 2. Try 8-bit → check accuracy
// 3. Try 6-bit → check accuracy
// 4. Try 4-bit with grouped channels → check accuracy
// 5. Only
---
*Content truncated.*
More by CharlesWiltgen
View all skills by CharlesWiltgen →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
fastapi-templates
wshobson
Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.
Related MCP Servers
Browse all serversOptimize your codebase for AI with Repomix—transform, compress, and secure repos for easier analysis with modern AI tool
Connect Supabase projects to AI with Supabase MCP Server. Standardize LLM communication for secure, efficient developmen
Empower your Unity projects with Unity-MCP: AI-driven control, seamless integration, and advanced workflows within the U
Boost productivity with AI for project management. monday.com MCP securely automates workflows and data. Seamless AI and
DeepSeek offers an AI-powered chatbot and writing assistant for chat completions, writing help, and code generation with
Integrate Ollama's local LLM models for secure, on-premise AI and data control with MCP-compatible apps. Deploy custom m
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.