axiom-vision-ref
Vision framework API, VNDetectHumanHandPoseRequest, VNDetectHumanBodyPoseRequest, person segmentation, face detection, VNImageRequestHandler, recognized points, joint landmarks, VNRecognizeTextRequest, VNDetectBarcodesRequest, DataScannerViewController, VNDocumentCameraViewController, RecognizeDocumentsRequest
Install
mkdir -p .claude/skills/axiom-vision-ref && curl -L -o skill.zip "https://mcp.directory/api/skills/download/4362" && unzip -o skill.zip -d .claude/skills/axiom-vision-ref && rm skill.zipInstalls to .claude/skills/axiom-vision-ref
About this skill
Vision Framework API Reference
Comprehensive reference for Vision framework computer vision: subject segmentation, hand/body pose detection, person detection, face analysis, text recognition (OCR), barcode detection, and document scanning.
When to Use This Reference
- Implementing subject lifting using VisionKit or Vision
- Detecting hand/body poses for gesture recognition or fitness apps
- Segmenting people from backgrounds or separating multiple individuals
- Face detection and landmarks for AR effects or authentication
- Combining Vision APIs to solve complex computer vision problems
- Looking up specific API signatures and parameter meanings
- Recognizing text in images (OCR) with VNRecognizeTextRequest
- Detecting barcodes and QR codes with VNDetectBarcodesRequest
- Building live scanners with DataScannerViewController
- Scanning documents with VNDocumentCameraViewController
- Extracting structured document data with RecognizeDocumentsRequest (iOS 26+)
Related skills: See axiom-vision for decision trees and patterns, axiom-vision-diag for troubleshooting
Vision Framework Overview
Vision provides computer vision algorithms for still images and video:
Core workflow:
- Create request (e.g.,
VNDetectHumanHandPoseRequest()) - Create handler with image (
VNImageRequestHandler(cgImage: image)) - Perform request (
try handler.perform([request])) - Access observations from
request.results
Coordinate system: Lower-left origin, normalized (0.0-1.0) coordinates
Performance: Run on background queue - resource intensive, blocks UI if on main thread
Request Handlers
Vision provides two request handlers for different scenarios.
VNImageRequestHandler
Analyzes a single image. Initialize with the image, perform requests against it, discard.
let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request1, request2]) // Multiple requests, one image
Initialize with: CGImage, CIImage, CVPixelBuffer, Data, or URL
Rule: One handler per image. Reusing a handler with a different image is unsupported.
VNSequenceRequestHandler
Analyzes a sequence of frames (video, camera feed). Initialize empty, pass each frame to perform(). Maintains inter-frame state for temporal smoothing.
let sequenceHandler = VNSequenceRequestHandler()
// In your camera/video frame callback:
func processFrame(_ pixelBuffer: CVPixelBuffer) throws {
try sequenceHandler.perform([request], on: pixelBuffer)
}
Rule: Create once, reuse across frames. The handler tracks state between calls.
When to Use Which
| Use Case | Handler |
|---|---|
| Single photo or screenshot | VNImageRequestHandler |
| Video stream or camera frames | VNSequenceRequestHandler |
| Temporal smoothing (pose, segmentation) | VNSequenceRequestHandler |
| One-off analysis of a CVPixelBuffer | VNImageRequestHandler |
Requests That Benefit from Sequence Handling
These requests use inter-frame state when run through VNSequenceRequestHandler:
VNDetectHumanBodyPoseRequest— Smoother joint trackingVNDetectHumanHandPoseRequest— Smoother landmark trackingVNGeneratePersonSegmentationRequest— Temporally consistent masksVNGeneratePersonInstanceMaskRequest— Stable person identity across framesVNDetectDocumentSegmentationRequest— Stable document edges- Any
VNStatefulRequestsubclass — Designed for sequences
Common Mistake
Creating a new VNImageRequestHandler per video frame discards temporal context. Pose landmarks jitter, segmentation masks flicker, and you lose the smoothing that sequence handling provides.
// Wrong — loses temporal context every frame
func processFrame(_ buffer: CVPixelBuffer) throws {
let handler = VNImageRequestHandler(cvPixelBuffer: buffer)
try handler.perform([poseRequest])
}
// Right — maintains inter-frame state
let sequenceHandler = VNSequenceRequestHandler()
func processFrame(_ buffer: CVPixelBuffer) throws {
try sequenceHandler.perform([poseRequest], on: buffer)
}
Subject Segmentation APIs
VNGenerateForegroundInstanceMaskRequest
Availability: iOS 17+, macOS 14+, tvOS 17+, visionOS 1+
Generates class-agnostic instance mask of foreground objects (people, pets, buildings, food, shoes, etc.)
Basic Usage
let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request])
guard let observation = request.results?.first as? VNInstanceMaskObservation else {
return
}
InstanceMaskObservation
allInstances: IndexSet containing all foreground instance indices (excludes background 0)
instanceMask: CVPixelBuffer with UInt8 labels (0 = background, 1+ = instance indices)
instanceAtPoint(_:): Returns instance index at normalized point
let point = CGPoint(x: 0.5, y: 0.5) // Center of image
let instance = observation.instanceAtPoint(point)
if instance == 0 {
print("Background tapped")
} else {
print("Instance \(instance) tapped")
}
Generating Masks
createScaledMask(for:croppedToInstancesContent:)
Parameters:
for:IndexSetof instances to includecroppedToInstancesContent:false= Output matches input resolution (for compositing)true= Tight crop around selected instances
Returns: Single-channel floating-point CVPixelBuffer (soft segmentation mask)
// All instances, full resolution
let mask = try observation.createScaledMask(
for: observation.allInstances,
croppedToInstancesContent: false
)
// Single instance, cropped
let instances = IndexSet(integer: 1)
let croppedMask = try observation.createScaledMask(
for: instances,
croppedToInstancesContent: true
)
Instance Mask Hit Testing
Access raw pixel buffer to map tap coordinates to instance labels:
let instanceMask = observation.instanceMask
CVPixelBufferLockBaseAddress(instanceMask, .readOnly)
defer { CVPixelBufferUnlockBaseAddress(instanceMask, .readOnly) }
let baseAddress = CVPixelBufferGetBaseAddress(instanceMask)
let width = CVPixelBufferGetWidth(instanceMask)
let bytesPerRow = CVPixelBufferGetBytesPerRow(instanceMask)
// Convert normalized tap to pixel coordinates
let pixelPoint = VNImagePointForNormalizedPoint(
CGPoint(x: normalizedX, y: normalizedY),
width: imageWidth,
height: imageHeight
)
// Calculate byte offset
let offset = Int(pixelPoint.y) * bytesPerRow + Int(pixelPoint.x)
// Read instance label
let label = UnsafeRawPointer(baseAddress!).load(
fromByteOffset: offset,
as: UInt8.self
)
let instances = label == 0 ? observation.allInstances : IndexSet(integer: Int(label))
VisionKit Subject Lifting
ImageAnalysisInteraction (iOS)
Availability: iOS 16+, iPadOS 16+
Adds system-like subject lifting UI to views:
let interaction = ImageAnalysisInteraction()
interaction.preferredInteractionTypes = .imageSubject // Or .automatic
imageView.addInteraction(interaction)
Interaction types:
.automatic: Subject lifting + Live Text + data detectors.imageSubject: Subject lifting only (no interactive text)
ImageAnalysisOverlayView (macOS)
Availability: macOS 13+
let overlayView = ImageAnalysisOverlayView()
overlayView.preferredInteractionTypes = .imageSubject
nsView.addSubview(overlayView)
Programmatic Access
ImageAnalyzer
let analyzer = ImageAnalyzer()
let configuration = ImageAnalyzer.Configuration([.text, .visualLookUp])
let analysis = try await analyzer.analyze(image, configuration: configuration)
ImageAnalysis
subjects: [Subject] - All subjects in image
highlightedSubjects: Set<Subject> - Currently highlighted (user long-pressed)
subject(at:): Async lookup of subject at normalized point (returns nil if none)
// Get all subjects
let subjects = analysis.subjects
// Look up subject at tap
if let subject = try await analysis.subject(at: tapPoint) {
// Process subject
}
// Change highlight state
analysis.highlightedSubjects = Set([subjects[0], subjects[1]])
Subject Struct
image: UIImage/NSImage - Extracted subject with transparency
bounds: CGRect - Subject boundaries in image coordinates
// Single subject image
let subjectImage = subject.image
// Composite multiple subjects
let compositeImage = try await analysis.image(for: [subject1, subject2])
Out-of-process: VisionKit analysis happens out-of-process (performance benefit, image size limited)
Person Segmentation APIs
VNGeneratePersonSegmentationRequest
Availability: iOS 15+, macOS 12+
Returns single mask containing all people in image:
let request = VNGeneratePersonSegmentationRequest()
// Configure quality level if needed
try handler.perform([request])
guard let observation = request.results?.first as? VNPixelBufferObservation else {
return
}
let personMask = observation.pixelBuffer // CVPixelBuffer
VNGeneratePersonInstanceMaskRequest
Availability: iOS 17+, macOS 14+
Returns separate masks for up to 4 people:
let request = VNGeneratePersonInstanceMaskRequest()
try handler.perform([request])
guard let observation = request.results?.first as? VNInstanceMaskObservation else {
return
}
// Same InstanceMaskObservation API as foreground instance masks
let allPeople = observation.allInstances // Up to 4 people (1-4)
// Get mask for person 1
let person1Mask = try observation.createScaledMask(
for: IndexSet(integer: 1),
croppedToInstancesContent: false
)
Limitations:
- Segments up to 4 people
- With >4 people: may miss people or combine them (typically background people)
- Use
VNDetectFaceRectanglesRequestto count faces if you need to handle crowded scenes
Hand Pose Detection
VNDet
Content truncated.
More by CharlesWiltgen
View all skills by CharlesWiltgen →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
fastapi-templates
wshobson
Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.
Related MCP Servers
Browse all serversEnhance software testing with Playwright MCP: Fast, reliable browser automation, an innovative alternative to Selenium s
Unlock seamless Figma to code: streamline Figma to HTML with Framelink MCP Server for fast, accurate design-to-code work
Uno Platform — Documentation and prompts for building cross-platform .NET apps with a single codebase. Get guides, sampl
The fullstack MCP framework for developing MCP apps for ChatGPT, Claude, and building MCP servers for AI agents. Connect
Cipher empowers agents with persistent memory using vector databases and embeddings for seamless context retention and t
Unlock browser automation studio with Browserbase MCP Server. Enhance Selenium software testing and AI-driven workflows
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.