axiom-vision-ref

1
0
Source

Vision framework API, VNDetectHumanHandPoseRequest, VNDetectHumanBodyPoseRequest, person segmentation, face detection, VNImageRequestHandler, recognized points, joint landmarks, VNRecognizeTextRequest, VNDetectBarcodesRequest, DataScannerViewController, VNDocumentCameraViewController, RecognizeDocumentsRequest

Install

mkdir -p .claude/skills/axiom-vision-ref && curl -L -o skill.zip "https://mcp.directory/api/skills/download/4362" && unzip -o skill.zip -d .claude/skills/axiom-vision-ref && rm skill.zip

Installs to .claude/skills/axiom-vision-ref

About this skill

Vision Framework API Reference

Comprehensive reference for Vision framework computer vision: subject segmentation, hand/body pose detection, person detection, face analysis, text recognition (OCR), barcode detection, and document scanning.

When to Use This Reference

  • Implementing subject lifting using VisionKit or Vision
  • Detecting hand/body poses for gesture recognition or fitness apps
  • Segmenting people from backgrounds or separating multiple individuals
  • Face detection and landmarks for AR effects or authentication
  • Combining Vision APIs to solve complex computer vision problems
  • Looking up specific API signatures and parameter meanings
  • Recognizing text in images (OCR) with VNRecognizeTextRequest
  • Detecting barcodes and QR codes with VNDetectBarcodesRequest
  • Building live scanners with DataScannerViewController
  • Scanning documents with VNDocumentCameraViewController
  • Extracting structured document data with RecognizeDocumentsRequest (iOS 26+)

Related skills: See axiom-vision for decision trees and patterns, axiom-vision-diag for troubleshooting

Vision Framework Overview

Vision provides computer vision algorithms for still images and video:

Core workflow:

  1. Create request (e.g., VNDetectHumanHandPoseRequest())
  2. Create handler with image (VNImageRequestHandler(cgImage: image))
  3. Perform request (try handler.perform([request]))
  4. Access observations from request.results

Coordinate system: Lower-left origin, normalized (0.0-1.0) coordinates

Performance: Run on background queue - resource intensive, blocks UI if on main thread

Request Handlers

Vision provides two request handlers for different scenarios.

VNImageRequestHandler

Analyzes a single image. Initialize with the image, perform requests against it, discard.

let handler = VNImageRequestHandler(cgImage: image)
try handler.perform([request1, request2])  // Multiple requests, one image

Initialize with: CGImage, CIImage, CVPixelBuffer, Data, or URL

Rule: One handler per image. Reusing a handler with a different image is unsupported.

VNSequenceRequestHandler

Analyzes a sequence of frames (video, camera feed). Initialize empty, pass each frame to perform(). Maintains inter-frame state for temporal smoothing.

let sequenceHandler = VNSequenceRequestHandler()

// In your camera/video frame callback:
func processFrame(_ pixelBuffer: CVPixelBuffer) throws {
    try sequenceHandler.perform([request], on: pixelBuffer)
}

Rule: Create once, reuse across frames. The handler tracks state between calls.

When to Use Which

Use CaseHandler
Single photo or screenshotVNImageRequestHandler
Video stream or camera framesVNSequenceRequestHandler
Temporal smoothing (pose, segmentation)VNSequenceRequestHandler
One-off analysis of a CVPixelBufferVNImageRequestHandler

Requests That Benefit from Sequence Handling

These requests use inter-frame state when run through VNSequenceRequestHandler:

  • VNDetectHumanBodyPoseRequest — Smoother joint tracking
  • VNDetectHumanHandPoseRequest — Smoother landmark tracking
  • VNGeneratePersonSegmentationRequest — Temporally consistent masks
  • VNGeneratePersonInstanceMaskRequest — Stable person identity across frames
  • VNDetectDocumentSegmentationRequest — Stable document edges
  • Any VNStatefulRequest subclass — Designed for sequences

Common Mistake

Creating a new VNImageRequestHandler per video frame discards temporal context. Pose landmarks jitter, segmentation masks flicker, and you lose the smoothing that sequence handling provides.

// Wrong — loses temporal context every frame
func processFrame(_ buffer: CVPixelBuffer) throws {
    let handler = VNImageRequestHandler(cvPixelBuffer: buffer)
    try handler.perform([poseRequest])
}

// Right — maintains inter-frame state
let sequenceHandler = VNSequenceRequestHandler()
func processFrame(_ buffer: CVPixelBuffer) throws {
    try sequenceHandler.perform([poseRequest], on: buffer)
}

Subject Segmentation APIs

VNGenerateForegroundInstanceMaskRequest

Availability: iOS 17+, macOS 14+, tvOS 17+, visionOS 1+

Generates class-agnostic instance mask of foreground objects (people, pets, buildings, food, shoes, etc.)

Basic Usage

let request = VNGenerateForegroundInstanceMaskRequest()
let handler = VNImageRequestHandler(cgImage: image)

try handler.perform([request])

guard let observation = request.results?.first as? VNInstanceMaskObservation else {
    return
}

InstanceMaskObservation

allInstances: IndexSet containing all foreground instance indices (excludes background 0)

instanceMask: CVPixelBuffer with UInt8 labels (0 = background, 1+ = instance indices)

instanceAtPoint(_:): Returns instance index at normalized point

let point = CGPoint(x: 0.5, y: 0.5)  // Center of image
let instance = observation.instanceAtPoint(point)

if instance == 0 {
    print("Background tapped")
} else {
    print("Instance \(instance) tapped")
}

Generating Masks

createScaledMask(for:croppedToInstancesContent:)

Parameters:

  • for: IndexSet of instances to include
  • croppedToInstancesContent:
    • false = Output matches input resolution (for compositing)
    • true = Tight crop around selected instances

Returns: Single-channel floating-point CVPixelBuffer (soft segmentation mask)

// All instances, full resolution
let mask = try observation.createScaledMask(
    for: observation.allInstances,
    croppedToInstancesContent: false
)

// Single instance, cropped
let instances = IndexSet(integer: 1)
let croppedMask = try observation.createScaledMask(
    for: instances,
    croppedToInstancesContent: true
)

Instance Mask Hit Testing

Access raw pixel buffer to map tap coordinates to instance labels:

let instanceMask = observation.instanceMask

CVPixelBufferLockBaseAddress(instanceMask, .readOnly)
defer { CVPixelBufferUnlockBaseAddress(instanceMask, .readOnly) }

let baseAddress = CVPixelBufferGetBaseAddress(instanceMask)
let width = CVPixelBufferGetWidth(instanceMask)
let bytesPerRow = CVPixelBufferGetBytesPerRow(instanceMask)

// Convert normalized tap to pixel coordinates
let pixelPoint = VNImagePointForNormalizedPoint(
    CGPoint(x: normalizedX, y: normalizedY),
    width: imageWidth,
    height: imageHeight
)

// Calculate byte offset
let offset = Int(pixelPoint.y) * bytesPerRow + Int(pixelPoint.x)

// Read instance label
let label = UnsafeRawPointer(baseAddress!).load(
    fromByteOffset: offset,
    as: UInt8.self
)

let instances = label == 0 ? observation.allInstances : IndexSet(integer: Int(label))

VisionKit Subject Lifting

ImageAnalysisInteraction (iOS)

Availability: iOS 16+, iPadOS 16+

Adds system-like subject lifting UI to views:

let interaction = ImageAnalysisInteraction()
interaction.preferredInteractionTypes = .imageSubject  // Or .automatic
imageView.addInteraction(interaction)

Interaction types:

  • .automatic: Subject lifting + Live Text + data detectors
  • .imageSubject: Subject lifting only (no interactive text)

ImageAnalysisOverlayView (macOS)

Availability: macOS 13+

let overlayView = ImageAnalysisOverlayView()
overlayView.preferredInteractionTypes = .imageSubject
nsView.addSubview(overlayView)

Programmatic Access

ImageAnalyzer

let analyzer = ImageAnalyzer()
let configuration = ImageAnalyzer.Configuration([.text, .visualLookUp])

let analysis = try await analyzer.analyze(image, configuration: configuration)

ImageAnalysis

subjects: [Subject] - All subjects in image

highlightedSubjects: Set<Subject> - Currently highlighted (user long-pressed)

subject(at:): Async lookup of subject at normalized point (returns nil if none)

// Get all subjects
let subjects = analysis.subjects

// Look up subject at tap
if let subject = try await analysis.subject(at: tapPoint) {
    // Process subject
}

// Change highlight state
analysis.highlightedSubjects = Set([subjects[0], subjects[1]])

Subject Struct

image: UIImage/NSImage - Extracted subject with transparency

bounds: CGRect - Subject boundaries in image coordinates

// Single subject image
let subjectImage = subject.image

// Composite multiple subjects
let compositeImage = try await analysis.image(for: [subject1, subject2])

Out-of-process: VisionKit analysis happens out-of-process (performance benefit, image size limited)

Person Segmentation APIs

VNGeneratePersonSegmentationRequest

Availability: iOS 15+, macOS 12+

Returns single mask containing all people in image:

let request = VNGeneratePersonSegmentationRequest()
// Configure quality level if needed
try handler.perform([request])

guard let observation = request.results?.first as? VNPixelBufferObservation else {
    return
}

let personMask = observation.pixelBuffer  // CVPixelBuffer

VNGeneratePersonInstanceMaskRequest

Availability: iOS 17+, macOS 14+

Returns separate masks for up to 4 people:

let request = VNGeneratePersonInstanceMaskRequest()
try handler.perform([request])

guard let observation = request.results?.first as? VNInstanceMaskObservation else {
    return
}

// Same InstanceMaskObservation API as foreground instance masks
let allPeople = observation.allInstances  // Up to 4 people (1-4)

// Get mask for person 1
let person1Mask = try observation.createScaledMask(
    for: IndexSet(integer: 1),
    croppedToInstancesContent: false
)

Limitations:

  • Segments up to 4 people
  • With >4 people: may miss people or combine them (typically background people)
  • Use VNDetectFaceRectanglesRequest to count faces if you need to handle crowded scenes

Hand Pose Detection

VNDet


Content truncated.

axiom-ios-build

CharlesWiltgen

Use when ANY iOS build fails, test crashes, Xcode misbehaves, or environment issue occurs before debugging code. Covers build failures, compilation errors, dependency conflicts, simulator problems, environment-first diagnostics.

91

axiom-getting-started

CharlesWiltgen

Use when first installing Axiom, unsure which skill to use, want an overview of available skills, or need help finding the right skill for your situation — interactive onboarding that recommends skills based on your project and current focus

00

axiom-ui-testing

CharlesWiltgen

Use when writing UI tests, recording interactions, tests have race conditions, timing dependencies, inconsistent pass/fail behavior, or XCTest UI tests are flaky - covers Recording UI Automation (WWDC 2025), condition-based waiting, network conditioning, multi-factor testing, crash debugging, and accessibility-first testing patterns

00

axiom-core-spotlight-ref

CharlesWiltgen

Use when indexing app content for Spotlight search, using NSUserActivity for prediction/handoff, or choosing between CSSearchableItem and IndexedEntity - covers Core Spotlight framework and NSUserActivity integration for iOS 9+

00

axiom-vision-diag

CharlesWiltgen

subject not detected, hand pose missing landmarks, low confidence observations, Vision performance, coordinate conversion, VisionKit errors, observation nil, text not recognized, barcode not detected, DataScannerViewController not working, document scan issues

00

axiom-now-playing-carplay

CharlesWiltgen

CarPlay Now Playing integration patterns. Use when implementing CarPlay audio controls, CPNowPlayingTemplate customization, or debugging CarPlay-specific issues.

00

You might also like

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

643969

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

591705

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

318398

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

339397

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

451339

fastapi-templates

wshobson

Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.

304231

Stay ahead of the MCP ecosystem

Get weekly updates on new skills and servers.