debug-cuda-crash
Tutorial for debugging CUDA crashes using API logging
Install
mkdir -p .claude/skills/debug-cuda-crash && curl -L -o skill.zip "https://mcp.directory/api/skills/download/2297" && unzip -o skill.zip -d .claude/skills/debug-cuda-crash && rm skill.zipInstalls to .claude/skills/debug-cuda-crash
About this skill
Tutorial: Debugging CUDA Crashes with API Logging
This tutorial shows you how to debug CUDA crashes and errors in FlashInfer using the @flashinfer_api logging decorator.
Goal
When your code crashes with CUDA errors (illegal memory access, out-of-bounds, NaN/Inf), use API logging to:
- Capture input tensors BEFORE the crash occurs
- Understand what data caused the problem
- Track tensor shapes, dtypes, and values through your pipeline
- Detect numerical issues (NaN, Inf, wrong shapes)
Why Use API Logging?
Problem: CUDA errors often crash the program, leaving no debugging information.
Solution: FlashInfer's @flashinfer_api decorator logs inputs BEFORE execution, so you can see what caused the crash even after the program terminates.
Step 1: Enable API Logging
Basic Logging (Function Names Only)
export FLASHINFER_LOGLEVEL=1 # Log function names
export FLASHINFER_LOGDEST=stdout # Log to console
python my_script.py
Output:
[2025-12-18 10:30:45] FlashInfer API Call: batch_decode_with_padded_kv_cache
Detailed Logging (Inputs/Outputs with Metadata)
export FLASHINFER_LOGLEVEL=3 # Log inputs/outputs with metadata
export FLASHINFER_LOGDEST=debug.log # Save to file
python my_script.py
Output in debug.log:
================================================================================
[2025-12-18 10:30:45] FlashInfer API Logging - System Information
================================================================================
FlashInfer version: 0.6.0
CUDA toolkit version: 12.1
GPU 0: NVIDIA H100 PCIe
Compute capability: 9.0 (SM90)
PyTorch version: 2.1.0
================================================================================
================================================================================
[2025-12-18 10:30:46] FlashInfer API Call: batch_decode_with_padded_kv_cache
--------------------------------------------------------------------------------
Positional input arguments:
arg[0]:
Tensor(
shape=(32, 8, 128)
dtype=torch.bfloat16
device=cuda:0
requires_grad=False
is_contiguous=True
)
Keyword input arguments:
kv_cache=
Tensor(
shape=(1024, 2, 8, 128)
dtype=torch.bfloat16
device=cuda:0
requires_grad=False
is_contiguous=True
)
Full Logging (With Tensor Statistics)
export FLASHINFER_LOGLEVEL=5 # Log with min/max/mean/nan/inf
export FLASHINFER_LOGDEST=debug.log
python my_script.py
Additional output:
Tensor(
shape=(32, 8, 128)
dtype=torch.bfloat16
device=cuda:0
requires_grad=False
is_contiguous=True
min=-3.125000
max=4.250000
mean=0.015625
nan_count=0
inf_count=0
)
Step 2: Reproduce the Crash
Example: Shape Mismatch
Your code crashes with:
RuntimeError: CUDA error: an illegal memory access was encountered
Enable logging and run again:
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=crash_log.txt
python my_script.py
The log shows inputs before the crash:
[2025-12-18 10:32:15] FlashInfer API Call: batch_decode_with_padded_kv_cache
Positional input arguments:
arg[0]:
Tensor(
shape=(32, 8, 128) # Query tensor
...
)
Keyword input arguments:
kv_cache=
Tensor(
shape=(1024, 2, 8, 64) # ❌ Wrong! Should be (..., 128) not (..., 64)
...
)
Found the bug: head_dim mismatch (64 vs 128)
Step 3: Common CUDA Errors and How to Debug
Error 1: Illegal Memory Access
Error Message:
RuntimeError: CUDA error: an illegal memory access was encountered
Enable logging:
export FLASHINFER_LOGLEVEL=3
python my_script.py
What to check in logs:
- ✅ Tensor shapes match expected dimensions
- ✅ All tensors are on CUDA (not CPU)
- ✅ Tensor strides are reasonable
- ✅
is_contiguous=True(if required)
Common causes:
- Wrong tensor dimensions
- CPU tensor passed to GPU kernel
- Incorrect stride patterns
Error 2: NaN or Inf Values
Error Message:
RuntimeError: Function ... returned nan or inf
Enable statistics logging:
export FLASHINFER_LOGLEVEL=5 # Level 5 shows nan_count, inf_count
python my_script.py
What to check in logs:
Tensor(
...
min=-1234567.000000 # ❌ Suspiciously large
max=9876543.000000 # ❌ Suspiciously large
mean=nan # ❌ NaN detected
nan_count=128 # ❌ 128 NaN values!
inf_count=0
)
Common causes:
- Division by zero in previous operation
- Numerical overflow/underflow
- Uninitialized memory
Error 3: Out of Memory
Error Message:
RuntimeError: CUDA out of memory
Enable logging:
export FLASHINFER_LOGLEVEL=3
python my_script.py
What to check in logs:
- ✅ Tensor shapes (are they unexpectedly large?)
- ✅ Batch size
- ✅ Sequence length
Example:
Tensor(
shape=(1024, 8192, 128, 128) # ❌ Way too large! Should be (1024, 128, 128)?
...
)
Error 4: Wrong Dtype
Error Message:
RuntimeError: expected scalar type BFloat16 but found Float16
Enable logging:
export FLASHINFER_LOGLEVEL=3
python my_script.py
What to check in logs:
Tensor(
dtype=torch.float16 # ❌ Should be torch.bfloat16
...
)
Step 4: Multi-Process Debugging
When running with multiple GPUs/processes, use %i pattern:
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=debug_rank_%i.txt # %i = process ID
torchrun --nproc_per_node=4 my_script.py
This creates separate logs:
debug_rank_12345.txt(process 12345)debug_rank_12346.txt(process 12346)debug_rank_12347.txt(process 12347)debug_rank_12348.txt(process 12348)
Now you can debug each rank independently.
Step 5: Advanced Debugging with compute-sanitizer
For harder bugs, combine API logging with CUDA tools:
Use compute-sanitizer (Memory Checker)
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=debug.log
compute-sanitizer --tool memcheck python my_script.py
Output shows exact memory errors:
========= COMPUTE-SANITIZER
========= Invalid __global__ write of size 4 bytes
========= at 0x1234 in ScaleKernel<float>
========= by thread (256,0,0) in block (10,0,0)
========= Address 0x7f1234567890 is out of bounds
Check debug.log to see what inputs caused this kernel to fail.
Use cuda-gdb (Debugger)
export FLASHINFER_LOGLEVEL=3
export FLASHINFER_LOGDEST=debug.log
cuda-gdb --args python my_script.py
In gdb:
(cuda-gdb) run
(cuda-gdb) where # Show stack trace when it crashes
Check debug.log for the inputs that led to the crash.
Step 6: Kernel-Level Debugging with printf()
You can use printf() inside CUDA kernels for debugging:
Basic Usage
__global__ void MyKernel(const float* input, float* output, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Print from one thread to avoid spam
if (threadIdx.x == 0 && blockIdx.x == 0) {
printf("n=%d, input[0]=%f\n", n, input[0]);
}
if (idx < n) {
output[idx] = input[idx] * 2.0f;
}
}
Important: Flush printf buffer after kernel:
my_kernel(input, output)
torch.cuda.synchronize() # ← Flushes printf output
⚠️ Warp-Specialized Kernels: Choosing the Right Print Thread
Problem: threadIdx.x == 0 doesn't work for all warps (warp starting at thread 32 won't have thread 0).
Solution: Choose one representative thread per specialization group.
__global__ void WarpSpecializedKernel(...) {
// Define your group's representative thread
// e.g., first thread of each warp: threadIdx.x % 32 == 0
// e.g., first thread of each 4-warp group: threadIdx.x % 128 == 0
if (is_group_representative) {
printf("Group %d processing\n", group_id);
}
}
Common mistake ❌:
// ❌ Only warp 0 will print!
if (threadIdx.x == 0) {
printf("Warp %d processing\n", threadIdx.x / 32);
}
Quick Reference
| Kernel Type | Print Condition | Notes |
|---|---|---|
| Simple kernel | threadIdx.x == 0 | One thread per block |
| Warp-specialized | One thread per group | Depends on kernel design |
Other Kernel Debugging Tools
// Assert for invariants
assert(value >= 0.0f && "Value must be non-negative");
// Compile-time checks
static_assert(BLOCK_SIZE % 32 == 0, "BLOCK_SIZE must be multiple of warp size");
Environment Variables Reference
| Variable | Values | Description |
|---|---|---|
FLASHINFER_LOGLEVEL | 0 | No logging (default) |
1 | Function names only | |
3 | Inputs/outputs with metadata | |
5 | + Tensor statistics (min/max/mean/nan/inf) | |
FLASHINFER_LOGDEST | stdout | Log to console (default) |
stderr | Log to stderr | |
<path> | Log to file | |
log_%i.txt | Multi-process: %i = process ID |
Best Practices
1. Always Start with Level 3
export FLASHINFER_LOGLEVEL=3
Level 3 provides tensor metadata (shape, dtype, device) without overwhelming output.
2. Use Level 5 for Numerical Issues
export FLASHINFER_LOGLEVEL=5
Only use level 5 when debugging NaN/Inf problems (adds statistics).
3. Log to File for Crashes
export FLASHINFER_LOGDEST=crash_log.txt
Console output may be lost when program crashes. File logs persist.
4. Compare Before/After
Enable logging and compare:
- Last successful API call (inputs logged, outputs logged) ✅
- First failed API call (inputs logged, no outputs) ❌ ← This is where it crashed!
5. Disable Logging in Production
unset FLASHINFER_LOGLEVEL # or export FLASHINFER_LOGLEVEL=0
Logging has zero overhead when disabled (decorator returns original function).
Troubleshooting
No Logs Appearing
Problem: Set FLASHINFER_LOGLEVEL=3 but no logs
Content truncated.
More by flashinfer-ai
View all skills by flashinfer-ai →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
fastapi-templates
wshobson
Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.
Related MCP Servers
Browse all serversUse Chrome DevTools for web site test speed, debugging, and performance analysis. The essential chrome developer tools f
Connect Supabase projects to AI with Supabase MCP Server. Standardize LLM communication for secure, efficient developmen
Empower your CLI agents with NotebookLM—connect AI tools for citation-backed answers from your docs, grounded in your ow
Control any ROS1 or ROS2 robot with natural language using ROS MCP Server—AI-powered, code-free, real-time monitoring an
Easily debug Node.js with real-time breakpoints and variable inspection directly in your conversation using Node.js Debu
Automate Xcode build, testing, and management using JavaScript scripts for efficient project workflows and smart error r
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.