benchmark-kernel

0views

1installs

Guide for benchmarking FlashInfer kernels with CUPTI timing

Install

mkdir -p .claude/skills/benchmark-kernel && curl -L -o skill.zip "https://mcp.directory/api/skills/download/5266" && unzip -o skill.zip -d .claude/skills/benchmark-kernel && rm skill.zip

Installs to .claude/skills/benchmark-kernel

About this skill

Tutorial: Benchmarking FlashInfer Kernels

This tutorial shows you how to accurately benchmark FlashInfer kernels.

Goal

Measure the performance of FlashInfer kernels:

Get accurate GPU kernel execution time
Compare multiple backends (FlashAttention2/3, cuDNN, CUTLASS, TensorRT-LLM)
Generate reproducible benchmark results
Save results to CSV for analysis

Timing Methods

FlashInfer supports two timing methods:

CUPTI (Preferred): Hardware-level profiling for most accurate GPU kernel time
- Measures pure GPU compute time without host-device overhead
- Requires cupti-python >= 13.0.0 (CUDA 13+)
CUDA Events (Fallback): Standard CUDA event timing
- Automatically used if CUPTI is not available
- Good accuracy, slight overhead from host synchronization

The framework automatically uses CUPTI if available, otherwise falls back to CUDA events.

Installation

Install CUPTI (Recommended)

For the most accurate benchmarking:

pip install -U cupti-python

Requirements: CUDA 13+ (CUPTI version 13+)

Without CUPTI

If you don't install CUPTI, the framework will:

Print a warning: CUPTI is not installed. Falling back to CUDA events.
Automatically use CUDA events for timing
Still provide good benchmark results

Method 1: Using flashinfer_benchmark.py (Recommended)

Step 1: Choose Your Test Routine

Available routines:

Attention: BatchDecodeWithPagedKVCacheWrapper, BatchPrefillWithPagedKVCacheWrapper, BatchPrefillWithRaggedKVCacheWrapper, BatchMLAPagedAttentionWrapper
GEMM: bmm_fp8, gemm_fp8_nt_groupwise, group_gemm_fp8_nt_groupwise, mm_fp4
MOE: trtllm_fp4_block_scale_moe, trtllm_fp8_block_scale_moe, trtllm_fp8_per_tensor_scale_moe, cutlass_fused_moe

Step 2: Run a Single Benchmark

Example - Benchmark decode attention:

# CUPTI will be used automatically if installed
python benchmarks/flashinfer_benchmark.py \
    --routine BatchDecodeWithPagedKVCacheWrapper \
    --backends fa2 fa2_tc cudnn \
    --page_size 16 \
    --batch_size 32 \
    --s_qo 1 \
    --s_kv 2048 \
    --num_qo_heads 32 \
    --num_kv_heads 8 \
    --head_dim_qk 128 \
    --head_dim_vo 128 \
    --q_dtype bfloat16 \
    --kv_dtype bfloat16 \
    --num_iters 30 \
    --dry_run_iters 5 \
    --refcheck \
    -vv

Example - Benchmark FP8 GEMM:

python benchmarks/flashinfer_benchmark.py \
    --routine bmm_fp8 \
    --backends cudnn cublas cutlass \
    --batch_size 256 \
    --m 1 \
    --n 1024 \
    --k 7168 \
    --input_dtype fp8_e4m3 \
    --mat2_dtype fp8_e4m3 \
    --out_dtype bfloat16 \
    --refcheck \
    -vv \
    --generate_repro_command

Timing behavior:

✅ If CUPTI installed: Uses CUPTI (most accurate)
⚠️ If CUPTI not installed: Automatically falls back to CUDA events with warning
🔧 To force CUDA events: Add --use_cuda_events flag

Step 3: Understand the Output

[INFO] FlashInfer version: 0.6.0
[VVERBOSE] gpu_name = 'NVIDIA_H100_PCIe'
[PERF] fa2            :: median time 0.145 ms; std 0.002 ms; achieved tflops 125.3 TFLOPs/sec; achieved tb_per_sec 1.87 TB/sec
[PERF] fa2_tc         :: median time 0.138 ms; std 0.001 ms; achieved tflops 131.5 TFLOPs/sec; achieved tb_per_sec 1.96 TB/sec
[PERF] cudnn          :: median time 0.142 ms; std 0.001 ms; achieved tflops 127.8 TFLOPs/sec; achieved tb_per_sec 1.91 TB/sec

Key metrics:

median time: Median kernel execution time (lower is better)
std: Standard deviation (lower means more consistent)
achieved tflops: Effective TFLOPS throughput
achieved tb_per_sec: Memory bandwidth utilization

Step 4: Run Batch Benchmarks

Create a test list file my_benchmarks.txt:

--routine BatchDecodeWithPagedKVCacheWrapper --backends fa2 cudnn --page_size 16 --batch_size 32 --s_kv 2048 --num_qo_heads 32 --num_kv_heads 8 --head_dim_qk 128 --head_dim_vo 128
--routine BatchDecodeWithPagedKVCacheWrapper --backends fa2 cudnn --page_size 16 --batch_size 64 --s_kv 4096 --num_qo_heads 32 --num_kv_heads 8 --head_dim_qk 128 --head_dim_vo 128
--routine bmm_fp8 --backends cudnn cutlass --batch_size 256 --m 1 --n 1024 --k 7168 --input_dtype fp8_e4m3 --mat2_dtype fp8_e4m3 --out_dtype bfloat16

Run all tests:

python benchmarks/flashinfer_benchmark.py \
    --testlist my_benchmarks.txt \
    --output_path results.csv \
    --generate_repro_command \
    --refcheck

Results are saved to results.csv with all metrics and reproducer commands.

Step 5: Common Flags

Flag	Description	Default
`--num_iters`	Measurement iterations	30
`--dry_run_iters`	Warmup iterations	5
`--refcheck`	Verify output correctness	False
`--allow_output_mismatch`	Continue on mismatch	False
`--use_cuda_events`	Force CUDA events (skip CUPTI)	False
`--no_cuda_graph`	Disable CUDA graph	False
`-vv`	Very verbose output	-
`--generate_repro_command`	Print reproducer command	False
`--case_tag`	Tag for CSV output	None

Method 2: Using bench_gpu_time() in Python

For custom benchmarking in your own code:

Step 1: Write Your Benchmark Script

import torch
from flashinfer.testing import bench_gpu_time

# Setup your kernel
def my_kernel_wrapper(q, k, v):
    # Your kernel call here
    return output

# Create test inputs
device = torch.device("cuda")
q = torch.randn(32, 8, 128, dtype=torch.bfloat16, device=device)
k = torch.randn(2048, 8, 128, dtype=torch.bfloat16, device=device)
v = torch.randn(2048, 8, 128, dtype=torch.bfloat16, device=device)

# Benchmark - CUPTI preferred, CUDA events if CUPTI unavailable
median_time, std_time = bench_gpu_time(
    my_kernel_wrapper,
    args=(q, k, v),
    enable_cupti=True,          # Prefer CUPTI, fallback to CUDA events
    num_iters=30,               # Number of iterations
    dry_run_iters=5,            # Warmup iterations
)

print(f"Kernel time: {median_time:.3f} ms ± {std_time:.3f} ms")

# Calculate FLOPS if you know the operation count
flops = ...  # Your FLOP count
tflops = (flops / 1e12) / (median_time / 1000)
print(f"Achieved: {tflops:.2f} TFLOPS/sec")

Note: If CUPTI is not installed, you'll see a warning and the function will automatically use CUDA events instead.

Step 2: Run Your Benchmark

python my_benchmark.py

Output with CUPTI:

Kernel time: 0.145 ms ± 0.002 ms
Achieved: 125.3 TFLOPS/sec

Output without CUPTI (automatic fallback):

[WARNING] CUPTI is not installed. Try 'pip install -U cupti-python'. Falling back to CUDA events.
Kernel time: 0.147 ms ± 0.003 ms
Achieved: 124.1 TFLOPS/sec

Step 3: Advanced Options

# Cold L2 cache benchmarking (optional)
median_time, std_time = bench_gpu_time(
    my_kernel,
    args=(x, y),
    enable_cupti=True,          # Will use CUDA events if CUPTI unavailable
    cold_l2_cache=True,         # Flush L2 or rotate buffers automatically
    num_iters=30
)

# Force CUDA events (skip CUPTI even if installed)
median_time, std_time = bench_gpu_time(
    my_kernel,
    args=(x, y),
    enable_cupti=False,         # Explicitly use CUDA events
    num_iters=30
)

Troubleshooting

CUPTI Warning Message

Warning: CUPTI is not installed. Falling back to CUDA events.

What it means: CUPTI is not available, using CUDA events instead

Impact: Less accurate for very fast kernels (5-50 us) due to synchronization overhead, but becomes negligible for longer-running kernels

Solution (optional): Install CUPTI for best accuracy:

pip install -U cupti-python

If installation fails, check:

CUDA version >= 13
Compatible cupti-python version

You can still run benchmarks without CUPTI - the framework handles this automatically.

Inconsistent Results

Problem: Large standard deviation or varying results

Solutions:

Increase warmup iterations:
```
--dry_run_iters 10
```
Increase measurement iterations:
```
--num_iters 50
```

Use cold L2 cache (in Python):

bench_gpu_time(..., rotate_buffers=True)

Disable GPU boost (advanced):
```
sudo nvidia-smi -lgc <base_clock>
```

Reference Check Failures

Error: [ERROR] Output mismatch between backends

What it means: Different backends produce different results

Solutions:

Allow mismatch and continue:
```
--allow_output_mismatch
```
Check numerical tolerance: Some backends use different precisions (FP32 vs FP16)

Investigate the difference:

-vv  # Very verbose mode shows tensor statistics

Backend Not Supported

Error: [WARNING] fa3 for routine ... is not supported on compute capability X.X

Solution: Check the backend support matrix in benchmarks/README.md or remove that backend from --backends list

Best Practices

Install CUPTI for best accuracy (but not required):
```
pip install -U cupti-python
```
Use reference checking to verify correctness:
```
--refcheck
```
Use verbose mode to see input shapes and dtypes:
```
-vv
```
Generate reproducer commands for sharing results:
```
--generate_repro_command
```
Run multiple iterations for statistical significance:
```
--num_iters 30 --dry_run_iters 5
```
Save results to CSV for later analysis:
```
--output_path results.csv
```
Compare multiple backends to find the best:
```
--backends fa2 fa3 cudnn cutlass
```

Quick Examples

Decode Attention (H100)

python benchmarks/flashinfer_benchmark.py \
    --routine BatchDecodeWithPagedKVCacheWrapper \
    --backends fa2 fa2_tc cudnn trtllm-gen \
    --page_size 16 --batch_size 128 --s_kv 8192 \
    --num_qo_heads 64 --num_kv_heads 8 \
    --head_dim_qk 128 --head

---

*Content truncated.*

More by flashinfer-ai

View all skills by flashinfer-ai →

debug-cuda-crash

flashinfer-ai

Tutorial for debugging CUDA crashes using API logging

102

add-cuda-kernel

flashinfer-ai

Step-by-step tutorial for adding new CUDA kernels to FlashInfer

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

1,6841,428

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

1,2621,324

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

1,5331,146

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

1,353807

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,263727

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

1,480684

Related MCP Servers

Browse all servers

Uno Platform

Uno Platform — Documentation and prompts for building cross-platform .NET apps with a single codebase. Get guides, sampl

9,8441 tools

pg-aiguide

pg-aiguide — Version-aware PostgreSQL docs and best practices tailored for AI coding assistants. Improve queries, migrat

1,5903 tools

DeepWiki

DeepWiki converts deepwiki.com pages into clean Markdown, with fast, secure extraction—perfect as a PDF text, page, or i

1,2790 tools

NextJS

Supercharge your NextJS projects with AI-powered tools for diagnostics, upgrades, and docs. Accelerate development and b

6657 tools

Spec-Driven Development

Guide your software projects with structured prompts from requirements to code using the waterfall development model and

4270 tools

Magic UI

Explore Magic UI, a React UI library offering structured component access, code suggestions, and installation guides for

1768 tools

Install

mkdir -p .claude/skills/benchmark-kernel && curl -L -o skill.zip "https://mcp.directory/api/skills/download/5266" && unzip -o skill.zip -d .claude/skills/benchmark-kernel && rm skill.zip

Installs to .claude/skills/benchmark-kernel

Stats

Views

Installs

Author

flashinfer-ai

3 skills published

Links

Source Code

benchmark-kernel

Install

About this skill

Tutorial: Benchmarking FlashInfer Kernels

Goal

Timing Methods

Installation

Install CUPTI (Recommended)

Without CUPTI

Method 1: Using flashinfer_benchmark.py (Recommended)

Step 1: Choose Your Test Routine

Step 2: Run a Single Benchmark

Step 3: Understand the Output

Step 4: Run Batch Benchmarks

Step 5: Common Flags

Method 2: Using bench_gpu_time() in Python

Step 1: Write Your Benchmark Script

Step 2: Run Your Benchmark

Step 3: Advanced Options

Troubleshooting

CUPTI Warning Message

Inconsistent Results

Reference Check Failures

Backend Not Supported

Best Practices

Quick Examples

Decode Attention (H100)

More by flashinfer-ai

debug-cuda-crash

add-cuda-kernel

You might also like

flutter-development

ui-ux-pro-max

drawio-diagrams-enhanced

godot

nano-banana-pro

pdf-to-markdown

Related MCP Servers