miles-rl-training
Provides guidance for enterprise-grade RL training using miles, a production-ready fork of slime. Use when training large MoE models with FP8/INT4, needing train-inference alignment, or requiring speculative RL for maximum throughput.
Install
mkdir -p .claude/skills/miles-rl-training && curl -L -o skill.zip "https://mcp.directory/api/skills/download/5201" && unzip -o skill.zip -d .claude/skills/miles-rl-training && rm skill.zipInstalls to .claude/skills/miles-rl-training
About this skill
miles: Enterprise-Grade RL for Large-Scale Model Training
miles is a high-performance, enterprise-ready RL framework optimized for large-scale model post-training. Built as a production fork of slime, it addresses critical challenges in MoE training stability, low-precision training, and train-inference alignment.
When to Use miles
Choose miles when you need:
- Training 1TB+ MoE models (DeepSeek V3, Qwen3-MoE)
- FP8 or INT4 quantization-aware training
- Bit-wise identical train-inference alignment
- Speculative RL for maximum throughput
- Production stability with enterprise support
Consider alternatives when:
- You want the research-grade original → use slime
- You need flexible backend swapping → use verl
- You want PyTorch-native abstractions → use torchforge
Key Features
Low-Precision Training
- Unified FP8: End-to-end FP8 for both inference and training
- INT4 QAT: 1TB models on single-machine VRAM (H200)
- Rollout Routing Replay (R3): Bit-wise expert alignment for MoE
Performance Optimizations
- Speculative RL: 25%+ rollout speedup with online SFT draft models
- Zero-Copy Weight Sync: CUDA IPC zero-copy mapping
- Partial Rollout: Recycle half-finished trajectories
Train-Inference Alignment
- TIS/MIS: Truncated/Masked Importance Sampling for off-policy correction
- Kernel-level optimization: FlashAttention-3, DeepGEMM integration
Installation
# Recommended: Docker
docker pull radixark/miles:latest
docker run --rm --gpus all --ipc=host --shm-size=16g \
-it radixark/miles:latest /bin/bash
# From source
git clone https://github.com/radixark/miles.git
cd miles
pip install -r requirements.txt
pip install -e .
Quick Start
miles inherits slime's configuration system. Basic training:
python train.py \
--advantage-estimator grpo \
--model-name qwen3-30b-a3b \
--hf-checkpoint /path/to/qwen3-30b-a3b-hf \
--rollout-batch-size 512 \
--n-samples-per-prompt 8
Workflow 1: Large MoE Training
Use this workflow for training large MoE models like DeepSeek V3 or Qwen3-MoE.
Prerequisites Checklist
- H100/H200 GPUs with FP8 support
- MoE model (DeepSeek V3, Qwen3-MoE)
- Docker environment with miles
Step 1: Environment Setup
# FP8 block scaling (recommended for stability)
export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
Step 2: Configure Training
python train.py \
--actor-num-gpus-per-node 8 \
--rollout-num-gpus 8 \
--hf-checkpoint /path/to/deepseek-v3 \
--advantage-estimator grpo \
--tensor-model-parallel-size 8 \
--expert-model-parallel-size 4 \
--prompt-data /path/to/data.jsonl \
--num-rollout 3000
Verification Checklist
- Model loads without errors
- Routing decisions are consistent
- No NaN/Inf in loss values
Workflow 2: Speculative RL Training
Use this workflow for maximum rollout throughput with EAGLE speculative decoding.
How Speculative RL Works
- Small draft model generates candidate tokens
- Target model verifies in parallel
- Draft model updated via online SFT to track policy
Step 1: Enable Speculative Decoding
miles supports EAGLE speculative decoding via SGLang:
python train.py \
--actor-num-gpus-per-node 8 \
--hf-checkpoint /path/to/target-model \
--sglang-speculative-algorithm EAGLE \
--sglang-speculative-num-steps 3 \
--sglang-speculative-eagle-topk 1 \
--sglang-speculative-num-draft-tokens 4 \
--sglang-speculative-draft-model-path /path/to/draft-model \
--advantage-estimator grpo \
--prompt-data /path/to/data.jsonl
Step 2: Enable Online MTP Training (Optional)
For online SFT of draft model during training:
--mtp-num-layers 1 \
--enable-mtp-training \
--mtp-loss-scaling-factor 0.2
Note: Online MTP training requires a torch dist checkpoint with MTP weights. Add --mtp-num-layers 1 during checkpoint conversion from HuggingFace.
Expected Speedup
- Standard rollout: Baseline
- Speculative RL: 25-40% faster rollout
- With partial rollout: Additional 10-15% throughput
Configuration Reference
miles inherits all slime arguments. See slime API Reference for the complete list.
Cluster Resources (from slime)
--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--rollout-num-gpus-per-engine 2
--colocate
Megatron Parallelism (from slime)
--tensor-model-parallel-size 8
--pipeline-model-parallel-size 2
--expert-model-parallel-size 4 # MoE expert parallelism
Speculative Decoding (miles-specific)
--sglang-speculative-algorithm EAGLE
--sglang-speculative-num-steps 3
--sglang-speculative-eagle-topk 1
--sglang-speculative-num-draft-tokens 4
--sglang-enable-draft-weights-cpu-backup
--sglang-speculative-draft-model-path /your/draft/model/path
Online MTP Training (miles-specific)
--mtp-num-layers 1
--enable-mtp-training
--mtp-loss-scaling-factor 0.2
Key Features (Conceptual)
The following features are documented in miles but specific CLI flags may vary. Consult the miles repository for latest configuration.
Unified FP8 Pipeline
End-to-end FP8 sampling and training that eliminates quantization-induced discrepancy causing RL collapse in MoE models.
Rollout Routing Replay (R3)
Records expert routing decisions during SGLang inference and replays them during Megatron training for bit-wise expert alignment.
How R3 Works:
- During SGLang inference, expert routing decisions are recorded
- Routing decisions stored in
sample.rollout_routed_experts - During Megatron training, routing is replayed instead of recomputed
- Ensures identical expert selection between train and inference
INT4 Quantization-Aware Training
Enables single-machine deployment of 1TB+ models (e.g., on H200).
Memory Savings with INT4:
| Model Size | BF16 VRAM | INT4 VRAM | Reduction |
|---|---|---|---|
| 70B | 140GB | 45GB | 3.1x |
| 235B | 470GB | 150GB | 3.1x |
| 671B | 1.3TB | 420GB | 3.1x |
Train-Inference Alignment
miles achieves "exactly 0 KL divergence" between training and inference through:
- Flash Attention 3
- DeepGEMM
- Batch-invariant kernels from Thinking Machines Lab
torch.compileintegration
Sample Data Structure
miles uses the same Sample dataclass as slime with the rollout_routed_experts field for MoE routing replay:
@dataclass
class Sample:
prompt: str | list[dict]
tokens: list[int]
response: str
reward: float | dict
loss_mask: list[int]
status: Status
metadata: dict
rollout_log_probs: list[float]
rollout_routed_experts: list[list[int]] # MoE routing for R3
See slime API Reference for the complete Sample definition.
Common Issues and Solutions
Issue: FP8 Training Collapse
Symptoms: Loss explodes, NaN values
Solutions:
- Use block scaling:
export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1 - Reduce learning rate:
--lr 5e-7 - Ensure MoE routing is consistent between train/inference
Issue: Speculative Draft Drift
Symptoms: Low acceptance rate over time
Solutions:
- Enable online MTP training to keep draft model aligned
- Reduce speculative steps:
--sglang-speculative-num-steps 2 - Use CPU backup:
--sglang-enable-draft-weights-cpu-backup
Issue: Train-Inference Mismatch
Symptoms: Policy divergence, reward collapse
Solutions:
- Use TIS for off-policy correction:
--use-tis --tis-threshold 0.9 - Verify log probs match between SGLang and Megatron
- Enable R3 for MoE models
Supported Models
| Family | Models | MoE Support |
|---|---|---|
| DeepSeek | R1, V3, V3.2 | Full |
| Qwen | 2, 2.5, 3 (including MoE) | Full |
| Llama | 3, 3.1, 3.3, 4 | Dense only |
| Gemma | 2, 3, 3N | Dense only |
| GLM | 4.5, 4.6, 4.7 | Dense only |
| MiniMax | M2, M2.1 | Full |
Resources
- GitHub: https://github.com/radixark/miles
- Introduction Blog: https://lmsys.org/blog/2025-11-19-miles/
- Slime (upstream): https://github.com/THUDM/slime
- SGLang: https://github.com/sgl-project/sglang
More by davila7
View all skills by davila7 →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
fastapi-templates
wshobson
Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.
Related MCP Servers
Browse all serversTest website accessibility and ensure WCAG compliance with Axe Accessibility, a web accessibility checker with detailed
Deepcon is an AI coding assistant server offering up-to-date package docs via semantic search for smarter, faster AI pow
Easily convert markdown to PDF using Markitdown MCP server. Supports HTTP, STDIO, and SSE for fast converting markdown t
Enhance software testing with Playwright MCP: Fast, reliable browser automation, an innovative alternative to Selenium s
Use Chrome DevTools for web site test speed, debugging, and performance analysis. The essential chrome developer tools f
Connect Blender to Claude AI for seamless 3D modeling. Use AI 3D model generator tools for faster, intuitive, interactiv
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.