miles-rl-training

0views

1installs

Provides guidance for enterprise-grade RL training using miles, a production-ready fork of slime. Use when training large MoE models with FP8/INT4, needing train-inference alignment, or requiring speculative RL for maximum throughput.

Install

mkdir -p .claude/skills/miles-rl-training && curl -L -o skill.zip "https://mcp.directory/api/skills/download/5201" && unzip -o skill.zip -d .claude/skills/miles-rl-training && rm skill.zip

Installs to .claude/skills/miles-rl-training

About this skill

miles: Enterprise-Grade RL for Large-Scale Model Training

miles is a high-performance, enterprise-ready RL framework optimized for large-scale model post-training. Built as a production fork of slime, it addresses critical challenges in MoE training stability, low-precision training, and train-inference alignment.

When to Use miles

Choose miles when you need:

Training 1TB+ MoE models (DeepSeek V3, Qwen3-MoE)
FP8 or INT4 quantization-aware training
Bit-wise identical train-inference alignment
Speculative RL for maximum throughput
Production stability with enterprise support

Consider alternatives when:

You want the research-grade original → use slime
You need flexible backend swapping → use verl
You want PyTorch-native abstractions → use torchforge

Key Features

Low-Precision Training

Unified FP8: End-to-end FP8 for both inference and training
INT4 QAT: 1TB models on single-machine VRAM (H200)
Rollout Routing Replay (R3): Bit-wise expert alignment for MoE

Performance Optimizations

Speculative RL: 25%+ rollout speedup with online SFT draft models
Zero-Copy Weight Sync: CUDA IPC zero-copy mapping
Partial Rollout: Recycle half-finished trajectories

Train-Inference Alignment

TIS/MIS: Truncated/Masked Importance Sampling for off-policy correction
Kernel-level optimization: FlashAttention-3, DeepGEMM integration

Installation

# Recommended: Docker
docker pull radixark/miles:latest
docker run --rm --gpus all --ipc=host --shm-size=16g \
  -it radixark/miles:latest /bin/bash

# From source
git clone https://github.com/radixark/miles.git
cd miles
pip install -r requirements.txt
pip install -e .

Quick Start

miles inherits slime's configuration system. Basic training:

python train.py \
    --advantage-estimator grpo \
    --model-name qwen3-30b-a3b \
    --hf-checkpoint /path/to/qwen3-30b-a3b-hf \
    --rollout-batch-size 512 \
    --n-samples-per-prompt 8

Workflow 1: Large MoE Training

Use this workflow for training large MoE models like DeepSeek V3 or Qwen3-MoE.

Prerequisites Checklist

H100/H200 GPUs with FP8 support
MoE model (DeepSeek V3, Qwen3-MoE)
Docker environment with miles

Step 1: Environment Setup

# FP8 block scaling (recommended for stability)
export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
export CUDA_DEVICE_MAX_CONNECTIONS=1

Step 2: Configure Training

python train.py \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --hf-checkpoint /path/to/deepseek-v3 \
    --advantage-estimator grpo \
    --tensor-model-parallel-size 8 \
    --expert-model-parallel-size 4 \
    --prompt-data /path/to/data.jsonl \
    --num-rollout 3000

Verification Checklist

Model loads without errors
Routing decisions are consistent
No NaN/Inf in loss values

Workflow 2: Speculative RL Training

Use this workflow for maximum rollout throughput with EAGLE speculative decoding.

How Speculative RL Works

Small draft model generates candidate tokens
Target model verifies in parallel
Draft model updated via online SFT to track policy

Step 1: Enable Speculative Decoding

miles supports EAGLE speculative decoding via SGLang:

python train.py \
    --actor-num-gpus-per-node 8 \
    --hf-checkpoint /path/to/target-model \
    --sglang-speculative-algorithm EAGLE \
    --sglang-speculative-num-steps 3 \
    --sglang-speculative-eagle-topk 1 \
    --sglang-speculative-num-draft-tokens 4 \
    --sglang-speculative-draft-model-path /path/to/draft-model \
    --advantage-estimator grpo \
    --prompt-data /path/to/data.jsonl

Step 2: Enable Online MTP Training (Optional)

For online SFT of draft model during training:

--mtp-num-layers 1 \
--enable-mtp-training \
--mtp-loss-scaling-factor 0.2

Note: Online MTP training requires a torch dist checkpoint with MTP weights. Add --mtp-num-layers 1 during checkpoint conversion from HuggingFace.

Expected Speedup

Standard rollout: Baseline
Speculative RL: 25-40% faster rollout
With partial rollout: Additional 10-15% throughput

Configuration Reference

miles inherits all slime arguments. See slime API Reference for the complete list.

Cluster Resources (from slime)

--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--rollout-num-gpus-per-engine 2
--colocate

Megatron Parallelism (from slime)

--tensor-model-parallel-size 8
--pipeline-model-parallel-size 2
--expert-model-parallel-size 4    # MoE expert parallelism

Speculative Decoding (miles-specific)

--sglang-speculative-algorithm EAGLE
--sglang-speculative-num-steps 3
--sglang-speculative-eagle-topk 1
--sglang-speculative-num-draft-tokens 4
--sglang-enable-draft-weights-cpu-backup
--sglang-speculative-draft-model-path /your/draft/model/path

Online MTP Training (miles-specific)

--mtp-num-layers 1
--enable-mtp-training
--mtp-loss-scaling-factor 0.2

Key Features (Conceptual)

The following features are documented in miles but specific CLI flags may vary. Consult the miles repository for latest configuration.

Unified FP8 Pipeline

End-to-end FP8 sampling and training that eliminates quantization-induced discrepancy causing RL collapse in MoE models.

Rollout Routing Replay (R3)

Records expert routing decisions during SGLang inference and replays them during Megatron training for bit-wise expert alignment.

How R3 Works:

During SGLang inference, expert routing decisions are recorded
Routing decisions stored in sample.rollout_routed_experts
During Megatron training, routing is replayed instead of recomputed
Ensures identical expert selection between train and inference

INT4 Quantization-Aware Training

Enables single-machine deployment of 1TB+ models (e.g., on H200).

Memory Savings with INT4:

Model Size	BF16 VRAM	INT4 VRAM	Reduction
70B	140GB	45GB	3.1x
235B	470GB	150GB	3.1x
671B	1.3TB	420GB	3.1x

Train-Inference Alignment

miles achieves "exactly 0 KL divergence" between training and inference through:

Flash Attention 3
DeepGEMM
Batch-invariant kernels from Thinking Machines Lab
torch.compile integration

Sample Data Structure

miles uses the same Sample dataclass as slime with the rollout_routed_experts field for MoE routing replay:

@dataclass
class Sample:
    prompt: str | list[dict]
    tokens: list[int]
    response: str
    reward: float | dict
    loss_mask: list[int]
    status: Status
    metadata: dict
    rollout_log_probs: list[float]
    rollout_routed_experts: list[list[int]]  # MoE routing for R3

See slime API Reference for the complete Sample definition.

Common Issues and Solutions

Issue: FP8 Training Collapse

Symptoms: Loss explodes, NaN values

Solutions:

Use block scaling: export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
Reduce learning rate: --lr 5e-7
Ensure MoE routing is consistent between train/inference

Issue: Speculative Draft Drift

Symptoms: Low acceptance rate over time

Solutions:

Enable online MTP training to keep draft model aligned
Reduce speculative steps: --sglang-speculative-num-steps 2
Use CPU backup: --sglang-enable-draft-weights-cpu-backup

Issue: Train-Inference Mismatch

Symptoms: Policy divergence, reward collapse

Solutions:

Use TIS for off-policy correction: --use-tis --tis-threshold 0.9
Verify log probs match between SGLang and Megatron
Enable R3 for MoE models

Supported Models

Family	Models	MoE Support
DeepSeek	R1, V3, V3.2	Full
Qwen	2, 2.5, 3 (including MoE)	Full
Llama	3, 3.1, 3.3, 4	Dense only
Gemma	2, 3, 3N	Dense only
GLM	4.5, 4.6, 4.7	Dense only
MiniMax	M2, M2.1	Full

Resources

GitHub: https://github.com/radixark/miles
Introduction Blog: https://lmsys.org/blog/2025-11-19-miles/
Slime (upstream): https://github.com/THUDM/slime
SGLang: https://github.com/sgl-project/sglang

More by davila7

View all skills by davila7 →

software-architecture

davila7

Guide for quality focused software architecture. This skill should be used when users want to write code, design architecture, analyze code, in any case that relates to software development.

533194

planning-with-files

davila7

Implements Manus-style file-based planning for complex tasks. Creates task_plan.md, findings.md, and progress.md. Use when starting complex multi-step tasks, research projects, or any task requiring >5 tool calls.

84112

scroll-experience

davila7

Expert in building immersive scroll-driven experiences - parallax storytelling, scroll animations, interactive narratives, and cinematic web experiences. Like NY Times interactives, Apple product pages, and award-winning web experiences. Makes websites feel like experiences, not just pages. Use when: scroll animation, parallax, scroll storytelling, interactive story, cinematic website.

13087

humanizer

davila7

Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases. Credits: Original skill by @blader - https://github.com/blader/humanizer

11557

game-development

davila7

Game development orchestrator. Routes to platform-specific skills based on project needs.

15249

telegram-bot-builder

davila7

Expert in building Telegram bots that solve real problems - from simple automation to complex AI-powered bots. Covers bot architecture, the Telegram Bot API, user experience, monetization strategies, and scaling bots to thousands of users. Use when: telegram bot, bot api, telegram automation, chat bot telegram, tg bot.

10349

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

1,6841,428

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

1,2621,324

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

1,5331,147

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

1,353807

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,263727

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

1,481684

Related MCP Servers

Browse all servers

Axe Accessibility

Test website accessibility and ensure WCAG compliance with Axe Accessibility, a web accessibility checker with detailed

786 tools

Deepcon

Deepcon is an AI coding assistant server offering up-to-date package docs via semantic search for smarter, faster AI pow

0 tools

Markitdown

Easily convert markdown to PDF using Markitdown MCP server. Supports HTTP, STDIO, and SSE for fast converting markdown t

90,3881 tools

Playwright Browser Automation

Enhance software testing with Playwright MCP: Fast, reliable browser automation, an innovative alternative to Selenium s

28,44922 tools

Chrome DevTools

Use Chrome DevTools for web site test speed, debugging, and performance analysis. The essential chrome developer tools f

28,13326 tools

Blender

Connect Blender to Claude AI for seamless 3D modeling. Use AI 3D model generator tools for faster, intuitive, interactiv

17,59521 tools

Install

mkdir -p .claude/skills/miles-rl-training && curl -L -o skill.zip "https://mcp.directory/api/skills/download/5201" && unzip -o skill.zip -d .claude/skills/miles-rl-training && rm skill.zip

Installs to .claude/skills/miles-rl-training

Stats

Views

Installs

Author

davila7

7 skills published

Links

Source Code

miles-rl-training

Install

About this skill

miles: Enterprise-Grade RL for Large-Scale Model Training

When to Use miles

Key Features

Low-Precision Training

Performance Optimizations

Train-Inference Alignment

Installation

Quick Start

Workflow 1: Large MoE Training

Prerequisites Checklist

Step 1: Environment Setup

Step 2: Configure Training

Verification Checklist

Workflow 2: Speculative RL Training

How Speculative RL Works

Step 1: Enable Speculative Decoding

Step 2: Enable Online MTP Training (Optional)

Expected Speedup

Configuration Reference

Cluster Resources (from slime)

Megatron Parallelism (from slime)

Speculative Decoding (miles-specific)

Online MTP Training (miles-specific)

Key Features (Conceptual)

Unified FP8 Pipeline

Rollout Routing Replay (R3)

INT4 Quantization-Aware Training

Train-Inference Alignment

Sample Data Structure

Common Issues and Solutions

Issue: FP8 Training Collapse

Issue: Speculative Draft Drift

Issue: Train-Inference Mismatch

Supported Models

Resources

More by davila7

software-architecture

planning-with-files

scroll-experience

humanizer

game-development

telegram-bot-builder

You might also like

flutter-development

ui-ux-pro-max

drawio-diagrams-enhanced

godot

nano-banana-pro

pdf-to-markdown

Related MCP Servers