slime-rl-training

Name: slime-rl-training
Author: davila7

1views

1installs

Provides guidance for LLM post-training with RL using slime, a Megatron+SGLang framework. Use when training GLM models, implementing custom data generation workflows, or needing tight Megatron-LM integration for RL scaling.

Install

mkdir -p .claude/skills/slime-rl-training && curl -L -o skill.zip "https://mcp.directory/api/skills/download/7236" && unzip -o skill.zip -d .claude/skills/slime-rl-training && rm skill.zip

Installs to .claude/skills/slime-rl-training

About this skill

slime: LLM Post-Training Framework for RL Scaling

slime is an LLM post-training framework from Tsinghua's THUDM team, powering GLM-4.5, GLM-4.6, and GLM-4.7. It connects Megatron-LM for training with SGLang for high-throughput rollout generation.

When to Use slime

Choose slime when you need:

Megatron-LM native training with SGLang inference
Custom data generation workflows with flexible data buffers
Training GLM, Qwen3, DeepSeek V3, or Llama 3 models
Research-grade framework with production backing (Z.ai)

Consider alternatives when:

You need enterprise-grade stability features → use miles
You want flexible backend swapping → use verl
You need PyTorch-native abstractions → use torchforge

Key Features

Training: Megatron-LM with full parallelism support (TP, PP, DP, SP)
Rollout: SGLang-based high-throughput generation with router
Data Buffer: Flexible prompt management and sample storage
Models: GLM-4.x, Qwen3, DeepSeek V3/R1, Llama 3

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│                    Data Buffer                          │
│ - Prompt initialization and management                  │
│ - Custom data generation and filtering                  │
│ - Rollout sample storage                                │
└─────────────┬───────────────────────────┬───────────────┘
              │                           │
┌─────────────▼───────────┐ ┌─────────────▼───────────────┐
│ Training (Megatron-LM)  │ │ Rollout (SGLang + Router)   │
│ - Actor model training  │ │ - Response generation       │
│ - Critic (optional)     │ │ - Reward/verifier output    │
│ - Weight sync to rollout│ │ - Multi-turn support        │
└─────────────────────────┘ └─────────────────────────────┘

Installation

# Recommended: Docker
docker pull slimerl/slime:latest
docker run --rm --gpus all --ipc=host --shm-size=16g \
  -it slimerl/slime:latest /bin/bash

# Inside container
cd /root/slime && pip install -e . --no-deps

From Source

git clone https://github.com/THUDM/slime.git
cd slime
pip install -r requirements.txt
pip install -e .

Quick Start: GRPO Training

# Source model configuration
source scripts/models/qwen3-4B.sh

# Launch training
python train.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 4 \
    --rollout-num-gpus 4 \
    --advantage-estimator grpo \
    --use-kl-loss --kl-loss-coef 0.001 \
    --rollout-batch-size 32 \
    --n-samples-per-prompt 8 \
    --global-batch-size 256 \
    --num-rollout 3000 \
    --prompt-data /path/to/data.jsonl \
    ${MODEL_ARGS[@]} ${CKPT_ARGS[@]}

Workflow 1: Standard GRPO Training

Use this workflow for training reasoning models with group-relative advantages.

Prerequisites Checklist

Docker environment or Megatron-LM + SGLang installed
Model checkpoint (HuggingFace or Megatron format)
Training data in JSONL format

Step 1: Prepare Data

# data.jsonl format
{"prompt": "What is 2 + 2?", "label": "4"}
{"prompt": "Solve: 3x = 12", "label": "x = 4"}

Or with chat format:

{
    "prompt": [
        {"role": "system", "content": "You are a math tutor."},
        {"role": "user", "content": "What is 15 + 27?"}
    ],
    "label": "42"
}

Step 2: Configure Model

Choose a pre-configured model script:

# List available models
ls scripts/models/
# glm4-9B.sh, qwen3-4B.sh, qwen3-30B-A3B.sh, deepseek-v3.sh, llama3-8B.sh, ...

# Source your model
source scripts/models/qwen3-4B.sh

Step 3: Launch Training

python train.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --advantage-estimator grpo \
    --use-kl-loss \
    --kl-loss-coef 0.001 \
    --prompt-data /path/to/train.jsonl \
    --input-key prompt \
    --label-key label \
    --apply-chat-template \
    --rollout-batch-size 32 \
    --n-samples-per-prompt 8 \
    --global-batch-size 256 \
    --num-rollout 3000 \
    --save-interval 100 \
    --eval-interval 50 \
    ${MODEL_ARGS[@]}

Step 4: Monitor Training

Check TensorBoard: tensorboard --logdir outputs/
Verify reward curves are increasing
Monitor GPU utilization across nodes

Workflow 2: Asynchronous Training

Use async mode for higher throughput by overlapping rollout and training.

When to Use Async

Large models with long generation times
High GPU idle time in synchronous mode
Sufficient memory for buffering

Launch Async Training

python train_async.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --advantage-estimator grpo \
    --async-buffer-size 4 \
    --prompt-data /path/to/train.jsonl \
    ${MODEL_ARGS[@]}

Async-Specific Parameters

--async-buffer-size 4        # Number of rollouts to buffer
--update-weights-interval 2  # Sync weights every N rollouts

Workflow 3: Multi-Turn Agentic Training

Use this workflow for training agents with tool use or multi-step reasoning.

Prerequisites

Custom generate function for multi-turn logic
Tool/environment interface

Step 1: Define Custom Generate Function

# custom_generate.py
async def custom_generate(args, samples, evaluation=False):
    """Multi-turn generation with tool calling."""
    for sample in samples:
        conversation = sample.prompt

        for turn in range(args.max_turns):
            # Generate response
            response = await generate_single(conversation)

            # Check for tool call
            tool_call = extract_tool_call(response)
            if tool_call:
                tool_result = execute_tool(tool_call)
                conversation.append({"role": "assistant", "content": response})
                conversation.append({"role": "tool", "content": tool_result})
            else:
                break

        sample.response = response
        sample.reward = compute_reward(sample)

    return samples

Step 2: Launch with Custom Function

python train.py \
    --custom-generate-function-path custom_generate.py \
    --max-turns 5 \
    --prompt-data /path/to/agent_data.jsonl \
    ${MODEL_ARGS[@]}

See examples/search-r1/ for a complete multi-turn search example.

Configuration Reference

Three Argument Categories

slime uses three types of arguments:

1. Megatron Arguments (passed directly):

--tensor-model-parallel-size 2
--pipeline-model-parallel-size 1
--num-layers 32
--hidden-size 4096

2. SGLang Arguments (prefixed with --sglang-):

--sglang-mem-fraction-static 0.8
--sglang-context-length 8192
--sglang-log-level INFO

3. slime Arguments:

# Resource allocation
--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--colocate  # Share GPUs between training/inference

# Data
--prompt-data /path/to/data.jsonl
--input-key prompt
--label-key label

# Training loop
--num-rollout 3000
--rollout-batch-size 32
--n-samples-per-prompt 8
--global-batch-size 256

# Algorithm
--advantage-estimator grpo  # or: gspo, ppo, reinforce_plus_plus
--use-kl-loss
--kl-loss-coef 0.001

Key Constraints

rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout

Example: 32 × 8 = 256 × 1

Data Buffer System

slime's data buffer enables flexible data management:

Basic Data Source

class RolloutDataSource:
    def get_samples(self, num_samples):
        """Fetch prompts from dataset."""
        return self.dataset.sample(num_samples)

    def add_samples(self, samples):
        """Called after generation (no-op by default)."""
        pass

Buffered Data Source (Off-Policy)

class RolloutDataSourceWithBuffer(RolloutDataSource):
    def __init__(self):
        self.buffer = []

    def add_samples(self, samples):
        """Store generated samples for reuse."""
        self.buffer.extend(samples)

    def buffer_filter(self, args, buffer, num_samples):
        """Custom selection logic (prioritized, stratified, etc.)."""
        return select_best(buffer, num_samples)

Common Issues and Solutions

Issue: SGLang Engine Crash

Symptoms: Inference engine dies mid-training

Solutions:

# Enable fault tolerance
--use-fault-tolerance

# Increase memory allocation
--sglang-mem-fraction-static 0.85

# Reduce batch size
--rollout-batch-size 16

Issue: Weight Sync Timeout

Symptoms: Training hangs after rollout

Solutions:

# Increase sync interval
--update-weights-interval 5

# Use colocated mode (no network transfer)
--colocate

Issue: OOM During Training

Symptoms: CUDA OOM in backward pass

Solutions:

# Enable gradient checkpointing
--recompute-activations

# Reduce micro-batch size
--micro-batch-size 1

# Enable sequence parallelism
--sequence-parallel

Issue: Slow Data Loading

Symptoms: GPU idle during data fetch

Solutions:

# Increase data workers
--num-data-workers 4

# Use streaming dataset
--streaming-data

Supported Models

Model Family	Configurations
GLM	GLM-4.5, GLM-4.6, GLM-4.7, GLM-Z1-9B
Qwen	Qwen3 (4B, 8B, 30B-A3B), Qwen3-MoE, Qwen2.5
DeepSeek	V3, V3.1, R1
Llama	Llama 3 (8B, 70B)
Others	Kimi K2, Moonlight-16B

Each model has pre-configured scripts in scripts/models/.

Advanced Topics

Co-location Mode

Share GPUs between training and inference to reduce memory:

python train.py \
    --colocate \
    --actor-num-gpus-per-node 8 \
    --sglang-mem-fraction-static 0.4 \
    ${MODEL_ARGS[@]}

Custom Reward Model

# custom_rm.py
class CustomRewardModel:
    def __init__(self, model_path):
        self.model = load_model(model_path)

    def compute_reward(self,

---

*Content truncated.*

More by davila7

View all skills by davila7 →

software-architecture

davila7

Guide for quality focused software architecture. This skill should be used when users want to write code, design architecture, analyze code, in any case that relates to software development.

1,389536

planning-with-files

davila7

Implements Manus-style file-based planning for complex tasks. Creates task_plan.md, findings.md, and progress.md. Use when starting complex multi-step tasks, research projects, or any task requiring >5 tool calls.

136344

scientific-brainstorming

davila7

Research ideation partner. Generate hypotheses, explore interdisciplinary connections, challenge assumptions, develop methodologies, identify research gaps, for creative scientific problem-solving.

369203

telegram-bot-builder

davila7

Expert in building Telegram bots that solve real problems - from simple automation to complex AI-powered bots. Covers bot architecture, the Telegram Bot API, user experience, monetization strategies, and scaling bots to thousands of users. Use when: telegram bot, bot api, telegram automation, chat bot telegram, tg bot.

169186

humanizer

davila7

Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases. Credits: Original skill by @blader - https://github.com/blader/humanizer

241116

scroll-experience

davila7

Expert in building immersive scroll-driven experiences - parallax storytelling, scroll animations, interactive narratives, and cinematic web experiences. Like NY Times interactives, Apple product pages, and award-winning web experiences. Makes websites feel like experiences, not just pages. Use when: scroll animation, parallax, scroll storytelling, interactive story, cinematic website.

153108

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

3,1732,730

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

4,2091,814

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

2,2141,668

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

2,3511,514

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

2,6511,273

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

2,064996

Related MCP Servers

Browse all servers

Axe Accessibility

Test website accessibility and ensure WCAG compliance with Axe Accessibility, a web accessibility checker with detailed reports and remediation guidance.

786 tools

Markitdown

Easily convert markdown to PDF using Markitdown MCP server. Supports HTTP, STDIO, and SSE for fast converting markdown to PDF workflows.

90,3881 tools

Playwright Browser Automation

Enhance software testing with Playwright MCP: Fast, reliable browser automation, an innovative alternative to Selenium software testing tools.

28,44922 tools

Chrome DevTools

Use Chrome DevTools for web site test speed, debugging, and performance analysis. The essential chrome developer tools for reliable web automation.

28,13326 tools

Blender

Connect Blender to Claude AI for seamless 3D modeling. Use AI 3D model generator tools for faster, intuitive, interactive 3D scene creation.

17,59521 tools

Claude Context

Claude Context offers semantic code search and indexing with vector embeddings and AST-based code splitting. Natural language queries across codebases.

5,5770 tools

Install

mkdir -p .claude/skills/slime-rl-training && curl -L -o skill.zip "https://mcp.directory/api/skills/download/7236" && unzip -o skill.zip -d .claude/skills/slime-rl-training && rm skill.zip

Installs to .claude/skills/slime-rl-training

Stats

Views

Installs

Author

davila7

7 skills published

Links

Source Code

slime-rl-training

Install

About this skill

slime: LLM Post-Training Framework for RL Scaling

When to Use slime

Key Features

Architecture Overview

Installation

From Source

Quick Start: GRPO Training

Workflow 1: Standard GRPO Training

Prerequisites Checklist

Step 1: Prepare Data

Step 2: Configure Model

Step 3: Launch Training

Step 4: Monitor Training

Workflow 2: Asynchronous Training

When to Use Async

Launch Async Training

Async-Specific Parameters

Workflow 3: Multi-Turn Agentic Training

Prerequisites

Step 1: Define Custom Generate Function

Step 2: Launch with Custom Function

Configuration Reference

Three Argument Categories

Key Constraints

Data Buffer System

Basic Data Source

Buffered Data Source (Off-Policy)

Common Issues and Solutions

Issue: SGLang Engine Crash

Issue: Weight Sync Timeout

Issue: OOM During Training

Issue: Slow Data Loading

Supported Models

Advanced Topics

Co-location Mode

Custom Reward Model

More by davila7

software-architecture

planning-with-files

scientific-brainstorming

telegram-bot-builder

humanizer

scroll-experience

You might also like

ui-ux-pro-max

pdf-to-markdown

flutter-development

drawio-diagrams-enhanced

godot

nano-banana-pro

Related MCP Servers