distributed-llm-pretraining-torchtitan
Provides PyTorch-native distributed LLM pretraining using torchtitan with 4D parallelism (FSDP2, TP, PP, CP). Use when pretraining Llama 3.1, DeepSeek V3, or custom models at scale from 8 to 512+ GPUs with Float8, torch.compile, and distributed checkpointing.
Install
mkdir -p .claude/skills/distributed-llm-pretraining-torchtitan && curl -L -o skill.zip "https://mcp.directory/api/skills/download/6227" && unzip -o skill.zip -d .claude/skills/distributed-llm-pretraining-torchtitan && rm skill.zipInstalls to .claude/skills/distributed-llm-pretraining-torchtitan
About this skill
TorchTitan - PyTorch Native Distributed LLM Pretraining
Quick start
TorchTitan is PyTorch's official platform for large-scale LLM pretraining with composable 4D parallelism (FSDP2, TP, PP, CP), achieving 65%+ speedups over baselines on H100 GPUs.
Installation:
# From PyPI (stable)
pip install torchtitan
# From source (latest features, requires PyTorch nightly)
git clone https://github.com/pytorch/torchtitan
cd torchtitan
pip install -r requirements.txt
Download tokenizer:
# Get HF token from https://huggingface.co/settings/tokens
python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=...
Start training on 8 GPUs:
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
Common workflows
Workflow 1: Pretrain Llama 3.1 8B on single node
Copy this checklist:
Single Node Pretraining:
- [ ] Step 1: Download tokenizer
- [ ] Step 2: Configure training
- [ ] Step 3: Launch training
- [ ] Step 4: Monitor and checkpoint
Step 1: Download tokenizer
python scripts/download_hf_assets.py \
--repo_id meta-llama/Llama-3.1-8B \
--assets tokenizer \
--hf_token=YOUR_HF_TOKEN
Step 2: Configure training
Edit or create a TOML config file:
# llama3_8b_custom.toml
[job]
dump_folder = "./outputs"
description = "Llama 3.1 8B training"
[model]
name = "llama3"
flavor = "8B"
hf_assets_path = "./assets/hf/Llama-3.1-8B"
[optimizer]
name = "AdamW"
lr = 3e-4
[lr_scheduler]
warmup_steps = 200
[training]
local_batch_size = 2
seq_len = 8192
max_norm = 1.0
steps = 1000
dataset = "c4"
[parallelism]
data_parallel_shard_degree = -1 # Use all GPUs for FSDP
[activation_checkpoint]
mode = "selective"
selective_ac_option = "op"
[checkpoint]
enable = true
folder = "checkpoint"
interval = 500
Step 3: Launch training
# 8 GPUs on single node
CONFIG_FILE="./llama3_8b_custom.toml" ./run_train.sh
# Or explicitly with torchrun
torchrun --nproc_per_node=8 \
-m torchtitan.train \
--job.config_file ./llama3_8b_custom.toml
Step 4: Monitor and checkpoint
TensorBoard logs are saved to ./outputs/tb/:
tensorboard --logdir ./outputs/tb
Workflow 2: Multi-node training with SLURM
Multi-Node Training:
- [ ] Step 1: Configure parallelism for scale
- [ ] Step 2: Set up SLURM script
- [ ] Step 3: Submit job
- [ ] Step 4: Resume from checkpoint
Step 1: Configure parallelism for scale
For 70B model on 256 GPUs (32 nodes):
[parallelism]
data_parallel_shard_degree = 32 # FSDP across 32 ranks
tensor_parallel_degree = 8 # TP within node
pipeline_parallel_degree = 1 # No PP for 70B
context_parallel_degree = 1 # Increase for long sequences
Step 2: Set up SLURM script
#!/bin/bash
#SBATCH --job-name=llama70b
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
srun torchrun \
--nnodes=32 \
--nproc_per_node=8 \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
-m torchtitan.train \
--job.config_file ./llama3_70b.toml
Step 3: Submit job
sbatch multinode_trainer.slurm
Step 4: Resume from checkpoint
Training auto-resumes if checkpoint exists in configured folder.
Workflow 3: Enable Float8 training for H100s
Float8 provides 30-50% speedup on H100 GPUs.
Float8 Training:
- [ ] Step 1: Install torchao
- [ ] Step 2: Configure Float8
- [ ] Step 3: Launch with compile
Step 1: Install torchao
USE_CPP=0 pip install git+https://github.com/pytorch/ao.git
Step 2: Configure Float8
Add to your TOML config:
[model]
converters = ["quantize.linear.float8"]
[quantize.linear.float8]
enable_fsdp_float8_all_gather = true
precompute_float8_dynamic_scale_for_fsdp = true
filter_fqns = ["output"] # Exclude output layer
[compile]
enable = true
components = ["model", "loss"]
Step 3: Launch with compile
CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \
--model.converters="quantize.linear.float8" \
--quantize.linear.float8.enable_fsdp_float8_all_gather \
--compile.enable
Workflow 4: 4D parallelism for 405B models
4D Parallelism (FSDP + TP + PP + CP):
- [ ] Step 1: Create seed checkpoint
- [ ] Step 2: Configure 4D parallelism
- [ ] Step 3: Launch on 512 GPUs
Step 1: Create seed checkpoint
Required for consistent initialization across PP stages:
NGPU=1 CONFIG_FILE=./llama3_405b.toml ./run_train.sh \
--checkpoint.enable \
--checkpoint.create_seed_checkpoint \
--parallelism.data_parallel_shard_degree 1 \
--parallelism.tensor_parallel_degree 1 \
--parallelism.pipeline_parallel_degree 1
Step 2: Configure 4D parallelism
[parallelism]
data_parallel_shard_degree = 8 # FSDP
tensor_parallel_degree = 8 # TP within node
pipeline_parallel_degree = 8 # PP across nodes
context_parallel_degree = 1 # CP for long sequences
[training]
local_batch_size = 32
seq_len = 8192
Step 3: Launch on 512 GPUs
# 64 nodes x 8 GPUs = 512 GPUs
srun torchrun --nnodes=64 --nproc_per_node=8 \
-m torchtitan.train \
--job.config_file ./llama3_405b.toml
When to use vs alternatives
Use TorchTitan when:
- Pretraining LLMs from scratch (8B to 405B+)
- Need PyTorch-native solution without third-party dependencies
- Require composable 4D parallelism (FSDP2, TP, PP, CP)
- Training on H100s with Float8 support
- Want interoperable checkpoints with torchtune/HuggingFace
Use alternatives instead:
- Megatron-LM: Maximum performance for NVIDIA-only deployments
- DeepSpeed: Broader ZeRO optimization ecosystem, inference support
- Axolotl/TRL: Fine-tuning rather than pretraining
- LitGPT: Educational, smaller-scale training
Common issues
Issue: Out of memory on large models
Enable activation checkpointing and reduce batch size:
[activation_checkpoint]
mode = "full" # Instead of "selective"
[training]
local_batch_size = 1
Or use gradient accumulation:
[training]
local_batch_size = 1
global_batch_size = 32 # Accumulates gradients
Issue: TP causes high memory with async collectives
Set environment variable:
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
Issue: Float8 training not faster
Float8 only benefits large GEMMs. Filter small layers:
[quantize.linear.float8]
filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"]
Issue: Checkpoint loading fails after parallelism change
Use DCP's resharding capability:
# Convert sharded checkpoint to single file
python -m torch.distributed.checkpoint.format_utils \
dcp_to_torch checkpoint/step-1000 checkpoint.pt
Issue: Pipeline parallelism initialization
Create seed checkpoint first (see Workflow 4, Step 1).
Supported models
| Model | Sizes | Status |
|---|---|---|
| Llama 3.1 | 8B, 70B, 405B | Production |
| Llama 4 | Various | Experimental |
| DeepSeek V3 | 16B, 236B, 671B (MoE) | Experimental |
| GPT-OSS | 20B, 120B (MoE) | Experimental |
| Qwen 3 | Various | Experimental |
| Flux | Diffusion | Experimental |
Performance benchmarks (H100)
| Model | GPUs | Parallelism | TPS/GPU | Techniques |
|---|---|---|---|---|
| Llama 8B | 8 | FSDP | 5,762 | Baseline |
| Llama 8B | 8 | FSDP+compile+FP8 | 8,532 | +48% |
| Llama 70B | 256 | FSDP+TP+AsyncTP | 876 | 2D parallel |
| Llama 405B | 512 | FSDP+TP+PP | 128 | 3D parallel |
Advanced topics
FSDP2 configuration: See references/fsdp.md for detailed FSDP2 vs FSDP1 comparison and ZeRO equivalents.
Float8 training: See references/float8.md for tensorwise vs rowwise scaling recipes.
Checkpointing: See references/checkpoint.md for HuggingFace conversion and async checkpointing.
Adding custom models: See references/custom-models.md for TrainSpec protocol.
Resources
- GitHub: https://github.com/pytorch/torchtitan
- Paper: https://arxiv.org/abs/2410.06511
- ICLR 2025: https://iclr.cc/virtual/2025/poster/29620
- PyTorch Forum: https://discuss.pytorch.org/c/distributed/torchtitan/44
More by davila7
View all skills by davila7 →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
fastapi-templates
wshobson
Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.
Related MCP Servers
Browse all serversEasily convert markdown to PDF using Markitdown MCP server. Supports HTTP, STDIO, and SSE for fast converting markdown t
Enhance software testing with Playwright MCP: Fast, reliable browser automation, an innovative alternative to Selenium s
Use Chrome DevTools for web site test speed, debugging, and performance analysis. The essential chrome developer tools f
Claude Context offers semantic code search and indexing with vector embeddings and AST-based code splitting. Natural lan
Enhance productivity with AI-driven Notion automation. Leverage the Notion API for secure, automated workspace managemen
Cipher empowers agents with persistent memory using vector databases and embeddings for seamless context retention and t
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.