training-llms-megatron

0
0
Source

Trains large language models (2B-462B parameters) using NVIDIA Megatron-Core with advanced parallelism strategies. Use when training models >1B parameters, need maximum GPU efficiency (47% MFU on H100), or require tensor/pipeline/sequence/context/expert parallelism. Production-ready framework used for Nemotron, LLaMA, DeepSeek.

Install

mkdir -p .claude/skills/training-llms-megatron && curl -L -o skill.zip "https://mcp.directory/api/skills/download/7702" && unzip -o skill.zip -d .claude/skills/training-llms-megatron && rm skill.zip

Installs to .claude/skills/training-llms-megatron

About this skill

Megatron-Core - Large-Scale LLM Training

Quick start

Megatron-Core trains LLMs from 2B to 462B parameters with up to 47% Model FLOP Utilization on H100 GPUs through advanced parallelism strategies.

Installation:

# Docker (recommended)
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:25.04-py3

# Or pip
pip install megatron-core

Simple distributed training:

# Train with 2 GPUs using data parallelism
torchrun --nproc_per_node=2 examples/run_simple_mcore_train_loop.py

# Or LLaMA-3 8B training
./examples/llama/train_llama3_8b_fp8.sh

Common workflows

Workflow 1: Train LLaMA-style model with 3D parallelism

Copy this checklist:

LLaMA Training Setup:
- [ ] Step 1: Choose parallelism configuration
- [ ] Step 2: Configure training hyperparameters
- [ ] Step 3: Launch distributed training
- [ ] Step 4: Monitor performance metrics

Step 1: Choose parallelism configuration

Model size determines parallelism strategy:

Model SizeGPUsTensor ParallelPipeline ParallelData ParallelContext Parallel
7B81181
13B82141
70B644441
405B1288822

Step 2: Configure training hyperparameters

#!/bin/bash
# train_llama_70b.sh

GPUS_PER_NODE=8
NNODES=8  # 64 GPUs total
TP=4      # Tensor parallel
PP=4      # Pipeline parallel
CP=1      # Context parallel

# LLaMA 70B configuration
MODEL_SIZE=70  # Billion parameters
HIDDEN_SIZE=8192
NUM_LAYERS=80
NUM_HEADS=64
SEQ_LENGTH=4096

# Training hyperparameters
MICRO_BATCH=1
GLOBAL_BATCH=1024
LR=3e-4

torchrun \
  --nproc_per_node=$GPUS_PER_NODE \
  --nnodes=$NNODES \
  pretrain_gpt.py \
  --tensor-model-parallel-size $TP \
  --pipeline-model-parallel-size $PP \
  --context-parallel-size $CP \
  --sequence-parallel \
  --num-layers $NUM_LAYERS \
  --hidden-size $HIDDEN_SIZE \
  --num-attention-heads $NUM_HEADS \
  --seq-length $SEQ_LENGTH \
  --max-position-embeddings $SEQ_LENGTH \
  --micro-batch-size $MICRO_BATCH \
  --global-batch-size $GLOBAL_BATCH \
  --lr $LR \
  --train-iters 100000 \
  --lr-decay-style cosine \
  --lr-warmup-iters 2000 \
  --weight-decay 0.1 \
  --clip-grad 1.0 \
  --bf16 \
  --use-mcore-models \
  --transformer-impl transformer_engine \
  --data-path /path/to/data \
  --vocab-file /path/to/vocab.json \
  --merge-file /path/to/merges.txt

Step 3: Launch distributed training

# Single node (8 GPUs)
bash train_llama_70b.sh

# Multi-node with SLURM
sbatch --nodes=8 --gpus-per-node=8 train_llama_70b.sh

Step 4: Monitor performance metrics

Key metrics to track:

Model FLOP Utilization (MFU): Target >40% on H100
Throughput: Tokens/sec/GPU
Memory usage: <80GB per GPU for 70B model
Loss: Should decrease steadily

Workflow 2: Configure Mixture of Experts (MoE) training

For sparse MoE models like Mixtral.

MoE Training:
- [ ] Step 1: Configure expert parallelism
- [ ] Step 2: Set MoE hyperparameters
- [ ] Step 3: Launch training with EP

Step 1: Configure expert parallelism

# Mixtral 8x7B example
TENSOR_PARALLEL=2
PIPELINE_PARALLEL=1
EXPERT_PARALLEL=4  # Split 8 experts across 4 GPUs
DATA_PARALLEL=4

TOTAL_GPUS=$((TENSOR_PARALLEL * PIPELINE_PARALLEL * EXPERT_PARALLEL * DATA_PARALLEL))
# = 2 * 1 * 4 * 4 = 32 GPUs

Step 2: Set MoE hyperparameters

torchrun \
  --nproc_per_node=8 \
  pretrain_gpt.py \
  --tensor-model-parallel-size 2 \
  --pipeline-model-parallel-size 1 \
  --expert-model-parallel-size 4 \
  --num-experts 8 \
  --moe-router-topk 2 \
  --moe-router-load-balancing-type aux_loss \
  --moe-aux-loss-coeff 0.01 \
  --hidden-size 4096 \
  --num-layers 32 \
  --num-attention-heads 32 \
  --seq-length 4096 \
  --max-position-embeddings 4096 \
  --bf16 \
  --use-mcore-models \
  --transformer-impl transformer_engine \
  --data-path /path/to/data \
  --vocab-file /path/to/vocab.json \
  --merge-file /path/to/merges.txt

Step 3: Launch training with EP

Expert parallelism distributes different experts across GPUs, reducing memory while maintaining capacity.

Memory without EP: 8 experts × 7B = 56GB per GPU
Memory with EP=4: 2 experts × 7B = 14GB per GPU
Savings: 75% memory reduction

Workflow 3: Optimize for maximum throughput

Achieve 47% MFU on H100.

Performance Optimization:
- [ ] Step 1: Enable Flash Attention
- [ ] Step 2: Use FP8 precision (H100)
- [ ] Step 3: Optimize micro-batch size
- [ ] Step 4: Tune parallelism degrees

Step 1: Enable optimizations

--use-mcore-models  # Use Megatron Core models
--transformer-impl transformer_engine  # Use Transformer Engine
--sequence-parallel  # Reduce activation memory (use with TP)

Step 2: Use FP8 precision (H100 only)

--fp8-hybrid  # FP8 mixed precision training
# Transformer Engine handles FP8 automatically

Result: 1.5-2x speedup on H100 vs BF16.

Step 3: Optimize micro-batch size

Find largest micro-batch that fits in memory:

# Start with 1, increase until OOM
for MBS in 1 2 4 8; do
  echo "Testing micro-batch-size=$MBS"
  torchrun ... --micro-batch-size $MBS
done

Typical values:

  • 7B model: 4-8
  • 70B model: 1-2
  • 405B model: 1

Step 4: Tune parallelism degrees

Rules of thumb:

Tensor Parallel: Use ≤8 (limited by NVLink within node)
Pipeline Parallel: Use for >70B models
Context Parallel: Use for sequences >8K tokens
Data Parallel: Fill remaining GPUs

Example 405B on 128 H100s:

TP=8 (1 node)
PP=8 (across nodes)
CP=2 (long sequences)
DP=1
Total = 8 × 8 × 2 × 1 = 128 GPUs

When to use vs alternatives

Use Megatron-Core when:

  • Training models >10B parameters
  • Need maximum efficiency (target >40% MFU)
  • Using NVIDIA GPUs (A100, H100)
  • Production training at scale
  • Want fine-grained parallelism control

Use alternatives instead:

  • PyTorch FSDP: Models <70B, simpler API, PyTorch native
  • DeepSpeed: Easier setup, good for <100B models
  • HuggingFace Accelerate: Prototyping, simpler workflows
  • LitGPT: Educational, single-file implementations

Common issues

Issue: Low GPU utilization (<30% MFU)

Causes:

  1. Micro-batch too small
  2. Too much parallelism overhead
  3. Not using Flash Attention

Fixes:

# Increase micro-batch
--micro-batch-size 4  # Was 1

# Enable optimizations
--use-flash-attn
--sequence-parallel

# Reduce TP if >8
--tensor-model-parallel-size 4  # Was 16

Issue: Out of memory

Reduce memory with:

--tensor-model-parallel-size 2  # Split model across GPUs
--recompute-granularity full  # Gradient checkpointing
--recompute-method block  # Checkpoint transformer blocks
--recompute-num-layers 1  # Checkpoint every layer

Or use CPU/NVMe offloading:

--cpu-optimizer  # Offload optimizer to CPU
--cpu-optimizer-type ADAM  # CPU Adam variant

Issue: Training slower than expected

Check:

  1. Network bottleneck: Ensure InfiniBand/NVLink enabled
  2. Pipeline bubbles: Use interleaved pipeline schedule
    --num-layers-per-virtual-pipeline-stage 2
    
  3. Data loading: Use fast data loader
    --dataloader-type cyclic
    

Issue: Diverging loss

Stabilize training:

--lr-warmup-iters 2000  # Longer warmup
--clip-grad 1.0  # Gradient clipping
--init-method-std 0.006  # Smaller init
--attention-dropout 0.0  # No dropout in attention
--hidden-dropout 0.0  # No dropout in FFN

Advanced topics

Parallelism strategies: See references/parallelism-guide.md for detailed comparison of TP/PP/DP/CP/EP with performance analysis and when to use each.

Performance benchmarks: See references/benchmarks.md for MFU numbers across different model sizes and GPU configurations.

Production configurations: See references/production-examples.md for real-world setups from LLaMA 3 405B, Nemotron-4 340B, and DeepSeek-V3 671B.

Training recipes: See references/training-recipes.md for complete hyperparameter configurations for GPT/LLaMA/Mixtral architectures.

Hardware requirements

  • GPU: NVIDIA Ampere+ (A100, H100, B200)
    • Turing works but slower
    • FP8 requires Hopper/Ada/Blackwell
  • Network: InfiniBand or 400Gb+ Ethernet for multi-node
  • Memory per GPU:
    • 7B model: 40GB+
    • 70B model: 80GB (with TP=4)
    • 405B model: 80GB (with TP=8, PP=8)
  • Storage: Fast NVMe for checkpoints (1TB+ for 70B+ models)

Resources

scroll-experience

davila7

Expert in building immersive scroll-driven experiences - parallax storytelling, scroll animations, interactive narratives, and cinematic web experiences. Like NY Times interactives, Apple product pages, and award-winning web experiences. Makes websites feel like experiences, not just pages. Use when: scroll animation, parallax, scroll storytelling, interactive story, cinematic website.

10968

software-architecture

davila7

Guide for quality focused software architecture. This skill should be used when users want to write code, design architecture, analyze code, in any case that relates to software development.

14749

game-development

davila7

Game development orchestrator. Routes to platform-specific skills based on project needs.

10630

senior-fullstack

davila7

Comprehensive fullstack development skill for building complete web applications with React, Next.js, Node.js, GraphQL, and PostgreSQL. Includes project scaffolding, code quality analysis, architecture patterns, and complete tech stack guidance. Use when building new projects, analyzing code quality, implementing design patterns, or setting up development workflows.

9325

ml-paper-writing

davila7

Write publication-ready ML/AI papers for NeurIPS, ICML, ICLR, ACL, AAAI, COLM. Use when drafting papers from research repos, structuring arguments, verifying citations, or preparing camera-ready submissions. Includes LaTeX templates, reviewer guidelines, and citation verification workflows.

7823

senior-security

davila7

Comprehensive security engineering skill for application security, penetration testing, security architecture, and compliance auditing. Includes security assessment tools, threat modeling, crypto implementation, and security automation. Use when designing security architecture, conducting penetration tests, implementing cryptography, or performing security audits.

7921

You might also like

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

9521,094

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

846846

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

571700

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

548492

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

673466

fastapi-templates

wshobson

Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.

514280

Stay ahead of the MCP ecosystem

Get weekly updates on new skills and servers.