add-archon-model

5
0
Source

Guide for adding a new model to the Archon engine. Use when user wants to add support for a new HuggingFace model architecture in ArchonEngine.

Install

mkdir -p .claude/skills/add-archon-model && curl -L -o skill.zip "https://mcp.directory/api/skills/download/2678" && unzip -o skill.zip -d .claude/skills/add-archon-model && rm skill.zip

Installs to .claude/skills/add-archon-model

About this skill

Add Archon Model

Add support for a new HuggingFace model architecture in the Archon training engine.

When to Use

This skill is triggered when:

  • User asks "how do I add a model to Archon?"
  • User wants to support a new model family (e.g., Llama, Mistral, DeepSeek) in ArchonEngine
  • User mentions adding a new ModelSpec or model type for Archon

Prerequisites

Before starting, ensure:

  • The target model is available on HuggingFace (has config.json with model_type)
  • You know the HuggingFace model ID (e.g., meta-llama/Llama-3-8B)
  • The model uses a standard transformer architecture (decoder-only)

Step-by-Step Guide

Step 1: Analyze the Target Model Architecture

Read the HuggingFace model's source code to extract key architecture information.

Action: Fetch and analyze the model's HuggingFace configuration and modeling files.

  1. Read the model's config.json (via AutoConfig.from_pretrained) to identify:

    • model_type string (this is the key used for registry lookup)
    • All architecture hyperparameters (hidden_size, num_layers, etc.)
    • Any model-specific fields (e.g., qk_norm, attention_bias, MoE fields)
  2. Read the HuggingFace modeling_*.py source to identify:

    • Attention variant: Does it have Q/K norm? Attention bias? Sliding window? Multi-latent attention?
    • FFN variant: SwiGLU (gate_proj + up_proj + down_proj)? GeGLU? Standard MLP?
    • MoE support: Does it have MoE layers? What router type? Shared experts?
    • RoPE variant: Standard RoPE? YaRN? NTK-aware scaling? What is the inv_freq formula?
    • Normalization: RMSNorm or LayerNorm? Pre-norm or post-norm? Elementwise affine?
    • Weight tying: Does tie_word_embeddings appear in config?
    • State dict key names: What are the HF weight key naming conventions?
  3. Summarize findings in a checklist like:

Target model: <name>
HF model_type: "<model_type>" (and variants like "<model_type>_moe" if applicable)
Attention: [standard GQA / with QK norm / with bias / sliding window / ...]
FFN: [SwiGLU / GeGLU / standard MLP / ...]
MoE: [no / yes - num_experts, top_k, shared_experts]
RoPE: [standard / YaRN / NTK-aware / ...]
Norm: [RMSNorm / LayerNorm] with [pre-norm / post-norm]
Weight tying: [yes / no]

Step 2: Select the Reference Model

Choose the closest existing implementation as a starting point:

Target characteristicsReferenceWhy
Dense-only, standard GQA, no QK normqwen2Simplest baseline, pure dense
Has QK norm, or has MoE supportqwen3Supports QK norm + MoE + shared experts

Action: Copy the reference model directory as the starting point:

areal/experimental/models/archon/<model>/
  __init__.py
  spec.py
  model/
    args.py
    model.py
    rope.py
    state_dict_adapter.py
  infra/
    parallelize.py

Step 3: Implement args.py

Adapt <Model>ModelArgs to match the target model's HuggingFace config fields.

Key changes from reference:

  1. Update the @dataclass fields to match the target model's hyperparameters:

    • Field names should use Archon conventions (dim, n_layers, n_heads, n_kv_heads, vocab_size, head_dim, hidden_dim, norm_eps, rope_theta, etc.)
    • Default values should match the smallest variant of the target model
    • Add model-specific fields (e.g., attention_bias, qk_norm, sliding_window)
  2. Update from_hf_config() to correctly map HuggingFace config attributes:

    • Use getattr(hf_config, "field_name", default) for optional fields
    • Handle variant-specific fields (e.g., MoE fields only present in MoE variants)
    • The method must return an instance of the model args class

Critical: Verify every field mapping against the HF model's config.json. Incorrect mappings here cause silent errors downstream.

Base class contract (BaseModelArgs):

@dataclass
class <Model>ModelArgs(BaseModelArgs):
    # ... model-specific fields ...

    @classmethod
    def from_hf_config(
        cls,
        hf_config: PretrainedConfig,
        is_critic: bool = False,
        **kwargs,
    ) -> <Model>ModelArgs:
        # Map HF config fields to Archon model args
        ...

Step 4: Implement model.py

Adapt the model architecture to match the target model.

Key components to adapt:

  1. Normalization (RMSNorm or similar):

    • Check if elementwise_affine is configurable
    • Check the epsilon default value
    • If the model uses LayerNorm, implement accordingly
  2. Attention module:

    • Q/K/V projection: Check bias presence (nn.Linear(..., bias=True/False))
    • QK norm: Add q_norm/k_norm if the model has them, remove if it doesn't
    • GQA: n_kv_heads < n_heads for grouped-query attention
    • Ulysses SP: Keep the set_cp_group / _sp_enabled pattern from the reference
    • Output projection: Check bias presence
  3. FeedForward module:

    • SwiGLU: w2(silu(w1(x)) * w3(x)) -- most common for modern LLMs
    • Check bias in linear layers
    • For MoE models: MoE module replaces FeedForward on designated layers
  4. TransformerBlock: Pre-norm (most modern LLMs) vs post-norm

    • MoE layer detection via _is_moe_layer() if applicable
  5. Top-level Model (<Model>Model(BaseArchonModel)):

    • tok_embeddings, layers (as ModuleDict), norm, output/score
    • init_weights(): Match initialization scheme from HF
    • init_buffers(): RoPE cache + MoE buffers
    • forward(): Must follow BaseArchonModel signature: (tokens, positions, cu_seqlens, max_seqlen) -> Tensor

Base class contract (BaseArchonModel):

class <Model>Model(BaseArchonModel):
    def forward(self, tokens, positions, cu_seqlens, max_seqlen) -> torch.Tensor: ...
    def init_weights(self) -> None: ...
    def init_buffers(self, buffer_device) -> None: ...

Step 5: Implement rope.py

Handle the rotary position embedding variant.

Options:

  1. Standard RoPE (same as qwen2/qwen3): Re-export from qwen2:

    from areal.experimental.models.archon.qwen2.model.rope import (
        apply_rotary_emb,
        precompute_rope_cache,
        repeat_kv,
        reshape_for_broadcast,
        rotate_half,
    )
    
  2. Custom RoPE (YaRN, NTK-aware, etc.): Implement custom precompute_rope_cache() and apply_rotary_emb() functions. The key difference is usually in how inv_freq is computed (scaling factors, interpolation, etc.).

Step 6: Implement state_dict_adapter.py

Map between HuggingFace and Archon weight key names.

This is the most error-prone step. The adapter must correctly handle:

  1. Key name mapping (from_hf_map dict):

    • Embedding: model.embed_tokens.weight -> tok_embeddings.weight
    • Attention: model.layers.{}.self_attn.q_proj.weight -> layers.{}.attention.wq.weight
    • FFN: model.layers.{}.mlp.gate_proj.weight -> layers.{}.feed_forward.w1.weight
    • Norms: model.layers.{}.input_layernorm.weight -> layers.{}.attention_norm.weight
    • Output: lm_head.weight -> output.weight
    • Skip keys (set to None): rotary_emb.inv_freq (computed at runtime)
    • Model-specific keys: bias terms, QK norm weights, etc.
  2. Reverse mapping (to_hf_map): Auto-generated from from_hf_map

  3. MoE expert weights (if applicable): 3D<->2D conversion for expert weights. Copy the MoE handling from qwen3 if the model has MoE.

  4. Weight tying: Skip output.weight during to_hf() if tie_word_embeddings=True

Verification approach: After implementation, the adapter should satisfy:

# Roundtrip: archon -> hf -> archon preserves all keys
hf_sd = adapter.to_hf(archon_sd)
roundtrip_sd = adapter.from_hf(hf_sd)
assert set(roundtrip_sd.keys()) == set(archon_sd.keys())

Base class contract (BaseStateDictAdapter):

class <Model>StateDictAdapter(BaseStateDictAdapter):
    def from_hf(self, hf_state_dict) -> dict[str, Any]: ...
    def to_hf(self, archon_state_dict) -> dict[str, Any]: ...
    def convert_single_to_hf(self, name, tensor) -> list[tuple[str, torch.Tensor]]: ...

Step 7: Implement parallelize.py

Define the parallelization strategy for the model.

The parallelize function applies parallelism in this order:

  1. TP (Tensor Parallelism) -- shard attention/FFN across devices
  2. EP (Expert Parallelism) -- for MoE models only
  3. CP (Context Parallelism / Ulysses SP) -- sequence parallelism
  4. AC (Activation Checkpointing) -- memory optimization
  5. torch.compile -- compilation optimization
  6. FSDP (Fully Sharded Data Parallelism) -- data parallelism

Key adaptations by model architecture:

  • Attention with QK norm: wq/wk use use_local_output=False (DTensor output for norm), add SequenceParallel(sequence_dim=2) for q_norm/k_norm
  • Attention without QK norm: wq/wk/wv all use use_local_output=True
  • Attention with bias: Bias terms follow the same parallel plan as their weights
  • MoE layers: Separate TP plan for MoE input/output, router gate, and expert weights. Copy from qwen3's apply_moe_ep_tp() and apply_non_moe_tp()
  • Dense-only models: Simpler plan without MoE handling. Copy from qwen2

Function signature (must match ParallelizeFn protocol):

def parallelize_<model>(
    model: nn.Module,
    parallel_dims: ArchonParallelDims,
    param_dtype: torch.dtype = torch.bfloat16,
    reduce_dtype: torch.dtype = torch.float32,
    loss_parallel: bool = True,
    cpu_offload: bool = False,
    reshard_after_forward_policy: str = "default",
    ac_config: ActivationCheckpointConfig | None = None,
    enable_compile: bool = True,
) -> nn.Module:

Step 8: Create spec.py and Register

Assemble the ModelSpec and register


Content truncated.

You might also like

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

643969

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

591705

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

318399

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

340397

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

452339

fastapi-templates

wshobson

Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.

304231

Stay ahead of the MCP ecosystem

Get weekly updates on new skills and servers.