add-archon-model

Name: add-archon-model
Author: inclusionAI

6views

2installs

Guide for adding a new model to the Archon engine. Use when user wants to add support for a new HuggingFace model architecture in ArchonEngine.

Install

mkdir -p .claude/skills/add-archon-model && curl -L -o skill.zip "https://mcp.directory/api/skills/download/2678" && unzip -o skill.zip -d .claude/skills/add-archon-model && rm skill.zip

Installs to .claude/skills/add-archon-model

About this skill

Add Archon Model

Add support for a new HuggingFace model architecture in the Archon training engine.

When to Use

This skill is triggered when:

User asks "how do I add a model to Archon?"
User wants to support a new model family (e.g., Llama, Mistral, DeepSeek) in ArchonEngine
User mentions adding a new ModelSpec or model type for Archon

Prerequisites

Before starting, ensure:

The target model is available on HuggingFace (has config.json with model_type)
You know the HuggingFace model ID (e.g., meta-llama/Llama-3-8B)
The model uses a standard transformer architecture (decoder-only)

Step-by-Step Guide

Step 1: Analyze the Target Model Architecture

Read the HuggingFace model's source code to extract key architecture information.

Action: Fetch and analyze the model's HuggingFace configuration and modeling files.

Read the model's config.json (via AutoConfig.from_pretrained) to identify:
- model_type string (this is the key used for registry lookup)
- All architecture hyperparameters (hidden_size, num_layers, etc.)
- Any model-specific fields (e.g., qk_norm, attention_bias, MoE fields)
Read the HuggingFace modeling_*.py source to identify:
- Attention variant: Does it have Q/K norm? Attention bias? Sliding window? Multi-latent attention?
- FFN variant: SwiGLU (gate_proj + up_proj + down_proj)? GeGLU? Standard MLP?
- MoE support: Does it have MoE layers? What router type? Shared experts?
- RoPE variant: Standard RoPE? YaRN? NTK-aware scaling? What is the inv_freq formula?
- Normalization: RMSNorm or LayerNorm? Pre-norm or post-norm? Elementwise affine?
- Weight tying: Does tie_word_embeddings appear in config?
- State dict key names: What are the HF weight key naming conventions?
Summarize findings in a checklist like:

Target model: <name>
HF model_type: "<model_type>" (and variants like "<model_type>_moe" if applicable)
Attention: [standard GQA / with QK norm / with bias / sliding window / ...]
FFN: [SwiGLU / GeGLU / standard MLP / ...]
MoE: [no / yes - num_experts, top_k, shared_experts]
RoPE: [standard / YaRN / NTK-aware / ...]
Norm: [RMSNorm / LayerNorm] with [pre-norm / post-norm]
Weight tying: [yes / no]

Step 2: Select the Reference Model

Choose the closest existing implementation as a starting point:

Target characteristics	Reference	Why
Dense-only, standard GQA, no QK norm	`qwen2`	Simplest baseline, pure dense
Has QK norm, or has MoE support	`qwen3`	Supports QK norm + MoE + shared experts

Action: Copy the reference model directory as the starting point:

areal/experimental/models/archon/<model>/
  __init__.py
  spec.py
  model/
    args.py
    model.py
    rope.py
    state_dict_adapter.py
  infra/
    parallelize.py

Step 3: Implement `args.py`

Adapt <Model>ModelArgs to match the target model's HuggingFace config fields.

Key changes from reference:

Update the @dataclass fields to match the target model's hyperparameters:
- Field names should use Archon conventions (dim, n_layers, n_heads, n_kv_heads, vocab_size, head_dim, hidden_dim, norm_eps, rope_theta, etc.)
- Default values should match the smallest variant of the target model
- Add model-specific fields (e.g., attention_bias, qk_norm, sliding_window)
Update from_hf_config() to correctly map HuggingFace config attributes:
- Use getattr(hf_config, "field_name", default) for optional fields
- Handle variant-specific fields (e.g., MoE fields only present in MoE variants)
- The method must return an instance of the model args class

Critical: Verify every field mapping against the HF model's config.json. Incorrect mappings here cause silent errors downstream.

Base class contract (BaseModelArgs):

@dataclass
class <Model>ModelArgs(BaseModelArgs):
    # ... model-specific fields ...

    @classmethod
    def from_hf_config(
        cls,
        hf_config: PretrainedConfig,
        is_critic: bool = False,
        **kwargs,
    ) -> <Model>ModelArgs:
        # Map HF config fields to Archon model args
        ...

Step 4: Implement `model.py`

Adapt the model architecture to match the target model.

Key components to adapt:

Normalization (RMSNorm or similar):
- Check if elementwise_affine is configurable
- Check the epsilon default value
- If the model uses LayerNorm, implement accordingly
Attention module:
- Q/K/V projection: Check bias presence (nn.Linear(..., bias=True/False))
- QK norm: Add q_norm/k_norm if the model has them, remove if it doesn't
- GQA: n_kv_heads < n_heads for grouped-query attention
- Ulysses SP: Keep the set_cp_group / _sp_enabled pattern from the reference
- Output projection: Check bias presence
FeedForward module:
- SwiGLU: w2(silu(w1(x)) * w3(x)) -- most common for modern LLMs
- Check bias in linear layers
- For MoE models: MoE module replaces FeedForward on designated layers
TransformerBlock: Pre-norm (most modern LLMs) vs post-norm
- MoE layer detection via _is_moe_layer() if applicable
Top-level Model (<Model>Model(BaseArchonModel)):
- tok_embeddings, layers (as ModuleDict), norm, output/score
- init_weights(): Match initialization scheme from HF
- init_buffers(): RoPE cache + MoE buffers
- forward(): Must follow BaseArchonModel signature: (tokens, positions, cu_seqlens, max_seqlen) -> Tensor

Base class contract (BaseArchonModel):

class <Model>Model(BaseArchonModel):
    def forward(self, tokens, positions, cu_seqlens, max_seqlen) -> torch.Tensor: ...
    def init_weights(self) -> None: ...
    def init_buffers(self, buffer_device) -> None: ...

Step 5: Implement `rope.py`

Handle the rotary position embedding variant.

Options:

Standard RoPE (same as qwen2/qwen3): Re-export from qwen2:

from areal.experimental.models.archon.qwen2.model.rope import (
    apply_rotary_emb,
    precompute_rope_cache,
    repeat_kv,
    reshape_for_broadcast,
    rotate_half,
)

Custom RoPE (YaRN, NTK-aware, etc.): Implement custom precompute_rope_cache() and apply_rotary_emb() functions. The key difference is usually in how inv_freq is computed (scaling factors, interpolation, etc.).

Step 6: Implement `state_dict_adapter.py`

Map between HuggingFace and Archon weight key names.

This is the most error-prone step. The adapter must correctly handle:

Key name mapping (from_hf_map dict):
- Embedding: model.embed_tokens.weight -> tok_embeddings.weight
- Attention: model.layers.{}.self_attn.q_proj.weight -> layers.{}.attention.wq.weight
- FFN: model.layers.{}.mlp.gate_proj.weight -> layers.{}.feed_forward.w1.weight
- Norms: model.layers.{}.input_layernorm.weight -> layers.{}.attention_norm.weight
- Output: lm_head.weight -> output.weight
- Skip keys (set to None): rotary_emb.inv_freq (computed at runtime)
- Model-specific keys: bias terms, QK norm weights, etc.
Reverse mapping (to_hf_map): Auto-generated from from_hf_map
MoE expert weights (if applicable): 3D<->2D conversion for expert weights. Copy the MoE handling from qwen3 if the model has MoE.
Weight tying: Skip output.weight during to_hf() if tie_word_embeddings=True

Verification approach: After implementation, the adapter should satisfy:

# Roundtrip: archon -> hf -> archon preserves all keys
hf_sd = adapter.to_hf(archon_sd)
roundtrip_sd = adapter.from_hf(hf_sd)
assert set(roundtrip_sd.keys()) == set(archon_sd.keys())

Base class contract (BaseStateDictAdapter):

class <Model>StateDictAdapter(BaseStateDictAdapter):
    def from_hf(self, hf_state_dict) -> dict[str, Any]: ...
    def to_hf(self, archon_state_dict) -> dict[str, Any]: ...
    def convert_single_to_hf(self, name, tensor) -> list[tuple[str, torch.Tensor]]: ...

Step 7: Implement `parallelize.py`

Define the parallelization strategy for the model.

The parallelize function applies parallelism in this order:

TP (Tensor Parallelism) -- shard attention/FFN across devices
EP (Expert Parallelism) -- for MoE models only
CP (Context Parallelism / Ulysses SP) -- sequence parallelism
AC (Activation Checkpointing) -- memory optimization
torch.compile -- compilation optimization
FSDP (Fully Sharded Data Parallelism) -- data parallelism

Key adaptations by model architecture:

Attention with QK norm: wq/wk use use_local_output=False (DTensor output for norm), add SequenceParallel(sequence_dim=2) for q_norm/k_norm
Attention without QK norm: wq/wk/wv all use use_local_output=True
Attention with bias: Bias terms follow the same parallel plan as their weights
MoE layers: Separate TP plan for MoE input/output, router gate, and expert weights. Copy from qwen3's apply_moe_ep_tp() and apply_non_moe_tp()
Dense-only models: Simpler plan without MoE handling. Copy from qwen2

Function signature (must match ParallelizeFn protocol):

def parallelize_<model>(
    model: nn.Module,
    parallel_dims: ArchonParallelDims,
    param_dtype: torch.dtype = torch.bfloat16,
    reduce_dtype: torch.dtype = torch.float32,
    loss_parallel: bool = True,
    cpu_offload: bool = False,
    reshard_after_forward_policy: str = "default",
    ac_config: ActivationCheckpointConfig | None = None,
    enable_compile: bool = True,
) -> nn.Module:

Step 8: Create `spec.py` and Register

Assemble the ModelSpec and register

Content truncated.

More by inclusionAI

View all skills by inclusionAI →

add-unit-tests

inclusionAI

Guide for adding unit tests to AReaL. Use when user wants to add tests for new functionality or increase test coverage.

add-dataset

inclusionAI

Guide for adding a new dataset loader to AReaL. Use when user wants to add a new dataset.

debug-distributed

inclusionAI

Guide for debugging distributed training issues in AReaL. Use when user encounters hangs, wrong results, OOM, or communication errors.

add-workflow

inclusionAI

Guide for adding a new RolloutWorkflow to AReaL. Use when user wants to create a new workflow.

add-reward

inclusionAI

Guide for adding a new reward function to AReaL. Use when user wants to create a reward function.

hypercode-forge

inclusionAI

🚀 HyperCode Forge - Competitive compression engine for MCP workflows

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

2,8652,517

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

3,7881,649

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

2,1481,640

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

2,2621,465

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

2,4591,222

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,953967

Related MCP Servers

Browse all servers

Context7

Boost your AI code assistant with Context7: inject real-time API documentation from OpenAPI specification sources into your coding workflow.

48,1802 tools

Mobile Next

Mobile Next offers fast, seamless mobile automation for iOS and Android. Automate apps, extract data, and simplify mobile workflows effortlessly.

3,75219 tools

Google Cloud Compute Engine

Explore MCP servers for Google Compute Engine. Integrate model context protocol solutions to streamline GCE app development and experimentation.

3,3520 tools

Google Kubernetes Engine (GKE)

Explore Google Kubernetes Engine (GKE) MCP servers. Access resources and examples for context-aware app development in Google's ecosystem.

3,3520 tools

pg-aiguide

pg-aiguide — Version-aware PostgreSQL docs and best practices tailored for AI coding assistants. Improve queries, migrations, and model integrations.

1,5903 tools

Unity

Empower your Unity projects with Unity-MCP: AI-driven control, seamless integration, and advanced workflows within the Unity Editor.

1,2310 tools

Install

mkdir -p .claude/skills/add-archon-model && curl -L -o skill.zip "https://mcp.directory/api/skills/download/2678" && unzip -o skill.zip -d .claude/skills/add-archon-model && rm skill.zip

Installs to .claude/skills/add-archon-model

Stats

Views

Installs

Author

inclusionAI

7 skills published

Links

Source Code

add-archon-model

Install

About this skill

Add Archon Model

When to Use

Prerequisites

Step-by-Step Guide

Step 1: Analyze the Target Model Architecture

Step 2: Select the Reference Model

Step 3: Implement `args.py`

Step 4: Implement `model.py`

Step 5: Implement `rope.py`

Step 6: Implement `state_dict_adapter.py`

Step 7: Implement `parallelize.py`

Step 8: Create `spec.py` and Register

More by inclusionAI

add-unit-tests

add-dataset

debug-distributed

add-workflow

add-reward

hypercode-forge

You might also like

ui-ux-pro-max

pdf-to-markdown

flutter-development

drawio-diagrams-enhanced

godot

nano-banana-pro

Related MCP Servers

add-archon-model

Install

About this skill

Add Archon Model

When to Use

Prerequisites

Step-by-Step Guide

Step 1: Analyze the Target Model Architecture

Step 2: Select the Reference Model

Step 3: Implement args.py

Step 4: Implement model.py

Step 5: Implement rope.py

Step 6: Implement state_dict_adapter.py

Step 7: Implement parallelize.py

Step 8: Create spec.py and Register

More by inclusionAI

add-unit-tests

add-dataset

debug-distributed

add-workflow

add-reward

hypercode-forge

You might also like

ui-ux-pro-max

pdf-to-markdown

flutter-development

drawio-diagrams-enhanced

godot

nano-banana-pro

Related MCP Servers

Step 3: Implement `args.py`

Step 4: Implement `model.py`

Step 5: Implement `rope.py`

Step 6: Implement `state_dict_adapter.py`

Step 7: Implement `parallelize.py`

Step 8: Create `spec.py` and Register