add-archon-model
Guide for adding a new model to the Archon engine. Use when user wants to add support for a new HuggingFace model architecture in ArchonEngine.
Install
mkdir -p .claude/skills/add-archon-model && curl -L -o skill.zip "https://mcp.directory/api/skills/download/2678" && unzip -o skill.zip -d .claude/skills/add-archon-model && rm skill.zipInstalls to .claude/skills/add-archon-model
About this skill
Add Archon Model
Add support for a new HuggingFace model architecture in the Archon training engine.
When to Use
This skill is triggered when:
- User asks "how do I add a model to Archon?"
- User wants to support a new model family (e.g., Llama, Mistral, DeepSeek) in ArchonEngine
- User mentions adding a new
ModelSpecor model type for Archon
Prerequisites
Before starting, ensure:
- The target model is available on HuggingFace (has
config.jsonwithmodel_type) - You know the HuggingFace model ID (e.g.,
meta-llama/Llama-3-8B) - The model uses a standard transformer architecture (decoder-only)
Step-by-Step Guide
Step 1: Analyze the Target Model Architecture
Read the HuggingFace model's source code to extract key architecture information.
Action: Fetch and analyze the model's HuggingFace configuration and modeling files.
-
Read the model's
config.json(viaAutoConfig.from_pretrained) to identify:model_typestring (this is the key used for registry lookup)- All architecture hyperparameters (hidden_size, num_layers, etc.)
- Any model-specific fields (e.g.,
qk_norm,attention_bias, MoE fields)
-
Read the HuggingFace
modeling_*.pysource to identify:- Attention variant: Does it have Q/K norm? Attention bias? Sliding window? Multi-latent attention?
- FFN variant: SwiGLU (gate_proj + up_proj + down_proj)? GeGLU? Standard MLP?
- MoE support: Does it have MoE layers? What router type? Shared experts?
- RoPE variant: Standard RoPE? YaRN? NTK-aware scaling? What is the inv_freq formula?
- Normalization: RMSNorm or LayerNorm? Pre-norm or post-norm? Elementwise affine?
- Weight tying: Does
tie_word_embeddingsappear in config? - State dict key names: What are the HF weight key naming conventions?
-
Summarize findings in a checklist like:
Target model: <name>
HF model_type: "<model_type>" (and variants like "<model_type>_moe" if applicable)
Attention: [standard GQA / with QK norm / with bias / sliding window / ...]
FFN: [SwiGLU / GeGLU / standard MLP / ...]
MoE: [no / yes - num_experts, top_k, shared_experts]
RoPE: [standard / YaRN / NTK-aware / ...]
Norm: [RMSNorm / LayerNorm] with [pre-norm / post-norm]
Weight tying: [yes / no]
Step 2: Select the Reference Model
Choose the closest existing implementation as a starting point:
| Target characteristics | Reference | Why |
|---|---|---|
| Dense-only, standard GQA, no QK norm | qwen2 | Simplest baseline, pure dense |
| Has QK norm, or has MoE support | qwen3 | Supports QK norm + MoE + shared experts |
Action: Copy the reference model directory as the starting point:
areal/experimental/models/archon/<model>/
__init__.py
spec.py
model/
args.py
model.py
rope.py
state_dict_adapter.py
infra/
parallelize.py
Step 3: Implement args.py
Adapt <Model>ModelArgs to match the target model's HuggingFace config fields.
Key changes from reference:
-
Update the
@dataclassfields to match the target model's hyperparameters:- Field names should use Archon conventions (
dim,n_layers,n_heads,n_kv_heads,vocab_size,head_dim,hidden_dim,norm_eps,rope_theta, etc.) - Default values should match the smallest variant of the target model
- Add model-specific fields (e.g.,
attention_bias,qk_norm,sliding_window)
- Field names should use Archon conventions (
-
Update
from_hf_config()to correctly map HuggingFace config attributes:- Use
getattr(hf_config, "field_name", default)for optional fields - Handle variant-specific fields (e.g., MoE fields only present in MoE variants)
- The method must return an instance of the model args class
- Use
Critical: Verify every field mapping against the HF model's config.json. Incorrect
mappings here cause silent errors downstream.
Base class contract (BaseModelArgs):
@dataclass
class <Model>ModelArgs(BaseModelArgs):
# ... model-specific fields ...
@classmethod
def from_hf_config(
cls,
hf_config: PretrainedConfig,
is_critic: bool = False,
**kwargs,
) -> <Model>ModelArgs:
# Map HF config fields to Archon model args
...
Step 4: Implement model.py
Adapt the model architecture to match the target model.
Key components to adapt:
-
Normalization (
RMSNormor similar):- Check if
elementwise_affineis configurable - Check the epsilon default value
- If the model uses
LayerNorm, implement accordingly
- Check if
-
Attention module:
- Q/K/V projection: Check bias presence (
nn.Linear(..., bias=True/False)) - QK norm: Add
q_norm/k_normif the model has them, remove if it doesn't - GQA:
n_kv_heads<n_headsfor grouped-query attention - Ulysses SP: Keep the
set_cp_group/_sp_enabledpattern from the reference - Output projection: Check bias presence
- Q/K/V projection: Check bias presence (
-
FeedForward module:
- SwiGLU:
w2(silu(w1(x)) * w3(x))-- most common for modern LLMs - Check bias in linear layers
- For MoE models:
MoEmodule replacesFeedForwardon designated layers
- SwiGLU:
-
TransformerBlock: Pre-norm (most modern LLMs) vs post-norm
- MoE layer detection via
_is_moe_layer()if applicable
- MoE layer detection via
-
Top-level Model (
<Model>Model(BaseArchonModel)):tok_embeddings,layers(asModuleDict),norm,output/scoreinit_weights(): Match initialization scheme from HFinit_buffers(): RoPE cache + MoE buffersforward(): Must followBaseArchonModelsignature:(tokens, positions, cu_seqlens, max_seqlen) -> Tensor
Base class contract (BaseArchonModel):
class <Model>Model(BaseArchonModel):
def forward(self, tokens, positions, cu_seqlens, max_seqlen) -> torch.Tensor: ...
def init_weights(self) -> None: ...
def init_buffers(self, buffer_device) -> None: ...
Step 5: Implement rope.py
Handle the rotary position embedding variant.
Options:
-
Standard RoPE (same as qwen2/qwen3): Re-export from qwen2:
from areal.experimental.models.archon.qwen2.model.rope import ( apply_rotary_emb, precompute_rope_cache, repeat_kv, reshape_for_broadcast, rotate_half, ) -
Custom RoPE (YaRN, NTK-aware, etc.): Implement custom
precompute_rope_cache()andapply_rotary_emb()functions. The key difference is usually in howinv_freqis computed (scaling factors, interpolation, etc.).
Step 6: Implement state_dict_adapter.py
Map between HuggingFace and Archon weight key names.
This is the most error-prone step. The adapter must correctly handle:
-
Key name mapping (
from_hf_mapdict):- Embedding:
model.embed_tokens.weight->tok_embeddings.weight - Attention:
model.layers.{}.self_attn.q_proj.weight->layers.{}.attention.wq.weight - FFN:
model.layers.{}.mlp.gate_proj.weight->layers.{}.feed_forward.w1.weight - Norms:
model.layers.{}.input_layernorm.weight->layers.{}.attention_norm.weight - Output:
lm_head.weight->output.weight - Skip keys (set to
None):rotary_emb.inv_freq(computed at runtime) - Model-specific keys: bias terms, QK norm weights, etc.
- Embedding:
-
Reverse mapping (
to_hf_map): Auto-generated fromfrom_hf_map -
MoE expert weights (if applicable): 3D<->2D conversion for expert weights. Copy the MoE handling from qwen3 if the model has MoE.
-
Weight tying: Skip
output.weightduringto_hf()iftie_word_embeddings=True
Verification approach: After implementation, the adapter should satisfy:
# Roundtrip: archon -> hf -> archon preserves all keys
hf_sd = adapter.to_hf(archon_sd)
roundtrip_sd = adapter.from_hf(hf_sd)
assert set(roundtrip_sd.keys()) == set(archon_sd.keys())
Base class contract (BaseStateDictAdapter):
class <Model>StateDictAdapter(BaseStateDictAdapter):
def from_hf(self, hf_state_dict) -> dict[str, Any]: ...
def to_hf(self, archon_state_dict) -> dict[str, Any]: ...
def convert_single_to_hf(self, name, tensor) -> list[tuple[str, torch.Tensor]]: ...
Step 7: Implement parallelize.py
Define the parallelization strategy for the model.
The parallelize function applies parallelism in this order:
- TP (Tensor Parallelism) -- shard attention/FFN across devices
- EP (Expert Parallelism) -- for MoE models only
- CP (Context Parallelism / Ulysses SP) -- sequence parallelism
- AC (Activation Checkpointing) -- memory optimization
- torch.compile -- compilation optimization
- FSDP (Fully Sharded Data Parallelism) -- data parallelism
Key adaptations by model architecture:
- Attention with QK norm: wq/wk use
use_local_output=False(DTensor output for norm), addSequenceParallel(sequence_dim=2)for q_norm/k_norm - Attention without QK norm: wq/wk/wv all use
use_local_output=True - Attention with bias: Bias terms follow the same parallel plan as their weights
- MoE layers: Separate TP plan for MoE input/output, router gate, and expert
weights. Copy from qwen3's
apply_moe_ep_tp()andapply_non_moe_tp() - Dense-only models: Simpler plan without MoE handling. Copy from qwen2
Function signature (must match ParallelizeFn protocol):
def parallelize_<model>(
model: nn.Module,
parallel_dims: ArchonParallelDims,
param_dtype: torch.dtype = torch.bfloat16,
reduce_dtype: torch.dtype = torch.float32,
loss_parallel: bool = True,
cpu_offload: bool = False,
reshard_after_forward_policy: str = "default",
ac_config: ActivationCheckpointConfig | None = None,
enable_compile: bool = True,
) -> nn.Module:
Step 8: Create spec.py and Register
Assemble the ModelSpec and register
Content truncated.
More by inclusionAI
View all skills by inclusionAI →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
fastapi-templates
wshobson
Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.
Related MCP Servers
Browse all serversBoost your AI code assistant with Context7: inject real-time API documentation from OpenAPI specification sources into y
Mobile Next offers fast, seamless mobile automation for iOS and Android. Automate apps, extract data, and simplify mobil
Explore MCP servers for Google Compute Engine. Integrate model context protocol solutions to streamline GCE app developm
Explore Google Kubernetes Engine (GKE) MCP servers. Access resources and examples for context-aware app development in G
Empower your Unity projects with Unity-MCP: AI-driven control, seamless integration, and advanced workflows within the U
IDA Pro software enables programmatic access to IDA disassembler databases for automated reverse engineering and binary
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.