terway-troubleshooting
Troubleshoot Terway CNI issues in Kubernetes using Kubernetes events and Terway logs. Use when diagnosing "cni plugin not initialized", Pod create/delete failures, or ENI/IPAM problems in Terway (centralized or non-centralized IPAM).
Install
mkdir -p .claude/skills/terway-troubleshooting && curl -L -o skill.zip "https://mcp.directory/api/skills/download/4962" && unzip -o skill.zip -d .claude/skills/terway-troubleshooting && rm skill.zipInstalls to .claude/skills/terway-troubleshooting
About this skill
Terway Troubleshooting SOP
When to use this Skill
Use this Skill whenever the user:
- Reports "cni plugin not initialized" or similar CNI errors on nodes
- Reports Pod creation or deletion failures in a cluster using Terway as the CNI
- Suspects ENI/IPAM/resource issues related to Terway (centralized or non-centralized)
- Needs to interpret specific Kubernetes events or critical log messages from Terway components.
Always assume the cluster is running Kubernetes and Terway is the CNI plugin.
High-level troubleshooting flow
Follow this order to diagnose Terway issues efficiently:
-
Verify Terway Component Health
- When CNI errors like "plugin not initialized" or "socket not found" occur, first check if Terway Pods are
Running. kubectl get pods -n kube-system -l app=terway-eniip -o wide- If a Pod is not
Running, check its events and logs (terway-initandterwaycontainers) using the patterns in Step 1.
- When CNI errors like "plugin not initialized" or "socket not found" occur, first check if Terway Pods are
-
Gather Necessary Context (As needed)
- If the cause isn't obvious from Pod status, gather cluster and node configuration.
- You can use the following scripts or run
kubectlcommands directly:- Cluster config:
./scripts/inspect-terway-cluster.sh - Node config:
./scripts/inspect-terway-node.sh <node-name> - Pod config:
./scripts/inspect-terway-pod.sh <namespace> <pod-name>
- Cluster config:
-
Use Kubernetes Events as the primary signal
- For any problematic Pod, inspect its Events first:
kubectl describe pod <pod> -n <ns>. - Map Terway-specific event reasons (e.g.,
AllocIPFailed,CniPodCreateError) to likely causes.
- For any problematic Pod, inspect its Events first:
-
Inspect Terway IPAM / ENI controllers
- Depending on IPAM type (
crdvsdefault), check relevant CRDs (PodENI,Node) and their Events.
- Depending on IPAM type (
-
Deep Dive into Logs
- Use logs only when Events are missing or point to an ambiguous failure.
Keep answers structured: first restate what has been checked, then propose next verification steps.
Step 1 – Terway and CNI initialization
-
If the user reports "cni plugin not initialized" or "dial unix ... eni.socket: no such file or directory":
- Phenomenon: The CNI plugin cannot communicate with the Terway Daemon.
- Immediate Action: Check the Terway Pod status:
kubectl get pods -n kube-system -l app=terway-eniip -o wide.
-
Diagnose by Pod Status and Log Patterns:
-
If the Pod is in
Init:Error: Inspectterway-initlogs (kubectl logs <pod> -n kube-system -c terway-init).exclusive eni mode changed:- Explanation: The label
k8s.aliyun.com/exclusive-mode-eni-typewas modified on an existing node. Exclusive mode only works for newly created nodes. - Fix: Revert the label or recreate the node.
- Explanation: The label
get node ... error:- Explanation: Terway cannot reach the Kubernetes API to fetch node labels (network or RBAC issue).
unsupport kernel version, require >=5.10:- Explanation: The node's kernel is too old for the configured features.
failed process input:- Explanation:
/etc/eni/eni_conf(fromeni-configCM) is missing or invalid JSON.
- Explanation:
mount failed:- Explanation: Failed to mount
bpffson/sys/fs/bpf(privilege or kernel support issue).
- Explanation: Failed to mount
Init erdma driverfailure:- Explanation:
modprobe erdmafailed on an ERDMA-enabled node.
- Explanation:
-
If the Pod is in
CrashLoopBackOfforError: Inspect main container logs.- Daemon (
terway) Patterns:error restart device plugin after kubelet restart: Check permissions/mounts for/var/lib/kubelet/device-plugins.unable to set feature gates: Invalid flag in--feature-gates.error create trunk eni: OpenAPI failure during trunk ENI initialization (check Aliyun credentials/quota).
- Controlplane (
terway-controlplane) Patterns:failed to create controller: Check RBAC permissions or CRD availability.failed to setup webhooks: TLS certificate or WebhookConfiguration issues.
- Daemon (
-
If the Pod is
Runningbut the socket error persists:- Verify that the
var/run/eni/directory is correctly shared between the host and the container via hostPath volume.
- Verify that the
-
-
Only after Terway is confirmed running on the node, proceed to Pod create/delete failures and Events.
Step 2 – Always start from Kubernetes Events
For any Pod with network-related failures:
-
Inspect Pod Events
- Instruct the user to run
kubectl describe pod <pod> -n <ns>and paste relevant Events. - Focus on Terway-related reasons (case-sensitive):
AllocIPFailed(Warning, Pod)AllocIPSucceed(Normal, Pod)VirtualModeChanged(Warning, Pod)CniPodCreateError(Warning, Pod)CniPodDeleteError(Warning, Pod)CniCreateENIError(Warning, Pod)CniPodENIDeleteErr(Warning, Pod)
- Instruct the user to run
-
Interpret common Pod event reasons
AllocIPFailed(Warning, Pod)- Message starts with
cmdAdd: error alloc ip: Backend communication failure (daemon to controlplane or daemon internal). - Message:
eth0 config is missing: Backend failed to return configuration for the primary interface. - Message contains OpenAPI errors:
InvalidVSwitchID.IPNotEnough/QuotaExceeded.PrivateIPAddress: VSwitch IP exhaustion.ErrEniPerInstanceLimitExceeded: Node-level ENI quota reached.
- Message starts with
AllocIPSucceed(Normal, Pod)- Message:
Alloc IP %s took %s. - IP allocation succeeded; if the Pod still fails, the issue is likely after IP allocation (datapath setup, routes, iptables, etc.).
- Message:
VirtualModeChanged(Warning, Pod)- Message:
IPVLan seems unavailable, use Veth instead. - The node’s kernel version is likely below 4.19 or lacks required capabilities. IPVLan-based data plane acceleration cannot be enabled, but Pod creation is unaffected. Networking falls back to Veth mode safely.
- Message:
CniPodCreateError(Warning, Pod)- From the controlplane Pod controller.
- Message:
error parse pod annotation:k8s.aliyun.com/pod-networksis malformed. - Message:
podNetworking is empty:k8s.aliyun.com/pod-networkingannotation is present but empty. - Message:
error get podNetworking %s: The referencedPodNetworkingCR is missing. - Message:
can not found available vSwitch for zone %s: No available VSwitch in the current zone matching the selector.
CniPodDeleteError(Warning, Pod)- Failure in Pod delete cleanup.
CniCreateENIError/CniPodENIDeleteErr(Warning, Pod)- From the PodENI controller. Message contains specific OpenAPI errors or
rollbackErr.
- From the PodENI controller. Message contains specific OpenAPI errors or
-
If no Terway-specific Events are present
- Confirm that the Pod is scheduled to a node where Terway is running.
- Then move to node-level and CRD-level Events.
Step 3 – Node and Node CR Events
Distinguish between:
- Kubernetes Node object (
corev1.Node). - Terway Node CRD (
network.alibabacloud.com/v1beta1 Node) used in centralized IPAM.
-
On the Kubernetes Node (
corev1.Node)- Important Terway-related event reasons:
AllocIPFailed(Warning, Node)- From local IPAM; indicates ENI/IP issues at node level.
ConfigError(Warning, Node)- From Terway node controllers when
eni-configor node capabilities are invalid.
- From Terway node controllers when
- Use these to distinguish between misconfiguration vs. resource exhaustion.
- Important Terway-related event reasons:
-
On the Terway Node CRD (centralized IPAM)
- When centralized IPAM is enabled, a
NodeCR undernetwork.alibabacloud.comexists. - Terway emits events on this CR for ENI lifecycle and pool operations:
CreateENIFailed: Message:Failed to create ENI type=%s vsw=%s: %v. Check for OpenAPI errors likeInvalidVSwitchID.IPNotEnough.AttachENIFailed: Message:trunk eni id not found(agent not ready) ortrunk eni is not allowed for eniOnly pod(scheduling/config mismatch).DeleteENIFailed: Message:Failed to delete ENI %s: %v.
- Node Conditions on Node CR:
SufficientIP: IfFalse, reason isIPResInsufficient, meaning the node pool cannot be filled.
- When centralized IPAM is enabled, a
-
Link Node events to Pod failures
- If Pods report
AllocIPFailedorCniPodCreateError, check whether the corresponding Node / Node CR shows ENI/IPAM failures. - Use that correlation to explain whether the problem is capacity, config, or bug.
- If Pods report
Step 4 – Centralized vs non-centralized IPAM behavior
When reasoning about Terway behavior, always clarify which IPAM mode is in use.
-
Detect mode from context
- Centralized IPAM indicators:
- Presence of Terway controlplane deployment.
- CRDs like
podenis.network.alibabacloud.com,nodes.network.alibabacloud.com,podnetworkings.network.alibabacloud.com. - Helm/config flag
centralizedIPAM: trueor controlplane config withCentralizedIPAMset.
- Non-centralized/local IPAM indicators:
- IPAM type in
eni-configisdefault. - Node-local IPAM logic in the daemon is responsible for ENI/IP management.
- IPAM type in
- Centralized IPAM indicators:
-
If centralized IPAM
- In addition to Pod and Node events, always consider:
- PodENI CR (per-pod ENI and IP state): events like
CreateENIFailed,AttachENIFailed,UpdatePodENIFailed. - Node CR: ENI pool and warmup behavior.
- PodNetworking CR: Events
SyncPodNetworkingSucceed/Failedwhen syncing vswitch lists.
- PodENI CR (per-pod ENI and IP state): events like
- For Pod failures:
- Check Pod Events (Cni* reasons) → PodENI Events → Node CR Events → controlplane logs.
- In addition to Pod and Node events, always consider:
-
If non-centralized IPAM
- Focus on:
- Node Events (
AllocIPFailed,ConfigError). eni-configConfigMap correctness (vswitches, security groups, ip_stack, trunk/erdma flags, etc.).- Terway daemon logs on the affected node.
- Node Events (
- Focus on:
Step 5 – Using logs only when Events are insufficient
- **When to move to
Content truncated.
You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
fastapi-templates
wshobson
Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.
Related MCP Servers
Browse all serversRun a ping test worldwide with Globalping. Diagnose connectivity issues using Cloudflare Workers for accurate network tr
Access Confluence pages and Jira in the cloud with Atlassian API. Integrate effortlessly using the REST API for Jira.
Explore Google Kubernetes Engine (GKE) MCP servers. Access resources and examples for context-aware app development in G
Interact with Kubernetes resources using natural language instead of complex kubectl commands. Simplify cluster manageme
Integrate with Plane for automated project and workflow management. Streamline software workflow tasks using robust work
Connect seamlessly with CircleCI to fetch build failure logs, troubleshoot issues, and streamline your CI/CD workflow.
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.