databricks-common-errors
Diagnose and fix Databricks common errors and exceptions. Use when encountering Databricks errors, debugging failed jobs, or troubleshooting cluster and notebook issues. Trigger with phrases like "databricks error", "fix databricks", "databricks not working", "debug databricks", "spark error".
Install
mkdir -p .claude/skills/databricks-common-errors && curl -L -o skill.zip "https://mcp.directory/api/skills/download/3379" && unzip -o skill.zip -d .claude/skills/databricks-common-errors && rm skill.zipInstalls to .claude/skills/databricks-common-errors
About this skill
Databricks Common Errors
Overview
Quick-reference diagnostic guide for the most frequent Databricks errors. Covers cluster failures, Spark OOM, Delta Lake conflicts, permissions, schema mismatches, rate limits, and job run failures with real SDK/SQL solutions.
Prerequisites
- Databricks CLI configured
- Access to cluster/job logs
databricks-sdkinstalled for programmatic debugging
Instructions
Step 1: Identify the Error Source
# Get failed run details
databricks runs get --run-id $RUN_ID --output json | jq '{
state: .state.result_state,
message: .state.state_message,
tasks: [.tasks[] | {key: .task_key, state: .state.result_state, error: .state.state_message}]
}'
Step 2: Match and Fix
CLUSTER_NOT_READY / INVALID_STATE
ClusterNotReadyException: Cluster 0123-456789-abcde is not in a RUNNING state
Cause: Cluster is starting, terminating, or in error state.
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.compute import State
w = WorkspaceClient()
cluster = w.clusters.get(cluster_id="0123-456789-abcde")
if cluster.state in (State.PENDING, State.RESTARTING):
w.clusters.ensure_cluster_is_running("0123-456789-abcde")
elif cluster.state == State.TERMINATED:
w.clusters.start_and_wait(cluster_id="0123-456789-abcde")
elif cluster.state == State.ERROR:
reason = cluster.termination_reason
print(f"Cluster error: {reason.code} — {reason.parameters}")
# Common: CLOUD_PROVIDER_LAUNCH_FAILURE, INSTANCE_POOL_CLUSTER_FAILURE
SPARK_DRIVER_OOM
java.lang.OutOfMemoryError: Java heap space
SparkException: Job aborted due to stage failure
Cause: Driver or executor running out of memory.
# Fix 1: Increase memory via cluster Spark config
spark_conf = {
"spark.driver.memory": "8g",
"spark.executor.memory": "8g",
"spark.sql.shuffle.partitions": "400", # reduce skew
}
# Fix 2: Never collect() large datasets
# BAD: all_data = df.collect()
# GOOD: df.write.format("delta").saveAsTable("catalog.schema.results")
# Fix 3: Broadcast small tables instead of shuffling
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_lookup_df), "key")
DELTA_CONCURRENT_WRITE
ConcurrentAppendException: Files were added by a concurrent update
ConcurrentDeleteReadException: A concurrent operation modified files
Cause: Multiple jobs writing to the same Delta table simultaneously.
from delta.tables import DeltaTable
import time
def merge_with_retry(spark, source_df, target_table, merge_key, max_retries=3):
"""MERGE with retry for concurrent write conflicts."""
for attempt in range(max_retries):
try:
target = DeltaTable.forName(spark, target_table)
(target.alias("t")
.merge(source_df.alias("s"), f"t.{merge_key} = s.{merge_key}")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute())
return
except Exception as e:
if "Concurrent" in str(e) and attempt < max_retries - 1:
time.sleep(2 ** attempt)
continue
raise
PERMISSION_DENIED
PERMISSION_DENIED: User does not have SELECT on TABLE catalog.schema.table
PermissionDeniedException: User does not have permission MANAGE on cluster
Cause: Missing Unity Catalog grants or workspace permissions.
-- Fix Unity Catalog permissions (requires GRANT privilege)
GRANT USAGE ON CATALOG analytics TO `data-team`;
GRANT USAGE ON SCHEMA analytics.silver TO `data-team`;
GRANT SELECT ON TABLE analytics.silver.orders TO `data-team`;
-- Check current grants
SHOW GRANTS ON TABLE analytics.silver.orders;
# Fix workspace object permissions
databricks permissions update jobs --job-id 123 --json '{
"access_control_list": [{
"user_name": "user@company.com",
"permission_level": "CAN_MANAGE_RUN"
}]
}'
INVALID_PARAMETER_VALUE
InvalidParameterValue: Instance type xyz not supported in region us-east-1
Invalid spark_version: 13.x.x-scala2.12
Cause: Wrong cluster config for the workspace region.
w = WorkspaceClient()
# List valid node types for this workspace
for nt in sorted(w.clusters.list_node_types().node_types, key=lambda x: x.memory_mb)[:10]:
print(f"{nt.node_type_id}: {nt.memory_mb}MB, {nt.num_cores} cores")
# List valid Spark versions
for v in w.clusters.spark_versions().versions:
if "LTS" in v.name:
print(f"{v.key}: {v.name}")
SCHEMA_MISMATCH
AnalysisException: A schema mismatch detected when writing to the Delta table
Cause: Source schema doesn't match target table.
# Option 1: Enable schema evolution
df.write.format("delta").option("mergeSchema", "true").mode("append").saveAsTable("target")
# Option 2: Identify differences
source_cols = set(df.columns)
target_cols = set(spark.table("target").columns)
print(f"Missing in source: {target_cols - source_cols}")
print(f"Extra in source: {source_cols - target_cols}")
# Option 3: Cast to match target schema
target_schema = spark.table("target").schema
for field in target_schema:
if field.name in df.columns:
df = df.withColumn(field.name, col(field.name).cast(field.dataType))
JOB_RUN_FAILED
RunState: FAILED — Run terminated with error
w = WorkspaceClient()
run = w.jobs.get_run(run_id=12345)
print(f"State: {run.state.life_cycle_state}")
print(f"Result: {run.state.result_state}")
print(f"Message: {run.state.state_message}")
# Check each task
for task in run.tasks:
if task.state.result_state and task.state.result_state.value == "FAILED":
output = w.jobs.get_run_output(task.run_id)
print(f"Task '{task.task_key}' failed: {output.error}")
if output.error_trace:
print(f"Traceback:\n{output.error_trace[:500]}")
HTTP 429 — RATE_LIMIT_EXCEEDED
See databricks-rate-limits skill for full retry patterns.
from databricks.sdk.errors import TooManyRequests
import time
def call_with_backoff(operation, max_retries=5):
for attempt in range(max_retries):
try:
return operation()
except TooManyRequests as e:
wait = e.retry_after_secs or (2 ** attempt)
print(f"Rate limited, waiting {wait}s...")
time.sleep(wait)
raise RuntimeError("Max retries exceeded")
Output
- Error identified and categorized
- Fix applied from matching error pattern
- Resolution verified
Error Handling
| Error Code | HTTP | Category | Quick Fix |
|---|---|---|---|
CLUSTER_NOT_READY | - | Compute | ensure_cluster_is_running() |
OutOfMemoryError | - | Spark | Increase memory, avoid .collect() |
ConcurrentAppendException | - | Delta | MERGE with retry, serialize writes |
PERMISSION_DENIED | 403 | Auth | GRANT in Unity Catalog |
INVALID_PARAMETER_VALUE | 400 | Config | Check list_node_types() |
AnalysisException | - | Schema | mergeSchema=true |
FAILED run state | - | Job | Check get_run_output() for traceback |
Too Many Requests | 429 | Rate Limit | Exponential backoff with Retry-After |
Examples
Quick Diagnostic Commands
databricks clusters get --cluster-id $CID | jq '{state, termination_reason}'
databricks runs list --job-id $JID --limit 5 | jq '.runs[] | {run_id, state: .state.result_state}'
databricks permissions get jobs --job-id $JID
Escalation Path
- Check Databricks Status
- Collect evidence with
databricks-debug-bundle - Search Community Forum
- Contact support with workspace ID and request ID from error response
Resources
Next Steps
For comprehensive debugging, see databricks-debug-bundle.
More by jeremylongshore
View all skills by jeremylongshore →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
fastapi-templates
wshobson
Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.
Related MCP Servers
Browse all serversLogfire is a data observability platform for querying, analyzing, and monitoring OpenTelemetry traces, errors, and metri
Sentry Issues integrates with Sentry error tracking to access issue data and events for analyzing exceptions in developm
Supercharge your NextJS projects with AI-powered tools for diagnostics, upgrades, and docs. Accelerate development and b
Claude Historian: AI-powered search for Claude Code conversations—find files, errors, context, and sessions via JSONL pa
Access Svelte documentation, code analysis, and autofix tools for Svelte 5 & SvelteKit. Improve projects with smart migr
Ask Human adds human-in-the-loop responses to AI, preventing errors on sensitive tasks like passwords and API endpoints.
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.