databricks-reference-architecture
Implement Databricks reference architecture with best-practice project layout. Use when designing new Databricks projects, reviewing architecture, or establishing standards for Databricks applications. Trigger with phrases like "databricks architecture", "databricks best practices", "databricks project structure", "how to organize databricks", "databricks layout".
Install
mkdir -p .claude/skills/databricks-reference-architecture && curl -L -o skill.zip "https://mcp.directory/api/skills/download/6847" && unzip -o skill.zip -d .claude/skills/databricks-reference-architecture && rm skill.zipInstalls to .claude/skills/databricks-reference-architecture
About this skill
Databricks Reference Architecture
Overview
Production-ready lakehouse architecture with Unity Catalog, Delta Lake, and the medallion pattern. Covers workspace organization, three-level namespace governance, compute strategy, CI/CD with Asset Bundles, and project structure for team collaboration.
Prerequisites
- Databricks workspace with Unity Catalog enabled
- Understanding of medallion architecture (bronze/silver/gold)
- Databricks CLI configured
- Terraform or Asset Bundles for infrastructure
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ UNITY CATALOG │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌───────────┐ │
│ │ Bronze │ │ Silver │ │ Gold │ │ ML Models │ │
│ │ Catalog │─▶│ Catalog │─▶│ Catalog │ │ (MLflow) │ │
│ │ (raw) │ │ (clean) │ │ (curated) │ │ │ │
│ └────────────┘ └────────────┘ └────────────┘ └───────────┘ │
│ ▲ │ │
│ ┌────────────┐ ┌────────────────┐ │
│ │ Auto Loader│ │ Model Serving │ │
│ │ Ingestion │ │ Endpoints │ │
│ └────────────┘ └────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ Compute: Job Clusters │ SQL Warehouses │ Instance Pools │
├─────────────────────────────────────────────────────────────────┤
│ Security: Row Filters │ Column Masks │ Secret Scopes │ SCIM │
├─────────────────────────────────────────────────────────────────┤
│ CI/CD: Asset Bundles │ GitHub Actions │ dev/staging/prod │
└─────────────────────────────────────────────────────────────────┘
Project Structure
databricks-platform/
├── src/
│ ├── ingestion/
│ │ ├── bronze_raw_events.py # Auto Loader streaming
│ │ ├── bronze_api_data.py # REST API batch ingestion
│ │ └── bronze_file_uploads.py # Manual file uploads
│ ├── transformation/
│ │ ├── silver_clean_events.py # Cleansing + dedup
│ │ ├── silver_schema_enforce.py # Schema validation
│ │ └── silver_scd2.py # Slowly changing dimensions
│ ├── aggregation/
│ │ ├── gold_daily_metrics.py # Business KPIs
│ │ ├── gold_user_features.py # ML feature engineering
│ │ └── gold_reporting.py # BI-ready views
│ └── ml/
│ ├── training/
│ │ └── train_churn_model.py
│ └── inference/
│ └── batch_scoring.py
├── tests/
│ ├── conftest.py # Spark fixtures
│ ├── unit/ # Local Spark tests
│ └── integration/ # Databricks Connect tests
├── resources/
│ ├── etl_jobs.yml # ETL job definitions
│ ├── ml_jobs.yml # ML pipeline definitions
│ └── maintenance.yml # OPTIMIZE/VACUUM schedules
├── databricks.yml # Asset Bundle root config
├── pyproject.toml
└── requirements.txt
Instructions
Step 1: Unity Catalog Hierarchy
-- One catalog per environment (or shared with schema isolation)
CREATE CATALOG IF NOT EXISTS dev_catalog;
CREATE CATALOG IF NOT EXISTS prod_catalog;
-- Medallion schemas per catalog
CREATE SCHEMA IF NOT EXISTS prod_catalog.bronze;
CREATE SCHEMA IF NOT EXISTS prod_catalog.silver;
CREATE SCHEMA IF NOT EXISTS prod_catalog.gold;
CREATE SCHEMA IF NOT EXISTS prod_catalog.ml_features;
CREATE SCHEMA IF NOT EXISTS prod_catalog.ml_models;
-- Permissions: engineers write bronze/silver, analysts read gold
GRANT USAGE ON CATALOG prod_catalog TO `data-engineers`;
GRANT CREATE, MODIFY, SELECT ON SCHEMA prod_catalog.bronze TO `data-engineers`;
GRANT CREATE, MODIFY, SELECT ON SCHEMA prod_catalog.silver TO `data-engineers`;
GRANT SELECT ON SCHEMA prod_catalog.gold TO `data-engineers`;
GRANT USAGE ON CATALOG prod_catalog TO `data-analysts`;
GRANT SELECT ON SCHEMA prod_catalog.gold TO `data-analysts`;
Step 2: Asset Bundle Configuration
# databricks.yml
bundle:
name: data-platform
workspace:
host: ${DATABRICKS_HOST}
include:
- resources/*.yml
variables:
catalog:
default: dev_catalog
alert_email:
default: dev@company.com
targets:
dev:
default: true
mode: development
workspace:
root_path: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/dev
staging:
variables:
catalog: staging_catalog
prod:
mode: production
variables:
catalog: prod_catalog
alert_email: oncall@company.com
workspace:
root_path: /Shared/.bundle/${bundle.name}/prod
Step 3: Compute Strategy
# resources/etl_jobs.yml
resources:
jobs:
daily_etl:
name: "daily-etl-${bundle.target}"
schedule:
quartz_cron_expression: "0 0 6 * * ?"
timezone_id: "UTC"
max_concurrent_runs: 1
tasks:
- task_key: bronze
notebook_task:
notebook_path: src/ingestion/bronze_raw_events.py
job_cluster_key: etl
- task_key: silver
depends_on: [{task_key: bronze}]
notebook_task:
notebook_path: src/transformation/silver_clean_events.py
job_cluster_key: etl
- task_key: gold
depends_on: [{task_key: silver}]
notebook_task:
notebook_path: src/aggregation/gold_daily_metrics.py
job_cluster_key: etl
job_clusters:
- job_cluster_key: etl
new_cluster:
spark_version: "14.3.x-scala2.12"
node_type_id: "i3.xlarge"
autoscale:
min_workers: 1
max_workers: 4
aws_attributes:
availability: SPOT_WITH_FALLBACK
first_on_demand: 1
spark_conf:
spark.databricks.delta.optimizeWrite.enabled: "true"
spark.databricks.delta.autoCompact.enabled: "true"
Step 4: Medallion Pipeline Pattern
# src/ingestion/bronze_raw_events.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, input_file_name
spark = SparkSession.builder.getOrCreate()
# Bronze: Auto Loader for incremental file ingestion
raw = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/checkpoints/bronze/events/schema")
.option("cloudFiles.inferColumnTypes", "true")
.load("s3://data-lake/raw/events/")
.withColumn("_ingested_at", current_timestamp())
.withColumn("_source_file", input_file_name())
)
(raw.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/checkpoints/bronze/events/data")
.toTable("prod_catalog.bronze.raw_events"))
Step 5: Table Maintenance Schedule
# resources/maintenance.yml
resources:
jobs:
weekly_optimize:
name: "maintenance-optimize-${bundle.target}"
schedule:
quartz_cron_expression: "0 0 2 ? * SUN"
timezone_id: "UTC"
tasks:
- task_key: optimize_tables
notebook_task:
notebook_path: src/maintenance/optimize_tables.py
new_cluster:
spark_version: "14.3.x-scala2.12"
node_type_id: "m5.xlarge"
num_workers: 1
# src/maintenance/optimize_tables.py
tables_to_optimize = [
("prod_catalog.silver.orders", ["order_date", "region"]),
("prod_catalog.silver.events", ["event_date"]),
("prod_catalog.gold.daily_metrics", []),
]
for table, z_cols in tables_to_optimize:
if z_cols:
spark.sql(f"OPTIMIZE {table} ZORDER BY ({', '.join(z_cols)})")
else:
spark.sql(f"OPTIMIZE {table}")
spark.sql(f"VACUUM {table} RETAIN 168 HOURS")
print(f"Maintained: {table}")
Output
- Unity Catalog hierarchy with env-isolated catalogs and medallion schemas
- Asset Bundle with dev/staging/prod targets and variable overrides
- Medallion pipeline (Auto Loader > MERGE > aggregations)
- RBAC grants separating engineer write from analyst read-only
- Table maintenance schedule (weekly OPTIMIZE + VACUUM)
Error Handling
| Issue | Cause | Solution |
|---|---|---|
| Schema evolution failure | New source columns | Auto Loader handles with schemaEvolutionMode |
| Permission denied on schema | Missing USAGE on parent catalog | GRANT USAGE ON CATALOG first |
| Concurrent write conflict | Multiple jobs writing same table | max_concurrent_runs: 1 in job config |
| Cluster timeout | Long-running tasks | Set timeout_seconds per task |
Examples
Validate Data Flow
SELECT 'bronze' AS layer, COUNT(*) AS rows FROM prod_catalog.bronze.raw_events
UNION ALL SELECT 'silver', COUNT(*) FROM prod_catalog.silver.events
UNION ALL SELECT 'gold', COUNT(*) FROM prod_catalog.gold.daily_metrics;
Resources
More by jeremylongshore
View all skills by jeremylongshore →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
fastapi-templates
wshobson
Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.
Related MCP Servers
Browse all serversGitHub Chat lets you query, analyze, and explore GitHub repositories with AI-powered insights, understanding codebases f
Nekzus Utility Server offers modular TypeScript tools for datetime, cards, and schema conversion with stdio transport co
Break down complex problems with Sequential Thinking, a structured tool and step by step math solver for dynamic, reflec
Build persistent semantic networks for enterprise & engineering data management. Enable data persistence and memory acro
Boost your AI code assistant with Context7: inject real-time API documentation from OpenAPI specification sources into y
Boost productivity with Task Master: an AI-powered tool for project management and agile development workflows, integrat
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.