Claude DuckDB Skill Guide (2026)

data-analysis

Hand the result frame to Polars when you need plots or Python-side post-processing.

filesystem

Let the agent discover which CSVs exist in the directory before querying.

Join a CSV against a Parquet file

Join two files of different formats in one query - a CSV export against a Parquet dimension table - without loading either into a database first.

ForData engineers stitching a hand-exported CSV against a columnar dataset from the warehouse dump.

The prompt

Use the write-script-duckdb skill. Join ./events.csv (event_id, user_id, ts, action) against ./users.parquet (user_id, plan, signup_country) on user_id. Return action counts per plan tier for the last 30 days. Use read_csv() and read_parquet() in the same FROM clause. Files stay on disk; no CREATE TABLE.

What slides.md looks like

SELECT
    u.plan,
    e.action,
    count(*) AS n
FROM read_csv('events.csv') AS e
JOIN read_parquet('users.parquet') AS u
    ON e.user_id = u.user_id
WHERE e.ts >= now() - INTERVAL 30 DAY
GROUP BY u.plan, e.action
ORDER BY u.plan, n DESC;

One-line tweak

Swap the JOIN for a LEFT JOIN and add WHERE u.user_id IS NULL to find events from users missing in the dimension file - a one-line orphan check.

Pairs with

filesystem

Resolve the two file paths and confirm both exist before the join runs.

data-analysis

Polars for follow-up reshaping once the joined frame is in hand.

Read and flatten nested JSON

Parse a JSON or newline-delimited JSON file, extract nested fields, and unnest an array column into rows - all in SQL.

ForEngineers wrangling API dumps and log exports that arrive as JSON, not tidy tables.

The prompt

Use the write-script-duckdb skill. I have ./api_dump.json where each record has { id, customer: { name, tier }, items: [ { sku, qty } ] }. Write DuckDB SQL that returns one row per item with id, customer name, tier, sku, qty. Use read_json(), the -> / ->> JSON operators, and UNNEST for the items array.

What slides.md looks like

SELECT
    j.id,
    j.customer->>'name' AS customer_name,
    j.customer->>'tier' AS tier,
    item->>'sku'        AS sku,
    (item->>'qty')::INT AS qty
FROM read_json('api_dump.json') AS j,
     UNNEST(j.items) AS t(item);

-- For newline-delimited logs use read_ndjson('logs.ndjson')
-- or read_json(..., format = 'newline_delimited')

One-line tweak

If the file is one giant JSON array per line that auto-detection mis-reads, force it with read_json('api_dump.json', format = 'auto') or point at read_ndjson() for line-delimited logs.

Pairs with

data-analysis

Polars handles deeply nested structs if a single UNNEST isn't enough.

github

Pull the raw JSON export straight from a repo or gist before parsing it.

Query a remote Parquet over HTTP

Run SQL against a Parquet file sitting on a URL or object store without downloading it first - DuckDB streams only the columns and row groups it needs.

ForAnyone exploring a public dataset, a shared S3 bucket, or a Hugging Face Parquet without wanting a 4 GB local copy.

The prompt

Use the write-script-duckdb skill. Query this remote Parquet over HTTPS: https://example.com/data/nyc_taxi_2026.parquet. Return average trip distance and total fare by payment_type, for trips over $50. Read it straight from the URL with read_parquet() - the httpfs extension auto-loads on first remote read, so no manual INSTALL is needed.

What slides.md looks like

-- httpfs auto-loads on the first https:// read
SELECT
    payment_type,
    count(*)               AS trips,
    round(avg(trip_distance), 2) AS avg_miles,
    round(sum(fare_amount), 2)   AS total_fares
FROM read_parquet('https://example.com/data/nyc_taxi_2026.parquet')
WHERE fare_amount > 50
GROUP BY payment_type
ORDER BY total_fares DESC;

-- Glob a folder of remote files and keep the source:
-- SELECT *, filename FROM read_parquet('s3://bucket/y=2026/*.parquet')

One-line tweak

Add filename = true (or SELECT *, filename) so each row carries which partition file it came from - invaluable when globbing a dated folder.

Pairs with

github

Resolve the raw URL of a Parquet committed to a repo's release assets.

duckdb

If multiple AI clients should hit the same remote dataset, the DuckDB MCP server runs the engine as a shared service.

Window functions for running totals & ranks

Compute a 7-day moving average, a running total, and per-group ranks in one pass using DuckDB's window functions and QUALIFY.

ForAnalysts who keep reaching for Excel's drag-fill. Window functions do it correctly and reproducibly.

The prompt

Use the write-script-duckdb skill. From ./daily_revenue.csv (date, region, revenue), compute per region: the 7-day moving average of revenue, the running cumulative total, and rank each day within its region by revenue. Use AVG/SUM OVER with frame clauses and row_number() OVER. Use QUALIFY to also return only each region's single best day.

What slides.md looks like

SELECT
    date, region, revenue,
    round(avg(revenue) OVER w7, 2) AS revenue_7d_avg,
    sum(revenue)       OVER wrun   AS revenue_running,
    row_number()       OVER wrank  AS day_rank
FROM 'daily_revenue.csv'
WINDOW
    w7   AS (PARTITION BY region ORDER BY date
             ROWS BETWEEN 6 PRECEDING AND CURRENT ROW),
    wrun AS (PARTITION BY region ORDER BY date),
    wrank AS (PARTITION BY region ORDER BY revenue DESC)
QUALIFY day_rank = 1;   -- best day per region only

One-line tweak

Drop the QUALIFY line to keep every day with its rolling metrics; keep it to collapse to the one peak day per region. Same query, two deliverables.

Pairs with

matplotlib

Plot the moving-average column as a line chart once the SQL returns the frame.

sql

Lean on the SQL-standards skill to keep window definitions readable across the team.

PIVOT and UNPIVOT a table

Reshape long data to wide (a cross-tab) and back again using DuckDB's PIVOT / UNPIVOT statements - no hand-written CASE-WHEN columns.

ForAnyone building a quarter-by-region matrix for a deck, or normalizing a wide spreadsheet into long form for analysis.

The prompt

Use the write-script-duckdb skill. From ./sales.csv (country, year, population) build a cross-tab: one row per country, one column per year, cells = sum(population). Use DuckDB's PIVOT statement. Then show me the inverse with UNPIVOT to fold a wide year-columns table back into (country, year, population) long form.

What slides.md looks like

-- Wide cross-tab: years become columns
PIVOT (FROM 'sales.csv')
ON year
USING sum(population)
GROUP BY country;

-- Inverse: fold year columns back to long form
UNPIVOT wide_sales
ON "2024", "2025", "2026"
INTO
    NAME  year
    VALUE population;

One-line tweak

PIVOT discovers the distinct ON values automatically. To freeze a fixed set of columns (e.g. only 2024-2026), use the SQL-standard form: PIVOT sales ON year IN (2024, 2025, 2026) USING sum(population).

Pairs with

xlsx

Write the pivoted cross-tab straight into a formatted spreadsheet tab.

matplotlib

Render the cross-tab as a heatmap once it's in wide form.

Export query results to Parquet

Run a transformation and write the result to a compressed, columnar Parquet file - the format every downstream warehouse and Polars/pandas job reads fastest.

ForData engineers producing a cleaned, typed artifact for a pipeline instead of re-deriving it from raw CSV every run.

The prompt

Use the write-script-duckdb skill. Read ./raw_events.csv, drop rows where user_id IS NULL, cast ts to TIMESTAMP, keep only the columns user_id, ts, action, and write the cleaned result to ./clean/events.parquet using COPY ... TO with FORMAT parquet. Then also show the PARTITION_BY variant that writes one folder per action.

What slides.md looks like

-- Single cleaned Parquet file
COPY (
    SELECT user_id, ts::TIMESTAMP AS ts, action
    FROM 'raw_events.csv'
    WHERE user_id IS NOT NULL
) TO 'clean/events.parquet' (FORMAT parquet);

-- Hive-partitioned: one sub-folder per action value
COPY (SELECT * FROM 'raw_events.csv')
TO 'clean/events' (FORMAT parquet, PARTITION_BY (action));

One-line tweak

Add COMPRESSION 'zstd' inside the options for a smaller file than the default Snappy, at a small write-time cost. Worth it for cold artifacts you read often.

Pairs with

data-analysis

The Parquet output drops straight into a Polars scan_parquet for the next stage.

filesystem

Create the ./clean output directory before COPY writes to it.

Query a SQLite database in place

Attach an existing SQLite file and run analytical SQL across its tables - join them, aggregate them, without an ETL step into another store.

ForDevelopers whose app data lives in a SQLite file (mobile apps, local tools, Django/Rails dev DBs) who want fast analytics on it.

The prompt

Use the write-script-duckdb skill. I have ./app.sqlite with tables users(id, country) and sessions(user_id, started_at, duration_s). Attach it as a SQLite database and return median session duration and session count per country. Use INSTALL sqlite; LOAD sqlite; ATTACH ... (TYPE sqlite); then query the attached tables by qualified name.

What slides.md looks like

INSTALL sqlite;   -- once per environment
LOAD sqlite;
ATTACH 'app.sqlite' AS app (TYPE sqlite);

SELECT
    u.country,
    count(*)                        AS sessions,
    median(s.duration_s)            AS median_seconds
FROM app.sessions AS s
JOIN app.users    AS u ON u.id = s.user_id
GROUP BY u.country
ORDER BY sessions DESC;

One-line tweak

DuckDB ships ATTACH readers for SQLite, Postgres, and MySQL. Swap (TYPE sqlite) for (TYPE postgres) plus a connection string to run the same analytical SQL against a live Postgres without touching its write path.

Pairs with

postgresql

When the source is Postgres instead, the PostgreSQL MCP server inspects schemas before you ATTACH.

filesystem

Locate the .sqlite file across the project before attaching.

COPY a CSV into a typed table

Define an explicit, typed schema and bulk-load a CSV into it with COPY - the path you want when types matter more than zero-setup convenience.

ForEngineers who need a column to be DECIMAL not DOUBLE, or a date parsed with a specific format, before any downstream query touches it.

The prompt

Use the write-script-duckdb skill. Create a typed DuckDB table 'transactions' (txn_id BIGINT, amount DECIMAL(12,2), currency VARCHAR, txn_date DATE) and bulk-load ./txns.csv into it with COPY ... FROM. The CSV has a header row and a non-default date format YYYY-MM-DD. After loading, return the row count and the sum by currency.

What slides.md looks like

CREATE TABLE transactions (
    txn_id   BIGINT,
    amount   DECIMAL(12,2),
    currency VARCHAR,
    txn_date DATE
);

COPY transactions FROM 'txns.csv'
    (HEADER true, DATEFORMAT '%Y-%m-%d');

SELECT currency, count(*) AS n, sum(amount) AS total
FROM transactions
GROUP BY currency
ORDER BY total DESC;

One-line tweak

Add (HEADER true, IGNORE_ERRORS true) to skip malformed rows instead of failing the whole load, then query the rejects table DuckDB records to see what it dropped.

Pairs with

sql

The SQL-standards skill keeps the DDL and load script house-style across the team.

filesystem

Confirm the CSV path and header layout before defining the schema.

Read an Excel sheet directly

Query an .xlsx workbook the same way you query a CSV - pick a sheet, treat row one as headers, and run SQL with no manual Save-As-CSV dance.

ForAnyone handed a spreadsheet by finance, ops, or a vendor who needs answers out of it without opening Excel.

The prompt

Use the write-script-duckdb skill. From ./budget.xlsx, read the sheet named 'FY26' with the first row as headers, and return total budget and total actual by department, plus the variance. Use read_xlsx() with the sheet and header options - DuckDB infers the Excel extension automatically, no manual INSTALL.

What slides.md looks like

SELECT
    department,
    sum(budget)            AS budget,
    sum(actual)            AS actual,
    sum(actual) - sum(budget) AS variance
FROM read_xlsx('budget.xlsx', sheet = 'FY26', header = true)
GROUP BY department
ORDER BY variance;

-- .xlsx only; .xls is not supported by the reader.

One-line tweak

Wrap the read in a COPY (...) TO 'budget_fy26.parquet' (FORMAT parquet) to snapshot the spreadsheet into a stable Parquet the rest of the pipeline can trust.

Pairs with

xlsx

When you need to write a formatted workbook back out, not just read one in.

pdf

Extract a table from a PDF report into rows first, then analyze it the DuckDB way.

Community signal

Three voices from people using DuckDB for real work. The first is the time-to-value story; the second is the daily-default shift; the third names exactly who it clicks for.

“Naively I started by looking at redshift, Athena/glue, etc. I spun my wheels for a week and barely made progress... So I decided to give DuckDB a shot and I got the whole project done in under a day.”

moribvndvs · Hacker News

An engineer tasked with giving a non-technical team arbitrary queries over a daily Parquet drop - a week of fragile cloud PoCs versus a day on DuckDB.

“Over the past few years, I've found myself using DuckDB more and more for data processing, to the point where I now use it almost exclusively”

Robin Linacre · Blog

Data scientist Robin Linacre on why DuckDB became his default tool for data processing - the conviction shift the cookbook is built around.

“I'm sure the use of duckdb may seem weird for normal developers, but for data people it really is game-changing, especially for data scientists or business analysts.”

rubenvanwyk · Hacker News

A commenter naming exactly who DuckDB clicks for - analysts and data scientists - even when it looks unusual to app developers.

The contrarian take

DuckDB earns real skepticism when people try to make it something it isn’t. The sharpest version, from sradman on Hacker News:

“I would not adopt DuckDB as an OLTP replacement for either SQLite or PostgreSQL without a strong OLAP component and some serious performance/load/integrity testing”

sradman · Hacker News

From a Hacker News thread on DuckDB's positioning.