dask

6views

8installs

Parallel/distributed computing. Scale pandas/NumPy beyond memory, parallel DataFrames/Arrays, multi-file processing, task graphs, for larger-than-RAM datasets and parallel workflows.

Install

mkdir -p .claude/skills/dask && curl -L -o skill.zip "https://mcp.directory/api/skills/download/2240" && unzip -o skill.zip -d .claude/skills/dask && rm skill.zip

Installs to .claude/skills/dask

About this skill

Dask

Overview

Dask is a Python library for parallel and distributed computing that enables three critical capabilities:

Larger-than-memory execution on single machines for data exceeding available RAM
Parallel processing for improved computational speed across multiple cores
Distributed computation supporting terabyte-scale datasets across multiple machines

Dask scales from laptops (processing ~100 GiB) to clusters (processing ~100 TiB) while maintaining familiar Python APIs.

When to Use This Skill

This skill should be used when:

Process datasets that exceed available RAM
Scale pandas or NumPy operations to larger datasets
Parallelize computations for performance improvements
Process multiple files efficiently (CSVs, Parquet, JSON, text logs)
Build custom parallel workflows with task dependencies
Distribute workloads across multiple cores or machines

Core Capabilities

Dask provides five main components, each suited to different use cases:

1. DataFrames - Parallel Pandas Operations

Purpose: Scale pandas operations to larger datasets through parallel processing.

When to Use:

Tabular data exceeds available RAM
Need to process multiple CSV/Parquet files together
Pandas operations are slow and need parallelization
Scaling from pandas prototype to production

Reference Documentation: For comprehensive guidance on Dask DataFrames, refer to references/dataframes.md which includes:

Reading data (single files, multiple files, glob patterns)
Common operations (filtering, groupby, joins, aggregations)
Custom operations with map_partitions
Performance optimization tips
Common patterns (ETL, time series, multi-file processing)

Quick Example:

import dask.dataframe as dd

# Read multiple files as single DataFrame
ddf = dd.read_csv('data/2024-*.csv')

# Operations are lazy until compute()
filtered = ddf[ddf['value'] > 100]
result = filtered.groupby('category').mean().compute()

Key Points:

Operations are lazy (build task graph) until .compute() called
Use map_partitions for efficient custom operations
Convert to DataFrame early when working with structured data from other sources

2. Arrays - Parallel NumPy Operations

Purpose: Extend NumPy capabilities to datasets larger than memory using blocked algorithms.

When to Use:

Arrays exceed available RAM
NumPy operations need parallelization
Working with scientific datasets (HDF5, Zarr, NetCDF)
Need parallel linear algebra or array operations

Reference Documentation: For comprehensive guidance on Dask Arrays, refer to references/arrays.md which includes:

Creating arrays (from NumPy, random, from disk)
Chunking strategies and optimization
Common operations (arithmetic, reductions, linear algebra)
Custom operations with map_blocks
Integration with HDF5, Zarr, and XArray

Quick Example:

import dask.array as da

# Create large array with chunks
x = da.random.random((100000, 100000), chunks=(10000, 10000))

# Operations are lazy
y = x + 100
z = y.mean(axis=0)

# Compute result
result = z.compute()

Key Points:

Chunk size is critical (aim for ~100 MB per chunk)
Operations work on chunks in parallel
Rechunk data when needed for efficient operations
Use map_blocks for operations not available in Dask

3. Bags - Parallel Processing of Unstructured Data

Purpose: Process unstructured or semi-structured data (text, JSON, logs) with functional operations.

When to Use:

Processing text files, logs, or JSON records
Data cleaning and ETL before structured analysis
Working with Python objects that don't fit array/dataframe formats
Need memory-efficient streaming processing

Reference Documentation: For comprehensive guidance on Dask Bags, refer to references/bags.md which includes:

Reading text and JSON files
Functional operations (map, filter, fold, groupby)
Converting to DataFrames
Common patterns (log analysis, JSON processing, text processing)
Performance considerations

Quick Example:

import dask.bag as db
import json

# Read and parse JSON files
bag = db.read_text('logs/*.json').map(json.loads)

# Filter and transform
valid = bag.filter(lambda x: x['status'] == 'valid')
processed = valid.map(lambda x: {'id': x['id'], 'value': x['value']})

# Convert to DataFrame for analysis
ddf = processed.to_dataframe()

Key Points:

Use for initial data cleaning, then convert to DataFrame/Array
Use foldby instead of groupby for better performance
Operations are streaming and memory-efficient
Convert to structured formats (DataFrame) for complex operations

4. Futures - Task-Based Parallelization

Purpose: Build custom parallel workflows with fine-grained control over task execution and dependencies.

When to Use:

Building dynamic, evolving workflows
Need immediate task execution (not lazy)
Computations depend on runtime conditions
Implementing custom parallel algorithms
Need stateful computations

Reference Documentation: For comprehensive guidance on Dask Futures, refer to references/futures.md which includes:

Setting up distributed client
Submitting tasks and working with futures
Task dependencies and data movement
Advanced coordination (queues, locks, events, actors)
Common patterns (parameter sweeps, dynamic tasks, iterative algorithms)

Quick Example:

from dask.distributed import Client

client = Client()  # Create local cluster

# Submit tasks (executes immediately)
def process(x):
    return x ** 2

futures = client.map(process, range(100))

# Gather results
results = client.gather(futures)

client.close()

Key Points:

Requires distributed client (even for single machine)
Tasks execute immediately when submitted
Pre-scatter large data to avoid repeated transfers
~1ms overhead per task (not suitable for millions of tiny tasks)
Use actors for stateful workflows

5. Schedulers - Execution Backends

Purpose: Control how and where Dask tasks execute (threads, processes, distributed).

When to Choose Scheduler:

Threads (default): NumPy/Pandas operations, GIL-releasing libraries, shared memory benefit
Processes: Pure Python code, text processing, GIL-bound operations
Synchronous: Debugging with pdb, profiling, understanding errors
Distributed: Need dashboard, multi-machine clusters, advanced features

Reference Documentation: For comprehensive guidance on Dask Schedulers, refer to references/schedulers.md which includes:

Detailed scheduler descriptions and characteristics
Configuration methods (global, context manager, per-compute)
Performance considerations and overhead
Common patterns and troubleshooting
Thread configuration for optimal performance

Quick Example:

import dask
import dask.dataframe as dd

# Use threads for DataFrame (default, good for numeric)
ddf = dd.read_csv('data.csv')
result1 = ddf.mean().compute()  # Uses threads

# Use processes for Python-heavy work
import dask.bag as db
bag = db.read_text('logs/*.txt')
result2 = bag.map(python_function).compute(scheduler='processes')

# Use synchronous for debugging
dask.config.set(scheduler='synchronous')
result3 = problematic_computation.compute()  # Can use pdb

# Use distributed for monitoring and scaling
from dask.distributed import Client
client = Client()
result4 = computation.compute()  # Uses distributed with dashboard

Key Points:

Threads: Lowest overhead (~10 µs/task), best for numeric work
Processes: Avoids GIL (~10 ms/task), best for Python work
Distributed: Monitoring dashboard (~1 ms/task), scales to clusters
Can switch schedulers per computation or globally

Best Practices

For comprehensive performance optimization guidance, memory management strategies, and common pitfalls to avoid, refer to references/best-practices.md. Key principles include:

Start with Simpler Solutions

Before using Dask, explore:

Better algorithms
Efficient file formats (Parquet instead of CSV)
Compiled code (Numba, Cython)
Data sampling

Critical Performance Rules

1. Don't Load Data Locally Then Hand to Dask

# Wrong: Loads all data in memory first
import pandas as pd
df = pd.read_csv('large.csv')
ddf = dd.from_pandas(df, npartitions=10)

# Correct: Let Dask handle loading
import dask.dataframe as dd
ddf = dd.read_csv('large.csv')

2. Avoid Repeated compute() Calls

# Wrong: Each compute is separate
for item in items:
    result = dask_computation(item).compute()

# Correct: Single compute for all
computations = [dask_computation(item) for item in items]
results = dask.compute(*computations)

3. Don't Build Excessively Large Task Graphs

Increase chunk sizes if millions of tasks
Use map_partitions/map_blocks to fuse operations
Check task graph size: len(ddf.__dask_graph__())

4. Choose Appropriate Chunk Sizes

Target: ~100 MB per chunk (or 10 chunks per core in worker memory)
Too large: Memory overflow
Too small: Scheduling overhead

5. Use the Dashboard

from dask.distributed import Client
client = Client()
print(client.dashboard_link)  # Monitor performance, identify bottlenecks

Common Workflow Patterns

ETL Pipeline

import dask.dataframe as dd

# Extract: Read data
ddf = dd.read_csv('raw_data/*.csv')

# Transform: Clean and process
ddf = ddf[ddf['status'] == 'valid']
ddf['amount'] = ddf['amount'].astype('float64')
ddf = ddf.dropna(subset=['important_col'])

# Load: Aggregate and save
summary = ddf.groupby('category').agg({'amount': ['sum', 'mean']})
summary.to_parquet('output/summary.parquet')

Unstructured to Structured Pipeline

import dask.bag as db
import json

# Start with Bag for unstructured data
bag = db.read_text('logs/*.json').map(json.loads)
bag = bag.filter(lambda x: x['status'] == 'valid')

# Convert t

---

*Content truncated.*

More by davila7

View all skills by davila7 →

software-architecture

davila7

Guide for quality focused software architecture. This skill should be used when users want to write code, design architecture, analyze code, in any case that relates to software development.

527190

Implements Manus-style file-based planning for complex tasks. Creates task_plan.md, findings.md, and progress.md. Use when starting complex multi-step tasks, research projects, or any task requiring >5 tool calls.

84107

scroll-experience

davila7

Expert in building immersive scroll-driven experiences - parallax storytelling, scroll animations, interactive narratives, and cinematic web experiences. Like NY Times interactives, Apple product pages, and award-winning web experiences. Makes websites feel like experiences, not just pages. Use when: scroll animation, parallax, scroll storytelling, interactive story, cinematic website.

13087

humanizer

davila7

Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases. Credits: Original skill by @blader - https://github.com/blader/humanizer

11557

game-development

davila7

Game development orchestrator. Routes to platform-specific skills based on project needs.

15249

telegram-bot-builder

davila7

Expert in building Telegram bots that solve real problems - from simple automation to complex AI-powered bots. Covers bot architecture, the Telegram Bot API, user experience, monetization strategies, and scaling bots to thousands of users. Use when: telegram bot, bot api, telegram automation, chat bot telegram, tg bot.

10349

flutter-development

aj-geddes

Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.

1,6821,428

ui-ux-pro-max

nextlevelbuilder

"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."

1,2591,319

drawio-diagrams-enhanced

jgtolentino

Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.

1,5271,144

godot

bfollington

This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.

1,349807

nano-banana-pro

garg-aayush

Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.

1,261727

pdf-to-markdown

aliceisjustplaying

Convert entire PDF documents to clean, structured Markdown for full context loading. Use this skill when the user wants to extract ALL text from a PDF into context (not grep/search), when discussing or analyzing PDF content in full, when the user mentions "load the whole PDF", "bring the PDF into context", "read the entire PDF", or when partial extraction/grepping would miss important context. This is the preferred method for PDF text extraction over page-by-page or grep approaches.

1,466674

Related MCP Servers

Browse all servers

Mobile Next

Mobile Next offers fast, seamless mobile automation for iOS and Android. Automate apps, extract data, and simplify mobil

3,75219 tools

Gemini CLI

Integrate with Gemini CLI for large-scale file analysis, secure code execution, and advanced context control using Googl

2,0396 tools

OpenAI o3 Search

Leverage OpenAI o3 Search for advanced web results, outperforming Bing AI and other engines with unrivaled AI search cap

2860 tools

OpenAI WebSearch

OpenAI WebSearch enables real-time AI search using Bing AI by Microsoft for up-to-date web info and configurable search

860 tools

Recraft AI

Recraft AI is an ai image generator for creating, editing, and upscaling raster or vector images with advanced artificia

460 tools

AWS Athena

AWS Athena connector lets you run SQL on AWS data lakes via Athena in AWS, empowering rapid, large-scale business intell

400 tools

Install

mkdir -p .claude/skills/dask && curl -L -o skill.zip "https://mcp.directory/api/skills/download/2240" && unzip -o skill.zip -d .claude/skills/dask && rm skill.zip

Installs to .claude/skills/dask

Stats

Views

Installs

Author

davila7

7 skills published

Links

Source Code

dask

Install

About this skill

Dask

Overview

When to Use This Skill

Core Capabilities

1. DataFrames - Parallel Pandas Operations

2. Arrays - Parallel NumPy Operations

3. Bags - Parallel Processing of Unstructured Data

4. Futures - Task-Based Parallelization

5. Schedulers - Execution Backends

Best Practices

Start with Simpler Solutions

Critical Performance Rules

Common Workflow Patterns

ETL Pipeline

Unstructured to Structured Pipeline

More by davila7

software-architecture

planning-with-files

scroll-experience

humanizer

game-development

telegram-bot-builder

You might also like

flutter-development

ui-ux-pro-max

drawio-diagrams-enhanced

godot

nano-banana-pro

pdf-to-markdown

Related MCP Servers