dask
Parallel/distributed computing. Scale pandas/NumPy beyond memory, parallel DataFrames/Arrays, multi-file processing, task graphs, for larger-than-RAM datasets and parallel workflows.
Install
mkdir -p .claude/skills/dask && curl -L -o skill.zip "https://mcp.directory/api/skills/download/2240" && unzip -o skill.zip -d .claude/skills/dask && rm skill.zipInstalls to .claude/skills/dask
About this skill
Dask
Overview
Dask is a Python library for parallel and distributed computing that enables three critical capabilities:
- Larger-than-memory execution on single machines for data exceeding available RAM
- Parallel processing for improved computational speed across multiple cores
- Distributed computation supporting terabyte-scale datasets across multiple machines
Dask scales from laptops (processing ~100 GiB) to clusters (processing ~100 TiB) while maintaining familiar Python APIs.
When to Use This Skill
This skill should be used when:
- Process datasets that exceed available RAM
- Scale pandas or NumPy operations to larger datasets
- Parallelize computations for performance improvements
- Process multiple files efficiently (CSVs, Parquet, JSON, text logs)
- Build custom parallel workflows with task dependencies
- Distribute workloads across multiple cores or machines
Core Capabilities
Dask provides five main components, each suited to different use cases:
1. DataFrames - Parallel Pandas Operations
Purpose: Scale pandas operations to larger datasets through parallel processing.
When to Use:
- Tabular data exceeds available RAM
- Need to process multiple CSV/Parquet files together
- Pandas operations are slow and need parallelization
- Scaling from pandas prototype to production
Reference Documentation: For comprehensive guidance on Dask DataFrames, refer to references/dataframes.md which includes:
- Reading data (single files, multiple files, glob patterns)
- Common operations (filtering, groupby, joins, aggregations)
- Custom operations with
map_partitions - Performance optimization tips
- Common patterns (ETL, time series, multi-file processing)
Quick Example:
import dask.dataframe as dd
# Read multiple files as single DataFrame
ddf = dd.read_csv('data/2024-*.csv')
# Operations are lazy until compute()
filtered = ddf[ddf['value'] > 100]
result = filtered.groupby('category').mean().compute()
Key Points:
- Operations are lazy (build task graph) until
.compute()called - Use
map_partitionsfor efficient custom operations - Convert to DataFrame early when working with structured data from other sources
2. Arrays - Parallel NumPy Operations
Purpose: Extend NumPy capabilities to datasets larger than memory using blocked algorithms.
When to Use:
- Arrays exceed available RAM
- NumPy operations need parallelization
- Working with scientific datasets (HDF5, Zarr, NetCDF)
- Need parallel linear algebra or array operations
Reference Documentation: For comprehensive guidance on Dask Arrays, refer to references/arrays.md which includes:
- Creating arrays (from NumPy, random, from disk)
- Chunking strategies and optimization
- Common operations (arithmetic, reductions, linear algebra)
- Custom operations with
map_blocks - Integration with HDF5, Zarr, and XArray
Quick Example:
import dask.array as da
# Create large array with chunks
x = da.random.random((100000, 100000), chunks=(10000, 10000))
# Operations are lazy
y = x + 100
z = y.mean(axis=0)
# Compute result
result = z.compute()
Key Points:
- Chunk size is critical (aim for ~100 MB per chunk)
- Operations work on chunks in parallel
- Rechunk data when needed for efficient operations
- Use
map_blocksfor operations not available in Dask
3. Bags - Parallel Processing of Unstructured Data
Purpose: Process unstructured or semi-structured data (text, JSON, logs) with functional operations.
When to Use:
- Processing text files, logs, or JSON records
- Data cleaning and ETL before structured analysis
- Working with Python objects that don't fit array/dataframe formats
- Need memory-efficient streaming processing
Reference Documentation: For comprehensive guidance on Dask Bags, refer to references/bags.md which includes:
- Reading text and JSON files
- Functional operations (map, filter, fold, groupby)
- Converting to DataFrames
- Common patterns (log analysis, JSON processing, text processing)
- Performance considerations
Quick Example:
import dask.bag as db
import json
# Read and parse JSON files
bag = db.read_text('logs/*.json').map(json.loads)
# Filter and transform
valid = bag.filter(lambda x: x['status'] == 'valid')
processed = valid.map(lambda x: {'id': x['id'], 'value': x['value']})
# Convert to DataFrame for analysis
ddf = processed.to_dataframe()
Key Points:
- Use for initial data cleaning, then convert to DataFrame/Array
- Use
foldbyinstead ofgroupbyfor better performance - Operations are streaming and memory-efficient
- Convert to structured formats (DataFrame) for complex operations
4. Futures - Task-Based Parallelization
Purpose: Build custom parallel workflows with fine-grained control over task execution and dependencies.
When to Use:
- Building dynamic, evolving workflows
- Need immediate task execution (not lazy)
- Computations depend on runtime conditions
- Implementing custom parallel algorithms
- Need stateful computations
Reference Documentation: For comprehensive guidance on Dask Futures, refer to references/futures.md which includes:
- Setting up distributed client
- Submitting tasks and working with futures
- Task dependencies and data movement
- Advanced coordination (queues, locks, events, actors)
- Common patterns (parameter sweeps, dynamic tasks, iterative algorithms)
Quick Example:
from dask.distributed import Client
client = Client() # Create local cluster
# Submit tasks (executes immediately)
def process(x):
return x ** 2
futures = client.map(process, range(100))
# Gather results
results = client.gather(futures)
client.close()
Key Points:
- Requires distributed client (even for single machine)
- Tasks execute immediately when submitted
- Pre-scatter large data to avoid repeated transfers
- ~1ms overhead per task (not suitable for millions of tiny tasks)
- Use actors for stateful workflows
5. Schedulers - Execution Backends
Purpose: Control how and where Dask tasks execute (threads, processes, distributed).
When to Choose Scheduler:
- Threads (default): NumPy/Pandas operations, GIL-releasing libraries, shared memory benefit
- Processes: Pure Python code, text processing, GIL-bound operations
- Synchronous: Debugging with pdb, profiling, understanding errors
- Distributed: Need dashboard, multi-machine clusters, advanced features
Reference Documentation: For comprehensive guidance on Dask Schedulers, refer to references/schedulers.md which includes:
- Detailed scheduler descriptions and characteristics
- Configuration methods (global, context manager, per-compute)
- Performance considerations and overhead
- Common patterns and troubleshooting
- Thread configuration for optimal performance
Quick Example:
import dask
import dask.dataframe as dd
# Use threads for DataFrame (default, good for numeric)
ddf = dd.read_csv('data.csv')
result1 = ddf.mean().compute() # Uses threads
# Use processes for Python-heavy work
import dask.bag as db
bag = db.read_text('logs/*.txt')
result2 = bag.map(python_function).compute(scheduler='processes')
# Use synchronous for debugging
dask.config.set(scheduler='synchronous')
result3 = problematic_computation.compute() # Can use pdb
# Use distributed for monitoring and scaling
from dask.distributed import Client
client = Client()
result4 = computation.compute() # Uses distributed with dashboard
Key Points:
- Threads: Lowest overhead (~10 µs/task), best for numeric work
- Processes: Avoids GIL (~10 ms/task), best for Python work
- Distributed: Monitoring dashboard (~1 ms/task), scales to clusters
- Can switch schedulers per computation or globally
Best Practices
For comprehensive performance optimization guidance, memory management strategies, and common pitfalls to avoid, refer to references/best-practices.md. Key principles include:
Start with Simpler Solutions
Before using Dask, explore:
- Better algorithms
- Efficient file formats (Parquet instead of CSV)
- Compiled code (Numba, Cython)
- Data sampling
Critical Performance Rules
1. Don't Load Data Locally Then Hand to Dask
# Wrong: Loads all data in memory first
import pandas as pd
df = pd.read_csv('large.csv')
ddf = dd.from_pandas(df, npartitions=10)
# Correct: Let Dask handle loading
import dask.dataframe as dd
ddf = dd.read_csv('large.csv')
2. Avoid Repeated compute() Calls
# Wrong: Each compute is separate
for item in items:
result = dask_computation(item).compute()
# Correct: Single compute for all
computations = [dask_computation(item) for item in items]
results = dask.compute(*computations)
3. Don't Build Excessively Large Task Graphs
- Increase chunk sizes if millions of tasks
- Use
map_partitions/map_blocksto fuse operations - Check task graph size:
len(ddf.__dask_graph__())
4. Choose Appropriate Chunk Sizes
- Target: ~100 MB per chunk (or 10 chunks per core in worker memory)
- Too large: Memory overflow
- Too small: Scheduling overhead
5. Use the Dashboard
from dask.distributed import Client
client = Client()
print(client.dashboard_link) # Monitor performance, identify bottlenecks
Common Workflow Patterns
ETL Pipeline
import dask.dataframe as dd
# Extract: Read data
ddf = dd.read_csv('raw_data/*.csv')
# Transform: Clean and process
ddf = ddf[ddf['status'] == 'valid']
ddf['amount'] = ddf['amount'].astype('float64')
ddf = ddf.dropna(subset=['important_col'])
# Load: Aggregate and save
summary = ddf.groupby('category').agg({'amount': ['sum', 'mean']})
summary.to_parquet('output/summary.parquet')
Unstructured to Structured Pipeline
import dask.bag as db
import json
# Start with Bag for unstructured data
bag = db.read_text('logs/*.json').map(json.loads)
bag = bag.filter(lambda x: x['status'] == 'valid')
# Convert t
---
*Content truncated.*
More by davila7
View all skills by davila7 →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
fastapi-templates
wshobson
Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.
Related MCP Servers
Browse all serversMobile Next offers fast, seamless mobile automation for iOS and Android. Automate apps, extract data, and simplify mobil
Integrate with Gemini CLI for large-scale file analysis, secure code execution, and advanced context control using Googl
Leverage OpenAI o3 Search for advanced web results, outperforming Bing AI and other engines with unrivaled AI search cap
OpenAI WebSearch enables real-time AI search using Bing AI by Microsoft for up-to-date web info and configurable search
Recraft AI is an ai image generator for creating, editing, and upscaling raster or vector images with advanced artificia
AWS Athena connector lets you run SQL on AWS data lakes via Athena in AWS, empowering rapid, large-scale business intell
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.