whatsapp-voice-talk
Real-time WhatsApp voice message processing. Transcribe voice notes to text via Whisper, detect intent, execute handlers, and send responses. Use when building conversational voice interfaces for WhatsApp. Supports English and Hindi, customizable intents (weather, status, commands), automatic language detection, and streaming responses via TTS.
Install
mkdir -p .claude/skills/whatsapp-voice-talk && curl -L -o skill.zip "https://mcp.directory/api/skills/download/8118" && unzip -o skill.zip -d .claude/skills/whatsapp-voice-talk && rm skill.zipInstalls to .claude/skills/whatsapp-voice-talk
About this skill
WhatsApp Voice Talk
Turn WhatsApp voice messages into real-time conversations. This skill provides a complete pipeline: voice → transcription → intent detection → response generation → text-to-speech.
Perfect for:
- Voice assistants on WhatsApp
- Hands-free command interfaces
- Multi-lingual chatbots
- IoT voice control (drones, smart home, etc.)
Quick Start
1. Install Dependencies
pip install openai-whisper soundfile numpy
2. Process a Voice Message
const { processVoiceNote } = require('./scripts/voice-processor');
const fs = require('fs');
// Read a voice message (OGG, WAV, MP3, etc.)
const buffer = fs.readFileSync('voice-message.ogg');
// Process it
const result = await processVoiceNote(buffer);
console.log(result);
// {
// status: 'success',
// response: "Current weather in Delhi is 19°C, haze. Humidity is 56%.",
// transcript: "What's the weather today?",
// intent: 'weather',
// language: 'en',
// timestamp: 1769860205186
// }
3. Run Auto-Listener
For automatic processing of incoming WhatsApp voice messages:
node scripts/voice-listener-daemon.js
This watches ~/.clawdbot/media/inbound/ every 5 seconds and processes new voice files.
How It Works
Incoming Voice Message
↓
Transcribe (Whisper API)
↓
"What's the weather?"
↓
Detect Language & Intent
↓
Match against INTENTS
↓
Execute Handler
↓
Generate Response
↓
Convert to TTS
↓
Send back via WhatsApp
Key Features
✅ Zero Setup Complexity - No FFmpeg, no complex dependencies. Uses soundfile + Whisper.
✅ Multi-Language - Automatic English/Hindi detection. Extend easily.
✅ Intent-Driven - Define custom intents with keywords and handlers.
✅ Real-Time Processing - 5-10 seconds per message (after first model load).
✅ Customizable - Add weather, status, commands, or anything else.
✅ Production Ready - Built from real usage in Clawdbot.
Common Use Cases
Weather Bot
// User says: "What's the weather in Bangalore?"
// Response: "Current weather in Delhi is 19°C..."
// (Built-in intent, just enable it)
Smart Home Control
// User says: "Turn on the lights"
// Handler: Sends signal to smart home API
// Response: "Lights turned on"
Task Manager
// User says: "Add milk to shopping list"
// Handler: Adds to database
// Response: "Added milk to your list"
Status Checker
// User says: "Is the system running?"
// Handler: Checks system status
// Response: "All systems online"
Customization
Add a Custom Intent
Edit voice-processor.js:
- Add to INTENTS map:
const INTENTS = {
'shopping': {
keywords: ['shopping', 'list', 'buy', 'खरीद'],
handler: 'handleShopping'
}
};
- Add handler:
const handlers = {
async handleShopping(language = 'en') {
return {
status: 'success',
response: language === 'en'
? "What would you like to add to your shopping list?"
: "आप अपनी शॉपिंग लिस्ट में क्या जोड़ना चाहते हैं?"
};
}
};
Support More Languages
- Update
detectLanguage()for your language's Unicode:
const urduChars = /[\u0600-\u06FF]/g; // Add this
- Add language code to returns:
return language === 'ur' ? 'Urdu response' : 'English response';
- Set language in
transcribe.py:
result = model.transcribe(data, language="ur")
Change Transcription Model
In transcribe.py:
model = whisper.load_model("tiny") # Fastest, 39MB
model = whisper.load_model("base") # Default, 140MB
model = whisper.load_model("small") # Better, 466MB
model = whisper.load_model("medium") # Good, 1.5GB
Architecture
Scripts:
transcribe.py- Whisper transcription (Python)voice-processor.js- Core logic (intent parsing, handlers)voice-listener-daemon.js- Auto-listener watching for new messages
References:
SETUP.md- Installation and configurationAPI.md- Detailed function documentation
Integration with Clawdbot
If running as a Clawdbot skill, hook into message events:
// In your Clawdbot handler
const { processVoiceNote } = require('skills/whatsapp-voice-talk/scripts/voice-processor');
message.on('voice', async (audioBuffer) => {
const result = await processVoiceNote(audioBuffer, message.from);
// Send response back
await message.reply(result.response);
// Or send as voice (requires TTS)
await sendVoiceMessage(result.response);
});
Performance
- First run: ~30 seconds (downloads Whisper model, ~140MB)
- Typical: 5-10 seconds per message
- Memory: ~1.5GB (base model)
- Languages: English, Hindi (easily extended)
Supported Audio Formats
OGG (Opus), WAV, FLAC, MP3, CAF, AIFF, and more via libsndfile.
WhatsApp uses Opus-coded OGG by default — works out of the box.
Troubleshooting
"No module named 'whisper'"
pip install openai-whisper
"No module named 'soundfile'"
pip install soundfile
Voice messages not processing?
- Check:
clawdbot status(is it running?) - Check:
~/.clawdbot/media/inbound/(files arriving?) - Run daemon manually:
node scripts/voice-listener-daemon.js(see logs)
Slow transcription?
Use smaller model: whisper.load_model("base") or "tiny"
Further Reading
- Setup Guide: See
references/SETUP.mdfor detailed installation and configuration - API Reference: See
references/API.mdfor function signatures and examples - Examples: Check
scripts/for working code
License
MIT - Use freely, customize, contribute back!
Built for real-world use in Clawdbot. Battle-tested with multiple languages and use cases.
More by openclaw
View all skills by openclaw →You might also like
flutter-development
aj-geddes
Build beautiful cross-platform mobile apps with Flutter and Dart. Covers widgets, state management with Provider/BLoC, navigation, API integration, and material design.
drawio-diagrams-enhanced
jgtolentino
Create professional draw.io (diagrams.net) diagrams in XML format (.drawio files) with integrated PMP/PMBOK methodologies, extensive visual asset libraries, and industry-standard professional templates. Use this skill when users ask to create flowcharts, swimlane diagrams, cross-functional flowcharts, org charts, network diagrams, UML diagrams, BPMN, project management diagrams (WBS, Gantt, PERT, RACI), risk matrices, stakeholder maps, or any other visual diagram in draw.io format. This skill includes access to custom shape libraries for icons, clipart, and professional symbols.
ui-ux-pro-max
nextlevelbuilder
"UI/UX design intelligence. 50 styles, 21 palettes, 50 font pairings, 20 charts, 8 stacks (React, Next.js, Vue, Svelte, SwiftUI, React Native, Flutter, Tailwind). Actions: plan, build, create, design, implement, review, fix, improve, optimize, enhance, refactor, check UI/UX code. Projects: website, landing page, dashboard, admin panel, e-commerce, SaaS, portfolio, blog, mobile app, .html, .tsx, .vue, .svelte. Elements: button, modal, navbar, sidebar, card, table, form, chart. Styles: glassmorphism, claymorphism, minimalism, brutalism, neumorphism, bento grid, dark mode, responsive, skeuomorphism, flat design. Topics: color palette, accessibility, animation, layout, typography, font pairing, spacing, hover, shadow, gradient."
godot
bfollington
This skill should be used when working on Godot Engine projects. It provides specialized knowledge of Godot's file formats (.gd, .tscn, .tres), architecture patterns (component-based, signal-driven, resource-based), common pitfalls, validation tools, code templates, and CLI workflows. The `godot` command is available for running the game, validating scripts, importing resources, and exporting builds. Use this skill for tasks involving Godot game development, debugging scene/resource files, implementing game systems, or creating new Godot components.
nano-banana-pro
garg-aayush
Generate and edit images using Google's Nano Banana Pro (Gemini 3 Pro Image) API. Use when the user asks to generate, create, edit, modify, change, alter, or update images. Also use when user references an existing image file and asks to modify it in any way (e.g., "modify this image", "change the background", "replace X with Y"). Supports both text-to-image generation and image-to-image editing with configurable resolution (1K default, 2K, or 4K for high resolution). DO NOT read the image file first - use this skill directly with the --input-image parameter.
fastapi-templates
wshobson
Create production-ready FastAPI projects with async patterns, dependency injection, and comprehensive error handling. Use when building new FastAPI applications or setting up backend API projects.
Related MCP Servers
Browse all serversIntegrate with Sinch APIs for temp phone number, phone number verification, messaging, calls, and more—all with multi-re
Send and receive WhatsApp messages directly from Claude and other AI assistants. Search conversations, manage contacts,
Unlock powerful text to speech and AI voice generator tools with ElevenLabs. Create, clone, and customize speech easily.
Extract text and audio from URLs, docs, videos, and images with AI voice generator and text to speech for unified conten
Voice Interface is a browser-based speech to text website offering fast, hands-free speech to text online and website sp
AllVoiceLab offers advanced voice cloning and free audio processing software for text-to-speech, speech transformation,
Stay ahead of the MCP ecosystem
Get weekly updates on new skills and servers.