Updated June 2026 ~14 min readIntermediate

MiniMax MCP Server: Complete Guide

The MiniMax MCP server gives an AI agent four generative-media superpowers — text-to-speech, voice cloning, image generation, and video generation — through one Model Context Protocol connection. This guide covers what it does, the eight tools, the smallest working install, the API-key-and-region gotcha that trips up most first-timers, and an honest read on cost before you wire it into a loop.

Editorial illustration: a central luminous coral-orange hub glyph at the geometric center, with four softly glowing arcs radiating outward to four smaller orbital glyphs — a sound-wave waveform, a portrait-frame outline, a film-strip rectangle, and a musical-note shape — on a deep midnight navy backdrop with faint connective query lines.
On this page · 17 sections
  1. TL;DR + what you need
  2. What it actually does
  3. Why it exists
  4. The named pieces
  5. Smallest end-to-end install
  6. The eight tools
  7. What we got wrong
  8. Wrong vs right patterns
  9. Common mistakes
  10. Cost notes
  11. Who it's for
  12. Community signal
  13. The verdict
  14. The bigger picture
  15. FAQ
  16. Glossary
  17. Sources

TL;DR + what you actually need

The MiniMax MCP server is the official bridge between an MCP client (Claude Desktop, Cursor, Windsurf, Claude Code) and MiniMax’s media-generation APIs. Install it once and your agent can speak, clone a voice, draw an image, and render a video — all from natural-language prompts. Three things you need before it works:

  • A MiniMax API key, created in the platform user center, set as MINIMAX_API_KEY.
  • The matching API host, set as MINIMAX_API_HOST. This is the part people get wrong — the host must match the region your key was issued in (global vs. mainland China). More on that below.
  • A runtime: uvx for the Python build (minimax-mcp) or npx for the JavaScript build (minimax-mcp-js). Pick one.

The fastest one-liner, for the Python build:

uvx minimax-mcp -y

That runs the server, but it needs the env vars set by your MCP client to do anything. The install panel further down emits the exact config block for your client.

What it actually does

Most AI agents can write text and call APIs but can’t produce media. Ask Claude to “narrate this script in a calm voice” or “make a six-second clip of rain on a window” and, without tools, it can only describe what it would do. The MiniMax MCP server closes that gap by exposing MiniMax’s generative models as callable tools.

Concretely, once it’s registered the agent can: synthesize speech from text in a chosen voice, clone a voice from an audio sample, design a brand-new voice from a written description, generate an image from a prompt, generate a video from a prompt, and compose music from a prompt and lyrics. The model picks the tool, fills in the arguments, and hands you back a file or a URL.

The official server ships in two flavors that wrap the same APIs: a Python implementation (MiniMax-AI/MiniMax-MCP) and a JavaScript one (MiniMax-AI/MiniMax-MCP-JS). Both are open source and maintained by MiniMax. We catalog the JS build at the MiniMax Multimodal server page.

Why it exists

Before MCP, putting MiniMax media into an agent meant writing glue code: HTTP calls to each endpoint, polling loops for the asynchronous video job, file downloads, and error handling — per agent, per project. Every team rebuilt the same plumbing. MCP — a standard wire format that lets any LLM client talk to any tool server — turns that plumbing into a config block. You declare the server once; every MCP-aware client gets the tools for free. The MiniMax MCP server exists to be that declared-once layer for MiniMax’s media stack.

Mental model: the named pieces

Four pieces decide whether your setup works. Learn the names and the rest of this guide reads faster.

MINIMAX_API_KEY

Your credential. Created in the MiniMax platform user center. It is region-bound — a key from the global platform will not authenticate against the mainland-China host, and vice versa.

MINIMAX_API_HOST

The API endpoint. The README lists two: https://api.minimax.io for the global platform and https://api.minimaxi.com for mainland China. Match it to your key’s region.

MINIMAX_MCP_BASE_PATH

The local directory where generated files land when you use local resource mode. Defaults to the desktop. Point it at a real, writable folder.

Resource mode (url / local)

url (default) returns links to hosted files; local downloads them to your base path. Set via MINIMAX_API_RESOURCE_MODE (Python) or MINIMAX_RESOURCE_MODE (JS).

The takeaway: the key and the host are a matched pair. Get the pairing right and everything else is detail. Get it wrong and you get an authentication error that looks like a bad key when the key is fine.

Smallest end-to-end install

Here is the shortest path from zero to a working tool. The install panel below pulls the exact config for your client straight from our catalog, so it stays in sync as MiniMax updates its setup.

One-line install · MiniMax Multimodal

Open server page

Install

If you prefer to hand-write the config, here is the canonical Claude Desktop block for the Python build (drop it into claude_desktop_config.json):

{
  "mcpServers": {
    "MiniMax": {
      "command": "uvx",
      "args": ["minimax-mcp", "-y"],
      "env": {
        "MINIMAX_API_KEY": "insert-your-api-key-here",
        "MINIMAX_MCP_BASE_PATH": "/Users/you/Desktop",
        "MINIMAX_API_HOST": "https://api.minimax.io",
        "MINIMAX_API_RESOURCE_MODE": "url"
      }
    }
  }
}

For the JavaScript build, swap the command to npx with args ["-y", "minimax-mcp-js"] and use MINIMAX_RESOURCE_MODE for the mode variable. Restart the client. Ask: “List the available MiniMax voices, then read ‘hello world’ in the first one.” If the agent calls a MiniMax tool and returns an audio file or link, you’re live. Browse every client and its config path at the server directory.

The eight tools, walked through

The Python build exposes eight tools. The JavaScript build is nearly identical but trades list_voices for play_audio. Names below are the Python set; verify against your build’s README, since this is the seam where the two diverge.

text_to_audio

Convert text to speech with a chosen voice, model, speed, and emotion.

list_voices

Return the catalog of available voices so the agent can pick one. (JS build: not present.)

voice_clone

Clone a voice from a provided audio sample. Use only voices you own or have permission to use.

voice_design

Generate a brand-new voice from a text description plus preview text.

text_to_image

Generate an image from a text prompt.

generate_video

Kick off a video-generation job from a prompt. Asynchronous — returns a task to poll.

query_video_generation

Poll a video task by id and fetch the finished clip when it's ready.

music_generation

Compose a music track from a prompt and lyrics.

The pairing worth internalizing is generate_video query_video_generation. Video is not synchronous: the first call returns a task id, and the agent has to poll the second tool until the render finishes. A model that calls generate_video and then declares the video “done” without polling will hand you nothing. Good clients chain the two automatically; thin ones need a nudge in the prompt.

What we got wrong

Two assumptions cost us time. First: we assumed the key was the only credential that mattered. We pasted a valid global key, left MINIMAX_API_HOST at a stale default a tutorial had used, and got “invalid api key” on every call. The key was perfect. The host was for the other region. The error message points at the key, but the fix is the host — match the host to the region the key was minted in.

Second: we treated video like text-to-speech and expected the clip back in one call. generate_video only starts the job. We saw a successful tool response, assumed the render existed, and couldn’t find a file anywhere. The render was still cooking on MiniMax’s side; the agent needed to poll query_video_generation with the returned task id. Asynchronous-by-default is the right design for a slow model — we just hadn’t read closely enough to expect it.

Wrong vs right patterns

❌ Wrong: same key, default host, hope it works

Pasting a key from one region and leaving the host unset (or set to the other region’s default). Authentication fails with a misleading “invalid api key” and you burn an hour regenerating a key that was never the problem.

✅ Right: set the host to match the key’s region

Global key → https://api.minimax.io. Mainland key → https://api.minimaxi.com. Set both in the same env block so the pairing is obvious to the next person reading the config.

❌ Wrong: looping video generation inside an automated agent

Pointing an unsupervised agent at generate_video in a retry loop. Each attempt spends credits whether or not you keep the output. A flaky prompt can drain a balance fast.

✅ Right: gate paid calls behind a human or a cap

Keep video and music generation in interactive sessions, or put a hard call-count cap around them. Use cheap TTS and image calls freely; treat video as a deliberate, reviewed action.

Common mistakes

“invalid api key” on a valid key

Root cause: the host doesn’t match the key’s region. Fix: set MINIMAX_API_HOST to https://api.minimax.io for a global key or https://api.minimaxi.com for a mainland key.

spawn uvx ENOENT

Root cause: the client can’t find uvx on its PATH. Fix: run which uvx and put the absolute path in the command field instead of the bare name.

Generated files vanish

Root cause: local resource mode with no valid base path. Fix: set MINIMAX_MCP_BASE_PATH to a real writable folder, or switch to url mode and keep the returned links.

Video “succeeds” but there’s no clip

Root cause: the agent stopped after generate_video without polling. Fix: instruct it to call query_video_generation with the task id until the job reports complete.

Cost notes

The MCP server is free; the generation it triggers is not. Every tool call that hits a MiniMax model spends API credits against your balance, and the cost is not uniform across tools. Text-to- speech and image calls are cheap. Video — the Hailuo family of models — is the expensive one, billed per clip by resolution and duration. MiniMax positions its pricing as low for the category, and reviewers broadly agree it undercuts rivals, but “cheap per call” and “cheap in an agent loop” are different statements.

The contrarian read, voiced across pricing analyses and reviews, is that the credit model forces rationing: you watch a monthly balance, and prices have crept upward as newer frontier models landed. That matters most for video. An agent told to “keep trying until the clip looks right” can spend a meaningful share of a monthly allowance on rejected renders. The practical rule: meter video and music behind a human or a cap; let TTS and images run freely.

Who it’s for — and who it isn’t

Use it if you:

  • Want one agent to produce several media types — voice, image, video, music — without stitching together multiple servers.
  • Are building a content or media pipeline where MiniMax’s Hailuo video or multilingual TTS is already your model of choice.
  • Prefer a single official, open-source server over hand-rolled API glue.

Look elsewhere if you:

  • Only need top-tier English voice work — a voice-specialist like ElevenLabs is the sharper tool for that one job.
  • Run cost-sensitive automated loops and can’t supervise paid video calls.
  • Are in a region where account, key, and host region don’t line up cleanly — the host gotcha will recur.

Community signal

MiniMax announced the official MCP server in mid-April 2026, framed as letting users invoke video, image, speech, and voice-cloning from a single text interface, compatible with Claude Desktop, Cursor, Windsurf, and OpenAI Agents. The first-party announcement lives at minimax.io/news/minimax-mcp. (We looked for a clean, verifiable maintainer tweet to embed and couldn’t confirm one to the standard this site holds, so we cite the announcement directly rather than embed an unverified post.)

Setup-guide coverage across third-party sites converges on one theme: the “invalid api key” error is the dominant first-run problem, and it’s almost always a host-region mismatch rather than a bad key. The contrarian thread runs through the pricing write-ups — Hailuo video is competitively priced per clip, but the credit system and upward price drift make heavy, unsupervised video use a real budget line, not a rounding error.

On the model side, MiniMax shipped Hailuo 2.3 in late October 2026 and upgraded its video agent to a full media agent — signal that the generative-media stack behind this MCP server is actively moving, not frozen.

The Verdict

Our Take

The MiniMax MCP server is the cleanest way to give one agent speech, voice cloning, image, and video in a single connection — install it if you want multimodal media output without juggling servers. Use it if MiniMax is already your generation stack and you can meter video spend. Skip it if you only need best-in-class English TTS (a voice specialist wins there) or if you can’t supervise paid video calls in an automated loop. Set the host to match your key’s region before anything else.

The bigger picture

The MiniMax MCP server is one entry in a fast-filling shelf of media-generation MCP servers — voice from ElevenLabs, images from several providers, video from a growing field. The pattern is the same across all of them: wrap a paid generative API in the MCP interface so any agent can reach it. What makes MiniMax notable is breadth — most rivals do one modality well; MiniMax bundles four behind one server. If you’re new to the protocol itself, start with our explainer on what MCP is, then browse the AI and machine learning category for the rest of the generative-media servers. As agents move from text into media production, the server that covers the most modalities with the least setup has a real edge — and right now that’s a short list MiniMax sits near the top of.

Frequently Asked Questions

What is the MiniMax MCP server?

It is the official Model Context Protocol server from MiniMax that lets an AI agent call MiniMax's generative-media APIs — text-to-speech, voice cloning, image generation, and video generation — as tools. You register it once in Claude Desktop, Cursor, or Windsurf, then ask the agent to produce media in plain language.

What can the MiniMax MCP generate?

Audio (text-to-speech in many voices), cloned voices from a sample, designed voices from a text description, images from a prompt, video from a prompt, and music from a prompt plus lyrics. The Python build exposes eight tools; the JavaScript build swaps the voice-listing tool for an audio-playback tool.

Is the MiniMax MCP free?

The MCP server software is free and open source. The media generation behind it is not — every text-to-speech, image, or video call spends MiniMax API credits against your platform balance. Video is the most expensive call. Budget for usage-based cost before you wire it into an automated agent loop.

How do I set up the MiniMax MCP API key?

Create a key in the MiniMax platform user center, then set MINIMAX_API_KEY in the server's env block. You must also set MINIMAX_API_HOST to the host that matches the region your key was issued in — global keys and mainland-China keys use different hosts. A mismatch returns an 'invalid api key' error even when the key is correct.

MiniMax MCP vs ElevenLabs MCP — which should I use?

ElevenLabs is voice-only and the stronger pick if all you need is best-in-class English TTS and voice cloning. MiniMax is multimodal — one server covers speech, voice cloning, image, and video — so it wins when you want an agent to produce several media types without juggling multiple servers and keys.

Which MCP clients support the MiniMax server?

Any MCP client that runs a stdio server: Claude Desktop, Cursor, Windsurf, Claude Code, and OpenAI Agents are all confirmed by the maintainer. The Python build runs via uvx; the JavaScript build runs via npx. Both speak the same MCP tool interface, so the client doesn't care which one you pick.

Where do the generated files go?

It depends on the resource mode. In 'url' mode (the default) the tools return links to MiniMax-hosted files. In 'local' mode the server downloads each result into the directory you set with MINIMAX_MCP_BASE_PATH. Set that path to a real, writable folder before generating, or local mode has nowhere to save.

Glossary

MCP
Model Context Protocol — a standard wire format that lets any LLM client call any tool server.
MCP server
A program that exposes tools (here: media generation) over MCP to a client.
stdio transport
The mode where the client launches the server as a subprocess and talks over stdin/stdout.
Text-to-speech (TTS)
Converting written text into spoken audio in a chosen voice.
Voice cloning
Reproducing a specific voice from an audio sample, for use in new speech.
Voice design
Generating a new synthetic voice from a written description rather than a sample.
Hailuo
MiniMax’s video-generation model family, called by generate_video.
Resource mode
Whether tools return hosted URLs (url) or download files locally (local).
API host
The base endpoint the server calls; must match the region your key was issued in.
uvx
A runner from the uv Python toolchain that executes a package without a manual install step.
API credits
The usage-based currency MiniMax bills generation against; video costs the most.

All Sources & Links

Primary

Community & Web

Internal

Found an issue?

If something here is out of date — a renamed tool, a changed host, a new install path — email [email protected] or read more on our about page. We keep these guides current.