Minimax Speech-02-HD

Minimax Speech-02-HD offers advanced text-to-audio capabilities with emotional expression and multilingual support. Try it now and see the results!

Platform: Replicate

Text-to-SpeechEmotional Voice SynthesisMultilingual TTSVoice Cloning

95.6k runs

0.10 per thousand input tokens

Commercial

🚀Function Overview

A high-fidelity text-to-speech model that generates natural audio output with precise control over vocal characteristics like emotion, language, pitch, and speed for professional applications.

Key Features

300+ pre-built voices across demographics
10-second voice cloning with 99% vocal similarity
Multilingual support for 30+ languages with native accents
Precise emotion control (auto-detect or manual)
Adjustable pitch, speed, volume, and pause durations
Professional-quality audio output optimized for voiceovers

Use Cases

•Audiobook narration
•Podcast and video voiceovers
•Multilingual IVR systems
•Video game character voices
•Accessibility tools for text-to-speech
•Personalized voice cloning applications

⚙️Input Parameters

text

string

Text to convert to speech. Every character is 1 token. Maximum 5000 characters. Use <#x#> between words to control pause duration (0.01-99.99s).

voice_id

string

Desired voice ID. Use a voice ID you have trained, or one of the following system voice IDs: Wise_Woman, Friendly_Person, Inspirational_girl, Deep_Voice_Man, Calm_Woman, Casual_Guy, Lively_Girl, Patient_Man, Young_Knight, Determined_Man, Lovely_Girl, Decent_Boy, Imposing_Manner, Elegant_Man, Abbess, Sweet_Girl_2, Exuberant_Girl

speed

number

Speech speed

volume

number

Speech volume

pitch

integer

Speech pitch

emotion

string

Speech emotion

english_normalization

boolean

Enable English text normalization for better number reading (slightly increases latency)

sample_rate

integer

Sample rate for the generated speech

bitrate

integer

Bitrate for the generated speech

channel

string

Number of audio channels

language_boost

string

Enhance recognition of specific languages and dialects

💡Usage Examples

Example 1

Input Parameters

{
  "text": "Speech-02-series is a Text-to-Audio and voice cloning technology that offers voice synthesis, emotional expression, and multilingual capabilities.\n\nThe HD version is optimized for high-fidelity applications like voiceovers and audiobooks. While the turbo one is designed for real-time applications with low latency.\n\nWhen using this model on Replicate, each character represents 1 token.",
  "pitch": 0,
  "speed": 1,
  "volume": 1,
  "bitrate": 128000,
  "channel": "mono",
  "emotion": "happy",
  "voice_id": "Friendly_Person",
  "sample_rate": 32000,
  "language_boost": "English",
  "english_normalization": true
}

Output Results

https://replicate.delivery/xezq/V5fclDfiEXq1GUvPTIC6zc4CWhYvZagKvPgkHlR9YldH3toUA/tmpdgbymb15.mp3

Quick Actions

Use NowView Documentation

Technical Specifications

Hardware Type
Run Count: 95.6k
Commercial Use: Supported
Pricing: 0.10 per thousand input tokens
Platform: Replicate

Related Keywords

Text-to-AudioEmotional ExpressionMultilingual CapabilitiesVoice CloningAudiobook NarrationPodcast VoiceoversIVR SystemsAccessibility Tools

Related Models

Cog Orpheus 3B

Spanish and English Text to Speech model from Canopy Labs (3b-es_it-ft-research_release)

Spark TTS

A model for text-to-speech generation with voice cloning and adjustable vocal parameters.

Dia 1.6B

Dia 1.6B by Nari Labs, Generates realistic dialogue audio from text, including non-verbal cues and voice cloning