Minimax Speech-02-HD
Minimax Speech-02-HD offers advanced text-to-audio capabilities with emotional expression and multilingual support. Try it now and see the results!
🚀Function Overview
A high-fidelity text-to-speech model that generates natural audio output with precise control over vocal characteristics like emotion, language, pitch, and speed for professional applications.
Key Features
- 300+ pre-built voices across demographics
- 10-second voice cloning with 99% vocal similarity
- Multilingual support for 30+ languages with native accents
- Precise emotion control (auto-detect or manual)
- Adjustable pitch, speed, volume, and pause durations
- Professional-quality audio output optimized for voiceovers
Use Cases
- •Audiobook narration
- •Podcast and video voiceovers
- •Multilingual IVR systems
- •Video game character voices
- •Accessibility tools for text-to-speech
- •Personalized voice cloning applications
⚙️Input Parameters
text
stringText to convert to speech. Every character is 1 token. Maximum 5000 characters. Use <#x#> between words to control pause duration (0.01-99.99s).
voice_id
stringDesired voice ID. Use a voice ID you have trained, or one of the following system voice IDs: Wise_Woman, Friendly_Person, Inspirational_girl, Deep_Voice_Man, Calm_Woman, Casual_Guy, Lively_Girl, Patient_Man, Young_Knight, Determined_Man, Lovely_Girl, Decent_Boy, Imposing_Manner, Elegant_Man, Abbess, Sweet_Girl_2, Exuberant_Girl
speed
numberSpeech speed
volume
numberSpeech volume
pitch
integerSpeech pitch
emotion
stringSpeech emotion
english_normalization
booleanEnable English text normalization for better number reading (slightly increases latency)
sample_rate
integerSample rate for the generated speech
bitrate
integerBitrate for the generated speech
channel
stringNumber of audio channels
language_boost
stringEnhance recognition of specific languages and dialects
💡Usage Examples
Example 1
Input Parameters
{ "text": "Speech-02-series is a Text-to-Audio and voice cloning technology that offers voice synthesis, emotional expression, and multilingual capabilities.\n\nThe HD version is optimized for high-fidelity applications like voiceovers and audiobooks. While the turbo one is designed for real-time applications with low latency.\n\nWhen using this model on Replicate, each character represents 1 token.", "pitch": 0, "speed": 1, "volume": 1, "bitrate": 128000, "channel": "mono", "emotion": "happy", "voice_id": "Friendly_Person", "sample_rate": 32000, "language_boost": "English", "english_normalization": true }
Quick Actions
Technical Specifications
- Hardware Type
- Run Count
- 95.6k
- Commercial Use
- Supported
- Pricing
- 0.10 per thousand input tokens
- Platform
- Replicate
Related Keywords
Related Models
Cog Orpheus 3B
Spanish and English Text to Speech model from Canopy Labs (3b-es_it-ft-research_release)
Spark TTS
A model for text-to-speech generation with voice cloning and adjustable vocal parameters.
Dia 1.6B
Dia 1.6B by Nari Labs, Generates realistic dialogue audio from text, including non-verbal cues and voice cloning