Minimax Speech-02 Turbo
Minimax Speech-02 Turbo offers real-time voice synthesis and voice cloning. Ready to experience the power of AI? Start your journey here!
🚀Function Overview
A real-time text-to-audio model with voice cloning, emotional expression controls, and multilingual support.
Key Features
- Voice cloning with 99% similarity using custom or pre-built voices
- Emotion control via auto-detection or manual settings
- Support for 30+ languages including regional accents
- Granular audio controls (pitch, speed, volume)
- Low-latency processing for real-time applications
- English text normalization for improved number reading
Use Cases
- •Real-time voice synthesis for interactive applications
- •Audiobook and podcast narration
- •Multilingual customer service chatbots
- •Personalized voice cloning solutions
- •Emotion-aware voice interfaces
- •Accessibility tools for text-to-speech conversion
⚙️Input Parameters
text
stringText to convert to speech. Every character is 1 token. Maximum 5000 characters. Use <#x#> between words to control pause duration (0.01-99.99s).
voice_id
stringDesired voice ID. Use a voice ID you have trained (https://replicate.com/minimax/voice-cloning), or one of the following system voice IDs: Wise_Woman, Friendly_Person, Inspirational_girl, Deep_Voice_Man, Calm_Woman, Casual_Guy, Lively_Girl, Patient_Man, Young_Knight, Determined_Man, Lovely_Girl, Decent_Boy, Imposing_Manner, Elegant_Man, Abbess, Sweet_Girl_2, Exuberant_Girl
speed
numberSpeech speed
volume
numberSpeech volume
pitch
integerSpeech pitch
emotion
stringSpeech emotion
english_normalization
booleanEnable English text normalization for better number reading (slightly increases latency)
sample_rate
integerSample rate for the generated speech
bitrate
integerBitrate for the generated speech
channel
stringNumber of audio channels
language_boost
stringEnhance recognition of specific languages and dialects
💡Usage Examples
Example 1
Input Parameters
{ "text": "Speech-02-series is a Text-to-Audio and voice cloning technology that offers voice synthesis, emotional expression, and multilingual capabilities.\n\nThe HD version is optimized for high-fidelity applications like voiceovers and audiobooks. While the turbo one is designed for real-time applications with low latency.\n\nWhen using this model on Replicate, each character represents 1 token.", "pitch": 0, "speed": 1, "volume": 1, "bitrate": 128000, "channel": "mono", "emotion": "angry", "voice_id": "Deep_Voice_Man", "sample_rate": 32000, "language_boost": "English", "english_normalization": true }
Quick Actions
Technical Specifications
- Hardware Type
- Run Count
- 57.6k
- Commercial Use
- Supported
- Pricing
- 0.06 per thousand input tokens
- Platform
- Replicate
Related Keywords
Related Models
Chatterbox TTS
Chatterbox is a state-of-the-art zeroshot TTS
NVIDIA PDF to Podcast
Transform PDFs into AI podcasts for engaging on-the-go audio content.
Piper Persian Text-to-Speech
A fast, local neural text to speech system that sounds great and is optimized for the Raspberry Pi 4. Piper is used in a variety of projects.