Minimax Speech-02 Turbo

Minimax Speech-02 Turbo offers real-time voice synthesis and voice cloning. Ready to experience the power of AI? Start your journey here!

Platform: Replicate

Text-to-SpeechVoice CloningMultilingual TTSReal-Time Audio

57.6k runs

0.06 per thousand input tokens

Commercial

🚀Function Overview

A real-time text-to-audio model with voice cloning, emotional expression controls, and multilingual support.

Key Features

Voice cloning with 99% similarity using custom or pre-built voices
Emotion control via auto-detection or manual settings
Support for 30+ languages including regional accents
Granular audio controls (pitch, speed, volume)
Low-latency processing for real-time applications
English text normalization for improved number reading

Use Cases

•Real-time voice synthesis for interactive applications
•Audiobook and podcast narration
•Multilingual customer service chatbots
•Personalized voice cloning solutions
•Emotion-aware voice interfaces
•Accessibility tools for text-to-speech conversion

⚙️Input Parameters

text

string

Text to convert to speech. Every character is 1 token. Maximum 5000 characters. Use <#x#> between words to control pause duration (0.01-99.99s).

voice_id

string

Desired voice ID. Use a voice ID you have trained (https://replicate.com/minimax/voice-cloning), or one of the following system voice IDs: Wise_Woman, Friendly_Person, Inspirational_girl, Deep_Voice_Man, Calm_Woman, Casual_Guy, Lively_Girl, Patient_Man, Young_Knight, Determined_Man, Lovely_Girl, Decent_Boy, Imposing_Manner, Elegant_Man, Abbess, Sweet_Girl_2, Exuberant_Girl

speed

number

Speech speed

volume

number

Speech volume

pitch

integer

Speech pitch

emotion

string

Speech emotion

english_normalization

boolean

Enable English text normalization for better number reading (slightly increases latency)

sample_rate

integer

Sample rate for the generated speech

bitrate

integer

Bitrate for the generated speech

channel

string

Number of audio channels

language_boost

string

Enhance recognition of specific languages and dialects

💡Usage Examples

Example 1

Input Parameters

{
  "text": "Speech-02-series is a Text-to-Audio and voice cloning technology that offers voice synthesis, emotional expression, and multilingual capabilities.\n\nThe HD version is optimized for high-fidelity applications like voiceovers and audiobooks. While the turbo one is designed for real-time applications with low latency.\n\nWhen using this model on Replicate, each character represents 1 token.",
  "pitch": 0,
  "speed": 1,
  "volume": 1,
  "bitrate": 128000,
  "channel": "mono",
  "emotion": "angry",
  "voice_id": "Deep_Voice_Man",
  "sample_rate": 32000,
  "language_boost": "English",
  "english_normalization": true
}

Output Results

https://replicate.delivery/xezq/SnPxXgl26yaAApm29BJpcHRl5PyxHAxpDt97TP59rPiFeWUKA/tmp517d49p_.mp3

Quick Actions

Use NowView Documentation

Technical Specifications

Hardware Type
Run Count: 57.6k
Commercial Use: Supported
Pricing: 0.06 per thousand input tokens
Platform: Replicate

Related Keywords

Real-time voice synthesisVoice cloningEmotional expressionMultilingual supportLow-latency processingAudiobook and podcast narrationCustomer service chatbotsPersonalized voice solutions

Related Models

Chatterbox TTS

Chatterbox is a state-of-the-art zeroshot TTS

NVIDIA PDF to Podcast

Transform PDFs into AI podcasts for engaging on-the-go audio content.

Piper Persian Text-to-Speech

A fast, local neural text to speech system that sounds great and is optimized for the Raspberry Pi 4. Piper is used in a variety of projects.