G
GetLLMs

Minimax Speech-02-HD

Minimax Speech-02-HD offers advanced text-to-audio capabilities with emotional expression and multilingual support. Try it now and see the results!

Platform: Replicate
Text-to-SpeechEmotional Voice SynthesisMultilingual TTSVoice Cloning
95.6k runs
0.10 per thousand input tokens
Commercial

🚀Function Overview

A high-fidelity text-to-speech model that generates natural audio output with precise control over vocal characteristics like emotion, language, pitch, and speed for professional applications.

Key Features

  • 300+ pre-built voices across demographics
  • 10-second voice cloning with 99% vocal similarity
  • Multilingual support for 30+ languages with native accents
  • Precise emotion control (auto-detect or manual)
  • Adjustable pitch, speed, volume, and pause durations
  • Professional-quality audio output optimized for voiceovers

Use Cases

  • Audiobook narration
  • Podcast and video voiceovers
  • Multilingual IVR systems
  • Video game character voices
  • Accessibility tools for text-to-speech
  • Personalized voice cloning applications

⚙️Input Parameters

text

string

Text to convert to speech. Every character is 1 token. Maximum 5000 characters. Use <#x#> between words to control pause duration (0.01-99.99s).

voice_id

string

Desired voice ID. Use a voice ID you have trained, or one of the following system voice IDs: Wise_Woman, Friendly_Person, Inspirational_girl, Deep_Voice_Man, Calm_Woman, Casual_Guy, Lively_Girl, Patient_Man, Young_Knight, Determined_Man, Lovely_Girl, Decent_Boy, Imposing_Manner, Elegant_Man, Abbess, Sweet_Girl_2, Exuberant_Girl

speed

number

Speech speed

volume

number

Speech volume

pitch

integer

Speech pitch

emotion

string

Speech emotion

english_normalization

boolean

Enable English text normalization for better number reading (slightly increases latency)

sample_rate

integer

Sample rate for the generated speech

bitrate

integer

Bitrate for the generated speech

channel

string

Number of audio channels

language_boost

string

Enhance recognition of specific languages and dialects

💡Usage Examples

Example 1

Input Parameters

{
  "text": "Speech-02-series is a Text-to-Audio and voice cloning technology that offers voice synthesis, emotional expression, and multilingual capabilities.\n\nThe HD version is optimized for high-fidelity applications like voiceovers and audiobooks. While the turbo one is designed for real-time applications with low latency.\n\nWhen using this model on Replicate, each character represents 1 token.",
  "pitch": 0,
  "speed": 1,
  "volume": 1,
  "bitrate": 128000,
  "channel": "mono",
  "emotion": "happy",
  "voice_id": "Friendly_Person",
  "sample_rate": 32000,
  "language_boost": "English",
  "english_normalization": true
}

Output Results

https://replicate.delivery/xezq/V5fclDfiEXq1GUvPTIC6zc4CWhYvZagKvPgkHlR9YldH3toUA/tmpdgbymb15.mp3

Quick Actions

Technical Specifications

Hardware Type
Run Count
95.6k
Commercial Use
Supported
Pricing
0.10 per thousand input tokens
Platform
Replicate

Related Keywords

Text-to-AudioEmotional ExpressionMultilingual CapabilitiesVoice CloningAudiobook NarrationPodcast VoiceoversIVR SystemsAccessibility Tools