Voice Synthesis Models

Find 14 advanced text-to-speech and voice generation models.

14 models

Found 14 models

Sort by:

Chatterbox TTS

thomcle/chatterbox-tts

Experience Chatterbox TTS for state-of-the-art zero-shot voice cloning and expressive speech synthesis. Perfect for AI agents and video voiceovers.

Zero-shot TTSVoice CloningEmotion Control+1

672

View Details

NVIDIA PDF to Podcast

nvidia/pdf-to-podcast

Transform your PDFs into engaging AI podcasts with NVIDIA PDF to Podcast. Convert documents to audio, customize voices, and set durations for rich, on-the-go content.

PDF to PodcastDocument-to-AudioAI Content Conversion+1

239

CPU

View Details

Piper Persian Text-to-Speech

mosnfar/piper_persian

Get Piper Persian Text-to-Speech for efficient, local speech synthesis. Optimized for Raspberry Pi, it converts Persian text to audio, ideal for voice assistants and accessibility.

Persian TTSText-to-SpeechRaspberry Pi Optimization+1

CPU

View Details

Minimax Voice Cloning

minimax/voice-cloning

Commercial

Minimax Voice Cloning allows you to create custom AI voices from short audio samples, perfect for personalized TTS applications and generating unique audio content.

Voice CloningAudio TrainingTTS Integration

4.1k

3 per output

View Details

Minimax Speech-02 Turbo

minimax/speech-02-turbo

Commercial

Experience Minimax Speech-02 Turbo for real-time voice synthesis and voice cloning. Unlock emotional expression and multilingual support for dynamic audio applications.

Text-to-SpeechVoice CloningMultilingual TTS+1

57.6k

0.06 per thousand input tokens

View Details

Minimax Speech-02-HD

minimax/speech-02-hd

Commercial

Discover Minimax Speech-02-HD: advanced text-to-audio with emotional expression and multilingual support for high-fidelity voiceovers and audiobooks.

Text-to-SpeechEmotional Voice SynthesisMultilingual TTS+1

95.6k

0.10 per thousand input tokens

View Details

Kimi Audio 7B Instruct

zsxkib/kimi-audio-7b-instruct

Commercial

Explore Kimi Audio 7B Instruct for universal audio processing, speech transcription, and emotion recognition. Unlock advanced audio AI capabilities now.

Universal Audio ProcessingSpeech TranscriptionEmotion Recognition+1

216

NVIDIA L40S

View Details

PrunaAI Dia 1.6B

prunaai/dia-1.6b

PrunaAI Dia 1.6B revolutionizes expressive voice generation with multi-speaker dialogue and non-verbal cues. Create dynamic, natural-sounding audio for diverse applications.

Text-to-SpeechDialogue GenerationAudio Synthesis

1.7k

A100 (80GB)

View Details

Dia 1.6B

zsxkib/dia

Commercial

Unlock realistic dialogue audio with Dia 1.6B. Generate multi-speaker conversations, non-verbal cues, and even clone voices for your projects.

Dialogue SynthesisVoice CloningText-to-Speech+1

5.9k

L40S

View Details

DeepAudio-V1 Model

acappemin/deepaudio-v1

DeepAudio-V1 Model excels in Video-to-Speech and Video-to-Audio generation, transforming video inputs into synchronized speech and audio. Experience seamless multimedia content creation.

Video-to-SpeechVideo-to-AudioEnd-to-End Generation

L40S

View Details

Cog Orpheus 3B

gianpaj/cog-orpheus-3b-0.1-ft

Experience Cog Orpheus 3B, a multilingual text-to-speech model offering voice cloning and emotional speech. Perfect for real-time applications and expressive audio.

Multilingual TTSVoice CloningEmotion Control

L40S

View Details

F5-TTS Vietnamese Text-to-Speech

tuannha/f5-tts-vi

F5-TTS Vietnamese converts text to speech with reference-audio voice cloning and adjustable speed. See required inputs, sample settings, and practical use cases before testing.

Vietnamese TTSZero-Shot Voice CloningSpeech Synthesis

View Details

Spark TTS

jichengdu/spark-tts

Spark TTS offers advanced text-to-speech generation with voice cloning for personalized audio and custom voice creation with adjustable parameters. Explore its features!

Text-to-Speech GenerationVoice CloningSpeech Parameter Control

209

L40S

View Details

MOSS-TTS v1.5

OpenMOSS-Team/MOSS-TTS-v1.5

Commercial

Review MOSS-TTS v1.5 model details: OpenMOSS model ID, voice cloning, long-form TTS, multilingual synthesis, Pinyin/IPA control, pause control, and local runtime caveats.

Text-to-SpeechVoice CloningLong-Form Speech+1

Hugging Face / Transformers / local GPU or compatible OpenMOSS inference backends

Open weights; no first-party hosted token price was verified in the collected sources. Runtime cost depends on the local or self-hosted backend.

View Details

Related Categories

Image Generation

Discover 282 AI models for content creation and text generation.

Image Editing

Explore 39 professional AI models for photo editing, enhancement, and manipulation.

Language Models

Browse 27 large language models for conversation and reasoning.

Video Generation

Access 22 AI models for video creation and editing.

View All Categories