Dia 1.6B

Experience the power of Dia 1.6B, a cutting-edge model for realistic dialogue audio generation. Let's explore what this AI model can do for you!

Platform: Replicate

Dialogue SynthesisVoice CloningText-to-SpeechNon-Verbal Audio Generation

5.9k runs

L40S

Commercial

🚀Function Overview

Generates realistic dialogue audio from text with speaker tags and non-verbal cues, optionally cloning voices from an audio prompt.

Key Features

Realistic multi-speaker dialogue generation
Non-verbal sound production (e.g., laughs, whispers)
Voice cloning from audio samples
Adjustable audio length, speed, and randomness
Seed-based reproducible outputs

Use Cases

•Audiobook/podcast dialogue generation
•Video game character voice creation
•E-learning/presentation voiceovers
•Voice prototype development
•Accessibility tools for conversations

⚙️Input Parameters

text

string

Input text for dialogue generation. Use [S1], [S2] to indicate different speakers and (description) in parentheses for non-verbal cues e.g., (laughs), (whispers).

audio_prompt

string

Optional audio file (.wav/.mp3/.flac) for voice cloning. The model will attempt to mimic this voice style.

max_new_tokens

integer

Controls the length of generated audio. Higher values create longer audio. (86 tokens ≈ 1 second of audio).

max_audio_prompt_seconds

integer

Maximum duration in seconds for the input voice cloning audio prompt. Only used when an audio prompt is provided. Longer voice samples will be truncated to this length.

cfg_scale

number

Controls how closely the audio follows your text. Higher values (3-5) follow text more strictly; lower values may sound more natural but deviate more.

temperature

number

Controls randomness in generation. Higher values (1.3-2.0) increase variety; lower values make output more consistent. Set to 0 for deterministic (greedy) generation.

top_p

number

Controls diversity of word choice. Higher values include more unusual options. Most users shouldn't need to adjust this parameter.

cfg_filter_top_k

integer

Technical parameter for filtering audio generation tokens. Higher values allow more diverse sounds; lower values create more consistent audio.

speed_factor

number

Adjusts playback speed of the generated audio. Values below 1.0 slow down the audio; 1.0 is original speed.

seed

integer

Random seed for reproducible results. Use the same seed value to get the same output for identical inputs. Leave blank for random results each time.

💡Usage Examples

Example 1

Input Parameters

{
  "text": "[S1] It's on Replicate!!! Oh fire! Oh my goodness! What's the procedure? What to we do people? The Dia text-to-speech model just dropped on Replicate!!\n[S2] Oh my god! Okay.. it's happening. Everybody stay calm!\n[S1] What's the procedure...\n[S2] Everybody stay fricking calm!!!... Everybody fudging calm down!!!!!\n[S1] Yes! Yes! Let's try it out at https://replicate.com/zsxkib/dia (laughs)\n[S2] (whispers) try it now (whispers)",
  "top_p": 0.95,
  "cfg_scale": 4,
  "temperature": 1.3,
  "speed_factor": 0.94,
  "max_new_tokens": 3072,
  "cfg_filter_top_k": 35
}

Output Results

https://replicate.delivery/xezq/RJ9IgzHZbxYwD9tgMTty7mWSxwSCHN3ParGeuvnWOl75v0SKA/output.wav

Quick Actions

Use NowView Documentation

Technical Specifications

Hardware Type: L40S
Run Count: 5.9k
Commercial Use: Supported
Platform: Replicate

Related Keywords

Realistic Dialogue Audio GenerationVoice CloningNon-Verbal CuesMulti-speaker DialogueAudiobook GenerationVideo Game VoicesE-learning Voiceovers

Related Models

Minimax Speech-02-HD

Text-to-Audio (T2A) that offers voice synthesis, emotional expression, and multilingual capabilities. Optimized for high-fidelity applications like voiceovers and audiobooks.

PrunaAI Dia 1.6B

A model for generating expressive voice audio from dialogue scripts.

Cog Orpheus 3B

Spanish and English Text to Speech model from Canopy Labs (3b-es_it-ft-research_release)