Dia 1.6B
Experience the power of Dia 1.6B, a cutting-edge model for realistic dialogue audio generation. Let's explore what this AI model can do for you!
🚀Function Overview
Generates realistic dialogue audio from text with speaker tags and non-verbal cues, optionally cloning voices from an audio prompt.
Key Features
- Realistic multi-speaker dialogue generation
- Non-verbal sound production (e.g., laughs, whispers)
- Voice cloning from audio samples
- Adjustable audio length, speed, and randomness
- Seed-based reproducible outputs
Use Cases
- •Audiobook/podcast dialogue generation
- •Video game character voice creation
- •E-learning/presentation voiceovers
- •Voice prototype development
- •Accessibility tools for conversations
⚙️Input Parameters
text
stringInput text for dialogue generation. Use [S1], [S2] to indicate different speakers and (description) in parentheses for non-verbal cues e.g., (laughs), (whispers).
audio_prompt
stringOptional audio file (.wav/.mp3/.flac) for voice cloning. The model will attempt to mimic this voice style.
max_new_tokens
integerControls the length of generated audio. Higher values create longer audio. (86 tokens ≈ 1 second of audio).
max_audio_prompt_seconds
integerMaximum duration in seconds for the input voice cloning audio prompt. Only used when an audio prompt is provided. Longer voice samples will be truncated to this length.
cfg_scale
numberControls how closely the audio follows your text. Higher values (3-5) follow text more strictly; lower values may sound more natural but deviate more.
temperature
numberControls randomness in generation. Higher values (1.3-2.0) increase variety; lower values make output more consistent. Set to 0 for deterministic (greedy) generation.
top_p
numberControls diversity of word choice. Higher values include more unusual options. Most users shouldn't need to adjust this parameter.
cfg_filter_top_k
integerTechnical parameter for filtering audio generation tokens. Higher values allow more diverse sounds; lower values create more consistent audio.
speed_factor
numberAdjusts playback speed of the generated audio. Values below 1.0 slow down the audio; 1.0 is original speed.
seed
integerRandom seed for reproducible results. Use the same seed value to get the same output for identical inputs. Leave blank for random results each time.
💡Usage Examples
Example 1
Input Parameters
{ "text": "[S1] It's on Replicate!!! Oh fire! Oh my goodness! What's the procedure? What to we do people? The Dia text-to-speech model just dropped on Replicate!!\n[S2] Oh my god! Okay.. it's happening. Everybody stay calm!\n[S1] What's the procedure...\n[S2] Everybody stay fricking calm!!!... Everybody fudging calm down!!!!!\n[S1] Yes! Yes! Let's try it out at https://replicate.com/zsxkib/dia (laughs)\n[S2] (whispers) try it now (whispers)", "top_p": 0.95, "cfg_scale": 4, "temperature": 1.3, "speed_factor": 0.94, "max_new_tokens": 3072, "cfg_filter_top_k": 35 }
Quick Actions
Technical Specifications
- Hardware Type
- L40S
- Run Count
- 5.9k
- Commercial Use
- Supported
- Platform
- Replicate
Related Keywords
Related Models
Minimax Speech-02-HD
Text-to-Audio (T2A) that offers voice synthesis, emotional expression, and multilingual capabilities. Optimized for high-fidelity applications like voiceovers and audiobooks.
PrunaAI Dia 1.6B
A model for generating expressive voice audio from dialogue scripts.
Cog Orpheus 3B
Spanish and English Text to Speech model from Canopy Labs (3b-es_it-ft-research_release)