MisoTTS
MisoTTS is an 8B-parameter text-to-speech model from Miso Labs for expressive speech and dialogue generation. Miso Labs says the weights are open source on Hugging Face, the model can condition on text and optional audio context, and API access is coming soon.
MisoTTS is page-worthy because it is a fresh voice-model release with primary-source technical detail, open weights, and early community attention around expressive speech generation. It also gives GetLLMs a current speech-model explainer without pretending the API/catalog fields are stable enough for a `/models` directory record.
Miso Labs announced MisoTTS on June 3, 2026 as an 8B-parameter model for emotive speech and dialogue generation. The official blog describes a 7.7B backbone plus 300M decoder, residual vector quantization with 32 audio codebooks, text and optional audio-context conditioning, current half-duplex limitations, open-source weights on Hugging Face, and API access coming soon. The Hugging Face model card confirms the MisoLabs/MisoTTS repository, text-to-speech task, model summary, license field, and that it is not deployed by an Inference Provider. Hacker News is used only as freshness and demand evidence.
- Track Miso Labs speech-model release facts and availability.
- Compare expressive TTS models that use text plus audio context.
- Understand RVQ-style audio tokenization in a reader-friendly way.
- Decide whether to wait for hosted API access before production evaluation.
The official release describes MisoTTS as an 8B-parameter speech model for expressive speech and dialogue generation. It generates from text and optional audio context, uses residual vector quantization, and is built around a 7.7B-parameter backbone with a 300M-parameter decoder.
- Task: text-to-speech and expressive speech generation.
- Architecture: a large temporal backbone plus a smaller depth decoder for audio codebooks.
- Availability: open weights on Hugging Face; API access is described as coming soon.
The Hugging Face card is useful for model identity, task, setup, and architecture, but it also says the model is not deployed by an Inference Provider. Until a stable public API, provider ID, pricing, limits, and hosted availability are verified, this should remain an entity explainer rather than a structured `/models` catalog entry.
Miso Labs says the current model handles individual turns and half-duplex audio, but does not yet solve turn-taking or full-duplex conversation. Treat quality and expressiveness claims as vendor-provided until independent evaluations and provider listings mature.
MisoTTS FAQ
Page-level questions for MisoTTS.
What is MisoTTS?+
MisoTTS is an 8B-parameter text-to-speech model from Miso Labs for expressive speech and dialogue generation. It can generate speech from text and optional audio context, with open weights available on Hugging Face.
Can I use MisoTTS through a hosted API?+
Not as a stable GetLLMs catalog record yet. Miso Labs says API access is coming soon, and the Hugging Face model card says the model is not deployed by an Inference Provider, so production users should recheck hosted availability before planning around it.
Why does MisoTTS use residual vector quantization?+
Miso Labs uses residual vector quantization to represent audio with multiple codebook indices instead of one very large flat vocabulary. The goal is to cover more speech variation while keeping the model architecture practical.